Trace travel debugging

While working in a large codebase, from time to time I have encountered bugs that are difficult to reproduce. I know that something has gone wrong, but it is difficult to pinpoint the failure; by the time the failure has occurred, the interesting values/faulty state is gone. For a rarely hit bug, stepping through with a debugger won't work, and crash dumps for ship builds can be difficult to interpret.

One way to attack a problem like this is to use Time Travel Debugging, where the program state is recorded over time, and so a developer can re-play the exact sequence of events. Time Travel Debugging is costly to set up, though, and in addition the overhead it introduces alters timing and potentially altering the repro. (These issues are commonly thread or cross-process concurrency bugs).

I recently thought of a way to assist these investigations. I wrote a tool that I call "Trace Travel Debugging". Trace Travel Debugging uses a logging approach. By comparing logs during a failure with logs during success, one can potentially see the point of divergence. Instead of manually adding log statements, though, my tool automatically traces all function calls, making it much more likely that the point of divergence will be seen.

An example trace for one thread of a program using the Anti-Grain graphics library:
  Agg2D::Agg2D(
    void Agg2D::lineCap(LineCap cap)
    void Agg2D::lineJoin(LineJoin join)
  void Agg2D::attach(Image& img)
    void Agg2D::attach(unsigned char* buf, unsigned width, unsigned height, int stride)
    height =64 
    width =256 
      void Agg2D::resetTransformations()
      void Agg2D::lineWidth(real w)
      void Agg2D::lineColor(unsigned r, unsigned g, unsigned b, unsigned a)
        void Agg2D::lineColor(Color c)
      void Agg2D::fillColor(unsigned r, unsigned g, unsigned b, unsigned a)
        void Agg2D::fillColor(Color c)
      void Agg2D::textAlignment(TextAlignment alignX, TextAlignment alignY)
      void Agg2D::clipBox(real x1, real y1, real x2, real y2)
      void Agg2D::lineCap(LineCap cap)
      void Agg2D::lineJoin(LineJoin join)
      void Agg2D::flipText(bool flip)
      void Agg2D::imageFilter(ImageFilter f)
      void Agg2D::imageResample(ImageResample f)
  void Agg2D::lineWidth(real w)
  void Agg2D::lineColor(unsigned r, unsigned g, unsigned b, unsigned a)
    void Agg2D::lineColor(Color c)
  void Agg2D::line(real x1, real y1, real x2, real y2)
    void Agg2D::addLine(real x1, real y1, real x2, real y2)
    void Agg2D::drawPath(DrawPathFlag flag)
      void Agg2D::addStrokePath()
      void Agg2D::render(bool fillColor)
      time at Agg2D::render is (2013-11-3-22-7-54-170790)
        static void render(Agg2D& gr,
          void blend_solid_hspan(int x, int y,
            void blend_pix(value_type* p,
            void blend_pix(value_type* p,
            void blend_pix(value_type* p,
            void blend_pix(value_type* p,
          void blend_solid_hspan(int x, int y,
            void blend_pix(value_type* p,
            void blend_pix(value_type* p,
            void blend_pix(value_type* p,
Seeing the flow of execution like this can also be generally useful for learning how a large program works.

Usage:

1) Add the following line to the headers section of a .cpp or .h file to log: #define TTD_TRACELOG(s,n)

2) If desired, add some manual logging statements in these files. For example, TTD_TRACELOG("the height is", height); or TTD_TRACELOG("in function main, time is {TIME}", 0);

3) Add each file to a list in tracetraveldebugging.py:
srcs = [ r'~/fastdev/fastpixel/fastpixel_v3/sdl/agg24/src', 
 '~/fastdev/fastpixel/fastpixel_v3/sdl/agg24/include/agg_pattern_filters_rgba.h', 
 '~/fastdev/fastpixel/fastpixel_v3/sdl/agg24/include/agg_pixfmt_rgb.h' ]
outputdir = r'~/output'


Now, run tracetraveldebugging.py. It will modify the specified sources (after backing up the original) and add logging statements for each function/method. Rebuild and run your program. Run tracetraveldebugging_print.py, and you will see traces for each thread.

Details:

Because race conditions / thread concurrency issues could be the cause, this tool needs to have minimal impact on the program. Each trace results in only one 32bit integer written to an already-open file handle. The resulting overhead was sufficiently small in the test projects I used.

The Python script uses a heuristic to find all functions and methods. The script associates a 32 bit integer with the current function and writes this to a persisted lookup file to use later. It injects code that creates a class instance. The class's constructor and destructor both write traces, and so we can trace both entry and exit.

For each trace statement, the current thread id is hashed to a 32-bit integer, the lower 14 bits are taken and used as an index into an array of 16384=2^14 file handles. If the file handle isn't set, it is opened. The integer and any associated data (like the current time if specified by the string "{TIME}") are then written in binary to the file. (On Windows, _setmaxstdio will work around file limits in long-running thread spawning scenario, and fwrite_no_lock can be used to ensure a critsec is not taken). Buffered output dramatically reduces the cost of these many fwrite calls.

tracetraveldebugging_print.py simply reads the binary output, finding each tag in the lookup file that was persisted earlier.

In Windows, ETW is a very-low-overhead tracing mechanism that would fit well here, but it is not cross-platform, and requires an ETW consumer module to be written to output events to disk.

I'll add this to GitHub soon.