Distance Debugging Logo

One of the most difficult complications to overcome when debugging a problem is that it occurred some time in the past. Generally, there is always some delay between when an issue actually occurs and when it is reported or noticed, but as that delay stretches to days or weeks, the probability of easily diagnosing the problem diminishes dramatically. In this interim period:

  • Log files and other system state get archived, filled with lots of additional information that is not relevant, or simply lost. This increases the amount of time and effort necessary to recreate the system state, if it is even still possible.
  • User memories of the experience including error messages, system actions and other transient state fade. This makes it less likely that interrogation of users will result in interesting and useful data.
  • Changes to hardware and software pile up, meaning that it can become increasingly difficult to recreate the problem or make the current state match a previous known bad state.
  • Changes to your own code pile up, causing you to have to jump bad to an arbitrary point in your development stream in order to make sense of an old error. It's also often true that changes in the code mean that is it difficult to establish whether an error in the older code could manifest itself in the current version of the code.

These effects show how important it is to eliminate temporal distance. Techniques to help in this regard include good logging and error capture to help recreate old system state, effective use of source code management tools, strict procedures in regard to error reporting, and creating an environment where users and developers are encouraged to and rewarded for noticing bugs quickly.

Tomorrow: Operational Distance, and Summary