What does an ideal debugging situation look like?
- The person reporting the problem gives a clear definition of what is going wrong, the severity, and instructions about how to reproduce.
- The person reporting the problem remains available and willing to answer additional questions.
- You can run their steps to reproduce on the actual system where it originally failed.
- You can easily automate this process so that it can be tested without a long data generation lead-up or 10 minutes of GUI clicks.
- The bug occurs in code that you wrote, recently.
- When you run the same test case off of the actual system, it still fails exactly as expected.
- When you fix the problem in your code, and after sufficient review, you can immediately install it on an actual system for testing.
- When you rerun the test case on the actual system, the bug disappears as expected.
- You can quickly and easily roll this out to other systems or users and it fixes the problem for the other systems and users as expected.
- You are hailed as a genius and hero for your quick fix turnaround (optional).
If you have ever had a situation like this, I would be shocked, yet the field of debugging tends to assume these conditions (with many notable exceptions of course). I have nothing against the current set of texts that provide the basic tools to find and fix bugs under ideal circumstances, but it's kind of like trying to use a book on Java style and syntax to actually develop an application. You can't get by without it, but it's not even close to the whole story.
The concept of Distance Debugging is an attempt to fill in this gap, and offer a theory of debugging problems in real-world situtations. I use distance as a recurring theme because I think it captures the essence of what is hard about debugging. Looking at the list above, here is what you are more likely to encounter, along with the type of distance:
- The person reporting the problem gives a vague description, gives no indication of frequency or severity, and is possibly quite angry with you about it, or worse, they report it several days or weeks after it occurred. [Social, Temporal Distance]
- The person reporting the problem is swamped and unable to help (and the organization refuses to make them available), is not interested in helping, or is simply unknown or inaccessible [Social, Operational, Physical Distance]
- You do not have any access to the actual system [Physical and Operational Distance].
- The problem requires an extensive set of manual steps to reproduce [Mental Distance, possibly caused by Physical or others].
- The problem appears to be in a piece of code that you did not write (such as a third-party library or the operating system), or that you wrote several years ago [Mental Distance].
- The problem stubbornly resists replication off of the actual system [Mental Distance].
- You manage to find and fix the problem, but you are prevented from installing the fix for 6 months. [Procedural Distance].
- Despite the bug disappearing from your system, the fix fails to affect the problem on the real system [Mental Distance, and possibly others].
- You roll out the fix, and while it fixes the bug for a third of the users, it hangs around for the remained [Mental, Procedural Distance].
- With your slow response time and fixes partially or completely failing, your reputation suffers and users become increasingly unwilling to report problems or assist with the process of fixing existing bugs [Social Distance].
It's a cycle that I've seen play out many times. It's what makes people so dissatisfied with technical support. You go into it assuming that they won't be able to fix it anyway, so why bother being especially helpful. This isn't to blame customers since they generally have every right to be upset, but to illustrate the consequences. Over the next 5 days, I will cover the types of distance in individual posts.
Tomorrow: Mental Distance.
