Distance Debugging Logo

I wanted to break out of the Distance Debugging discussion for a moment (to return later today) to highlight a new feature I just added up top, the Distance Debugging Help Desk. In brief, I want to help you fix your problems with computer hardware and software in order to get practice debugging. The only thing I ask in return is that I can post about it, and if you have a blog, that you link back to me if I fix your problem. Anyway, there are a few caveats that are listed in full in the page linked at the top (basically, I can't fix everything, I'm very busy, and please don't sue me), so check that out and then email me at helpdesk (at) distancedebugging.com if you are so inclined.

For one very brief example, I stumbled across Alex Hopmann's Blog a few days ago and he had noted a problem where hard drive accesses caused his computer to slow way down. It just so happened that recently I was investigating problems with DMA on Linux, and although it was a Windows computer, the symptoms were exactly the same. I emailed him and with that clue, as he explains in this post, he was able to resolve it with no trouble and got his laptop much more functional again.

Anyway, I can't guarantee a fix for you would be that simple, but it can't hurt, can it?

I am a software engineer, developer, and designer by trade. In my experience, I am especially good at debugging and that's probably because I spent a ridiculous amount of time doing it. You know how some people are obsessive about keeping a perfectly clean and orderly house? I am that way about my computer's operation. If I notice anything broken or simply not working the way I would expect it to, I will endlessly investigate until it is solved. Over the years this has meant that I've developed an oddly comprehensive knowledge of things that go wrong with computers. I have also, with a bit of introspection, worked up a theory to try to teach others how to get better at debugging in general. My hope for this blog, and this site in general, is to improve readers' debugging skills. If it accumulates interesting stories of my own and others' debugging travails along the way, so much the better.

One of the most difficult complications to overcome when debugging a problem is that it occurred some time in the past. Generally, there is always some delay between when an issue actually occurs and when it is reported or noticed, but as that delay stretches to days or weeks, the probability of easily diagnosing the problem diminishes dramatically. In this interim period:

  • Log files and other system state get archived, filled with lots of additional information that is not relevant, or simply lost. This increases the amount of time and effort necessary to recreate the system state, if it is even still possible.
  • User memories of the experience including error messages, system actions and other transient state fade. This makes it less likely that interrogation of users will result in interesting and useful data.
  • Changes to hardware and software pile up, meaning that it can become increasingly difficult to recreate the problem or make the current state match a previous known bad state.
  • Changes to your own code pile up, causing you to have to jump bad to an arbitrary point in your development stream in order to make sense of an old error. It's also often true that changes in the code mean that is it difficult to establish whether an error in the older code could manifest itself in the current version of the code.

These effects show how important it is to eliminate temporal distance. Techniques to help in this regard include good logging and error capture to help recreate old system state, effective use of source code management tools, strict procedures in regard to error reporting, and creating an environment where users and developers are encouraged to and rewarded for noticing bugs quickly.

Tomorrow: Operational Distance, and Summary

The Social Distance factor is often overlooked when attempting to debug a problem. It takes several forms, but is mostly boils down to a single issue: you are not a member of the same group as the people using the system. You belong to the software developer group, while the people using it belong to the accounting, or medical, or whatever group has contracted your services. You think about the software in a different way and you use different terms, which can make communication about issues difficult. These differences cause an initial distrust that gets reinforced through these communication problems.

Social Distance causes problems in a variety of ways, for instance:

  • Bugs are reported in terms of the user experience, not in terms of the programmer understanding. This leads to confusion about what is actually wrong, and worse, the severity of the problem.
  • It can be hard to get access to the right people when you are part of the "out" group. You are often not seen as a priority.
  • The users' perception of the job you are doing can vary significantly from your own. You may believe that they see you as delivering key functionality in a timely fashion and fixing problems rapidly, while they actually see trivial updates and constant breakdowns.
  • It is often difficult to know who to trust when receiving conflicting information from different people within a customer. Being on the outside, you won't know useful information like, "So-and-so always has a million complaints about everything, just ignore him/her", or "If So-and-so says fix it, you better fix it".

Social Distance is the main reason we hate technical support and the applications we use. When things go wrong or don't work as anticipated, we feel that either the authors don't understand our problems, or they don't care. It also tends to feedback quickly, so that once you are perceived by users to be an asset, your subsequent actions will be seen as good, and if you are perceived as a problem, your actions will be seen as detrimental or annoying.

Overcoming social distance and getting to be "one of the team" can go a very long way towards solving debugging problems. You will get better information, get reports more quickly, and will be given more opportunities to get things right. The keys to doing this include practicing good information hygiene, managing your reputation, and like with physical distance, identifying and cultivating relationships with high-quality intermediaries, but then additionally using them to help gain the trust of others.

Tomorrow: Temporal Distance

Physical Distance, while somewhat self-explanatory, can be used to describe both an issue with physical proximity, and also with access. There are servers many miles away on which I have shell access and can therefore directly observe and query the system. There are also servers that run my code that I am several steps removed from and into which I have no direct visibility. Physical Distance is a problem because it means that manipulating the system requires some local intermediary. All of your data collection and actions are then filtered through them, and their oversights and mistakes can derail the most careful debugging process.

Physical Distance also acts as a complicating or driving factor in the other kinds of distance. In yesterday's post I mentioned the unobserved changing of hardware or software as a kind of mental distance. However, with physical access, even if you aren't notified of the change directly, once you note its possibility, it is trivial to rule in or out. Without it, you have to rely on your intermediary to attempt to verify, and they may not always be correct. Physical Distance can also be a big factor in Social Distance (more on this tomorrow) since it can be hard to gain user trust without being in physical proximity, and Temporal Distance (Monday), since problems can go unreported if you are not around to observe them.

The keys to overcoming physical distance include techniques such as developing systems that provide multiple, mutually supporting data points to help root out errors made by the intermediary, and identifying and cultivating relationships with high-quality intermediaries. More on this in the discussion about tools and techniques.

Tomorrow: Social Distance

When you sit down to debug a problem, there is a certain amount of information missing. Many point to this as the essence of debugging: filling in those gaps until you have a clear enough picture of what is wrong to identify the cause and fix the problem. My problem with this description is that it assumes that you can simply gather the necessary information until you have enough. In my experience, there is a certain amount of information that is simply not available due to reasons such as:

  • The person or people who know about the system or code no longer work there, or were never directly accessible in the first place as when using commercial software.
  • It is unclear who actually knows, especially in the case of large third-party libraries or open-source software.
  • The information is closely guarded for intellectual property or reverse engineering prevention reasons.
  • The cost of acquiring the necessary information is simply too high.

So while debugging is partially about figuring out what you don't know and filling in gaps, it is just as much about reasoning well with partial information, or using what you do know to constrain the missing information to a set of possibilities that can be exhaustively tested. In some cases, debugging is even about deciding whether it is faster to research and acquire the information directly (with some estimated probability of success), or to try to guess at the missing information in the hopes that with a small number of possibilities, one will become obviously correct. You want to find the mental distance gap that is easiest, cheapest, or fastest to close, depending on the situation.

There is a second kind of mental distance, and that is not knowing about changes that take place that affect your system. Since change tracking should drive a debugging investigation (more on this in a few days), this can be very troublesome. Some examples of "hidden" changes that result in baffling bugs:

  • A piece of hardware is added or removed from a machine.
  • A user decides to try data inputs that are outside the original specifications of the system.
  • A user decides to try a sequence of operations that they have never tried before (and maybe no one has tried before).
  • A piece of conflicting software is installed.
  • A piece of dependent software or the OS is upgraded.

When a bug is reported on a running system and one of these types of things has taken place unbeknownst to you, it can be nearly impossible to determine why the system has failed. Being aware of this mental gap, and keeping in mind that machines change configuration and people change their behavior can be critical to quickly asking the right questions in a debugging situation.

In case you were looking for help right away, I'm not going to cover how to overcome mental distance in this post. Instead, I will be laying out the techniques, tools and strategies in a separate series of posts after the types of distance are described.
Tomorrow: Physical Distance

What does an ideal debugging situation look like?

  • The person reporting the problem gives a clear definition of what is going wrong, the severity, and instructions about how to reproduce.
  • The person reporting the problem remains available and willing to answer additional questions.
  • You can run their steps to reproduce on the actual system where it originally failed.
  • You can easily automate this process so that it can be tested without a long data generation lead-up or 10 minutes of GUI clicks.
  • The bug occurs in code that you wrote, recently.
  • When you run the same test case off of the actual system, it still fails exactly as expected.
  • When you fix the problem in your code, and after sufficient review, you can immediately install it on an actual system for testing.
  • When you rerun the test case on the actual system, the bug disappears as expected.
  • You can quickly and easily roll this out to other systems or users and it fixes the problem for the other systems and users as expected.
  • You are hailed as a genius and hero for your quick fix turnaround (optional).

If you have ever had a situation like this, I would be shocked, yet the field of debugging tends to assume these conditions (with many notable exceptions of course). I have nothing against the current set of texts that provide the basic tools to find and fix bugs under ideal circumstances, but it's kind of like trying to use a book on Java style and syntax to actually develop an application. You can't get by without it, but it's not even close to the whole story.

The concept of Distance Debugging is an attempt to fill in this gap, and offer a theory of debugging problems in real-world situtations. I use distance as a recurring theme because I think it captures the essence of what is hard about debugging. Looking at the list above, here is what you are more likely to encounter, along with the type of distance:

  • The person reporting the problem gives a vague description, gives no indication of frequency or severity, and is possibly quite angry with you about it, or worse, they report it several days or weeks after it occurred. [Social, Temporal Distance]
  • The person reporting the problem is swamped and unable to help (and the organization refuses to make them available), is not interested in helping, or is simply unknown or inaccessible [Social, Operational, Physical Distance]
  • You do not have any access to the actual system [Physical and Operational Distance].
  • The problem requires an extensive set of manual steps to reproduce [Mental Distance, possibly caused by Physical or others].
  • The problem appears to be in a piece of code that you did not write (such as a third-party library or the operating system), or that you wrote several years ago [Mental Distance].
  • The problem stubbornly resists replication off of the actual system [Mental Distance].
  • You manage to find and fix the problem, but you are prevented from installing the fix for 6 months. [Procedural Distance].
  • Despite the bug disappearing from your system, the fix fails to affect the problem on the real system [Mental Distance, and possibly others].
  • You roll out the fix, and while it fixes the bug for a third of the users, it hangs around for the remained [Mental, Procedural Distance].
  • With your slow response time and fixes partially or completely failing, your reputation suffers and users become increasingly unwilling to report problems or assist with the process of fixing existing bugs [Social Distance].

It's a cycle that I've seen play out many times. It's what makes people so dissatisfied with technical support. You go into it assuming that they won't be able to fix it anyway, so why bother being especially helpful. This isn't to blame customers since they generally have every right to be upset, but to illustrate the consequences. Over the next 5 days, I will cover the types of distance in individual posts.

Tomorrow: Mental Distance.

In an attempt to lay out a lot of ideas all at once, I decided to join up with the National Blog Posting Month, the lazy (or let's say, time-challenged) stepcousin of National Novel Writing Month. The idea is pretty straightforward: just post every day for the entire month. I decided that I will use this opportunity to lay out what I see as the major pieces of the skill and theory of Distance Debugging in the hopes that others will comment and improve on the ideas, or at least tell me if this stuff is trivial or plain wrong.

I'd like to start by talking about the skill of debugging, leaving distance debugging aside for the moment. I was having a conversation with my dad the other day about how he was dissatisfied with the company that provides his firm with IT support. I made the offhand comment to him that the problem is not that they don't care about problems, it's that they aren't any good at finding and fixing them. To me, this is a critical but hidden issue. Everyone has their favorite horror story about technical support, but no one seems to be stopping to say, "hey, maybe we should think about how to get better at fixing things."

In the world of software development, the problem gets worse. Estimates vary widely on the amount of time the average developers spends debugging, but there is general agreement that it's at least 50% of their time (and up to 80% or more). However, while there are a zillion books that can teach you the basics of programming, and hundreds of titles devoted to every miniscule aspect of design, there are only a handful of titles devoted to debugging. Why is it that the thing we probably spend the most time doing is the hardest to get any good information about?

I honestly don't know, but I have a few ideas:

  1. Debugging is seen as something you either know how to do or you don't. It's not teachable, so why write or read a book about it. This is probably totally false.
  2. Debugging is seen as something that you pick up through experience, and that's the only way to do it. There's no point in trying to come up with a curriculum. It's true that experience helps, but this kind of argument is always made in nascent domains.
  3. There isn't a good theory of debugging, so we wouldn't know what to teach. This is partially true, but as the few books on the market show, you can have a set of rules or guiding principles as you do in any scientific field.

So I'm starting with the assumptions that debugging can be taught, that it is very important to teach, and it's really just a matter of figuring out what to teach. Having worked in the field and displayed a knack for finding and fixing problems, I think I have a coherent way of describing the process. I come from a different perspective because I believe that most real-world problems have an element of distance in them (more on this tomorrow) and this is where the current thinking in the field falls short.

Tomorrow's entry: Distance is Everything.