Distance Debugging Logo

To briefly recap Part I, the idea is to try to establish a rapport while getting more information, as well as learning about what kind of user is behind the report. By now, you have heard "the story of the bug".

  1. After hearing the basic story, decide whether you have a clear enough understanding of the problem. If so, restate it to the user to see if they agree. At that point it mostly becomes a negotiation to make sure that any remaining disfluencies are ironed out.
  2. If you feel that there are still gaps or confusing elements, it might be because of one of a few things:
    • There is some word that you both think is clearly defined but which you are actually using differently and that is preventing good communication.The key here is to focus on the phrase that is leading you astray. The user might say "and then the mouse stops working", and they really mean that they can't click on something that they think should be clickable, but in your head you are imagining some kind of driver or hardware failure. Try to get them to state the problem in different words or better yet, see if they can demonstrate the problem for you live (assuming you don't have a major physical distance issue).
    • The user keeps restating the problem in terms of the way that the system is currently operating (which they think of as a wrong) but is not stating what they actually want.The user may not really know what they want, they only know what the don't want. In that case, it is often helpful to start offering up your own countersuggestions to play a little game of "warmer/colder". If they say, "the reports are generated biweekly", you might counter with, "We could easily generate them every week". If the problem is that they want more frequent reporting, they might continue along that line and say, "How about twice a week?", but if the problem is that their inbox is choked with stuff already and they don't need that much information, they might say, "no, no, that's even worse". At least you can start to see the direction in which the change should be made.
  3. If you reach a state where the problem is not clear, but at least the steps to reproduce are, then it is often useful to end your conversation with the user and move on to replication as that can give you a more direct understanding of the problem. Once you can see if firsthand, then you can always call back the user to gather more information.
  4. If you determine that the user is making a feature request and not a bug report, be upfront about it. "I agree that what you are describing would be very nice to have and would save you time. It would probably take a fair amount of time to build, but I'm sure that we could get it out in the next release if you can get it added to the list of features." Encourage them to discuss it with other users to create a critical mass, and point them to the customer contact for your project.
  5. If all else fails, and you have already developed another trusted user contact (see next item) end your conversation with the current user and talk it over with your trusted contact. "I got this bug report from another user that I just can't make heads or tails of. Does this make any sense to you?" Since they likely speak the same language as the reported, they make pick up on language that you did not, or they might even say, "yeah, I have that problem all the time, but I didn't really know how to explain it". In any case, they will likely give you additional assistance in deciphering it.
  6. At the end of the conversation, make sure you are clear on the severity and timetable for this problem. State it to the user: "I know that this is really keeping you from getting any work done. I will get back to you in an hour or so with a status report. Hopefully I'll have a fix ready to go by then.", or "Thanks for the info. I'll check in next week and let you know if I've thought of anything." Also, make sure to be clear if you think you will need more information: "I may have a few more questions for you as I look into this. Would you mind if I gave you a call tomorrow or the next day?"
  7. Finally, if after discussing the report, it becomes clear that the reporting user is committed to/passionate about your system, has a natural rapport with you over the phone, and is somewhat technically savvy, you will want to think about escalating that relationship into a trusted contact or intermediary. That will be the subject of tomorrow's post.

Tomorrow: Developing a Trusted Contact

When you a receive a bug report that is hard to understand from a person with whom you have never had contact, it can be difficult to get a complete picture of what is wrong. Assuming that you can make some contact with them, here is a good process for clarifying a report:

  1. Start by communicating your desire to be helpful and fix things. This will start the relationship out with the right tone. Users are often so used to being treated like nuisances that they will be very reticent to discuss things at first and will almost be apologetic for things they notice. Once they figure out that your top priority is fixing things, communication will become easier.
  2. Try to gauge the priority of this bug for this user in terms of their overall workload. This might be something very small to them and they are swamped with work, or it might be preventing them from completing something important. Say something like, "I saw your bug report and I wanted to get a little more information, if you have time. I couldn't tell from what your report whether it was actually blocking your work, or just an annoyance."
  3. If the user has time and interest, start with the "Tell Me a Story" bit described a few days ago, but in addition to focusing on the details of the bug, try to gauge the user's level of technical skill, and their overall feeling about the system. Your goal is not only to fix this bug, but also to get them to feel good about the system. It sounds touchy-feely, but perception is everything, and if they want to vent about lots of things they think are wrong, it can help if they know you are listening, and you can get invaluable information about how things are actually used.
  4. As you are talking with the user, it can sometimes help to categorize them for future reference, because it can help you understand where they are coming from:
    • The Power User - The most valuable kind of user because they push the limits of the system. They will report a lot of bugs, and will care if they are fixed.
    • The Squeaky Wheel - Reports a lot of bugs but often of the preference type. Reports need to be read with more skepticism than usual.
    • The Boss - While someone in charge might report bugs that you otherwise would push off, fixing their bugs might have a stronger effect on the overall use of the system because of their influence. Plus, they might be the ones controlling your budget.
    • The Geek - A tech-savvy user can be an invaluable source of not only bug reporting, but also as a kind of translator between the way that you think and the way that the users think. However, watch out for geeks trying to be overly helpful and offering endless "free advice".

Tomorrow: Clarifying the Bug Report, Part II

One of the hardest things to do when you receive a bug report is outright rejecting it. As keepers of the system, we want everything to be perfect under every circumstance. However, wanting to please users to this degree can actually hurt your reputation rather than help it.

There are few good reasons to reject a bug report on its face:

  • The bug report involves running the system in a totally untested and out-of-scope configuration. For example, if a user reported a bug where the server wouldn't run inside the Windows emulator on a Mac, I might briefly investigate as an intellectual exercise, but I'm not going to treat this a bug.
  • The bug report expresses a preference, and other users might have different preferences. If every user expresses the same sentiment, then I might accept it as a bug. If users differ, then it starts to fall into the realm of customizability, which generally involves new feature development.
  • The bug cannot be replicated and is not severe. In this case, it's just not worth spending time on. If it can be replicated, it's probably relatively easy to fix. If it's severe (i.e. results in data loss or unusuable system), then it's probably worth spending some time looking in to.

Problems that can arise from accepting any bug report as a command to fix include:

  • Your reputation can suffer because it appears that you can't fix some of these "unfixable" bugs, especially the unreplicable ones.
  • Your reputation can suffer because you are spending too much time developing new features under the guise of bug fixing, and so you are perceived as having a buggy system rather than a full-featured one.
  • Your bug tracking system gets cluttered with lots of unreplicable bugs, which never get fixed since they can't be replicated.

Saying No is not that hard if you start doing it from the outset, and you communicate clearly with the reporter your reasons for rejecting the bug. Most users will accept an explanation that says, "I'm sure that was very annoying when it happened, but it hasn't happened again, and no one seems to be able to make it happen again. If it happens again, let me know", or "We've never tried running it that way before, so I'm not surprised that it didn't work. I'll make a note of it, and if you want to do this in the future, make sure it gets into the list of requirements". You can't just start rejecting bugs 3 years into a project though, it will just make users frustrated. You must implement this strategy from the beginning.

Tomorrow: Clarifying the Bug Report

What if you look at that initial bug report and no possibilities jump to mind? It happens for many reasons. Sometimes, the nature of the bug being reported is so bizarre and unlikely that you can't even imagine why it might happen. Other times, the bug report itself is impenetrable and you can't even tell what might be wrong. How do you start the process of formulating a theory? Here are a few techniques for getting the investigation rolling:

  • Tell me a Story - Contact the person reporting the bug (if available) and ask them to tell you the story of what happened leading up to and immediately after the bug occurred in an informal way. Often it will reveal details that the reporter did not think were relevant originally but which turn out to be crucial.
  • Give it a Shot - If the bug report lays out a set of steps to reproduce and you have the ability to attempt them in some fashion, give it a shot. Often there is an initial psychological barrier of doubt where a bug is hard to take seriously because it seems implausible. The effect of making it happen can catalyze your thinking when it forces you to accept the reality of the problem.
  • Ask for Corroboration - This is useful in the case where the bug appears to be in a an apparently heavily-used piece of code, in which case, it would surprising that no one else is encountering it. It can help to send out a message to other users asking if they have seen this problem, or if they would be willing to try the steps. The results can tell you one of a few things: the reported activity is actually uncommon (in the case that it is easily replicable, but no one else has hit it), it is very common, and users just haven't been reporting it (happens more than you think), the reporting user's installation might be corrupted or broken (when no one else can replicate except the reporter), or the steps to replicate are slightly or majorly incomplete or incorrect, which happens because people can't quite remember what they did.
  • Ask for Replication - Often a user will report something the first time it happens, which is good. The problem is, often it's an odd one-time occurrence that never happens again. You will tear your hair out trying to replicate and track down something that very, very rarely occurs. If no one can replicate after a few attempts, make a note of it and move on (see tomorrow's post, Saying No)
  • Check the History - Bug tracking will be covered in another post, but assuming you are keep good records, check for similar bugs. Perhaps the combination of the newly found problem and a previously unsolved bug will give you enough evidence to suddenly find a solution for both.
  • Look for Non-obvious Changes - If all else fails and you suddenly have a repeatable bug and there is no obvious cause, start to look into non-obvious changes. Was some unexpected system maintenance performed? Was a piece of hardware upgraded or swapped out? This can be a somewhat open-ended investigation, but start by looking at the software and hardware that would have the most obvious effect on the failure.

Tomorrow: Saying No to a Bug Report

So you have a bug report with some information. It may not be complete and you may not understand all of it. However, chances are, when you read it, an idea of where the problem lies will immediately jump into your head. There is an excellent (and very popular) book by Malcolm Gladwell called Blink. While it covers a lot of ground, the primary focus is on how the brain processes things subconsciously, which in some cases is good and productive, and in others is bad and even dangerous. The problem is knowing when to trust that instantaneous gut reaction. In the case of the bug report, that initial gut reaction is often invaluable, but it can also lead you astray. I follow a simple procedure to try to weed out misleading reactions:

  1. Make a note of what the initial thought about the bug is ("it sounds like an SQL error").
  2. Run that idea through a mini-gauntlet of reasons to throw it out (these will probably sound familiar):
    • Is this a plausible theory?
    • Is this a probable theory?
    • Am I just trying to confirm something I already believe like blaming a faulty or poorly understood component?
  3. If it fails any of these tests, keep the theory around in the tracker (more on the bug tracker later), but start with more data collection.
  4. If it passes the tests, begin with that theory as a going assumption and look at the code for an obvious mistake (if available) or create test cases that would match the theory, rather than collect more data.

It is surprising how often this process allows me to bypass an extended debugging session. It is also surprising how often I've talked myself out of that initial idea and wasted a lot of time before coming back around to it.

Tomorrow: When the Blink Fails

The bug report is your initial contact with a bug and it often heavily influences the way that you approach your investigation. What does a standard, high-quality bug report contain?

  1. Clear statement of what is wrong
  2. Steps to reproduce
  3. Specific version information about relevant software (and possibly hardware).
  4. Severity of the Problem

Having that information is a great start, but it's not always available, especially 1 and 2. Often instead of a clear statement of the problem, you get a long description of current functionality like, "It prints a biweekly report", or "I get an error when I do (known error-causing action) ". The person making the report really wants it to do something else, but there is no way of knowing what from the report itself.

Instead of steps to reproduce you will get a general statement of what they were doing when it happened: "I was working on my PDQ report when I double-clicked the icon that brings up the report query interface, but it gave me an error and quit." The problem is, it can be hard to understand what they were doing because they don't use the same terms that you would. In this example, you may not have any idea what the "icon that brings up the report query interface" is. Remember that a bug is in the eye of the beholder and will be framed within the user's perception of the system. In these cases, you will need to get a lot more information to determine what is actually going wrong.

Tomorrow: Read the Bug Report and Blink

The one thing that I stress over and over again when working with people fixing bugs is to look at the changes. I base this on a pretty simple fact: if it was working fine before, then it's probably working fine now. We like to anthropomorphize computers and imagine that a little homunculus is inside pulling levers and pushing electrons around, and that like humans, this little guy might zig when he should have zagged. Occasionally, computers suddenly fail in spectacularly bizarre ways, but the vast majority of the time, a human has changed something and that's why your program is suddenly failing.

What does this mean for your daily debugging practice? There are several ramifications:

  • When searching for the most probable explanations, a good heuristic is to include at least one thing that has changed in every explanation. An explanation that relies on an unchanged thing failing or behaving differently than previously known is extremely improbable.
  • Keeping careful track of things that have changed is of absolutely critical importance. This means not only within your own code, but across all aspects of the system. This is where distance can often cause the most problems.
  • When gathering data about a bug where it is not readily clear what has changed, make establishing that fact your first goal.

When looking for changes you will likely first check for software and hardware changes, and that is a reasonable place to start. However, there are two commonly overlooked sources of change: user behavior, and the passage of time. User behavior may change for several reasons. It could be that there has been a customer policy change, or it just may be due to changes in their business environment. For example, let's say that your system has a comment field where users rarely enter information. Suddenly, the powers-that-be mandate that every record modification must be tagged with the name and date of the person doing the modification, so users begin tacking on their names and dates in the comment field, to them, a logical place. Suddenly people are unable to save records. You might be totally stumped until you realize that you limited the comment field to 100 characters, and that limit was quickly exceeded after several user edits. This policy change had the indirect effect of creating a "bug" where there previously was none.

In terms of business environment changes, a classic bug problem is that of using a numerical field (instead of a character field) for zip codes. The bug appears when zip codes that begin with '0' suddenly start getting truncated because the leading zero is meaningless in a numerical field. An interesting complication of this problem would be a deployed system for a company that has done business in nothing but California for several years and so the problem goes unnoticed. One day, they get a new client in Massachusetts and the system can't print invoices correctly for that new client. You might wonder why the system is "suddenly" failing, but looking at the change in clientele would give you the hint you need.

The passage of time can be much trickier to notice and catch. The passage of time often means that hidden assumptions are exposed and limits are exceeded after a set period of time, and we fail to realize the significance of that time period. The classic example of this type of bug is the much-discussed Y2K bug, which wasn't a bug until enough time had passed. This type of problem crops up more often than you would think. I was working on a system that after exactly 4 months (I now know) suddenly completely failed. The cause turned out to be the logging system I designed using Oracle's partitioning feature where it can divide up the database data based on a criterion such as the value of a date column. This allows for easier maintenance in terms of log rolling because you can drop individual partitions without having to take the entire table offline. It turns out that I had made a mistake in designing the table though, and it only had 4 partitions each holding a month's worth of data. I had assumed that we would institute a policy of rolling logs within that 4 month window thereby avoiding the issue, but in the fog of those early months after a new deployment, it was totally forgotten. Of course, once that magic date was exceeded, Oracle threw up it's hands saying "I have no place to put this data!" and that brought the whole system to a screetching halt. While the fix of putting in an additional "everything else" partition was no trouble, it certainly was not my finest hour.

Now that some problems with debugging have been discussed, one one powerful heuristic for locating bugs quickly, tomorrow I'll begin discussing the actual process of debugging itself, from bug report to deployed fix.

Tomorrow: The Bug Report

Yesterday's post covered two big problems in debugging, mostly having to do with theories. There are two other big problems with theories that you should be on the lookup for: improbability and missing walls. Scientists, when attempting to explain a phenomena, use an interesting criterion, parsimony. Essentially it says, if you have two theories and one is simpler, it should be preferred. Parsimony is ultimately about probability, since nature seems to prefer simple solutions. Computer problems don't always wind up having the most simple explanations, since they are built by humans and not the Flying Spaghetti Monster, but they do often wind up having the most probable explanation.

When trying to debug a problem, you need to ask yourself, "is the theory that I am proposing the most likely thing that could be going wrong here?" Novice debuggers tend to forget about what's probable in their hope that they will have an opportunity to find and fix that killer bug that they can brag about in a war story. However, you will get traction more quickly if you look at the most probable causes first.

The last issue that I'd like to cover is that of theories that are missing a wall. Often, a developer will approach me and say, "I've narrowed it down to cause X". I'll look over the data they've collected and talk through the issues. Along the way, I notice something interesting: while the cause they describe matches the data, it occurs to me that there is at least one other probable and plausible theory. Essentially, they haven't ruled enough out and so I tell them that their theory is missing a wall. There is clearly a piece of data that could be collected that would quickly distinguish between these alternatives. Noticing a missing wall is a learned skill, but it's one reason why it's always nice to have a trusted coworker whom you can run theories by. It's always surprising when someone else points out an obvious alternative explanation.

Tomorrow: Look at the Changes

The common wisdom about debugging suggests that you need to develop a theory of the problem and collect data around that theory. If the data does not match it, you need to revise your theory and try again until all the data fits. This is great in practice, but it goes wrong in two ways.

The first comes from the world of psychology and sociology. People have a weakness in their reasoning that stems from a tendency that normally serves us well: it is difficult to have one's mind changed. If we could be convinced of things too easily, we would fall prey to a number of schemes or simply always be carried along at the whim of whatever ideas we happened to encounter (not that we don't do this too, but that's a different problem). The real problem arises when we hold tightly to our beliefs in the face of overwhelming contrary evidence. Psychologists use the term cognitive dissonance to describe the feeling of our internal beliefs not matching observed data. It's that pit-of-the-stomach feeling that says "oh no, how can it be that I was wrong this whole time?". Rather than experience an unnecessary amount of cognitive dissonance, we instead slightly or even greatly reframe the incoming facts so that we don't have to change either our minds or experience cognitive dissonance.

Confirmation Bias is the technical term to describe that reframing of facts: we are biased to see things in a way that confirms our existing beliefs. Normally it is used to describe problems of social interaction, for example, if you believe that a certain employee where you work is lazy, you will tend to interpret the things they say and do in a way that supports that notion, even after significant counterexamples. If they complete a large, complex task, even spending long hours and late nights on it, you might be inclined to dismiss it by saying "oh, he/she was working with so-and-so who is such a hard worker, otherwise he/she would have never gotten it done", confirming your previous view rather than reevaluating your opinion. It's incredibly insidious and something we should all be on the lookout for.

In the world of debugging, confirmation bias leads the scientific method astray because we are more likely to cling to our theory by reframing the evidence as support rather than throw out the theory and start again. It doesn't help if one starts by just collecting data and then trying to fit a theory. As soon as a theory is presented, we start fitting data to it. I've wasted so much debugging time coming up with more and more elaborate explanations to explain away data rather than give up and start looking around for a new and better theory. When you are spending more time trying to fit facts into your theory than generating new data, take a hard look at that theory and see if you really just succumbing to confirmation bias.

The other big mistake that I see made all the time is the implausible theory. I will step into a debugging situation and say, "tell me what you think the problem is", and the person will lay out a very clear theory that matches all their observed data. The problem is that there is other data readily available that directly contradicts the theory but which they are not including in their analysis. For example, you collect a lot of data about a sudden network performance issue and come to the conclusion that a faulty network card is to blame. You have a lot of nice graphs showing the performance before and after a certain date, and showing how the application runs as expected on another machine. Seems like a decent analysis, until I point out that none of the other applications on the machine have shown any performance change. That information was clearly available for the analysis and quickly eliminates a simple network card issue (although you can't rule out some more complex interaction), but with a kind of blinders on, it gets missed. The result is that you often spend a lot of time putting in a fix and then testing something that simply cannot have any impact so it's a huge time sink. In this example, you might have requisitioned and installed a new network card only to discover absolutely no change, and that is incredibly frustrating not to mention bad for your reputation.

It often happens in desperation, when we just want some theory that actually fits our data and we either willfully or subconsciously ignore key facts. Confirmation bias plays a big role too, as we cling to our implausible theory in the face of contradictory evidence. When you take confirmation bias and implausibility to the extreme, you get what I call the "One-Track Mind". These are the computer people who believe that all computer badness comes from some particular thing, whether it's Windows, Java, databases, video cards, etc. They had some bad experiences a long time ago with whatever it was, and from now on, they will perceive any innocent or even positive operation of their hated target as buggy or incorrect (the confirmation part), and they will immediately blame it when something goes wrong despite tangential or non-existent evidence (the implausibility part). I would stay clear of these people as much as possible and never ask them to help you debug something.

Tommorow: Improbable Theories & Missing Walls

Operational Distance is the gap between you and the power structure that has control over the system you are trying to debug. Sometimes, that gap is essentially zero, as when you are the adminstrator and arbiter of the system. Other times, it is a gulf requiring you to navigate dozens of people on the way to obtaining permission to even view a log file. Operational Distance is often overlooked in the pre-production stages because at that point it doesn't really exist. However, once the system is fielded and users are either reliant on it, or are simply naturally and somewhat rightfully resistant to change, it's too late. Without the right structures and agreements in place beforehand, it can be difficult or impossible to debug a problem because of all the organizational roadblocks in the way, not to mention that even if you fixed it, you wouldn't be allowed near the system to actually install new code for months.

Operational Distance can be overcome with careful design and lots of communication between you and your customer, but you have to be explicit about what you will need to be able to do after the system is deployed.

I have covered the five basic types of distance, Mental, Physical, Social, Temporal and Operational. I would now like to transition to discussion about some general concepts and ideas about debugging that I think are undercovered or underemphasized in the current literature. The first series of posts will relate to common problems that many debuggers (the people, not the software) run in to that are avoidable. The second series of posts will cover some uncommon general techniques and skills that are relevant to a wide variety of debugging situations.

Tomorrow: The Big Two: Confirmation Bias & Implausibility

Syndicate content