Distance Debugging Logo

So you have a bug report with some information. It may not be complete and you may not understand all of it. However, chances are, when you read it, an idea of where the problem lies will immediately jump into your head. There is an excellent (and very popular) book by Malcolm Gladwell called Blink. While it covers a lot of ground, the primary focus is on how the brain processes things subconsciously, which in some cases is good and productive, and in others is bad and even dangerous. The problem is knowing when to trust that instantaneous gut reaction. In the case of the bug report, that initial gut reaction is often invaluable, but it can also lead you astray. I follow a simple procedure to try to weed out misleading reactions:

  1. Make a note of what the initial thought about the bug is ("it sounds like an SQL error").
  2. Run that idea through a mini-gauntlet of reasons to throw it out (these will probably sound familiar):
    • Is this a plausible theory?
    • Is this a probable theory?
    • Am I just trying to confirm something I already believe like blaming a faulty or poorly understood component?
  3. If it fails any of these tests, keep the theory around in the tracker (more on the bug tracker later), but start with more data collection.
  4. If it passes the tests, begin with that theory as a going assumption and look at the code for an obvious mistake (if available) or create test cases that would match the theory, rather than collect more data.

It is surprising how often this process allows me to bypass an extended debugging session. It is also surprising how often I've talked myself out of that initial idea and wasted a lot of time before coming back around to it.

Tomorrow: When the Blink Fails

The bug report is your initial contact with a bug and it often heavily influences the way that you approach your investigation. What does a standard, high-quality bug report contain?

  1. Clear statement of what is wrong
  2. Steps to reproduce
  3. Specific version information about relevant software (and possibly hardware).
  4. Severity of the Problem

Having that information is a great start, but it's not always available, especially 1 and 2. Often instead of a clear statement of the problem, you get a long description of current functionality like, "It prints a biweekly report", or "I get an error when I do (known error-causing action) ". The person making the report really wants it to do something else, but there is no way of knowing what from the report itself.

Instead of steps to reproduce you will get a general statement of what they were doing when it happened: "I was working on my PDQ report when I double-clicked the icon that brings up the report query interface, but it gave me an error and quit." The problem is, it can be hard to understand what they were doing because they don't use the same terms that you would. In this example, you may not have any idea what the "icon that brings up the report query interface" is. Remember that a bug is in the eye of the beholder and will be framed within the user's perception of the system. In these cases, you will need to get a lot more information to determine what is actually going wrong.

Tomorrow: Read the Bug Report and Blink

In the NaBloPoMo review for 'G', Rashenbo mentioned the following:

I liked [blog] and wanted to comment... but Wordpress was making me register and I get tired of registering and entering captcha codes to make a comment... so I just scooted along!

That gave me something of an epiphany. I have only had one comment on this blog since its inception. I know it isn't the most interesting of topics or aimed at the widest of audiences, but I really would like feedback on the ideas. I didn't think that my fear of comment spam was restricting things, but I should have thought about it. I also hate registering with some site I don't really know, for them to do FSM-knows-what with, so why am I forcing people to do the same? I just changed the policy and now anyone can comment, without having to register. I'll rely on the magic of Akismet to stop comment spam for me. Enjoy the new and improved Shouting Distance Blog, Now With Less Tyranny(TM).

The one thing that I stress over and over again when working with people fixing bugs is to look at the changes. I base this on a pretty simple fact: if it was working fine before, then it's probably working fine now. We like to anthropomorphize computers and imagine that a little homunculus is inside pulling levers and pushing electrons around, and that like humans, this little guy might zig when he should have zagged. Occasionally, computers suddenly fail in spectacularly bizarre ways, but the vast majority of the time, a human has changed something and that's why your program is suddenly failing.

What does this mean for your daily debugging practice? There are several ramifications:

  • When searching for the most probable explanations, a good heuristic is to include at least one thing that has changed in every explanation. An explanation that relies on an unchanged thing failing or behaving differently than previously known is extremely improbable.
  • Keeping careful track of things that have changed is of absolutely critical importance. This means not only within your own code, but across all aspects of the system. This is where distance can often cause the most problems.
  • When gathering data about a bug where it is not readily clear what has changed, make establishing that fact your first goal.

When looking for changes you will likely first check for software and hardware changes, and that is a reasonable place to start. However, there are two commonly overlooked sources of change: user behavior, and the passage of time. User behavior may change for several reasons. It could be that there has been a customer policy change, or it just may be due to changes in their business environment. For example, let's say that your system has a comment field where users rarely enter information. Suddenly, the powers-that-be mandate that every record modification must be tagged with the name and date of the person doing the modification, so users begin tacking on their names and dates in the comment field, to them, a logical place. Suddenly people are unable to save records. You might be totally stumped until you realize that you limited the comment field to 100 characters, and that limit was quickly exceeded after several user edits. This policy change had the indirect effect of creating a "bug" where there previously was none.

In terms of business environment changes, a classic bug problem is that of using a numerical field (instead of a character field) for zip codes. The bug appears when zip codes that begin with '0' suddenly start getting truncated because the leading zero is meaningless in a numerical field. An interesting complication of this problem would be a deployed system for a company that has done business in nothing but California for several years and so the problem goes unnoticed. One day, they get a new client in Massachusetts and the system can't print invoices correctly for that new client. You might wonder why the system is "suddenly" failing, but looking at the change in clientele would give you the hint you need.

The passage of time can be much trickier to notice and catch. The passage of time often means that hidden assumptions are exposed and limits are exceeded after a set period of time, and we fail to realize the significance of that time period. The classic example of this type of bug is the much-discussed Y2K bug, which wasn't a bug until enough time had passed. This type of problem crops up more often than you would think. I was working on a system that after exactly 4 months (I now know) suddenly completely failed. The cause turned out to be the logging system I designed using Oracle's partitioning feature where it can divide up the database data based on a criterion such as the value of a date column. This allows for easier maintenance in terms of log rolling because you can drop individual partitions without having to take the entire table offline. It turns out that I had made a mistake in designing the table though, and it only had 4 partitions each holding a month's worth of data. I had assumed that we would institute a policy of rolling logs within that 4 month window thereby avoiding the issue, but in the fog of those early months after a new deployment, it was totally forgotten. Of course, once that magic date was exceeded, Oracle threw up it's hands saying "I have no place to put this data!" and that brought the whole system to a screetching halt. While the fix of putting in an additional "everything else" partition was no trouble, it certainly was not my finest hour.

Now that some problems with debugging have been discussed, one one powerful heuristic for locating bugs quickly, tomorrow I'll begin discussing the actual process of debugging itself, from bug report to deployed fix.

Tomorrow: The Bug Report

I couldn't keep up with the hectic two post-a-day schedule, so I'm only doing one today, and since my brain is burned out, it's going to be an off-topic one. I'd like to turn to the subject of memes for a second. Apparently, in the world of bloggers, the term meme has come to mean a formulaic type of post where you answer a bunch of funny personal questions about yourself. I find this a bit annoying because the word meme, as originally defined, describes something interesting and useful, even if it isn't the most scientific thing in the world. The things you see on the internet are templates, or maybe ideas, but I find it hard to see them as memes. For one thing, they aren't "competing" with anything else in terms of mindspace. There is room for them all, so they lose the analogy with genes. I might agree that the concept of "a list of items as a way to fill blog space and learn interesting things about yourself and others" is a meme, but any particular implementation of that idea seems like it would not be a meme in and of itself, based on the original definition that Dawkins gave: (from Wikipedia) a unit of cultural information transferable from one mind to another...Examples of memes are tunes, catch-phrases, clothes fashions, ways of making pots or of building arches.

Anyway, I've mostly given up on this since things mean what we decide they mean. I guess if you can't beat 'em...

5 things I've never eaten for breakfast

  1. Tires
  2. Acorns
  3. Gravity
  4. Shark
  5. Creosote

Yesterday's post covered two big problems in debugging, mostly having to do with theories. There are two other big problems with theories that you should be on the lookup for: improbability and missing walls. Scientists, when attempting to explain a phenomena, use an interesting criterion, parsimony. Essentially it says, if you have two theories and one is simpler, it should be preferred. Parsimony is ultimately about probability, since nature seems to prefer simple solutions. Computer problems don't always wind up having the most simple explanations, since they are built by humans and not the Flying Spaghetti Monster, but they do often wind up having the most probable explanation.

When trying to debug a problem, you need to ask yourself, "is the theory that I am proposing the most likely thing that could be going wrong here?" Novice debuggers tend to forget about what's probable in their hope that they will have an opportunity to find and fix that killer bug that they can brag about in a war story. However, you will get traction more quickly if you look at the most probable causes first.

The last issue that I'd like to cover is that of theories that are missing a wall. Often, a developer will approach me and say, "I've narrowed it down to cause X". I'll look over the data they've collected and talk through the issues. Along the way, I notice something interesting: while the cause they describe matches the data, it occurs to me that there is at least one other probable and plausible theory. Essentially, they haven't ruled enough out and so I tell them that their theory is missing a wall. There is clearly a piece of data that could be collected that would quickly distinguish between these alternatives. Noticing a missing wall is a learned skill, but it's one reason why it's always nice to have a trusted coworker whom you can run theories by. It's always surprising when someone else points out an obvious alternative explanation.

Tomorrow: Look at the Changes

Wired ran a feature in this month's magazine where they asked a group of writers and designers to write "Really Short Stories", 6 words or fewer. What I found interesting is that the best writers (in my opinion) are also the best really short story writers. For example, they open with Hemingway's original (not part of the writers, but the inspiration):

"For sale: baby shoes, never worn."

In general though, they range from the syntactically clever:

Steve ignores editor's word limit and
- Steven Meretzky

To the punny, self-referential:

Machine. Unexpectedly, I’d invented a time
- Alan Moore

To the unnerving:

The baby’s blood type? Human, mostly.
- Orson Scott Card

I thought I'd try my hand at it:

I crossed; their blinker was misleading

"Tim?" "Yeah?" "Don't you have twins?"

This just in I hate colons

The nachos ended my culinary career

If you'd like to join in the fun, add your own in the comments...

The common wisdom about debugging suggests that you need to develop a theory of the problem and collect data around that theory. If the data does not match it, you need to revise your theory and try again until all the data fits. This is great in practice, but it goes wrong in two ways.

The first comes from the world of psychology and sociology. People have a weakness in their reasoning that stems from a tendency that normally serves us well: it is difficult to have one's mind changed. If we could be convinced of things too easily, we would fall prey to a number of schemes or simply always be carried along at the whim of whatever ideas we happened to encounter (not that we don't do this too, but that's a different problem). The real problem arises when we hold tightly to our beliefs in the face of overwhelming contrary evidence. Psychologists use the term cognitive dissonance to describe the feeling of our internal beliefs not matching observed data. It's that pit-of-the-stomach feeling that says "oh no, how can it be that I was wrong this whole time?". Rather than experience an unnecessary amount of cognitive dissonance, we instead slightly or even greatly reframe the incoming facts so that we don't have to change either our minds or experience cognitive dissonance.

Confirmation Bias is the technical term to describe that reframing of facts: we are biased to see things in a way that confirms our existing beliefs. Normally it is used to describe problems of social interaction, for example, if you believe that a certain employee where you work is lazy, you will tend to interpret the things they say and do in a way that supports that notion, even after significant counterexamples. If they complete a large, complex task, even spending long hours and late nights on it, you might be inclined to dismiss it by saying "oh, he/she was working with so-and-so who is such a hard worker, otherwise he/she would have never gotten it done", confirming your previous view rather than reevaluating your opinion. It's incredibly insidious and something we should all be on the lookout for.

In the world of debugging, confirmation bias leads the scientific method astray because we are more likely to cling to our theory by reframing the evidence as support rather than throw out the theory and start again. It doesn't help if one starts by just collecting data and then trying to fit a theory. As soon as a theory is presented, we start fitting data to it. I've wasted so much debugging time coming up with more and more elaborate explanations to explain away data rather than give up and start looking around for a new and better theory. When you are spending more time trying to fit facts into your theory than generating new data, take a hard look at that theory and see if you really just succumbing to confirmation bias.

The other big mistake that I see made all the time is the implausible theory. I will step into a debugging situation and say, "tell me what you think the problem is", and the person will lay out a very clear theory that matches all their observed data. The problem is that there is other data readily available that directly contradicts the theory but which they are not including in their analysis. For example, you collect a lot of data about a sudden network performance issue and come to the conclusion that a faulty network card is to blame. You have a lot of nice graphs showing the performance before and after a certain date, and showing how the application runs as expected on another machine. Seems like a decent analysis, until I point out that none of the other applications on the machine have shown any performance change. That information was clearly available for the analysis and quickly eliminates a simple network card issue (although you can't rule out some more complex interaction), but with a kind of blinders on, it gets missed. The result is that you often spend a lot of time putting in a fix and then testing something that simply cannot have any impact so it's a huge time sink. In this example, you might have requisitioned and installed a new network card only to discover absolutely no change, and that is incredibly frustrating not to mention bad for your reputation.

It often happens in desperation, when we just want some theory that actually fits our data and we either willfully or subconsciously ignore key facts. Confirmation bias plays a big role too, as we cling to our implausible theory in the face of contradictory evidence. When you take confirmation bias and implausibility to the extreme, you get what I call the "One-Track Mind". These are the computer people who believe that all computer badness comes from some particular thing, whether it's Windows, Java, databases, video cards, etc. They had some bad experiences a long time ago with whatever it was, and from now on, they will perceive any innocent or even positive operation of their hated target as buggy or incorrect (the confirmation part), and they will immediately blame it when something goes wrong despite tangential or non-existent evidence (the implausibility part). I would stay clear of these people as much as possible and never ask them to help you debug something.

Tommorow: Improbable Theories & Missing Walls

So it looks like I am getting a fair number of people coming around from the NaBloPoMo Randomizer. Having viewed many of the other blogs, it is clear that I am sort of out of the mainstream here with my geeky technobabble. So I thought I might write something less boring or more useful to the blogging community (or hopefully both) in a first post, and continue my other topic for those who are interested in a second post.

Today my topic is sitemeter. Sitemeter is how I know people are using the randomizer to get to me and not that I suddenly have a readership. I recommend it. It's free and it answers useful questions like, "does anyone care what I have to say?" by showing you who visits (just by generic info, so I don't really know who you are) and how long they stay around. For example, I know that this morning, I've gotten 4 randomizer hits, but no one chose to click beyond the first page. Maybe I need more alluring titles for old posts.

Sitemeter has a nice feature where you can block your own browser or your own IP address so that you don't show up in your own statistics. I stumbled across a funny sort-of bug (you know I couldn't stay off the topic that long) in the way they block you. I have it set up to block my IP address since I work out the house and am at home 95% of the time. One morning I was actually working out of a coffee shop and I pulled up the page without thinking about the fact that I would show up. No big deal, I could ignore that 1 random hit from myself.

The funny part was after I got back home. I check in every so often during the day to see if anyone has commented (they haven't). On the third or fourth visit of the day I pulled up my stats. I was astounded! Someone was visting the blog like every hour! They must be desperately visiting the blog in a vain hope that I might have posted something that they can read. I must quickly satisfy their hunger! So I posted, and checked back an hour or so later to see if they had visited again.

Looking at the stats I see yes, they had been back! Then I noticed something odd. The mystery visitor's referral page was one of the administrative pages on my blog, something that only I have access to. Further investigation revealed that indeed, the visitor could only be me. How was this possible given that I have myself blocked? Well, it turns out that once I had visited in the morning from an unblocked location, sitemeter must have sent me back a cookie. Then, when I visited later from home, rather than re-looking me up to see where I was visiting from, it just associated that same data with my browser so I appeared to be coming from the coffee shop even though I was at home.

I was disappointed that no one was obsessively checking my blog, but at least got to investigate and solve an interesting problem. To fix it, I just went to sitemeter and had them block my browser, in addition to IP address. I probably could have also just gone in and deleted the cookie. If someone else out there has experienced the same thing and was totally baffled, I hope this helps.

Operational Distance is the gap between you and the power structure that has control over the system you are trying to debug. Sometimes, that gap is essentially zero, as when you are the adminstrator and arbiter of the system. Other times, it is a gulf requiring you to navigate dozens of people on the way to obtaining permission to even view a log file. Operational Distance is often overlooked in the pre-production stages because at that point it doesn't really exist. However, once the system is fielded and users are either reliant on it, or are simply naturally and somewhat rightfully resistant to change, it's too late. Without the right structures and agreements in place beforehand, it can be difficult or impossible to debug a problem because of all the organizational roadblocks in the way, not to mention that even if you fixed it, you wouldn't be allowed near the system to actually install new code for months.

Operational Distance can be overcome with careful design and lots of communication between you and your customer, but you have to be explicit about what you will need to be able to do after the system is deployed.

I have covered the five basic types of distance, Mental, Physical, Social, Temporal and Operational. I would now like to transition to discussion about some general concepts and ideas about debugging that I think are undercovered or underemphasized in the current literature. The first series of posts will relate to common problems that many debuggers (the people, not the software) run in to that are avoidable. The second series of posts will cover some uncommon general techniques and skills that are relevant to a wide variety of debugging situations.

Tomorrow: The Big Two: Confirmation Bias & Implausibility