Distance Debugging Logo

What if you "blink" and come up with nothing? This same question was posed a few weeks ago in a similar post, but this post will focus more on the fundamental reasons that you will fail to produce a theory in addition to specific ideas for solving the problem.

There are many reasons that you may be unable to produce a viable theory. Here are three of the primary reasons:

  1. The problem is overconstrained. Given the symptoms you are seeing, there is no theory that you can imagine that would match those conditions.
  2. The problem is underconstrained. There are dozens of theories that would match the conditions and so no one thing jumps out.
  3. The problem itself seems improbable. While you have a theory that explains it, it doesn't explain why it is suddenly happening now.

A problem can seem overconstrained for several reasons:

  • The biggest one is a bad observation. A single data point that is out of line with everything else that you know, or which directly contradicts something that you had assumed was true can throw off your whole investigation.If one piece of data is keeping you from formulating a theory, then try temporarily assuming that the data you have is bad, or spend some time investigating that one piece in depth in an attempt to really establish or discard it. More often than not, you will discover that it was erroneous, or that you really misunderstood its ramifications and in fact it does not contradict your theory.
  • The next reason for overconstraint is subtle, and I like to refer to it as the "sampling" problem. If you take measurements of a time-varying phenomena at just the right (or really, wrong) intervals, you can perceive something totally incorrectly. In the world of signal analysis, you can have the problem of "aliasing", where a high-frequency waveform appears as a much lower frequency wave because of the interference of the sampling rate.The problem of "aliasing" occurs in debugging data collection as well. Here's a direct example: imagine that you get a call from a system administrator telling you that a server has to be restarted every Tuesday morning. You think to yourself, "what kind of bug happens on the same day every week?" Actually, the system has to be restarted every morning, but that system administrator is generally the first one there on Tuesdays, and she is only one who has mentioned anything to you. That is an example of an "aliased" piece of data. Anytime you are relying on a scattered set of reports over time, or across a set of different people, look out for sampling problems and make an attempt to fill in those gaps.
  • The final common cause of overconstraint is one or more unquestioned assumptions getting in the way. We often overconstrain ourselves with our thought process in which we say, "Well, we are certain that X is true, and Y is true, so I can't possibly see how the bug would happen." When you look more deeply though, you can easily imagine a condition under which X would be false. For example, most of the world is still using a single-processor machine at this point (although that is rapidly changing). You might get a bug report about a threading condition that you just can't imagine happening on a single-processor machine. However, once you realize that there is no reason that the problem machine in question has to be single-threaded (and it's very easy to check), it seems obvious what is going wrong. Learning to notice and question those normally tacit assumptions is a skill that can save significant time and anguish.

Underconstraint, and improbable bugs will be covered tomorrow.

Tomorrow: The Distance Bug Investigation, Part IV