Distance Debugging Logo

Debuggers are hindered by the lack of a language for talking about the stages of attacking a problem. When someone says, "I'm debugging that server crash", is it almost fixed? Do they know what the problem is but are unsure how to fix it? Do they even know what the problem is?

To address this problem, I am proposing the following six stages of debugging a problem:

Instantiation - A bug has been found, but it has not yet been clearly defined. In other words, someone has told you something is wrong, but the nature of the problem is not yet understood. The simple declaration of a bug is enough to get into this stage, and the bug remains instantiated until the verification process has begun.

Verification - After the bug has been instantiated, its existence must be verified. This means giving the bug the prima facie test: does the described behavior, on its face, actually constitute a bug? Many bug reports can be thrown out in this stage because they describe the expected behavior of the system (in which case the bug may be a request for change, or a simple misunderstanding), because they describe problems originating outside the application, or because they are so vague as to be impossible to fix such as "System was slow". If the bug appears reasonable, the recreation stage is entered. It can also either be rejected outright, or sent back to the creator for more information.

Recreation - The next stage is recreating the problem in some inspectable way. Originally, I wanted to call this stage "replication", but I don't want to overload that term. Some bugs don't have a natural "replication" mode, but can be recreated. For instance, "query performance is bad on query X". There is not much to replicate, other than to confirm that the problem exists as stated. However, in most cases, this stage will consist of the process of replicating the stated bug through a series of specific steps.

Isolation - This is the process of filtering out all the stuff that is not wrong, and reducing down to the point or points of failure. For many bugs, especially those that were easy to replicate, this is where the bulk of the work is spent. When isolation is complete, you should have a very clear understanding of what is wrong, and how to go about fixing it.

Repair - Once the bug has been isolated, one or more fixes must be applied. It may turn out that the isolation was incorrect, and in many cases, a debugging session will bounce back and forth between isolation and repair.

Validation - Finally, once the bug has been repaired, the fix has to be validated. In some instances, this stage will be trivial due to steps taken in the repair or isolation stages, such as when a test case is used to isolate the problem which now passes, or when a page refresh is all that is needed to see the improvement. In other cases, the fix must be tried in an operational setting, to verify that the thing that you fixed is the thing that was actually broken.

To recap:

  1. Instantiation
  2. Verification
  3. Recreation
  4. Isolation
  5. Repair
  6. Validation

So when someone asks you where you are with the server crash, you can now say, "I've verified the problem and am working on recreation", or "I've isolated the problem and I'm working on repair". This allows others to better understand how much progress is being made, and to increase communication with peers and with management.