Distance Debugging Logo

The final step in the debugging process is putting a fix into place. It's often much harder than it sounds, especially in the case where you just don't feel like your going theory really captures the problem. A good fix must clearly arise from that theory and too often this part gets botched, and a debugger will convince themselves that their theory is wrong because their attempted fix didn't appear to do anything, when really the problem is that the fix didn't actually address the cause as described by the theory.

Ultimately, fixes are a kind of data collection, albeit a special one. It is the data collection that tells you whether you've ultimately understood the problem or not. When you apply a fix, one of few things happens:

  1. The problem disappears, and never reappears
  2. The problem disappears, but reappears under an apparently different set of circumstances. In this case, it's possible that your theory and therefore your fix was too circumscribed.
  3. The problem goes away but another problem appears. At this point there is a tough judgement call to be made: is this new problem something that was being covered up by the previous issue and therefore should be treated as a fresh investigation, or is your fix itself faulty simply causing the problem to morph instead of disappear?
  4. The problem appears to be unaffected. In this case you need to decide whether your theory or your fix is the problem.

Returning to the question of what a good bug tracking system ought to do, it should do a better job of helping you track this information in order to make a decision. If the problem reappears under a different set of circumstances, or morphs into a different problem that you think is similar in nature, you may want to open a new bug at first, but then merge it back into the original bug once you figure out what is really going on. On the other hand, when a bug apparently reappears based on observed symptoms, often the underlying cause is totally different, and being able to break that new information out into a new bug would be ideal, but perhaps with a link based on the symptoms so that if they reappear, you can quickly look at previous theories along with their fixes. That last piece is really the most important thing: bug tracking systems generally don't give you a way to list all the things you've tried that you thought might work along with the outcomes. Many times I've found myself thinking that I had a great idea for a change to the system that wasn't possible at the time, but if a certain fact were no longer true, it would be a much better solution.

In the end, keeping a good record of what you have fixed and how will pay dividends, both in terms of beating recurring problems when you accidentally undo something, and in terms of brainstorming when you are stumped. I'm hoping to incorporate some or all of these ideas into a new bug analysis system to be described in this space shortly.

Each type of data collection comes with its own risks and problems that you need to keep in mind:

  • Probe Execution - While it tends to be the most reliable of the techniques, it also happens to be the one that is least often available to you, and it also tends to give you very little information. It is only available to you if you have all the parts of the system handy, and in the case of something like a debugger, is only really useful if you have source code. The other problem is that while you step through 1000 mind-numbing lines of code looking for a slight divergence, you are likely to give up before you find it. It ends up as too much of a fishing expedition. However, if you have narrowed down the problem to a specific region or a specific problematic operation, a runtime probe can work. The use of automated debugging tools such as the Delta Debugger (see the link in the right side bar) can also make your runtime probes significantly more powerful and time-effective.
  • Human Inquiry - Generally the least reliable technique, just ask any attorney. Humans are biased, and they are easily influenced so they will often tell you what you want to hear, or will fill in gaps in their knowledge with invented content so as to appear helpful. In many cases, the person who knows is also often not available making the technique useless. The key is to make sure that you are talking to the person who would know, and to keep your questions very specific. Asking something like, "Did the error message say anything about a network problem?" is generally more valuable than, "What did the error message say?"
  • Test Case Execution - Very reliable, and generally available, test cases are my recommended data collection technique in most instances. The biggest caveat is that your test case may have a bug that hides data from you, so don't ignore counterintuitive results. Also, as a matter of course, write test cases that you expect to fail than ones you expect to succeed. This might sound silly, but I often see a bug and ask to see what tests were written only to discover that they test the functionality so meekly that they barely test anything at all. Tests that fail mean more work to do, but they will mean less work overall. On a related note, I like to say that a test that you expect to fail but actually succeeds is 10x as valuable as one that you expect to succeed but actually fails. When you think something will fail, but it succeeds, you can be sure that you are not really understanding the problem. When you expect success but get failure, you've just found another thing that fails and that may or may not be new information.
  • Artifact Examination - Artifacts suffer from 2 major issues: their reliability is totally unknown, and they are static. Like test cases, the production of system artifacts can suffer from bugs such as failing to write out important information or writing out incorrect information. A good example of this is inconsistent error logging, a problem that plagues many systems. If you have an error that is actually a cascading effect from a previous error but only the second one gets recorded, you will be missing a huge piece of the puzzle, and the artifact might actually lead you astray. In terms of being static, think of all of the times that you have looked at an error log and thought, "I wish I had written out the value of [important variable]". It's extremely hard to predict what information will be useful at some point in the future and there is no way to go back and fix it. In a few days I will discuss techniques to make your artifacts more reliable.
  • Differential Computation - Suffers from an assortment of issues. The biggest is knowing what to diff. You can't run the calculation on everything, you need to pick things that you think have changed. Next is knowing how to diff. Text files and even most binaries can be easily compared, although you may not always understand the results. Diffing something like hardware configurations is a lot tougher and generally has to be done manually, and that introduces human error. Finally, the problem of spurious or ignorable differences can cloud your findings. Simple things like line breaks will complicate the results of a text diff. Trying to figure out whether differences in CPU chip speed is important is significantly more difficult.

Tomorrow: Putting it all together, Diagnosing in Action

There are a handful of data collection techniques that you will use again and again. Each of these will help you fill certain knowledge gaps, and each has its own set of pitfalls and gotchas that will waste your time and energy. I touched these in a general sense yesterday, but here is a more detailed list:

  • Runtime probe - Data collections that involve directly querying or observing the running system. Typical probes include running the system in a debugger for the purpose of observing control flow or the value of variables, or using a more general-purpose query language such as DTrace for Solaris, or system utilities like strace or netstat on Linux.
  • Human inquiry - Data collection that involves asking another person for information. This might be asking a user who discovered a bug for more information about what they experienced or asking a system adminstrator if they've applied a certain OS patch.
  • Test Case Execution - Data collection that involves crafting a specific test case meant to replicate a bug, or produce a specific error.
  • Artifact Examination - Data collection by looking at some artifact produced by the running (or crashing system) to glean specific information. This includes looking in a log for an error message, or opening a core dump file in a debugger.
  • Differential Computation - Data collection that consists of simply determining the difference between two or more things in order to know first if there is a difference, and second, what the specific difference is. The example given a few days ago of comparing the set of dependent libraries is a good example. Other common activities include diffing a configuration file, and diffing a set of data inputs to determine why one version of a system is working and another is not.

Tomorrow: The Caveats and Issues with Each Approach

Data collection is a complex and critical piece of bug investigation. Data can come from many sources:

  • The results of specific "experiments" such as the success or failure of a unit test.
  • Observations about the state of the environment or the system made by you, your team, or by end users.
  • Artifacts produced by the running system such as log files or error messages.

One common mistake when collecting data is the production of information for the sake of keeping busy or in the hope that a general fishing expedition will reveal a critical piece of information. It almost never does. Using your theory or theories (or reason for the the lack thereof as described previously) to choose specific data collections that will isolate the correct one is by far the critical skill needed for debugging. It is also the part of the scientific method that is hardest to get right, and the success of many experiments hinges on the question of whether the results ultimately support or refute the theory being tested.

Putting that problem aside for a moment, there is a lower hurdle of simply making sure that the result of any data collection at least in conception is supposed to confirm or deny a theory. This means:

  1. Start by determining what pieces of information are needed, and then gearing your collections to that information specifically.
  2. For each data collection, have a clear notion of possible outcomes and how each outcome maps to supporting, refuting, or is irrelevant to each theory.
  3. Watch out for data collections that seem discriminating, but in fact have outcomes that support all or most of your theories. Also watch out for outcomes that would refute all of your theories, because assuming that you are a reasonably good theorizers, that outcome is probably extremely unlikely and so may not be worth doing at all.
  4. If possible, talk through these outcomes and your reasoning with others on your team to help find gaps in your thinking.
  5. If after this analysis you determine that a data collection is not going to provide you with good information, just don't do it. Don't convince yourself that just doing it might provide you will useful information somewhere down the line. 99% of the time you will just have extra data that serves no purpose, and may actually confuse things later on.

I'm stopping this whole "Distance Bug Investigation, Part X" thing since it makes the titles meaningless. I might even go back and change the older posts so that they have reasonable titles as well, so don't be surprised...
Tomorrow: Common Data Collection Techniques, with Caveats

Continuing with the discussion of what to do when you have no theory, the next likely cause is underconstraint, when the problem suggests a large number of possible causes. In this case, your best course of action is to figure out a few data collections that are the ideal combination of distinguishing (i.e. they would support some theories and not others) and low-cost doable.

Let's say that you are beta-testing a new version of your software, a standalone desktop application. It runs fine in your test rig, but for every user that installs it, it crashes on startup. Since the crash could have many causes, and dozens if not hundreds of things have changed in the newest version, you are totally underconstrained. You might try the following set of data collections to try to narrow things down:

  • Go through the list of dependent libraries for the application and compare the version on the customer machine to the version on the test machine.
  • Attempt an installation on several additional machines in your local environment in the hope that one of them has the same crash, thereby giving you a local machine to use for comparison.
  • Perform an installation of the software yourself on a customer computer, with a user present. Perhaps in your testing you are making a different set of choices or are otherwise installing it in a fashion that differs from their actions.

These three data collections are likely fast, as in could be performed the same day, and fairly accurate. The first would tell you if the environment might be to blame, the second would give you a leg up on replication, and the third would tell you if installation procedures were to blame. In each case, a meaningful result you would give you a significantly more constrained problem, even though you might still not have a theory at that point.

The final common cause of having no theory is the improbable problem. It can be hard to come up with a theory when we believe that the situation being described simply can't happen. Imagine that you get a bug report that a user is receiving two identical email notifications every time a report is generated by your system. You however know that you just put in a piece of code that is explicitly looking for duplicate notifications and throwing them out because of a known previous issue. You will convince yourself that the problem is everywhere but in your code. You will blame the outgoing mail server, the incoming mail server, user error, etc. before coming up with a theory. As described in the post from a couple of weeks ago on this same topic, there is one good thing to start with when you encounter a problem like this, and that is to replicate. The replication will force you to take it seriously. Another good tactic is to get another opinion from someone on your team. They might say, "oh yeah, I can totally see how that would happen even with your check in there", and then you will have something to go on.

For more general suggestions about trying to generate a theory, see Day 14: When the Blink Fails. Now the we have covered theorizing, the next few posts will cover data collection in more detail.

Tomorrow: The Distance Bug Investigation, Part V

What if you "blink" and come up with nothing? This same question was posed a few weeks ago in a similar post, but this post will focus more on the fundamental reasons that you will fail to produce a theory in addition to specific ideas for solving the problem.

There are many reasons that you may be unable to produce a viable theory. Here are three of the primary reasons:

  1. The problem is overconstrained. Given the symptoms you are seeing, there is no theory that you can imagine that would match those conditions.
  2. The problem is underconstrained. There are dozens of theories that would match the conditions and so no one thing jumps out.
  3. The problem itself seems improbable. While you have a theory that explains it, it doesn't explain why it is suddenly happening now.

A problem can seem overconstrained for several reasons:

  • The biggest one is a bad observation. A single data point that is out of line with everything else that you know, or which directly contradicts something that you had assumed was true can throw off your whole investigation.If one piece of data is keeping you from formulating a theory, then try temporarily assuming that the data you have is bad, or spend some time investigating that one piece in depth in an attempt to really establish or discard it. More often than not, you will discover that it was erroneous, or that you really misunderstood its ramifications and in fact it does not contradict your theory.
  • The next reason for overconstraint is subtle, and I like to refer to it as the "sampling" problem. If you take measurements of a time-varying phenomena at just the right (or really, wrong) intervals, you can perceive something totally incorrectly. In the world of signal analysis, you can have the problem of "aliasing", where a high-frequency waveform appears as a much lower frequency wave because of the interference of the sampling rate.The problem of "aliasing" occurs in debugging data collection as well. Here's a direct example: imagine that you get a call from a system administrator telling you that a server has to be restarted every Tuesday morning. You think to yourself, "what kind of bug happens on the same day every week?" Actually, the system has to be restarted every morning, but that system administrator is generally the first one there on Tuesdays, and she is only one who has mentioned anything to you. That is an example of an "aliased" piece of data. Anytime you are relying on a scattered set of reports over time, or across a set of different people, look out for sampling problems and make an attempt to fill in those gaps.
  • The final common cause of overconstraint is one or more unquestioned assumptions getting in the way. We often overconstrain ourselves with our thought process in which we say, "Well, we are certain that X is true, and Y is true, so I can't possibly see how the bug would happen." When you look more deeply though, you can easily imagine a condition under which X would be false. For example, most of the world is still using a single-processor machine at this point (although that is rapidly changing). You might get a bug report about a threading condition that you just can't imagine happening on a single-processor machine. However, once you realize that there is no reason that the problem machine in question has to be single-threaded (and it's very easy to check), it seems obvious what is going wrong. Learning to notice and question those normally tacit assumptions is a skill that can save significant time and anguish.

Underconstraint, and improbable bugs will be covered tomorrow.

Tomorrow: The Distance Bug Investigation, Part IV

This post will start from the assumption that you have at least one theory to work with. Tomorrow's post will cover developing a theory when you have none. There are few questions you need to ask yourself about the theory in order to determine how to proceed:

  • Is the theory testable, given the current distance constraints?
  • If not, could the distance issue be overcome by one of your trusted contacts?
  • If not, could the theory be testable in some simulated environment with some likelihood of success?

The problem is that while you may have an excellent theory, it might be quite difficult to determine whether or not it's true, and you may want to investigate some less probable theories that are quickly testable before looking at the most probable one.

Here is an example: you have built and deployed a client-server system for a customer. You receive a phone call on Friday morning that the server mysteriously stopped responding to client requests on Wednesday. The local administrator simply brought the server application down and back up again, and the server appeared to go back to normal. However, the server was displaying the same symptoms again that morning, and they decided to call you. Upon the problem being described, you have a few ideas that immediately jump to mind, in order of estimated probability

  1. The system is designed to write a log file to disk, and rolls to a new log every 100MB. It is possible that the disk has filled up and when it runs out of log space gets wedged in that state. The reboot causes it to clear the most recent log and start again, but when it fills up, it has the same problem. This would explain the long lead up to the first problem, but the rapid reoccurence.
  2. Perhaps a deadlock is occurring because a greater number of users are using the system and a section of code that was not properly protected is now causing a problem. While it is unlikely to occur, it will be become increasingly likely as the load on the server increases. This also would match a condition with a long lead-up, and then a relatively rapid reoccurrence.
  3. The disk that the server is running on is failing. Whenever a bad sector is accessed, the server goes into a long read/retry loop until it finally fails, leaving the system in a bad state. It hit that bad sector for the first time on Wednesday, and then hit it again this morning.

While they all are ultimately testable, they have different issues. Despite the fact that 2 is more likely than 3, it might make sense to investigate 3 first since it very quickly can be ruled in or out. Here is how I might proceed in this investigation:

  1. Start with the first theory, which has two prerequisites. First, the disk has to be almost out of space. This will require a trusted contact to verify, if the system is physically remote. Second, you will have to mimic the condition of the system running out of disk space in a local capacity to see what actually happens. If both of these things turn out to be true, then you are almost certainly correct. If the disk has plenty of space, it's probably not even worth checking the second condition. if the disk is almost out of space, but it doesn't fail in the same way when you test it locally, it still might be a viable theory, and you will have to judge whether it's worth freeing up disk space and hoping for the best.
  2. If there is plenty of disk space, then checking with a trusted contact to determine if there are disk errors occurring is probably your best next step. This is usually easy to establish by looking at operating system logs.
  3. If there is plenty of disk space, and no errors occurring, it's probably time to start doing some more code investigation to determine if a deadlock or other thread issue is the problem, and you can proceed from there.

Tomrorow: The Distance Bug Investigation, Part III

So you have a bug reported, some network of contacts that you have developed over time, and maybe a theory that developed when you "Blinked" at the report. How do you actually start to confirm or deny that theory, or even develop a theory if you came up with nothing? The first thing to do is open a "case file" for the bug, and that will require a new kind of bug tracking.
Traditional bug reporting systems such as Bugzilla and TestTrack, while offering a wide range of capabilities, have fundamentally the same goal: allow a manager or team to see what is currently wrong, who is responsible for fixing it, and possibly when it will be fixed. Strangely, they offer very little support for the person actually doing the fixing in terms of tracking and augmenting their work. Imagine if development environments had more support for keeping track of who was responsible for writing each piece of code than for actually writing the code itself.  That is the state of bug tracking.

There are other problems with current bug tracking systems, with their emphasis on workflow and assignment of responsbility. They encourage finger-pointing rather than fixing. We should be creating an atmosphere that says, "every bug is everyone's problem". Often bugs take forever to get fixed simply because the bug is constantly being reassigned in tiny steps to different users rather than having them attack it quickly as a team. In a more subtle way, they orient us towards the surface features of a bug rather than theories of the problem. This is most pronounced when a bug is "reopened" when a problem reappears because of a totally different root cause. This makes it appear that the original fixer was lax or incorrect in their solution, which is totally false.

What should a better bug tracking system provide:

  • It should provide a way of entering basic information about the bug as current systems do, including a description of the problem as first described, the initial reporter, and the severity.
  • It should, at a glance, allow a developer to determine the current state of a bug. Do we already know what's wrong but haven't had time to fix it? Do we not even have any theories of the problem? Have we not even begun collecting any data about the problem? There should be more information than a state like OPEN, ASSIGNED, or CLOSED.
  • It should provide a way to track and quickly review all the observables for a bug, i.e. direct data collections, user observations, etc. along with an estimate of the certainty of that information so that we can discount information that the team did not obtain directly if necessary.
  • It should provide a way to track and quickly review any theories that we have developed. It should allow the assignment of likelihoods to theories so that we can see what our most probable theories are. We should also be able to quickly see what theories were rejected and whether any theories remain that have not be disproved.
  • It should provide a place to list any tests that have been tried and the data they produced, and the ability to link that information to the the theories in terms of whether the results support, refute or do not affect them.
  • It should provide a place to put any additional assumptions that are being made about the bug or the system in question that either remain to be tested, or which are either untestable, or usually tacit but which might be called into question in the current bug investigation.
  • It should provide a clear place to list possible fixes based on the current theory, with a description of how it addresses the problem, and any associated information about the fix.
  • It should provide a clear statement of the final resolution and any follow-up caveats or assumptions built in to the fix that was chosen, such as side-effects or reasons that the fix might be undone or redone if circumstances change in the future.

The overarching goal of a system like this is to act as a persistent debugging memory, where you can relate new issues to previous ones, figure out which underlying causes recur in your system to point out weak links, and gain other insight into how and why your system tends to fail. In my copious free time, I am actually working on developing a bug tracking system that meets the above requirements, so look for that on this site sometime in the (not-so) near future. However, I've discovered that this type of information can be tracked in a document without too much trouble.  I will reference this type of tracking as the investigation process is covered in the next few posts.
Tomorrow: The Distance Bug Investigation, Part II

Now that you have a contact who has delivered results on at least one task, you should figure out their commitment level and domain of expertise. Here are some general levels from least to most commitment:

Level 0: The user will field the occasional question, but will not perform any actions on your behalf.

Level 1: The user will field questions, and is willing to test patches, beta releases, and other code changes to verify fixes and produce extra debugging information.

Level 2: Everything from Level 1, plus is willing to spend regular time working through issues with you. They will set aside time (maybe averaging an hour every 1-2 weeks) to actually work with you over the phone or IM (or even email if necessary) to collect information in real time.

Level 3: Everything from Level 2, plus will continue to assist you with some task autonomously. They will accept some level of tasking from you to perform during their daily work and will report back results to you. There are very few users that are willing to do this out of the goodness of their heart, or for love of system. Generally you will get a Level 3 user when some higher-up compels an employee to assist you for the good of the project.

In some cases, it will be immediately clear what level a particular user is willing to work at just based on what they offer to do. In other cases, a user will progress to higher levels over time simply because they feel rewarded for the work they do.

Beyond determining the level, you will want to get a mix of contacts from various departments or teams to get a wider set of perspectives and testers. For example, if you are making a billing system, you will want a contact from the sales team, the accountants, and any other group that might have it's own perspective. In general, you shouldn't turn away an additional contact from a group from which you already have a contact, but it makes sense to actively look for users from groups for which you don't currently have a good contact.

Tomorrow: The Distance Bug Investigation, Part I

In a Distance Debugging situation, especially one that involves a large physical distance, nothing is more valuable than a trusted contact. A trusted contact ideally is another member of your team that is working with you to solve problems, but more likely it is someone from your customer, either a user or local administrator, that will act as your eyes and fingers.

The qualities that you need in a trusted contact include (in approximate order of importance):

  1. Attention to Detail - This can't be stressed enough. I would much prefer a totally naive user with no technical skill that will faithfully report error messages and follow directions perfectly than the opposite. Sometimes a highly technical contact will try to out-think you to the detriment of the debugging effort.
  2. Enthusiasm/Passion - I am firm believer that we ultimately do not commit to things that we do not enjoy. A tepid contact who doesn't care about the fate of the system will not make the extra effort to make sure that you solve the tough problem.
  3. A Thick Skin - This is needed for two reasons. First, when time is short and tempers flare, you need to know that your contact can take a little sarcasm. Second, you will need them to check and recheck something, and then check it again just for your peace of mind. They need to understand that you are being thorough, and not that you are questioning their competence.
  4. Domain Knowledge - They should understand the world of the users of the system, in order to answer certain questions, although this is less important
  5. Technical Knowledge - Some technical knowledge can be helpful, especially simple utility things like how to open the Windows Control Panel. However, this is the generally not very important.

The first thing to do when looking for a trusted contact is look for someone with these qualities. Often, I will look at a very detailed bug report and think, "This person looks like they would be a good contact to make." I will immediately try to get in touch to let them know that I appreciate their report and to gauge their interest in working with me on a regular basis. One of the easiest ways to determine this is by simply giving them a small task such as getting the answer to a question for you. For example, I might ask, "Can you check around with other users in your area and see if any of them use the X feature? We're trying to decide if we want to keep it around." If I check in with them the next week and they've polled 5 or 10 users, then I know I can count on them in the future. On their end, they usually realize that this gives them more of an opportunity to influence and guide future development, and it gives them buy-in.

Tomorrow: Developing a Trusted Contact, Part II

Syndicate content