Distance Debugging Logo

As a final send off to NaBloPoMo, I thought I was give a brief example showing how I might put all the things I've been talking about together when solving a problem. Just to let everyone know, while I likely won't be posting every day going forward, I will try to post at least every other day, on average, so look for plenty of new content here.

A month or so ago, I posted about the terrible tech support that my grandmother received regarding issues with her aging laptop. While I disagree with her decision to keep the poor thing alive, they failed in a basic way to provide her with anything approximating distance debugging. I want to talk about one of those issues, starting from the bug report.

Report: My grandmother told me that, "when I tried to go to the New York Times page, it just opens my Documents folder".

Blink: Nothing yet. This made no sense to me, so I asked her to tell me the story.

Story: "I subscribe to the NYT news updates, and they send me articles by email with links, but when I click it it just opens my documents folder."

Blink: That triggered something: I bet her preferred browser got screwed up in her network settings. Would that actually result in this kind of behavior?

Theory: The network settings preferred browser got screwed up and that is making it impossible for her to follow links embedded in emails.

Data Collection #1: Check the setting of the preferred browser, artifact examination. Luckily I had access to the computer so I could check, but I probably could have navigated her through it over the phone. Confirmed that the browser was unset.

Fix Attempted (we haven't covered this yet, but coming soon): Changed it to Internet Explorer. Clicked link in email, browser opened with page. Fix successful.

However, I was still bothered by the fact that it was changed. Settings don't just flip themselves. Maybe there was some deeper problem of which this was actually just a symptom.

Data Collection #2: Human Inquiry. I asked my grandmother if she'd changed the setting for some reason, or installed any plugins or upgrades that might have affected it. She mentioned that she'd let my cousin install Firefox. Mystery solved.

So in a complete telling of the problem and fix, I would say that the attempt to install a second browser somehow changed her network settings, but not in a proper way, causing it to fail to load any browser when a link was clicked within another application. This would often result in just popping open the My Documents folder for reasons still unexplained, but likely unimportant.

That's a relatively trivial example, where the blink moment was the right answer, but I think it illustrates my overall approach to debugging in a nutshell.

Coming Soon: Let's Talk about Fix, Baby

Each type of data collection comes with its own risks and problems that you need to keep in mind:

  • Probe Execution - While it tends to be the most reliable of the techniques, it also happens to be the one that is least often available to you, and it also tends to give you very little information. It is only available to you if you have all the parts of the system handy, and in the case of something like a debugger, is only really useful if you have source code. The other problem is that while you step through 1000 mind-numbing lines of code looking for a slight divergence, you are likely to give up before you find it. It ends up as too much of a fishing expedition. However, if you have narrowed down the problem to a specific region or a specific problematic operation, a runtime probe can work. The use of automated debugging tools such as the Delta Debugger (see the link in the right side bar) can also make your runtime probes significantly more powerful and time-effective.
  • Human Inquiry - Generally the least reliable technique, just ask any attorney. Humans are biased, and they are easily influenced so they will often tell you what you want to hear, or will fill in gaps in their knowledge with invented content so as to appear helpful. In many cases, the person who knows is also often not available making the technique useless. The key is to make sure that you are talking to the person who would know, and to keep your questions very specific. Asking something like, "Did the error message say anything about a network problem?" is generally more valuable than, "What did the error message say?"
  • Test Case Execution - Very reliable, and generally available, test cases are my recommended data collection technique in most instances. The biggest caveat is that your test case may have a bug that hides data from you, so don't ignore counterintuitive results. Also, as a matter of course, write test cases that you expect to fail than ones you expect to succeed. This might sound silly, but I often see a bug and ask to see what tests were written only to discover that they test the functionality so meekly that they barely test anything at all. Tests that fail mean more work to do, but they will mean less work overall. On a related note, I like to say that a test that you expect to fail but actually succeeds is 10x as valuable as one that you expect to succeed but actually fails. When you think something will fail, but it succeeds, you can be sure that you are not really understanding the problem. When you expect success but get failure, you've just found another thing that fails and that may or may not be new information.
  • Artifact Examination - Artifacts suffer from 2 major issues: their reliability is totally unknown, and they are static. Like test cases, the production of system artifacts can suffer from bugs such as failing to write out important information or writing out incorrect information. A good example of this is inconsistent error logging, a problem that plagues many systems. If you have an error that is actually a cascading effect from a previous error but only the second one gets recorded, you will be missing a huge piece of the puzzle, and the artifact might actually lead you astray. In terms of being static, think of all of the times that you have looked at an error log and thought, "I wish I had written out the value of [important variable]". It's extremely hard to predict what information will be useful at some point in the future and there is no way to go back and fix it. In a few days I will discuss techniques to make your artifacts more reliable.
  • Differential Computation - Suffers from an assortment of issues. The biggest is knowing what to diff. You can't run the calculation on everything, you need to pick things that you think have changed. Next is knowing how to diff. Text files and even most binaries can be easily compared, although you may not always understand the results. Diffing something like hardware configurations is a lot tougher and generally has to be done manually, and that introduces human error. Finally, the problem of spurious or ignorable differences can cloud your findings. Simple things like line breaks will complicate the results of a text diff. Trying to figure out whether differences in CPU chip speed is important is significantly more difficult.

Tomorrow: Putting it all together, Diagnosing in Action

Evidence:

  1. I had to go on travel for business, which is generally inconvenient and tiring, but mostly unavoidable.
  2. My flight was slightly delayed, but not excessively so. I played a video game on my computer to fill the extra time. A 6- or 7-year-old kid noticed and stood next to my chair watching intently. All I could think about was what my son will be like at that age.
  3. The same kid was sitting in the row behind me and was dramatically overtired and very upset. I broke out the same game on the flight and he stood in the aisle of the airplane watching me and it calmed him down. The family also had an infant, so I was glad to give them a break.
  4. I had two messages when I arrived. #1: my wife calling me to say that a pipe burst at my son's day care so she is home with him all day tomorrow.
  5. The second message was the call from the school telling me that pipe burst. Glad I didn't get that one first.
  6. There were several put-out people waiting in the car rental kiosk because they didn't want minivans and that was all that was left. I told the guy I didn't care and he smiled and said "we have an Explorer for you".
  7. I agree that when the car stub stays FORD EXPR, guessing Explorer is reasonable. However, it was actually a Ford Express, which is a full-size van that could easily seat 11. It could probably fit a Explorer inside of it. I drove it anyway because I didn't want to be one of the put-out people.
  8. I arrived at my hotel and got to my room with no issue. While the card worked fine and the door opened, it was immediately stopped by the little hinge thing that prevents unauthorized access. As far as I know, the room was vacant, i.e. no one jumped up and started shouting about who was barging in at 11:30 at night, so the whole thing had kind of a ship-in-a-bottle quality to it. I spent more time wondering how it was possible than caring about the inconvenience. Does this happen a lot? Was it intentional, like some misguided practical joke by the previous tenant? I will never know.
  9. The front desk was very apologetic and upgraded me to one of the "executive" rooms. He pointed me to some elevators off to the side that I had not noticed before. Unfortunately, this appears to be the only perk, your own set of elevators. Well, unless you count the hideous two-color silver/gold faucets, and the fact that your door has some kind of weird engraved pineapple insignia on it. I kid you not.  I figured I would at least get two bathrooms.

So here I sit in my "executive" room with the party van parked nearby, instead of at home with my son and wife. My hat's off to you universe.  I have absolutely no idea what to expect from the rest of this trip.

There are a handful of data collection techniques that you will use again and again. Each of these will help you fill certain knowledge gaps, and each has its own set of pitfalls and gotchas that will waste your time and energy. I touched these in a general sense yesterday, but here is a more detailed list:

  • Runtime probe - Data collections that involve directly querying or observing the running system. Typical probes include running the system in a debugger for the purpose of observing control flow or the value of variables, or using a more general-purpose query language such as DTrace for Solaris, or system utilities like strace or netstat on Linux.
  • Human inquiry - Data collection that involves asking another person for information. This might be asking a user who discovered a bug for more information about what they experienced or asking a system adminstrator if they've applied a certain OS patch.
  • Test Case Execution - Data collection that involves crafting a specific test case meant to replicate a bug, or produce a specific error.
  • Artifact Examination - Data collection by looking at some artifact produced by the running (or crashing system) to glean specific information. This includes looking in a log for an error message, or opening a core dump file in a debugger.
  • Differential Computation - Data collection that consists of simply determining the difference between two or more things in order to know first if there is a difference, and second, what the specific difference is. The example given a few days ago of comparing the set of dependent libraries is a good example. Other common activities include diffing a configuration file, and diffing a set of data inputs to determine why one version of a system is working and another is not.

Tomorrow: The Caveats and Issues with Each Approach

Another brief departure today since I would rather watch Monday Night Football than think real hard. Last Tuesday, I took my car in for service because it was starting very rough in the cold weather, and for probably the first time ever, I actually understand what they did to fix it. Let me be very clear about one thing: I know very, very little about cars. I understand the basics of an engine from a physics standpoint, and I have rudimentary knowledge of things like how brakes work and so on, but I've never tried to fix a car, and frankly, I think it's in my best interest that I do not attempt to become a "car" guy. If I started to learn about cars, I would have to know absolutely everything about cars. I would spend a lot of time obsessively tweaking cars that I own, and I don't have time or mindspace for any of that.

However, I do understand computers, and in many cases, hardware. That's why I was pleased to read the diagnosis (I'm paraphrasing): "Tech determined that firmware version for [unknown car thing] was version 1.07 which is known to cause misfire condition. Computer showed newer version from July 06, which was installed and should correct condition and hopefully cold start issues." So they upgraded the software to hopefully fix the problem. This I understand.

As cars become more software-oriented, especially in electric or hybrid cars that are mostly "drive-by-wire", where doing things like pushing the brake actually controls a signal to the computer rather than physically controlling a friction creating device, I may actually start to be able to fix cars simply because they have become computers. This brings up a larger idea for me, that of the people with computer skills becoming kind of "universal mechanics". As everything in our lives becomes a computer, from cable boxes to refrigerators to cars, anyone with some knowledge of operating systems and networking should in be able to fix a certain portion of the things that go wrong with these devices. I think it may end up giving a certain power to these skills that will make them valuable in a way that the ability to write once was for the ruling classes, only hopefully, more widely attainable.

Data collection is a complex and critical piece of bug investigation. Data can come from many sources:

  • The results of specific "experiments" such as the success or failure of a unit test.
  • Observations about the state of the environment or the system made by you, your team, or by end users.
  • Artifacts produced by the running system such as log files or error messages.

One common mistake when collecting data is the production of information for the sake of keeping busy or in the hope that a general fishing expedition will reveal a critical piece of information. It almost never does. Using your theory or theories (or reason for the the lack thereof as described previously) to choose specific data collections that will isolate the correct one is by far the critical skill needed for debugging. It is also the part of the scientific method that is hardest to get right, and the success of many experiments hinges on the question of whether the results ultimately support or refute the theory being tested.

Putting that problem aside for a moment, there is a lower hurdle of simply making sure that the result of any data collection at least in conception is supposed to confirm or deny a theory. This means:

  1. Start by determining what pieces of information are needed, and then gearing your collections to that information specifically.
  2. For each data collection, have a clear notion of possible outcomes and how each outcome maps to supporting, refuting, or is irrelevant to each theory.
  3. Watch out for data collections that seem discriminating, but in fact have outcomes that support all or most of your theories. Also watch out for outcomes that would refute all of your theories, because assuming that you are a reasonably good theorizers, that outcome is probably extremely unlikely and so may not be worth doing at all.
  4. If possible, talk through these outcomes and your reasoning with others on your team to help find gaps in your thinking.
  5. If after this analysis you determine that a data collection is not going to provide you with good information, just don't do it. Don't convince yourself that just doing it might provide you will useful information somewhere down the line. 99% of the time you will just have extra data that serves no purpose, and may actually confuse things later on.

I'm stopping this whole "Distance Bug Investigation, Part X" thing since it makes the titles meaningless. I might even go back and change the older posts so that they have reasonable titles as well, so don't be surprised...
Tomorrow: Common Data Collection Techniques, with Caveats

Continuing with the discussion of what to do when you have no theory, the next likely cause is underconstraint, when the problem suggests a large number of possible causes. In this case, your best course of action is to figure out a few data collections that are the ideal combination of distinguishing (i.e. they would support some theories and not others) and low-cost doable.

Let's say that you are beta-testing a new version of your software, a standalone desktop application. It runs fine in your test rig, but for every user that installs it, it crashes on startup. Since the crash could have many causes, and dozens if not hundreds of things have changed in the newest version, you are totally underconstrained. You might try the following set of data collections to try to narrow things down:

  • Go through the list of dependent libraries for the application and compare the version on the customer machine to the version on the test machine.
  • Attempt an installation on several additional machines in your local environment in the hope that one of them has the same crash, thereby giving you a local machine to use for comparison.
  • Perform an installation of the software yourself on a customer computer, with a user present. Perhaps in your testing you are making a different set of choices or are otherwise installing it in a fashion that differs from their actions.

These three data collections are likely fast, as in could be performed the same day, and fairly accurate. The first would tell you if the environment might be to blame, the second would give you a leg up on replication, and the third would tell you if installation procedures were to blame. In each case, a meaningful result you would give you a significantly more constrained problem, even though you might still not have a theory at that point.

The final common cause of having no theory is the improbable problem. It can be hard to come up with a theory when we believe that the situation being described simply can't happen. Imagine that you get a bug report that a user is receiving two identical email notifications every time a report is generated by your system. You however know that you just put in a piece of code that is explicitly looking for duplicate notifications and throwing them out because of a known previous issue. You will convince yourself that the problem is everywhere but in your code. You will blame the outgoing mail server, the incoming mail server, user error, etc. before coming up with a theory. As described in the post from a couple of weeks ago on this same topic, there is one good thing to start with when you encounter a problem like this, and that is to replicate. The replication will force you to take it seriously. Another good tactic is to get another opinion from someone on your team. They might say, "oh yeah, I can totally see how that would happen even with your check in there", and then you will have something to go on.

For more general suggestions about trying to generate a theory, see Day 14: When the Blink Fails. Now the we have covered theorizing, the next few posts will cover data collection in more detail.

Tomorrow: The Distance Bug Investigation, Part V

What if you "blink" and come up with nothing? This same question was posed a few weeks ago in a similar post, but this post will focus more on the fundamental reasons that you will fail to produce a theory in addition to specific ideas for solving the problem.

There are many reasons that you may be unable to produce a viable theory. Here are three of the primary reasons:

  1. The problem is overconstrained. Given the symptoms you are seeing, there is no theory that you can imagine that would match those conditions.
  2. The problem is underconstrained. There are dozens of theories that would match the conditions and so no one thing jumps out.
  3. The problem itself seems improbable. While you have a theory that explains it, it doesn't explain why it is suddenly happening now.

A problem can seem overconstrained for several reasons:

  • The biggest one is a bad observation. A single data point that is out of line with everything else that you know, or which directly contradicts something that you had assumed was true can throw off your whole investigation.If one piece of data is keeping you from formulating a theory, then try temporarily assuming that the data you have is bad, or spend some time investigating that one piece in depth in an attempt to really establish or discard it. More often than not, you will discover that it was erroneous, or that you really misunderstood its ramifications and in fact it does not contradict your theory.
  • The next reason for overconstraint is subtle, and I like to refer to it as the "sampling" problem. If you take measurements of a time-varying phenomena at just the right (or really, wrong) intervals, you can perceive something totally incorrectly. In the world of signal analysis, you can have the problem of "aliasing", where a high-frequency waveform appears as a much lower frequency wave because of the interference of the sampling rate.The problem of "aliasing" occurs in debugging data collection as well. Here's a direct example: imagine that you get a call from a system administrator telling you that a server has to be restarted every Tuesday morning. You think to yourself, "what kind of bug happens on the same day every week?" Actually, the system has to be restarted every morning, but that system administrator is generally the first one there on Tuesdays, and she is only one who has mentioned anything to you. That is an example of an "aliased" piece of data. Anytime you are relying on a scattered set of reports over time, or across a set of different people, look out for sampling problems and make an attempt to fill in those gaps.
  • The final common cause of overconstraint is one or more unquestioned assumptions getting in the way. We often overconstrain ourselves with our thought process in which we say, "Well, we are certain that X is true, and Y is true, so I can't possibly see how the bug would happen." When you look more deeply though, you can easily imagine a condition under which X would be false. For example, most of the world is still using a single-processor machine at this point (although that is rapidly changing). You might get a bug report about a threading condition that you just can't imagine happening on a single-processor machine. However, once you realize that there is no reason that the problem machine in question has to be single-threaded (and it's very easy to check), it seems obvious what is going wrong. Learning to notice and question those normally tacit assumptions is a skill that can save significant time and anguish.

Underconstraint, and improbable bugs will be covered tomorrow.

Tomorrow: The Distance Bug Investigation, Part IV

This post will start from the assumption that you have at least one theory to work with. Tomorrow's post will cover developing a theory when you have none. There are few questions you need to ask yourself about the theory in order to determine how to proceed:

  • Is the theory testable, given the current distance constraints?
  • If not, could the distance issue be overcome by one of your trusted contacts?
  • If not, could the theory be testable in some simulated environment with some likelihood of success?

The problem is that while you may have an excellent theory, it might be quite difficult to determine whether or not it's true, and you may want to investigate some less probable theories that are quickly testable before looking at the most probable one.

Here is an example: you have built and deployed a client-server system for a customer. You receive a phone call on Friday morning that the server mysteriously stopped responding to client requests on Wednesday. The local administrator simply brought the server application down and back up again, and the server appeared to go back to normal. However, the server was displaying the same symptoms again that morning, and they decided to call you. Upon the problem being described, you have a few ideas that immediately jump to mind, in order of estimated probability

  1. The system is designed to write a log file to disk, and rolls to a new log every 100MB. It is possible that the disk has filled up and when it runs out of log space gets wedged in that state. The reboot causes it to clear the most recent log and start again, but when it fills up, it has the same problem. This would explain the long lead up to the first problem, but the rapid reoccurence.
  2. Perhaps a deadlock is occurring because a greater number of users are using the system and a section of code that was not properly protected is now causing a problem. While it is unlikely to occur, it will be become increasingly likely as the load on the server increases. This also would match a condition with a long lead-up, and then a relatively rapid reoccurrence.
  3. The disk that the server is running on is failing. Whenever a bad sector is accessed, the server goes into a long read/retry loop until it finally fails, leaving the system in a bad state. It hit that bad sector for the first time on Wednesday, and then hit it again this morning.

While they all are ultimately testable, they have different issues. Despite the fact that 2 is more likely than 3, it might make sense to investigate 3 first since it very quickly can be ruled in or out. Here is how I might proceed in this investigation:

  1. Start with the first theory, which has two prerequisites. First, the disk has to be almost out of space. This will require a trusted contact to verify, if the system is physically remote. Second, you will have to mimic the condition of the system running out of disk space in a local capacity to see what actually happens. If both of these things turn out to be true, then you are almost certainly correct. If the disk has plenty of space, it's probably not even worth checking the second condition. if the disk is almost out of space, but it doesn't fail in the same way when you test it locally, it still might be a viable theory, and you will have to judge whether it's worth freeing up disk space and hoping for the best.
  2. If there is plenty of disk space, then checking with a trusted contact to determine if there are disk errors occurring is probably your best next step. This is usually easy to establish by looking at operating system logs.
  3. If there is plenty of disk space, and no errors occurring, it's probably time to start doing some more code investigation to determine if a deadlock or other thread issue is the problem, and you can proceed from there.

Tomrorow: The Distance Bug Investigation, Part III

Looking to chat about debugging issues, or have a comment, suggestion, or idea that you don't want to post about publicly?  Feel free to email me at holman (at) distancedebugging (dot) com.  I check that email fairly regularly, but I do have a day job that prevents me from rapid responses during the day.  However, I'll try to get back to you as soon as I can.