Distance Debugging Logo

Debuggers are hindered by the lack of a language for talking about the stages of attacking a problem. When someone says, "I'm debugging that server crash", is it almost fixed? Do they know what the problem is but are unsure how to fix it? Do they even know what the problem is?

To address this problem, I am proposing the following six stages of debugging a problem:

Instantiation - A bug has been found, but it has not yet been clearly defined. In other words, someone has told you something is wrong, but the nature of the problem is not yet understood. The simple declaration of a bug is enough to get into this stage, and the bug remains instantiated until the verification process has begun.

Verification - After the bug has been instantiated, its existence must be verified. This means giving the bug the prima facie test: does the described behavior, on its face, actually constitute a bug? Many bug reports can be thrown out in this stage because they describe the expected behavior of the system (in which case the bug may be a request for change, or a simple misunderstanding), because they describe problems originating outside the application, or because they are so vague as to be impossible to fix such as "System was slow". If the bug appears reasonable, the recreation stage is entered. It can also either be rejected outright, or sent back to the creator for more information.

Recreation - The next stage is recreating the problem in some inspectable way. Originally, I wanted to call this stage "replication", but I don't want to overload that term. Some bugs don't have a natural "replication" mode, but can be recreated. For instance, "query performance is bad on query X". There is not much to replicate, other than to confirm that the problem exists as stated. However, in most cases, this stage will consist of the process of replicating the stated bug through a series of specific steps.

Isolation - This is the process of filtering out all the stuff that is not wrong, and reducing down to the point or points of failure. For many bugs, especially those that were easy to replicate, this is where the bulk of the work is spent. When isolation is complete, you should have a very clear understanding of what is wrong, and how to go about fixing it.

Repair - Once the bug has been isolated, one or more fixes must be applied. It may turn out that the isolation was incorrect, and in many cases, a debugging session will bounce back and forth between isolation and repair.

Validation - Finally, once the bug has been repaired, the fix has to be validated. In some instances, this stage will be trivial due to steps taken in the repair or isolation stages, such as when a test case is used to isolate the problem which now passes, or when a page refresh is all that is needed to see the improvement. In other cases, the fix must be tried in an operational setting, to verify that the thing that you fixed is the thing that was actually broken.

To recap:

  1. Instantiation
  2. Verification
  3. Recreation
  4. Isolation
  5. Repair
  6. Validation

So when someone asks you where you are with the server crash, you can now say, "I've verified the problem and am working on recreation", or "I've isolated the problem and I'm working on repair". This allows others to better understand how much progress is being made, and to increase communication with peers and with management.

During my long blogging hiatus, I've been up to a few things:

  • Kicking off my new business, Distance Software. The website isn't much to look at yet, but I've got the new logo (one is also coming for Distance Debugging shortly), and I'm working with that same designer to create the new site.
  • I've started a new Drupal setup to start to try to create a community debugging site/forum. There isn't much there yet, but I'm going to start porting over content that I've posted here, as well as new material.? Check out FixIt!
  • Ported distancedebugging.com and distancesoftware.com over to new digs on a dedicated server at SuperbInternet.
  • Helping with the planning for BarCampMilwaukee.? I had a lot of fun at last year's event, and I'm hoping to contribute a lot more time and energy this year.? I'm going to have Distance Software sponsor, and I'm also running the BCM site here on my box to give it some bandwidth and horsepower.? We'll see how it holds up!

Look for lots of new posts now that I'm back in the saddle.

After witnessing a string of situations where people proposed what were, to me at least, wildly improbable theories of a problem, I began to question why it was that people kept throwing out these elaborate explanations of seemingly straightforward problems. Upon reflection, I believe it has a lot to do with people's egos, and their desire to be part of something historical. These three factors drive the generation of complex, unlikely theories in favor of simple, probable ones:

  1. The War Story Factor - You can't constantly retell a story about the time that you had a frustrating off-by-one error, even if the debugging process was arduous and the stakes were high. People only want to hear "fix-it" stories where the solution to the problem either required some seemingly mystical leap of logic, or where the actual underlying cause turned out to be incredibly bizarre and improbable. Thus, in proposing an unlikely hypothesis, we hope that it's true so that we get a new war story to add to our arsenal.
  2. The Genius Factor - We all want to look smart in front of our peers, and what better way to do so than by proposing a chimerical theory of the problem, only to be proven right in the end! If your theory is wrong, then you shrug your shoulders and move on; it's likely that no one will hold you to account for the hours or days you spent pursuing this theory. In this way, improbable theories have really only an upside for the proposers: you'll look like a genius if you are right, and just another wrong guesser if you're not.
  3. The Hero Factor - Related to the genius factor is a desire by many software people (and I assume trade and service people of all stripes) to be seen as the savior or hero. You want people to say, "We were really stuck. Thank goodness that you came up with that crazy theory that we hadn't considered. We would still be struggling with this problem!" You get to swoop in and save the day with your fantastic theory.

So we want our problems to be solved, and solved quickly, but we secretly hope they have complex, mind-bending solutions so that we can boost our egos when we solve them. We are rewarded for doing the opposite of what is generally effective.

How can we change this behavior from a team standpoint? For a start, keep track of who proposes which theories, and who tends to be right more often than not. Offer kudos to people who solve problems quickly and efficiently rather than those who solve "hard" or "unusual" problems. We tend to judge the "hardness" retroactively based on the perceived unlikeliness or difficulty of the solution, even though most debugging problems look hard when you don't know the solution, and this penalizes people who try simple solutions first and plow through lots of fixes in a shorter period of time.

The second thing to do is to make people feel bad about improbable theories. This can be as simple as encouraging mild ridicule of team members who consistently make bizarre leaps of faith in their hypotheses. At the very least, force people to make their reasoning explicit. Is their hypothesis really the one that is least inconsistent with the evidence, or is there a much more parsimonious explanation? With a carrot for rapid debugging and a stick for improbable theorizers it should be possible to eliminate the ego factor and improve your team's hypothesis-making.

I've encountered a frustrating roadblock in few distance debugging situations recently, and I have looking for a term to describe the problem. Here's the issue, in a nutshell:

A sticky problem on a remote system has been narrowed down to a likely cause, such as the interaction of two different applications. A test has been proposed to verify this assessment, which is to remove one of the applications, and run the test procedure again. However, the remote resource balks, saying "But we'll need both of those applications on the real system!" The problem is that they have conflated testing and fixing. In other words, they have taken the suggestion of "remove an application to test the theory" as "remove the application to fix the problem", when one does not have to imply the other.  Unfortunately, with these issues mixed together it creates a mental roadblock where you can no longer make headway by ruling a particular cause in or out.

So what can be done? I've found that the first thing to do is make the test/fix separation explicit. Tell the remote resource, "I understand that this is a not a permanent solution, but until we've verified that the application interaction is the problem, there is no point in pursuing other solutions." Another tactic is to offer a test procedure that involves undoing and then redoing the piece that your remote resource insists is untouchable to address their fears.  In the conflicting applications example, this might be, "Remove one of the applications and check to see if the problem still occurs.  Then reinstall that application and try it again to verify that the problem is back again."  Finally, make it clear that you understand the constraints by saying things like, "I know that you will need both applications in the long run.  I'm sure that once we've narrowed it down, we can find out what the conflict is and resolve it."  Once you have shown your focus on getting to the cause rather than proposing a solution, your remote resource should be more receptive to allowing you to try something out.

When testing an application, because of slight differences between the test environment or usage pattern and the real system, we often end up discovering "bugs" that would never happen under normal conditions. These bugs tend to be surprising because we wonder how the problem could have escaped our noticed for so long or how it could have been introduced.  Here are two examples of these bugs, followed by an explanation of how the "artifact" was created, identified, and resolved:

  1. I was working on a system where users had a foldering structure stored on the server. We were testing server performance, and were simulating a large number of users creating lots of folders over a long period of time. Things were getting slower and slower and it looked like there was a serious performance problem.
  2. Recently, I was working on porting an application from an older version of WebLogic to a newer version (9.2). We have a load testing rig that simulated the effects of many users calling the system over time. Everything was going smoothly with the port until we started ramping up our testing with the simulated clients. Each client was using SSL to connect and a certificate to authenticate, and the server should be keeping track of the who was authenticated for a given call so that their actions can be associated with them. Our test rig relied on multiple simulated users connecting from a single physical machine (a fairly standard practice for load-testing), and when we tried it with the updated version, suddenly calls coming from the same machine were seeming to have somewhat arbitrary credentials associated with them, as if the server code was not thread-safe and the authentication-related code was totally broken.

Now, the thrilling conclusion:

  1. We had to first cut apart the size of the data being created from other factors, such as length of time since test initiation (since data sizes tend to grow as time goes on). When we went and looked at the actual data being created, we noticed that we hadn't set any limits on how many subfolders should be created for a given folder, with some folders winding up with 1000s of child folders, something that was deemed very unlikely to happen in practice (and in fact it never has). We made a note of the fact that a performance problem could arise if a user chose to create a huge number of child folders, and changed our test rig to create deeper folder nesting rather than wider folder nesting keeping the number of folders the same while avoiding an unlikely usage pattern.
  2. While the original theory was that we had somehow failed to port our code to the new WebLogic verson correctly, this simply caused us to chase down a lot of dead ends. We decided to start running only one client per physical machine to see if the problem appeared (after putting in lots of extra logging on the server and writing a very simple, repeatable test to demonstrate the problem). The problem disappeared in the multiple machine test, and it became clear that the issue was related to running multiple clients on the same machine. At this point we were tired of dealing with the issue and accepted this workaround, since in practice, we never had a situation where multiple clients would be connecting from the same machine and authenticating as different users. We still don't know if WebLogic somehow associates credentials with a particular IP address, and if so, if there is some way to turn this off. To really verify this theory we would need to set of a machine with multiple IP addresses assigned to the same NIC, and somehow get different clients to use different IP addresses.

What's the moral of the story?

  • When you uncover a bug during testing that surprises you in that you would have expected to see it under production conditions, go back and verify that you are actually trying to do something that the production system does.
  • In the case where the bug is something that hinders your ability to test, but would have no effect on the actual system (as in bug #2 above), it can be a very tough call to determine how much energy to put into fixing it.
  • While testing for a broad range of conditions and situations can be beneficial for a system, especially in case where you might anticipate a future problem (as in bug #1), you can also wind up plugging a lot of holes that won't ever leak.

I think my favorite artifact story is the one retold by Steve McConnell about a team trying to get better performance out of their OS using some profiler data:

Bentley also reports the case of a team that discovered that half an operating system's time was spent in a small loop. They rewrote the loop in microcode and made the loop 10 times faster, but it didn't change the system's performance-they had rewritten the system's idle loop.

A method eating up 50% of the execution time sure looks like a nasty bug, but it was only an artifact of the system design. Keep this lesson in mind next time you see something so shocking.

In the late 90s, I started hearing a lot about Linux and wanted to give it a shot. A friend of mine had the CDs for Red Hat 4-point-something and he lent it and a copy of his 2.5-inch thick Red Hat Linux Unleashed (or some such title) book to me, and I undertook a project that would ultimately change my life as a hacker and a computer scientist: I tried to get Linux installed on my "state-of-the-art" Pentium II computer. Installing Linux back then, while I'm sure it was light-years ahead of where many practitioners started, was still difficult enough that I actually had to learn something about computers to make it happen. The whole process took a couple of weeks, filled with reading, research, trial and error, and ultimately, it was the beginning of a process of knowledge and skill acquisition that continues today.

Here are a few of the things that I learned in those weeks:

  • Despite having an installation process that walked me through the the steps necessary to install and configure the system, there were many questions along the way. How did I want my disk partitioned? How much swap space? Which packages did I want installed? How did I want my network configured? Each question sent me off on an investigation about the possibilities, and the advantages and drawbacks of each option.
  • I wanted a dual-boot system since I wasn't really ready to abandon my Windows 95 use yet given it was all I really knew. This meant learning about the boot sequence, boot loaders, the MBR, lilo, disk partitioning, and even a little about disk geometry.
  • Of course, I wanted an X windows based system so that I could run graphical apps. Back then, autodetection of video card and monitor settings was dicey. To get it up and running meant learning about video timings, how to modify the XF86config file, and reading the arcane spec sheets that came with my monitor and video card to find a compatible setting so that it actually started up.
  • Once I had gotten things installed, the next step was figuring out what I could do with the thing. I had some familiarity with a CLI from my DOS days, but a real shell is a little different. Even simple things such as "how do I run a program?" turned out to be tricky, and learning that I needed to prepend "./" to run a command in the current directory was a revelation.

Like most things, I got better with experience. Pretty soon I was figuring out how to burn CDs (and understanding file mounting, CD disc structure, and some basics of device drivers), download and compile software using the configure/make/make install pattern, and much more. Within a year I was setting up a Linux box to provide NAT for my cable modem (I was lucky to have access to an early cable modem service), meaning I learned a ton about the nitty-gritty of real world networking, iptables (it was probably ipchains then), and how to set up and configure a small home network including local DHCP and DNS services, port forwarding, and much more.

The freely available nature of Linux and it's subcomponents, coupled with the vast resources of documentation and community mean that any self-motivated person can spin up on pretty much any technical topic and actually try out a working implementation to get a feel for how the idea plays out in practice. In the early days of computing, it was this way for the vast majority of users, with the tradeoff being that there was no easy alternative for non-technical users. Since the rise of Windows and Mac as the dominant operating systems, this option has been hidden or even taken off the table for many up and coming programmers.

The fact that you have to "think" to use Linux has been criticized, but few people seem to note the danger of having many developers learn on an operating system that requires little or no thought. Whether or not you agree with the sentiment that Linux makes you think too hard, there is no reason we can't have an OS that is for the "masses" and another for developers who actually care about what is going on on their system, so this isn't so much a criticism of Windows as it is a criticism of computer science programs and software development shops complacency in accepting Windows as their standard platform.

My advice to young programmers is, rather than always working with software that doesn't make you think, spent some time with some that does.  Like learning a new programming language, it's a great way to expand your knowledge of computer science concepts, as well as develop important problem solving skills.  You will be surprised at what you don't know when the configuration  tools and installation wizards are stripped away.

I worked as a computer administrator for a small Mac-based network during my college years. Things ran fairly smoothly most of the time, but one event sticks out in my head from my time there. I was sitting at my desk doing some routine maintenance when one of the staff ran up to me saying, "One of the students said that the computer she was working at has a virus!" Panicked and fearing that I'd forgotten to update the virus definitions or otherwise fallen asleep at the switch, I rushed over to the computer in question. Nothing immediately appeared out of the ordinary, but my first self-preservation instinct was to yank the network cord out of the back, shut all non-essential programs down, and run a fresh virus scan to see what we had been infected with. I paused though, because something didn't seem right.

It occurred to me: what did this student see that made them think there was a virus infection? I know what one looks like because I've had to fix infected computers and seen the bizarre unkillable processes, random pop-up windows, sluggishness, etc. However, most people when they see a virus think of a giant window popping up and critters dancing around your screen and giant text reading "You have gotten the PDQ virus! I will now delete all your files!" I wish that they so readily advertised their presence as it would save me a lot of time. While considering what might have conveyed the presence of a virus, I glanced at the browser window the student had left open, showing a page with a banner ad at the top. The banner was flashing blue and purple and said "YOUR COMPUTER IS INFECTED WITH A VIRUS!!!".  With a chuckle, I explained what had happened to the staff and went back to my normal routine.

Distance Debugging often means that you have to take someone else's account of a situation. It can be easy to forget that you are working from second-hand data and from some one else's interpretation of what they observed. What can you do to help understand others' perspectives and observations?

  1. Be very aware of the "layman's" terminology in common technical domains in order to help clarify seemingly bizarre support requests. For instance, I've noticed that many people use a kind of synecdoche and say "Internet" where they mean "Web". As in "the Internet" is down to indicate that they can't get to websites.
  2. When working with other technical people, think about their background and biases. Do they likely know what they are talking about in the domain they are working in? What evidence do you have that their mental model of the system matches or does not match your own? Are they naturally distrustful of certain applications or systems? Would there be any reason for them to obfuscate or otherwise manipulate the information being presented (for instance to cover their own or another's mistake)?
  3. Reflect on communications negotiations, successes, and failures.  Did you successfully solve a problem because you looked at it through another's eyes?  Were you able to translate from their description to a correct representation of the problem?  Did you get frustrated? Did you miss critical details? What words were used that might be useful to file away in a "translation guide" for dealing with a particular individual or a class of individuals the next time?
  4. Be careful of chronology.  We have a tendency to forget when a certain piece of knowledge became known to a certain person, and can either come to erroneous conclusions or dismiss valid ones by saying, "They wouldn't have done X because they knew Y", when in fact Y couldn't have been known by them at the time.

Cultivating theory of mind skills will not only serve you well in a debugging setting, but can help in almost any interpersonal situation.  Most of the time, we take for granted our ability to consider the minds of others but when we fail to do so, we risk making serious errors in judgment.

Let's say that you want to win the lottery (who doesn't?). You have two problems that need to be overcome:

  1. You have to guess the right numbers.
  2. Even if you guess right, you have to split the jackpot with a bunch of other lucky folks who also guess right.

Most people only worry about part 1, but if you are going for the best expected value, which is a combination of odds of and payoff, then you need to worry about 2. For instance, you would probably rather be the sole winner for a $200 million jackpot than be forced to split it with 20 other people (even though $10 million is still a ton of money).

The ironic part is that there is little or nothing that you can do about 1, but you can do something about 2. Here's a simple tip: play numbers greater than 31. A common strategy is playing birthdays as one's numbers because they are easily mapped to lottery selections and perhaps in some primordial way, they provide a symbolic "offering" of your loved ones to the lottery gods. Assuming every combination of numbers has the same odds of winning, you might as well play some numbers that are unlikely to be played by others so that if you win, you reduce the likelihood of having to split the payout. Since birthdays must consist of numbers that are less than or equal to 31, you increase your expected value with some well-chosen numbers.

What does this have to do with debugging? It illustrates that you can often improve your chances of success, in addition to saving time and energy, by remembering that computers are built, programmed, and used by people. Psychologists use the term "Theory of Mind" to describe people's ability to conceive of others as having mental states, intentions, beliefs and so on. The minds of those involved in the system, particularly users, should be taken into account when trying to understand reported problems.

Tomorrow: Using assumptions about mental states to fix things faster

I found a nifty drop-down menu that I want to use to conserve space, but right now everything is a little wonky so bear with me. Eventually I'm going to compress everything down, remove the bottom boxes, and create more space for the content, which I think will be easier on the eyes. Until then, there is some duplication and the menu is hard to navigate, so please accept my apologies.

Follow up: I've finished the main revision and I'm very satisfied with the results. I think there is less screen clutter, at the expense of less information. If you have problems viewing or have other suggestions, please leave them in the comments.

Syndicate content