Distance Debugging Logo

When choosing an avenue of attack during the Isolation stage, it's important to keep in mind two different dimensions: probability and testability.? Probability is your informed estimate of how likely you believe a particular problem is the cause. Testability is how much time and effort you suspect it will take to rule that particular cause in or out.? Ideally, your most probable causes would be the most testable, but it rarely works out so nicely.? Ultimately, you have a simple 2x2 matrix of possibilities, and you can place each theory in one of the sectors:

Probability vs. Testability

The first theories to try are of course the Highly Probable-Easily Testable ones,? labeled "Ideal" in the matrix.? Next is a judgement call.? If you have some very Easily Testable theories that are fairly Improbable, it might make sense to take an hour to knock them all out.? These are labeled "Why Not?". On the other hand, if you have a Highly Probable theory that might take some effort to test, it could be much more valuable. These are labeled "Necessary Evil".? Finally, if you've exhausted all other possibilities, it's time for the Low Probability and Hard to Test theories, labeled "Last Resort".? Before you start trying to follow up on these ones, take another long look at what has already been tried, the data you've already collected, and any other information that might help you see a glimmer of another possibility before wasting a lot of effort on an unlikely theory.? However, sometimes there's no other choice.
Rather than just assigning a label to each theory, it can also help to simply draw out the matrix above and plot your theories on an X/Y axis, where upper-left is best and lower-right is worst.? This can help you easily see both where you ought to start, and how much work you are in for before starting in on an extended isolation exercise.

Debuggers are hindered by the lack of a language for talking about the stages of attacking a problem. When someone says, "I'm debugging that server crash", is it almost fixed? Do they know what the problem is but are unsure how to fix it? Do they even know what the problem is?

To address this problem, I am proposing the following six stages of debugging a problem:

Instantiation - A bug has been found, but it has not yet been clearly defined. In other words, someone has told you something is wrong, but the nature of the problem is not yet understood. The simple declaration of a bug is enough to get into this stage, and the bug remains instantiated until the verification process has begun.

Verification - After the bug has been instantiated, its existence must be verified. This means giving the bug the prima facie test: does the described behavior, on its face, actually constitute a bug? Many bug reports can be thrown out in this stage because they describe the expected behavior of the system (in which case the bug may be a request for change, or a simple misunderstanding), because they describe problems originating outside the application, or because they are so vague as to be impossible to fix such as "System was slow". If the bug appears reasonable, the recreation stage is entered. It can also either be rejected outright, or sent back to the creator for more information.

Recreation - The next stage is recreating the problem in some inspectable way. Originally, I wanted to call this stage "replication", but I don't want to overload that term. Some bugs don't have a natural "replication" mode, but can be recreated. For instance, "query performance is bad on query X". There is not much to replicate, other than to confirm that the problem exists as stated. However, in most cases, this stage will consist of the process of replicating the stated bug through a series of specific steps.

Isolation - This is the process of filtering out all the stuff that is not wrong, and reducing down to the point or points of failure. For many bugs, especially those that were easy to replicate, this is where the bulk of the work is spent. When isolation is complete, you should have a very clear understanding of what is wrong, and how to go about fixing it.

Repair - Once the bug has been isolated, one or more fixes must be applied. It may turn out that the isolation was incorrect, and in many cases, a debugging session will bounce back and forth between isolation and repair.

Validation - Finally, once the bug has been repaired, the fix has to be validated. In some instances, this stage will be trivial due to steps taken in the repair or isolation stages, such as when a test case is used to isolate the problem which now passes, or when a page refresh is all that is needed to see the improvement. In other cases, the fix must be tried in an operational setting, to verify that the thing that you fixed is the thing that was actually broken.

To recap:

  1. Instantiation
  2. Verification
  3. Recreation
  4. Isolation
  5. Repair
  6. Validation

So when someone asks you where you are with the server crash, you can now say, "I've verified the problem and am working on recreation", or "I've isolated the problem and I'm working on repair". This allows others to better understand how much progress is being made, and to increase communication with peers and with management.

I've encountered a frustrating roadblock in few distance debugging situations recently, and I have looking for a term to describe the problem. Here's the issue, in a nutshell:

A sticky problem on a remote system has been narrowed down to a likely cause, such as the interaction of two different applications. A test has been proposed to verify this assessment, which is to remove one of the applications, and run the test procedure again. However, the remote resource balks, saying "But we'll need both of those applications on the real system!" The problem is that they have conflated testing and fixing. In other words, they have taken the suggestion of "remove an application to test the theory" as "remove the application to fix the problem", when one does not have to imply the other.  Unfortunately, with these issues mixed together it creates a mental roadblock where you can no longer make headway by ruling a particular cause in or out.

So what can be done? I've found that the first thing to do is make the test/fix separation explicit. Tell the remote resource, "I understand that this is a not a permanent solution, but until we've verified that the application interaction is the problem, there is no point in pursuing other solutions." Another tactic is to offer a test procedure that involves undoing and then redoing the piece that your remote resource insists is untouchable to address their fears.  In the conflicting applications example, this might be, "Remove one of the applications and check to see if the problem still occurs.  Then reinstall that application and try it again to verify that the problem is back again."  Finally, make it clear that you understand the constraints by saying things like, "I know that you will need both applications in the long run.  I'm sure that once we've narrowed it down, we can find out what the conflict is and resolve it."  Once you have shown your focus on getting to the cause rather than proposing a solution, your remote resource should be more receptive to allowing you to try something out.

I worked as a computer administrator for a small Mac-based network during my college years. Things ran fairly smoothly most of the time, but one event sticks out in my head from my time there. I was sitting at my desk doing some routine maintenance when one of the staff ran up to me saying, "One of the students said that the computer she was working at has a virus!" Panicked and fearing that I'd forgotten to update the virus definitions or otherwise fallen asleep at the switch, I rushed over to the computer in question. Nothing immediately appeared out of the ordinary, but my first self-preservation instinct was to yank the network cord out of the back, shut all non-essential programs down, and run a fresh virus scan to see what we had been infected with. I paused though, because something didn't seem right.

It occurred to me: what did this student see that made them think there was a virus infection? I know what one looks like because I've had to fix infected computers and seen the bizarre unkillable processes, random pop-up windows, sluggishness, etc. However, most people when they see a virus think of a giant window popping up and critters dancing around your screen and giant text reading "You have gotten the PDQ virus! I will now delete all your files!" I wish that they so readily advertised their presence as it would save me a lot of time. While considering what might have conveyed the presence of a virus, I glanced at the browser window the student had left open, showing a page with a banner ad at the top. The banner was flashing blue and purple and said "YOUR COMPUTER IS INFECTED WITH A VIRUS!!!".  With a chuckle, I explained what had happened to the staff and went back to my normal routine.

Distance Debugging often means that you have to take someone else's account of a situation. It can be easy to forget that you are working from second-hand data and from some one else's interpretation of what they observed. What can you do to help understand others' perspectives and observations?

  1. Be very aware of the "layman's" terminology in common technical domains in order to help clarify seemingly bizarre support requests. For instance, I've noticed that many people use a kind of synecdoche and say "Internet" where they mean "Web". As in "the Internet" is down to indicate that they can't get to websites.
  2. When working with other technical people, think about their background and biases. Do they likely know what they are talking about in the domain they are working in? What evidence do you have that their mental model of the system matches or does not match your own? Are they naturally distrustful of certain applications or systems? Would there be any reason for them to obfuscate or otherwise manipulate the information being presented (for instance to cover their own or another's mistake)?
  3. Reflect on communications negotiations, successes, and failures.  Did you successfully solve a problem because you looked at it through another's eyes?  Were you able to translate from their description to a correct representation of the problem?  Did you get frustrated? Did you miss critical details? What words were used that might be useful to file away in a "translation guide" for dealing with a particular individual or a class of individuals the next time?
  4. Be careful of chronology.  We have a tendency to forget when a certain piece of knowledge became known to a certain person, and can either come to erroneous conclusions or dismiss valid ones by saying, "They wouldn't have done X because they knew Y", when in fact Y couldn't have been known by them at the time.

Cultivating theory of mind skills will not only serve you well in a debugging setting, but can help in almost any interpersonal situation.  Most of the time, we take for granted our ability to consider the minds of others but when we fail to do so, we risk making serious errors in judgment.

Yesterday covered some "don't"s, and today we'll cover the "do"s:

  • Segmented Logging - Besides rolling the logs at intervals to allow for multiple files, it also makes sense to segment your logging into different tiers by seriousness and verbosity. I like to use at least these four logs as available places for content:
  1. The Standard Out Jungle - Anything goes in this log. Here, developers can spit out pretty much whatever they'd like without fear of cluttering up the important logs. Data structure dumps, "Here", anything that helps them observe and diagnose the running system. However, it's a shared resource, so expect to go digging through other people's Standard Out junk as well. This log is not serious, but it may be verbose.
  2. System-at-a-glance - A concise summary of every logged message that the running system produces. Includes a timestamp, severity, short summary, and a reference number for each message. This is the serious, but not verbose log. A remote, non-technical resource should be able to quickly skim this file and visually determine if anything important/worrisome is happening.
  3. System-debug - A verbose explanation of the information being logged in System-at-a-glance. If a skim of the at-a-glance log seems to indicate a problem or something that requires further investigation, the reference number associated with the message can be used to cross reference the message in the verbose log. In fact, these two logs receive the exact same set of messages (and this is done automatically by the logging layer, not relying on users to write to them both) but with different information culled from the message to keep them in sync. This is the serious and verbose log.
  4. Critical-at-a-glance - This is the log that you can check every morning. In general, it should have nothing in it. The appearance of anything means that there is a serious problem that needs to be addressed immediately because it is unresolvable without human intervention and will have deleterious consequences. The information in the System logs might be beneficial for digging into the problem, but they contain significantly more information and so are not ideal for a daily review.
  • Think About your Reader - Who do you expect to read the log file? You will probably see it, eventually, but there may be a lot of other eyes on it first. The user, other admin or IT support staff who are local to the system, and possibly more. Ideally, you'll never have to see the log because either a) the information is so clear that someone else can handle it (not very likely) or b) the log has the important information carefully outlined and packaged so that a relevant section can be sent off to you (hopefully very likely).Automated tools for log mining have their place, but they often strike me as arising from sloth, i.e. it implies that you'd rather spend hours slicing and dicing a 500MB log file every time an error occurs rather than spend a day or two upfront cleaning up and organzing your logging. It also fails to take into account the reality of a distance debugging situation. Big logs don't email well, and it totally rules out the possibility that a system-local resource might be able to tell you the important stuff, unless they want to become log mining gurus themselves.As was hinted at yesterday with Alarmist Logging, you also don't want the user to open up the log to find thousands of lines of their personal data interspersed with other random outputs about fatal errors and who knows what. It certainly will not instill them with a sense of confidence.
  • Practice Debugging with Logs - It's hard to know what information will prove useful, but if you implement a logging policy and infrastructure early on, you can start using it to try to debug problems in the development phase, to prove that you will be able to do it in production. This will show what you need to put in the logs, and how much logging is sufficient.
  • Institute a Logging Clean-up Phase of Development - To avoid logorrhea, make sure that before any release is cut that the code is inspected for useless or misleading log statements. This can be executed just like a standard code review, but it is pretty quick since you can just jump around from log() call to log() call and simply question the necessity and validity of each. In many cases, all that is needed is a redirection of the message from the System log to the Standard Out Jungle, which can be disabled in the production system.

The moral of the story is: treat logs as a key debugging resource. You can signficantly improve their value to you and others with a small amount of time spent on the details of what gets logged and how it is recorded.

Most applications produce some kind of log. Often it is the primary or only resource available to debug a problem. Yet few applications have any unified approach to figure out what should be logged, how it should be logged, and why it should be logged. Most of the time, while there is a shared logging API available, developers use it haphazardly and to record a wide variety of information. There are a few documents on the web such as this one that try to indicate some logging best practices. Over the next few posts I will try to explain some of my own best logging practices and contribute to this body of knowledge.

At the most basic level, you need to ask yourself, if my log files were all I had available, how confident would I be in my ability to fix bugs? If you have little confidence, then you are probably doing it wrong. Before discussing how to create logs that build confidence, let's talk about some common mistakes:

  • One Log File Syndrome - There is a misguided desire to put all that is logged into a single file. This is a very bad idea, for reasons that will soon become clear. Having a logging standard that assumes more than one log file offers a built-in way to segregate data, and quick access to the data that you need is critical in a distance debugging situation.
  • Logorrhea - Developers like to put things in the log file, and they hate to take them out. This often leads to a situation where the log becomes an indiscriminant mess of random garbage. Worse, it kills system performance, often mysteriously (until you put it in a profiler and see 100M calls to log()).
  • Alarmist Logging - Have you ever opened a log file to see messages like "FATAL Error: File Already Open"? The terminiology used within the log can send your distance resource into fits if they see a bunch of "fatal" errors, even if the error was in some sense "fatal" in that it caused the operation to fail. The way that logs are phrased, and the overall tone of the messages can help guide a remote reader to the actual issues in the log. Alarmist logging will make them think that their system is milliseconds away from disaster.
  • Logging what ain't broke - Logs are for capturing information that will probably be useful for debugging, should the need arise. Sometimes it's not clear what is useful, and so having examples of success to compare against are useful. But I've seen too many logs that are 99.9% lines that say "Commit Successful". If you have to spend more time ignoring success data than you do reading and analyzing failure data, you need to revisit your logging.

Tomorrow: Logging at a Distance, Part 2

Skepticism and Cynicism may not seem like opposites, but in the world of debugging, they often are. As discussed yesterday, the Cynical Debugger says "I believe that things can and will go wrong, and I want to plan for that."  Balancing that attitutde, the Skeptical Debugger says "I don't believe that bug report" and "There's probably a much simpler explanation". An expert debugger knows how to manage the tension between wanting to accept the crazy things that happen to an application in the wild, and wanting to dismiss bad or highly improbable data.

When a bug report is submitted to me, one of the first questions I like to ask myself is: what of this information, if any, seems totally out of line? As described in this post about saying 'no' to a bug report, an application can't be all things to all people, and it is necessary to say 'no' at times. Reading the report with a cynical eye will give you plenty of reasons to say 'yes', but reading the report with a skeptical eye will tell you why you can say 'no'. It is what prevents you both from catering to the whims of particular users, and from embarking on a wild goose chase for a bug that occurs under questionable circumstances. Take for example a bug report like this:

"I was testing the latest build of the client software, and when I sent a login request to the server it spit out a stack trace back at me. Please fix the server."

It's quite possible that the latest build of the client software uncovered some bug in the login procedures on the server. But it seems much more likely that the thing that changed, i.e. the client software, is to blame and is now doing the login procedure incorrectly. Spending a lot of time investigating this "server" bug is very likely going to result in simply figuring out what the client is doing wrong and then telling them about it. Unfortunately, people often want to blame the component that is producing the error and not think about what has changed. With a skeptical eye, you can step back and ask "why is this bug happening now?", and try to communicate that you are not absolving the server of responsibility, you are simply skeptical that a properly functioning server that has not been changed would suddenly fail.

The same goes for bug reports that look like, "My computer has been acting flaky all day, and your application keeps crashing. Please fix your application." Users want something to blame, and if you have a reputation as a responsive team that can fix things in a hurry, these kinds of bug reports will only increase. The key is again, remaining skeptical, and communicating your skepticism to others. "I'm guessing that your computer acting flaky is the cause of the application crashing, and not the other way around. You normally don't have any problems with our application, and your computer has acted flaky many times in the past, right?" These kinds of skeptical statements help clarify that you are open to fixing real problems with your and even others' applications, but you are not all-powerful.

Distance Debugging is ultimately about figuring out where to apply your limited resources in a highly uncertain environment. Being appropriately skeptical can help you avoid getting trapped into expending these resources in inappropriate, counterproductive ways while still letting your team and your users understand that you will invest the necessary resources when the situation warrants.

One of the qualities that I look for in other developers to signal that they have the requisite experience is what I call "Healthy Cynicism". Programmers fresh from school, or who have only been given token work in the past have an attitude that says, "Things will work out". That's a great attitude to have when you are building something new, but it's counterproductive when you need to debug something. Debugging is about expecting the worst. I have a lot of conversations with people that go like this:

Me: "So if the user decides to minimize and then maximize the window in the three seconds while the application is loading, the system will segfault."

Cheery Programmer: "Will that ever happen?"

Me: "I'm reading to you from the bug report database"

Certainly cynicism can get out of hand, and lead to a situation where a group becomes mired in inaction because everything seems likely to result in failure. The attitude of a Healthy Cynic is not that failure is inevitable, but that any misuse of the system you can think of will probably be done, and any bad system state you can imagine will be entered at some point. The Healthy Cynic says, let's be serious about what bad things can go wrong and make sure our system can handle them.

In debugging, the attitude is invaluable. As I've alluded to before, part of the success in debugging is simply accepting that a bug exists in the first place. For example, I see replication as much about being able to prove you fixed the bug as it is about proving the bug exists. A Healthy Cynic looks at a bug and says, yeah, that could probably happen and has a jump on others who are still trying to explain the bug away. A Healthy Cynic also assumes that users are mostly stumbling through your application and will do ridiculous things with it. It would be nice if they didn't, but pretending that they will stay on your nice rutted paths is a recipe for disaster.

This combination of considering worst-case scenarios and then trying to build a system which anticipates and handles them, and treating bugs as real and likely right from the start is what makes Healthy Cynicism a core debugging skill. Unfortunately, it's not enough without an offsetting force. Tomorrow's post will discuss the flip side, Healthy Skepticism.

When good designers build systems, they take a lot of criteria into account, including common ones such as how it meets the requirements, maintainability, and extensibility. One aspect that is often missed is debuggability, or how easy it will be to fix a problem with the system when it occurs. Like other criteria, debuggability can mean sacrificing potentially beneficial complexity in the early stages, for example, for improved performance.

Consider the problem of using an artificial neural network for classification. They often give excellent results, and can learn basically an arbitrary association between input features and output classification given enough nodes. They can generalize from a set of training inputs to a set of new inputs, and are often an ideal solution for large data set partitioning. They have one big drawback though: if they fail to classify something correctly you have pretty much no idea why. The problem is buried somewhere in the weights and connections of the nodes in the network. There is little debugging that can be done directly, with your only option being more training of the network in the hopes that it will solve whatever issue it is having. At the opposite end of debuggability is rule-based classification. While it may be time-consuming to create an appropriate set of rules to classify all documents correctly, and newly arriving documents might require new rules to be added in the future, it should be perfectly clear how the resulting classification was reached.

If you were building a system with a classification component, you might be inclined to use the ANN solution because of the speed and power, but the possibility looms that you will pay for it with counterintuitive, hard-to-fix bad classifications. If you take debuggability into account, you would likely avoid this type of design solution, opting for either the rule-based approach in the early going, or possibly a hybrid solution that uses the statistical approach for a quick classification, and then a rule-based approach to prevent common mistakes.

The ANN solution is an extreme example of something that is not debuggable, but many sophisticated algorithms suffer from this problem to one degree or another. This isn't to say that these solutions cannot be used, but they must be used with caution and when you have a working fallback. On the other hand, there are other cases where you can build capabilities into your system that help with debugging and don't require a functional tradeoff in general. A summary of some of those capabilities will be the subject of the next post.

In my post on Thursday, I gave some example interview questions with the third one being a question about the detailed workings of one of several different computer scenarios including loading a web page, booting a computer, and a couple others. These questions are looking for domain knowledge, i.e. information about general fields of knowledge within computer science and practice versus information about a specific piece of hardware or software. You might well ask, why does he care if someone knows this stuff? Isn't the knowledge about that particular system better? Sure, but it would be way too time consuming to verify, for example, the details of the networking stack for an application, and in some cases the information isn't even available. You have to assume that things have been done in a sensible way or according to a known specification or pattern, and once you make those kinds of assumptions, your mental model of how that specification or pattern operates comes into play.

Mental model theory is a well-known psychological theory of human reasoning that says that rather than using deductive reasoning to solve most problems ("if X than Y" type stuff), humans build little models of situations in their head and refer to them to answer questions. It makes intuitive sense if you reflect on your own thinking. It is important though that when you are forced to rely on your mental model instead of being able to test things directly, that your model be accurate, or you will jump to nonsensical conclusions about how and why bugs are occurring.

For instance, you are working on a networked system and you are investigating a bug where some data is being lost in transmission. I have a mental model of network transmission where I imagine the data being snipped into little packets, tagged with a endpoint location and then relayed through a bunch of other computers to get to its destination. If pushed, I can get into the details of TCP stacks, how routing works, even the low-level details of ethernet transmission, but most of that doesn't really matter and so my default mental model is "good enough". In that model, I can easily see situations under which some data might get through but other data is lost since I know that the data is being turned into packets all of which may take a different set of hops through a network. However, if your mental model of a network imagines some sort of direct connection between the two machines with data simply being passed from one to the other in a big chunk, a partial transmission might seem totally baffling. This is an extreme example, and I assume that most developers with a basic CS education understand at least about packetizing, but I've been wrong before.

So in the example interview question, what I want to hear about is their mental model, and to a lesser extent, their direct technical knowledge of a particular specification or implementation although I don't expect them to quote RFCs to me. Also, I want to see how aware they are about their own thinking. I would much rather hear, "I don't really know what happens between here and here", than make something up that is patently false. We take these mental models for granted and it is easy to overlook flaws or gaps for a long time because we are never called upon to reason about a situation with that level of fidelty. In the world of debugging though, you will need to be very aware of how you are imagining the situation and what conclusions you are drawing from that world inside your head.

Syndicate content