Distance Debugging Logo

When good designers build systems, they take a lot of criteria into account, including common ones such as how it meets the requirements, maintainability, and extensibility. One aspect that is often missed is debuggability, or how easy it will be to fix a problem with the system when it occurs. Like other criteria, debuggability can mean sacrificing potentially beneficial complexity in the early stages, for example, for improved performance.

Consider the problem of using an artificial neural network for classification. They often give excellent results, and can learn basically an arbitrary association between input features and output classification given enough nodes. They can generalize from a set of training inputs to a set of new inputs, and are often an ideal solution for large data set partitioning. They have one big drawback though: if they fail to classify something correctly you have pretty much no idea why. The problem is buried somewhere in the weights and connections of the nodes in the network. There is little debugging that can be done directly, with your only option being more training of the network in the hopes that it will solve whatever issue it is having. At the opposite end of debuggability is rule-based classification. While it may be time-consuming to create an appropriate set of rules to classify all documents correctly, and newly arriving documents might require new rules to be added in the future, it should be perfectly clear how the resulting classification was reached.

If you were building a system with a classification component, you might be inclined to use the ANN solution because of the speed and power, but the possibility looms that you will pay for it with counterintuitive, hard-to-fix bad classifications. If you take debuggability into account, you would likely avoid this type of design solution, opting for either the rule-based approach in the early going, or possibly a hybrid solution that uses the statistical approach for a quick classification, and then a rule-based approach to prevent common mistakes.

The ANN solution is an extreme example of something that is not debuggable, but many sophisticated algorithms suffer from this problem to one degree or another. This isn't to say that these solutions cannot be used, but they must be used with caution and when you have a working fallback. On the other hand, there are other cases where you can build capabilities into your system that help with debugging and don't require a functional tradeoff in general. A summary of some of those capabilities will be the subject of the next post.

Google has recently announced a new open source crash reporting client-server library called Air Bag. As described in this article, the idea is to replace the closed source TalkBack crash reporting system currently used in Firefox, to begin with, and then extend it further to other applications. The idea of a client-server crash reporting system is not new. On Windows, the Dr. Watson "Application X has quit unexpectedly. Send report to Microsoft?" box is somewhat ubiquitous, and there are equivalents on other operating systems such as Bug Buddy on Linux.

While Google will probably make a few friends with this new reporting system, I have to wonder whether these types of blind crash reporting tools are really all that valuable. They suffer from some serious problems depite their advantages:

  • The staggering amount of data that is collected
  • The lack of context information
  • User annoyance and perception

One of the links in the side bar is to the Cooperative Bug Isolation Project, which aims to resolve the first issue by applying statistical techniques to a cluster of reports to help isolate faulty regions. I'm not really doing the project justice, but I recommend checking it out.
In terms of the second issue, I'm not really sure what there is to do. I've seen some debate in the bug reporting tools' mailing lists about whether or not to ask the user what they were doing at the time of the crash. The pro is that it provides more information, the con is that it forces a disgruntled user to stop what they were doing for even more time, and probably increases the likelihood that they will send no report at all. That is true of pretty much any non-automated data collection technique.

The third issue relates to getting the user even more upset about something that they were annoyed with in the first place. For example, every so often, my Windows computer reboots for no apparent reason, always when I am not using it (like in the middle of the night). I only know because when I come in the next day, it is sitting at the login screen instead of where I left it. When I log in, the little "Windows has encountered a serious error!" box pops up and asks if I want to send a report. The thing is, I don't honestly believe they care, and I wonder if that's something that a lot of people submitting these bug reports come to believe. It crashes all the time and I'm constantly sending these bug reports, yet nothing ever seems to get done about it. It reinforces the notion that developers don't have any desire to fix problems.

So what can be done to improve centralized crash reporting?

  • Something I've been thinking about for a while is introducing the notion of a "semantic" stack within a program. Essentially this would be adding code markup that gives a series of method calls a semantic tag like "Opening an Existing Document". Then, in addition to providing the raw stack trace, the system could dump out the semantic stack and also get a sense of what the user was doing in more context. It might look like "Clicked Existing Document Menu Item -> Selected "Foo.txt" from File Chooser -> Crash"
  • Using some of the statistical techniques from cooperative bug isolation, allow a user to look at how common their particular problem is, and from that get a sense of its priority. If each crash that was produced was associated with a key that could be used to query a database, then the user could submit that key to a website that showed them how that bug clustered with other reports. If it is an outlier, perhaps their own system is to blame. If it is one of the most frequently reported, it will likely be fixed.
  • Perhaps link the bug reporting system in with the automatic update checking that many systems do (like windows automatic update or pup on Fedora) and tie incoming updates to specific bug reporting clusters/keys. That way, a user knows that when they get a particular update, it will fix a particular issue. This helps with user perception that developers are paying attention.

Crash reporting as it stands seems to be somewhat useful for developers, but mostly a pain for users. I think that a few small changes could go a long way towards improving its utility for both sides.

Sorry about the week-long absence. I took a family vacation and was out of the clutches of the internet for a few days. While it was a nice change of pace, even when I am ostensibly on vacation, I tend to stumble across interesting problems, or in this case, solutions to problems.

I always wonder about credit card fraud prevention when I am on vacation. Basically, I normally have a bunch of charges from one physical location, and then I suddenly have a bunch of charges in a new physical location. On it's face, it seems like it would be very difficult to separate out the vacation credit card usage pattern from the stolen CC# pattern. However, using my credit card in another state rarely seems to trigger a fraud alert from my credit card company, while buying a computer that is shipped to the billling address does, for whatever reason. I guess this is what the bayesian or whatever predictive model they use tells them. This always concerns me though.

This trip, I stumbled upon what I think is a new wrinkle in the out-of-state credit card fraud prevention system. When I went to fill up my rental car's gas tank, I used my card as normal, only this time it asked me for my zip code. At first, this prompt had me totally befuddled since I was expecting it to simply ask me to "lift nozzle" as usual. Then it occurred to me: what an ingenious way to prevent out-of-state usage of a stolen credit card number! If my credit card is stolen and used by some unknown third party, the likelihood that they will have any idea of the billing zip code for that card is almost zero, while a legitimate card user will have a close to 100% chance of knowing it (barring odd edge cases like borrowing a friends' card). It's a great example of getting something right that supposed security systems screw up all the time: a security policy that is at most a small nuisance to legitimate users, but a significant hurdle to illegitimate users. I might be reading too much into this simple prompt, but I'd be curious to hear if any readers have had this experience in their day-to-day credit card usage, and if it is a relatively new enhancment.

Testing is a poorly understood concept within the world of software development, especially unit and blackbox testing. I think the fundamental problem is in understanding the purpose of testing. When I first started as a developer, I thought of testing as something that someone else does to my code to find problems in it, and I fell into the trap of seeing myself as adversarial with testing. Now having built systems with large, extensive unit and functional tests written concurrently with the code, I have a totally different conception of testing. I treat my test cases like formal requirements. If all the tests pass, then I am meeting my requirements. If the system breaks and my tests didn't catch it, then my requirements were not completely specified and I need to modify my test cases.

This philosophy has some interesting consequences. First, developing, testing and debugging end up working in lockstep: every time a feature is added or bug is found, a test is added for it. One thing I really like about Ruby on Rails is that this idea is built in to the "generator" mechanism in that it creates test fixtures for every controller that you create. Second, I don't code "scared" anymore. I define coding scared as a refusal to make changes or add features because you are afraid of breaking your system in unexpected ways. I hated coding scared, but it's generally where you end up without tests. Since you have no way of quickly and comprehensively verifying the result when you make changes, you have rely on expensive and time-consuming manual usage to tell you what you've done wrong. Finally, this approach has made me much less resistant to change. In a traditional software development model, you spend a lot of time trying to determine how risky is and whether the risk is worth it. I worry about that less than I used to, because if something is going to screw my system up badly, I'll know it right away, and with good iterative development practices, I'll know exactly where things went wrong.

In my head, I imagine each piece of software as being built on a giant empty surface. As I make changes and add features to the system I am drawing and redrawing a shape on that surface. The problem is, the shape needs to avoid certain regions of the surface, regions which represent bugs or other problems. The tests that I write are little fences  that keep me out of those regions, and as I write more tests, the shape that I can fill in becomes more and more clear. That is the true purpose of tests: they are the best way to constrain our systems and guide us to the ideal shape.  With the fences in place you can worry less about making missteps and worry more about how to fill in the correct regions.

Lots going on this week. Last Wednesday was one of those days where it just seems like everything is on the fritz. They parked several large trucks out in the alley behind my house and set up a bunch of tables and all kinds of tools and appeared to be basically ripping out the phone network access point and putting in a brand new one. Considering that I've seen a guy out there working on the old one about once a week since I moved in 6 months ago, I'm not surprised they finally decided to scrap and replace it. I don't know if phone service was affected because I have a VoIP line and digital phone service through my cable provider, so basically all the data to and from my house goes through the one pipe.

At the same time though, a bunch of stuff seemed to be affected. Without a better understanding of what they were doing, I'm pretty sure it was just coincidence but still. DNS was taking forever to resolve, and my VoIP line which is usually pretty solid was popping and clicking like crazy which is what happens when it can't keep the circuit at full speed, so maybe there was some network effects. Also, our HD TiVo decided to freak out and start freezing up every 45 minutes. This is apparently a known issue with the new 6.3a software rollout, but why it hadn't manifested until that day I'll never know. After some searching I discovered that one recommendation was to do a reboot from the TiVo menus rather than wait for it to crash and then power cycle it. That seemed to clear it, so we'll see. Anyway, the next day everything seemed back to normal, so who knows.

The other big task was trying to get my music shared via DAAP from Linux to iTunes and vice-versa. I seemed to have gotten the first part to work (Linux to iTunes) using mt-daap although I had to wrangle up some old Howl RPMs to provide the multicast-DNS part for rendezvous or bonjour or whatever they are calling it these days. For some reason all the DAAP packages seem to be built with a Howl dependency, and Howl is no longer active so that was annoying. Then I was able to see my music library from inside of a windows machine with iTunes.

However, I can't seem to get the iTunes music from inside Amarok (and I can't use the sharing function from inside Amarok itself; iTunes reports an unexpected error in those cases). The ultimate goal was to create a mixed playlist from music from two different machines. However, while I could share music to iTunes, iTunes won't let you add shared songs to your own playlists for whatever stupid reason, and I couldn't go the other direction, which rendered Amarok's ability to do this moot. In the end, I just wound up being happy with what I had, but I'm sure I'll revisit in the future.

In my post on Thursday, I gave some example interview questions with the third one being a question about the detailed workings of one of several different computer scenarios including loading a web page, booting a computer, and a couple others. These questions are looking for domain knowledge, i.e. information about general fields of knowledge within computer science and practice versus information about a specific piece of hardware or software. You might well ask, why does he care if someone knows this stuff? Isn't the knowledge about that particular system better? Sure, but it would be way too time consuming to verify, for example, the details of the networking stack for an application, and in some cases the information isn't even available. You have to assume that things have been done in a sensible way or according to a known specification or pattern, and once you make those kinds of assumptions, your mental model of how that specification or pattern operates comes into play.

Mental model theory is a well-known psychological theory of human reasoning that says that rather than using deductive reasoning to solve most problems ("if X than Y" type stuff), humans build little models of situations in their head and refer to them to answer questions. It makes intuitive sense if you reflect on your own thinking. It is important though that when you are forced to rely on your mental model instead of being able to test things directly, that your model be accurate, or you will jump to nonsensical conclusions about how and why bugs are occurring.

For instance, you are working on a networked system and you are investigating a bug where some data is being lost in transmission. I have a mental model of network transmission where I imagine the data being snipped into little packets, tagged with a endpoint location and then relayed through a bunch of other computers to get to its destination. If pushed, I can get into the details of TCP stacks, how routing works, even the low-level details of ethernet transmission, but most of that doesn't really matter and so my default mental model is "good enough". In that model, I can easily see situations under which some data might get through but other data is lost since I know that the data is being turned into packets all of which may take a different set of hops through a network. However, if your mental model of a network imagines some sort of direct connection between the two machines with data simply being passed from one to the other in a big chunk, a partial transmission might seem totally baffling. This is an extreme example, and I assume that most developers with a basic CS education understand at least about packetizing, but I've been wrong before.

So in the example interview question, what I want to hear about is their mental model, and to a lesser extent, their direct technical knowledge of a particular specification or implementation although I don't expect them to quote RFCs to me. Also, I want to see how aware they are about their own thinking. I would much rather hear, "I don't really know what happens between here and here", than make something up that is patently false. We take these mental models for granted and it is easy to overlook flaws or gaps for a long time because we are never called upon to reason about a situation with that level of fidelty. In the world of debugging though, you will need to be very aware of how you are imagining the situation and what conclusions you are drawing from that world inside your head.

There is a culture of the job interview "riddles" among the big tech companies like Microsoft and Google. You can find a collection at this site if you are interested. These are all fine and good, but I always have an important question: what does this have to do with the job? I've railed about the need for authentic assessment in this space before, and this seems to be the exact same problem. Why have me do things in the interview that I won't do in the job? They seem to be falling into the IQ trap, assuming that you can design a test that somehow gets at the fundamental kernel of general intelligence when a) you are very likely only testing a very specific set of skills and b) high-stakes testing almost never tells you anything about a subject's ability. Maybe there is a decent correlation between people who excel at those riddles and people who excel within the company, but I kind of doubt it, and I'm guessing it tends to eliminate good candidates who are just bad at on-the-spot puzzle solving.

So, can I do any better? What would I ask a potential hire for my notional debugging company?

  1. Do you own a set of Torx screwdrivers?
  2. In as much detail as you are able, describe one of the following:
    1. What happens between the time that you enter the URL of a website into your browser and that page displays (or fails to display)?
    2. What happens between the time that you enter an SQL query into a database and the results are returned to you?
    3. What happens between the time that you press the power button on your computer, and the login screen appears on your desktop (in the OS of your choice)?
    4. What happens between the time that you hit send on an email message and the time that the recipient receives the message?
  3. This computer won't boot up. What's wrong with it?
  4. Are there devices in your house that have either software or hardware that they did not ship with, and how and why did they get modified?

The justifications:

  1. This might seem like a silly question, and it borders on a non-authentic question. However, the only people I've ever met who own them are people who like to open things up and find out what's going on inside.
  2. This is the first big hurdle. I am consistently shocked at the gaps and misconceptions people have in their technical knowledge. I figure that a good interview subject could probably spend upwards of an hour on any one of these questions, and the choices are broad enough (networking, databases, operating systems/computer theory, general IT) and common enough that a potential hire should be able to speak about at least one of them at length. They all have interesting possibilities for discussion along the way, and the questions that they ask in response would be informative as well.
  3. This is the critical portion of the interview, and the key to authentic assessment since this is really what they would be doing. I don't care so much if they manage to fix it or not, but I care about how they approach the problem. Do they start by asking me some background questions, or do they just pop the thing open (both of which might have merit)? Where do they start looking if they pop it open? If they ask questions, do they seem to be trying to work on a theory? Are they just totally overwhelmed or bored by the task or do they seem excited to work on it?
  4. Again, bordering on non-authentic, but this speaks to the other thing I look for besides skill: enthusiasm. I like to say that you can teach skill but you can't teach enthusiasm. Someone who hacks the stuff in their house or likes to have hacked stuff in their house is interested in learning about and fixing things independent of their work, and that speaks to engagement with the subject.

So that's my interview in a nutshell. I think it would appropriately identify the people with the requisite skill and interest for a job debugging full-time.

Besides a knack for finding bugs, I also seem to have the ability to find missing "stuff" like a misplaced document or a piece of clothing. My wife on the other hand, for all her other excellent qualities, is terrible at finding things to the point where it has become something of a running gag between us. Additionally, I think that my finding skills have caused hers to completely atrophy and she will make only a token search for something before turning it over to me. As I've looked at the difference in the way that we each approach the process of finding, I've noticed some key differences that have led to me believe that one of the underlying skills for good debugging is a good sense of how to find something. Here are few differences between us:

  1. Confidence - When I go to look for something, I am certain that it will be found. My wife is always convinced that it is lost forever and I think that colors her approach.
  2. Systematicity - I check and recheck a sequence of areas moving from area of highest likelihood to least likelikhood. My wife tends to look quickly through a series of places hoping that it will be easily found and often fails to recheck high likelihood areas. This probably goes back to the confidence issue.
  3. Region Expansion - If after exhaustive search of likely areas I turn up nothing, I will expand the search area to include places where it "couldn't possibly be", and am not surprised to find things in those places. My wife always tells me I'm "wasting my time" since she "never would have put it there." But after turning something up in an unusual place, she generally has a perfectly good explanation for how it wound up there.

Believe it or not, I think I was actually trained to be a good finder at a young age. For instance, my mother created a game called "hiding in plain sight" where I would leave a room and she would hide a very conspicuous item, like a large doll or a colorful block, somewhere in the room. The only rule was that the item had to be completely visible as you walked around the room, in other words, it might not be visible when you first stepped in, but it would never be in a drawer or buried under a pillow. This should give you an idea of what my childhood was like.

Anyway, you might think this greatly limits your possibilities of places to hide something but in fact it really does not. You would be surprised how hard it can be to find even a very noticeable item in a room full of other stuff. I remember going through a series of increasingly sophisticated strategies. At first, I just kind of walked around hoping to spot the item. Then I started focusing more on figuring out where the good hiding places were and checking them first. I also got better at picking out the shape and color of the item and focusing on those features. As I got better, my mom also came up with little things to make it trickier, like hiding the item in the same place in succession, or adding similar items to the room.

The tricks I learned from this game have carried over into my work finding bugs. I always like to reiterate that debugging is a teachable skill, and I think it's clear that many of the underlying skills that constitute the process are teachable as well.  As I do this exploration of debugging concepts on this website, I will try to illuminate and discuss these underlying skills as well.

Two quick things that I have been working on this week:

  1. A couple of months ago, my Linux server crashed because the memory failed (I bought cheap memory, a mistake I will not make again), and when I tried to reboot the machine, the memory errors caused it to scribble all over the file system thereby hosing my boot partition. Anyway, I just reinstalled the OS since the machine hadn't been up long enough for me to start daily backups, and all my real data was on another disk so there wasn't much to be gained from some sort of file forensics.However, there was one lingering issue that was bothering me. I have my DHCP server set up so that if a client passes in a hostname, it looks to see if that machine should be assigned a particular IP address. This common trick guarantees that my servers always receive the same IP address, while still keeping everyone on DHCP, avoiding IP clashes, etc. This also hooks into the DNS, so that I can request machines by name rather than by IP address without having to hack the hosts file on every machine. The problem is, my rebuilt server kept receiving some random IP instead of the IP it was supposed to be getting.

    After much reconfiguration and fiddling, I noticed that on the log on my router (which serves DHCP and DNS) it was complaining because I had specified the hostname as foo.domain.com instead of just foo and it gave me some complaint about "Ignoring hostname due to illegal domain part". I'm sure some of you are nodding your head and saying, yup, you can only pass in a hostname, not a qualified hostname. Well, now I know, but that's a stupid rule. Why can't it just look at the first part? Changing that setting on the server fixed the problem and now all is happy.

  2. To further sharpen my debugging skills, I thought I would join up with one of those free tech support sites (since I'm not getting any requests directly here) and see if I might be able to help some people out. I decided to go with Suggest-a-fix, since it seems like a nice low-frills, focus on the help kind of site. I've added it to the Links on the side, so feel free to check it out when you have a second.

I have one rule of thumb that serves me well in my general work as a developer, but also as debugger: nobody reads anything. Never. This is generally accepted wisdom among software people, with the conclusion being that you need to build software in such a way that a user could stumble through it without ever having to read a word of documentation. Overall, I think this has a detrimental effect on software because it forces interfaces to focus too heavily on certain stereotyped ways of working at the expense of a better overall task-oriented interface. However, it has become a necessary evil with the alternative being that no one will use your stuff because it's too "confusing".

As a backlash to this mindset, I have cultivated an almost obsessive tendency to read every piece of documentation that comes with things that I purchase or whenever I'm asked to look at a problem. This is a very simple trick that often makes me seem far more capable than I actually am. For example, most digital thermostats have fairly clear instructions right inside the cover about how to set the basic start and stop times, temperature, etc. But most people don't ever look at that, instead attempting to puzzle through it by randomly pressing buttons until they get where they think they need to be. Having a reputation as someone who can "figure things out", I am often asked to "decipher" someone's new themostat and get it set up for them. I start by flipping open the case and reading those tiny instructions which usually tell me everything I need to know. I then proceed to follow the directions and set the thing up, to the delight of the asking party. They say, "how did you figure it out so quickly? I couldn't get anywhere with that thing!", and unwilling to share my secret I say something like, "oh, I've seen one like this before".

Reading the available documentation is a stunningly simple yet effective technique in any debugging situation. There is often quite a bit available that no one has bothered to look at thinking that it would be tedious or a waste of time. I can't guarantee that it won't be, but you will be surprised at the nuggets of wisdom buried deep in user and administrator manuals. Even if it is not helpful for the situation at hand, the information may prove useful in the future, and it is a great habit to develop.