Distance Debugging Logo

As a final send off to NaBloPoMo, I thought I was give a brief example showing how I might put all the things I've been talking about together when solving a problem. Just to let everyone know, while I likely won't be posting every day going forward, I will try to post at least every other day, on average, so look for plenty of new content here.

A month or so ago, I posted about the terrible tech support that my grandmother received regarding issues with her aging laptop. While I disagree with her decision to keep the poor thing alive, they failed in a basic way to provide her with anything approximating distance debugging. I want to talk about one of those issues, starting from the bug report.

Report: My grandmother told me that, "when I tried to go to the New York Times page, it just opens my Documents folder".

Blink: Nothing yet. This made no sense to me, so I asked her to tell me the story.

Story: "I subscribe to the NYT news updates, and they send me articles by email with links, but when I click it it just opens my documents folder."

Blink: That triggered something: I bet her preferred browser got screwed up in her network settings. Would that actually result in this kind of behavior?

Theory: The network settings preferred browser got screwed up and that is making it impossible for her to follow links embedded in emails.

Data Collection #1: Check the setting of the preferred browser, artifact examination. Luckily I had access to the computer so I could check, but I probably could have navigated her through it over the phone. Confirmed that the browser was unset.

Fix Attempted (we haven't covered this yet, but coming soon): Changed it to Internet Explorer. Clicked link in email, browser opened with page. Fix successful.

However, I was still bothered by the fact that it was changed. Settings don't just flip themselves. Maybe there was some deeper problem of which this was actually just a symptom.

Data Collection #2: Human Inquiry. I asked my grandmother if she'd changed the setting for some reason, or installed any plugins or upgrades that might have affected it. She mentioned that she'd let my cousin install Firefox. Mystery solved.

So in a complete telling of the problem and fix, I would say that the attempt to install a second browser somehow changed her network settings, but not in a proper way, causing it to fail to load any browser when a link was clicked within another application. This would often result in just popping open the My Documents folder for reasons still unexplained, but likely unimportant.

That's a relatively trivial example, where the blink moment was the right answer, but I think it illustrates my overall approach to debugging in a nutshell.

Coming Soon: Let's Talk about Fix, Baby

Looking to chat about debugging issues, or have a comment, suggestion, or idea that you don't want to post about publicly?  Feel free to email me at holman (at) distancedebugging (dot) com.  I check that email fairly regularly, but I do have a day job that prevents me from rapid responses during the day.  However, I'll try to get back to you as soon as I can.

I am a software engineer, developer, and designer by trade. In my experience, I am especially good at debugging and that's probably because I spent a ridiculous amount of time doing it. You know how some people are obsessive about keeping a perfectly clean and orderly house? I am that way about my computer's operation. If I notice anything broken or simply not working the way I would expect it to, I will endlessly investigate until it is solved. Over the years this has meant that I've developed an oddly comprehensive knowledge of things that go wrong with computers. I have also, with a bit of introspection, worked up a theory to try to teach others how to get better at debugging in general. My hope for this blog, and this site in general, is to improve readers' debugging skills. If it accumulates interesting stories of my own and others' debugging travails along the way, so much the better.

Besides my work with computers, my other main interest is education, and I actually went so far as to obtain an Ed. M a few years back. One notion that stuck with me long after I left was that of authenticity, specifically authentic assessment. In contrast to the more prevalent high-stakes testing, authentic assessment is about testing people by making them do the thing itself rather than asking questions about it. The example we used in the class is that of a chef. If you were hiring a chef, how would you gauge their ability? It seems fairly obvious that you would have them cook you a meal. Contrast that with most classrooms, where they would be asked to draw a map to the grocery store or tell how many teaspoons are in a tablespoon. It is ridiculous in that context, but it's not too far from the reality of school testing.

I've discovered that authenticity is a generally useful concept, and shows up (although not under that name) in the testing literature. Proponents of Extreme Programming refer to test-first programming, where as you would expect, you write your tests first and then write the code. The idea is that the tests tell you how things ought to work and then you make the code fit that mode. I like this because it tries to impose an authenticity constraint on testing: make your tests do the things your code should do. I do this quite a bit, and I've discovered that the value of the approach is not so much in the fact that you find and fix bugs in your code quickly this way (you do) but that you find and fix bugs in your API and your thinking. If you write your API first, you will write tests that call it the way it was written and you aren't as likely to notice how awkward a certain set of method calls is, or how you never really need to use some methods. If you write your tests first with an imaginary "perfect" API, you will then code up that nice, simple, authenic interface. Try it some time on a new project. You will be surprised how different your methods end up looking since you have the freedom to write them as you might use them, rather than testing some predetermined API.
You can also impose authenticity on your debugging. Try to imagine how a real user would be using the system, and use the most realistic data available. Think through a day in the life of user including what other applications they might be running or how they might easily perform an action that you thought would never happen. You will be surprised how many times you can end up replicating an seemingly unreplicatable problem just by making it a little more real.

A read an interesting article in Wired a few weeks ago (I get the print edition, but you can read it here.) I thought it was relevant to Distance Debugging because of the comparison of the traditional approach to map-making versus that of the up-and-comer. Basically, it boils down to driving all the roads to see what's going on, versus collecting huge amounts of data such as instant email alerts regarding road changes, scouring local media for information, and looking at satellite imagery to try to infer road status (speed, construction, one-way, etc) without ever leaving the office.

This new approach has started to gain significant ground despite being used by a smaller player. The problem with the "close" approach is that it is about exhaustion, where the drivers are going out and directly observing the state of roads and feeding that into their model. The "distance" approach relies on the fact that people local to the changes will be noting them anyway, and so there is no need to drive roads over and over again.

This is analogous to Distance Debugging versus a close approach such as running the program in a debugger. The debugger is like driving the roads: as you go along you will eventually come along to something worth noting, but will spend a lot of time looking at things you already knew. In the distance approach, you determine what information seems relevant and collect it directly, or have the system dump out information at certain key points. In my experience, and as has been illustrated by the growing market share of Tele Atlas, the distance approach can be more effective and less costly than "driving the roads" since you spend so much less time filtering out redundant data. It will be interesting to see if this trend will spread to other industries.

I stayed at a hotel recently that offered wireless internet for free. However, it didn't work very well. Despite having good signal strength (iwconfig showed a 70/100 or higher), and the fact that I had no trouble talking the router itself, it was dropping packets like crazy, and it would just plain stop responding. I would have IM open, and then I would suddenly be disconnected, or the person I was talking to would stop receiving messages or vice versa. It was very annoying.

As is my habit, I started poking around on the network a little bit just to see what was going on. Nothing seemed awry, so I entered the address of the router itself into my web browser seeing if I could get some sort of status page. With the page that came up, I knew immediately that I was looking at a Linksys router. When I clicked on the status page, it asked for a username and password, and just out of curiosity and fully expecting it to fail, I entered the defaults. Of course, I was immediately dropped into administrative mode.

So now I could get a little more information about what was going on. Looking at the admin console, it became clear that they were running an ancient firmware and that an upgrade was desperately overdue. However, I started to feel weird about the whole thing. Partially it's because I always feel ambiguous about even benignly poking around on someone else's system, and partially because I know that in the current litigious culture, you can get thrown in jail for even thinking about cracking something. I knew it would be so easy to just upgrade their firmware, reboot, and no one would be the wiser (except for maybe the front desk that would note a drop-off in complaints about their servier), but it just didn't seem right. I tried to think of the analogous situation in real life: I figure it's like coming over to someone's house, noticing that their door is open and walking in to discover a leaky faucet that you fix before leaving. I like that analogy because like the real life situation, performing a good deed subjects you to legal issues such as being accused of breaking and entering.

We as a society have a very anti-intrusion bias, and I like that even if it means occasionally good deeds cannot be performed. However, I think the emergence of Wikis and other community controlled resources shows that under the right circumstances we are willing to sacrifice control for the possibility of greater positive outcomes. Perhaps someday we will trust each other enough to open up this idea to a larger set of environments.

I had a funny distance debugging experience while traveling in Boston. We were staying with in-laws who had a high-speed connection, but had it hooked up to only one computer. I decided to go into my old office in Somerville, and while I was there, my wife called because she wanted to hook up her computer so she could check her email. She told me that it looked like a cable modem (which it was), so I knew from experience that generally they are relatively simple to hook up to. You don't need a username and password most of the time since it is lan-style connection (i.e. plug in your ethernet cable and go) rather than PPPoE or something else more complicated. However, I also knew, from a very frustrating past experience, that the cable modem learns your MAC address, and so you can't just disconnect one computer and hook up another, you have to turn the modem off and back on again. Here is a transcript of our conversation:

Me: Plug the ethernet cable into your computer.

Wife: Okay, done.

Me: Now turn the modem off and back on again.

Wife: So I should just hit the on/off button on the top and then hit it again?

Me: Yeah, wait 15 seconds or so before turning it back on again.

Wife: Okay...now what?

Me: (long-winded description of configuring dhcp on Mac OS X)

Wife: It won't give me an IP address

Me: Hmmm

I was pretty much stumped and chalked it up to some system I hadn't seen before or which required a password as some cable services do. . Later on when I got home, I looked at the modem and saw the "on/off" button. I realized then that it was actually the "standby" button, which wasn't what we needed at all. The modem has no on/off button; you have to unplug and replug it.

It's funny because at the time I thought, "Gee, these things almost never have on/off switches since it saves 3 cents. This must be a model I've never seen before." Instead of asking better questions like "are all the lights off now?" (they wouldn't be in standby mode, the standby light would be lit). To me it's a classic example of falling into a trap of assuming that the information being given by the remote person (and my wife is very technically savvy, so I had no reason to doubt her) is completely accurate, rather than relying on the observables to verify information. I had done everything right, except for asking her to push a button that didn't exist.

I had a very interesting and somewhat terrifying experience traveling home from Boston yesterday. When I arrived at the ticket counter with my wife and toddler, they proceeded to inform us that while his ticket had been issued, our seats had been "reserved, but not ticketed". It turned out there had been some agent error and they had put our seats on "Courtesy Hold" instead of just booking them, and that was combined with some sort of computer error where the hold was automatically cleared. Ultimately, we got on the flight, but after the fact, I was in a distance debugging mindset and tried to think of the fundamental issues, and how they might be prevented in the future.

  1. The system was in an essentially impossible state according to the average person's (i.e. my) mental model. I wasn't aware that we could exist in a limbo state where we had reservations but not tickets, although I do now. Developers tend to put those kinds of states into programs all the time, sometimes directly, sometimes implicitly. Usually they are meant as a temporary "holding" state to allow a certain transition to take place, as was in this case in the form of a "courtesy hold", whatever that means. However, in my experience, these states are the source of most problems, because the people on the inside (the gate agent) have the problem of not only trying to understand it themselves, but they have to communicate it to a stultified outside party. Also, since they are generally poorly understood, humans almost always do The Wrong Thing when they are encountered, leading to a worse situation as in this case where our hold was unceremoniously cancelled at some point.
  2. This state was indistinguishable from the normal, ticketed state. We received confirmation emails with an itinerary, etc. The only clue I might have had is that we were billed only for my son's ticket. If you are going to allow in-between states that exist only in software, at least make a huge note of it any communication so that we know.
  3. It turned out that my 18-month-old son had in fact been ticketed, so there was actually a ticket in the system for solo infant flyer. I'm guessing that should not have been allowed or should have been flagged immediately.

Once the gate agent determined that we had done nothing wrong and that we truly had all the stuff you would have had if you had actually been ticketed, she proceeded to try to get us our tickets. Oddly, the real problem wasn't that the plane was full, it was that she wanted to get us our original fare, as well she should. She kept saying things like "the fare no longer exists", which to me brought up another key point: many times software keeps users from doing necessary things, I assume in an effort to avoid fraud. For example, you can't arbitrarily change the price of a piece of clothing or a hamburger at the register. This makes sense from the corporate point of view; they don't have to worry about someone giving all their friends a 50% discount. On the other hand, if you've ever waited for 15+ minutes when an item rings up wrong and no one on site has the power to change it, you can easily see the downside. This is the state we found ourselves in. The agent was very nice about it and was able to get our original fare eventually, but it took multiple phone calls and lots of typing.

It seems like there is a better way to handle these circumstances: auditing. Allow users to make justified changes at the "register", with a required explanation and an audit timestamp and user credential. If people knew that any time they access these features it raises a flag, they would be unlikely to try them for fun and profit. Or better yet, stop sweating it so much. The small amounts you lose in employee theft would be compensated by greater customer satisfaction. I am hesistant to fly this airline again (although I'm sure I will due to their greater availability of direct flights), but I will likely think twice when there are competing fares and routes. As it stands, employees are restricted from certain types of fraud, but they are also prevented from meeting customer needs.

I came across this post, linked from slashdot. In it, Steven Levy speaks somewhat philosophically about the iPod shuffle feature, and the well-documented non-randomness problem. He comes to the conclusion that there is no deep conspiracy to play certain artists and the iPod is not telepathic, we are simply illustrating two well-known cognitive phenomena: our general failure to understand statistics (see John Allen Paulos's Innumeracy for an excellent discussion), and our desire to see patterns where there are none.

The article is interesting, but does not discuss what to me is the most interesting part of the "problem". Does their shuffle feature have a bug? Their development team originally said no. They insisted their random number generator was a perfectly valid algorithm. I'm sure that it is. However, that isn't the bug. The bug is giving people "good" randomness instead of what they want, which is something that feels random to a person. Of course, they are a customer-oriented company and fixed the bug. From the article above:

But the non-randomness illusion was so prevalent that ultimately Apple felt compelled to address it. In the version of iTunes rolled out in September 2005, there appeared a new feature: smart shuffle. It presents iPodders with a scroll bar that "allows you to control how likely you are to hear multiple songs in a row by the same artists or on the same album". If you pull the lever to the right, the iPod will mess with its usual distribution pattern, intentionally spacing out songs by a given artist. As Jobs explained it in his presentation the day the new iTunes rolled out, he gave what he hoped would be the last word on the Great iPod Randomness Controversy: "We're making it less random to make it feel more random."

I think the last quote really sums it up. They added a feature to fix a bug in perception despite the fact that there is no bug in execution. This is a lesson that many of us have to learn the hard way, when we continue to fight a losing battle to avoid fixing a bug because we did things "right", but it turns out it isn't what was wanted.

One of the things that I try to avoid when debugging a problem, especially one to which I don't have real insight into, is assuming that because two things co-occur, one must be the cause for the other. More often than not, the two things are actually both caused by a third, unseen event. However, when my Roadrunner email (which I rarely use) suddenly started telling me "account disabled" immediately after I had a new digital phone line installed, it seemed to be too coincidental.

The real fun was trying to figure out how to get Time Warner to listen to me without having to sit in a long tech support queue and argue with some level 1 tech about why it wasn't working as they walk me through unplugging and replugging my cable modem, forcing me to narrate my pretend actions so that they will finally transfer me to the next level.
I have learned that you will always get more leeway if you buy something, or ask about available services and their prices when you actually need tech support. This works for two reasons, in my experience:

  1. They have a more customer-service oriented mindset, and so they want to make you happy.
  2. They tend to be less technical, or at least have fewer notions about being a technical person and so they will listen when you offer up a problem diagnosis.

I remembered that I had asked for voicemail on the line but they did not add it for whatever reason, and this provides a clear "in" to the sales line. I try the tech line first just to see if I can quickly resolve it, but the wait is > 45 minutes. I then try the sales line; I get a person right away and I explain that I want to add VM. No problem, they can add that for me.

Me: Oh, and BTW, my email account stopped working right after the phone was installed.

Sales Guy: Oh really, let me look at that...yeah it looks like it got confused here. Let me fix that for you.

Me: (astonished silence)...uh, thanks

10 minutes later my VM was active, and my email was back online. So anyway, the moral of the story is: to get good tech support, tell them that you want to buy something.

Syndicate content