Distance Debugging Logo

I was recently reading The Pragmatic Programmer and was thinking about the concept of Good Enough software. Briefly, the idea is that when you learn to build software that meets users' high priority needs rather some standard of technical perfection, you and your users will be much happier. The critical piece to making it work is engaging them in the dialog about where corners can be cut and making them aware of the resource constraints that exist (time, money, ennui, etc) so that they can make an informed decision.

What I found interesting though is that even these authors writing from the pragmatic point of view felt the need for the caveat regarding high-performance, zero-fault software such as the stuff runnning pacemakers or controlling aircraft to say that there are cases where the Good Enough theory and process doesn't apply. Why do we feel the need for a methodology or process concept to span all types of projects when looking at its validity? People will dismiss extreme programming by saying "well that would never work if you needed to build software for the space shuttle". I would probably agree with that, but that's beside the point; in my experience, the level of implied quality and criticality of the system is generally inversely proportional to the scope of the requirements. In other words, if it really has to work all the time perfectly, chances are that you also have been told exactly what it needs to do and how it has to do it. No one ever says, "Can you build me a pacemaker, and while it's being used for humans right now, we might also use it for mice, and it might need to perform dialysis functions also."

We don't need agile techniques and rapid prototyping and usability studies and iterative development to build a pacemaker because from the get-go it's clear exactly what needs to do, and all the time and energy is spent on making sure it does that one thing perfectly. Extreme Programming, and the idea of Good Enough software which is closely related to the XP planning game, is for what has been in my experience the vast majority of cases where you really don't exactly know what you want, and where the feature set and degree of robustness are up for debate. Building systems that have nebulous requirements but a variable margin for error can be as difficult if not more than those with strict requirements but zero margin for error. The multitude of software project management theories and the endless stories of projects being killed for failing to deliver the right (or any) functionality attest to that.

In most instances, it should be possible to look at the number and detail of your requirements and figure out where your project is on the scope vs. quality management curve and select your methodology accordingly. Of course, you will likely end up drifting along the curve during your project's lifetime and should adapt accordingly. Project Management theories and companies that adopt one methodology as part of their identity ("We do Scrum here") tend to have trouble with one of the two inversely compatible goals, or at least they end up burning a lot of management overhead to solve the problem they aren't having. Conversly, we should stop making excuses or offering caveats for our management tools that do provide us a process to tackle the problems we are having.

I used to work as a projectionist in a movie theater that showed primarily classic films. This involved significantly more work than is required with modern films because of the need to switch between reels. In a nutshell, films used to be distributed in a stack of canisters each containing about 20 minutes of film, called a reel. Showing a film consisting of reels requires two projectors, and the projectionist has to load up each reel on the currently unused projector and wait for a cue to start up that next reel, which was loaded up such that the actual picture would begin exactly 10 seconds after the projector was started. Then, a second cue tells them exactly when to swap between the two projectors (in my case there was a floor pedal that would instantaneously close one projector and open the other). If done correctly the movie would appear to seemlessly transition from one reel to the next and audience would never be the wiser.

So what are the cues that the projectionist uses? In every movie you have ever seen (chances are) a small black dot appears in the upper-right hand corner of the film at 10 seconds remaining, and 1/4 second remaining. Now that I know it's there, I can't "unsee" it, so I notice it in pretty much every film I see in the theaters. Despite the fact that most films released nowadays are made of stronger materials and put together into one huge reel, making the projectionist job significantly easier and the dots superfluous, they are still there.

Even if you do know they are there, they can be very hard to notice unless you have spent a fair amount of time looking for them. To me, the most interesting part about the dots is that I've yet to meet someone who when I tell them about the phenomenon for the first time says "oh yeah, I always wondered what those were for!" Most people say, "No way, I would have noticed that" and are somewhat incredulous that this could have been happening in every film without them consciously processing it.

It just goes to show how powerful the brain's capacity to filter and make sense of the world is. Since those dots appear to have no meaning in terms of the film content, they are dismissed at some lower level of processing as perhaps some dirt on the film or a one-time anomaly to be corrected for. The brain is compensating for the eye's blind spot all the time, so throwing out a little black dot every so often is a piece of cake.

The black dot problem, as it relates to debugging, is the problem of failing to perceive useful information because it is grouped in with a larger, consistent data set.   For example, a rare but recurring error buried in hundreds of megabytes of uniform log files could be the black dot that you are missing when searching for the cause of a system crash.  Sometimes, simply being told that the "black dot" is there is enough to allow you to find it.  Other times, it's just a matter of stepping back and repeating to yourself that the information is there and trying to force those subconcious filters off for a while to really see the data in front of you. However it comes about, when you finally notice the dot, you will wonder how you could have missed it for so long.

Remember in math class the teacher admonishing you to go back and "Check Your Work" after completing a problem set? It's a simple request that can have amazing results for your overall score when you go back and notice simple errors, or identify answers that don't pass the smell test. Of course, this never helped me much in math because I am terrible at arithmetic, and no amount of work-checking ever seemed to allow me to notice that I'd added numbers that I meant to multiply or vice-versa. However, checking work is an important part of what I do when debugging.

Basically, if I am executing a set of steps to replicate or gather data about a problem, the first thing I do is to verify, to the extent possible, that I am really doing the steps I think I am. This is harder than it sounds for two main reasons: silent success and silent failure. Silent success means that things work as expected, but produce no output. This is fairly common since we tend not to write out logging information for succesful operations. Silent failure can be an attempt at fault-tolerance, or sometimes the message happens in a low-level library and can't be handled correctly, or is just buried or piped to an obscure log file (apparently silent failure).

Both of these problems need to be addressed if you want to check your work properly. In the case that the system can be run in a debugger, work-checking is trivial since you can step through the sequence in question. If not, some judicious print-outs to verify can work instead. Checking your work is more than that though, it also means looking for the evidence of execution. This can be reviewing log files after startup to see if everything looks kosher. It can be looking at process tables to verify that something is running. It can be pinging ports to verify that something is alive. Anything that gives you some level of confidence that things are doing what they say they are.

Without taking these kinds of actions and checking your work, you run the risk of burning many hours debugging a shadow bug, which hints at the real problems in an indirect way like Plato's shadows on the cave wall.  For example, you build a new version of an application, but the build script output a warning that a properties file is missing and so it is skipping part of the build. Since you are not expecting the build script to fail, this warning goes unnoticed. You run the latest version and suddenly weird things are going wrong, and you are convinced that you have broken something in the code. When you finally notice the build script output, you may have spent significant time debugging a non-problem, or at least trying to debug an epiphenomenal problem rather than the real thing. Checking work means more time fixing real problems, and less time chasing after specters of bugs that are faults of the process and not of the application.

A recent publication by Edward A. Lee at UC Berkeley called "The Problem with Threads" is an interesting look at why multithreaded programming is hard, and more specifically, why the Thread abstraction makes is harder. While I'm not going to disagree with these sentiments in general, I am shocked by the common paralyzing fear of multithreaded programming among otherwise competent, confident, programmers.

There seems to be a disconnect between the actual difficulty level of writing a multithreaded system and the perceived difficulty. The reasoning goes like this:

  1. Single-threaded programs are deterministic.
  2. It is possible to exhaustively test only deterministic programs.
  3. Multi-threaded programs are non-deterministic.
  4. Therefore, it is not possible to exhaustively test multi-threaded programs.

Take for example, this passage from the Lee's paper:

A part of the Ptolemy Project experiment was to see whether effective software engineering practices could be developed for an academic research setting. We developed a process that included a code maturity rating system (with four levels, red, yellow, green, and blue), design reviews, code reviews, nightly builds, regression tests, and automated code coverage metrics... The reviewers included concurrency experts, not just inexperienced graduate students...We wrote regression tests that achieved 100 percent code coverage. The nightly build and regression tests ran on a two processor SMP machine, which exhibited different thread behavior than the development machines, which all had a single processor. The Ptolemy II system itself began to be widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked on April 26, 2004, four years later.

It is certainly true that our relatively rigorous software engineering practice identi?ed and ?xed many concurrency bugs. But the fact that a problem as serious as a deadlock that locked up the system could go undetected for four years despite this practice is alarming. How many more such problems remain? How long do we need test before we can be sure to have discovered all such problems? Regrettably, I have to conclude that testing may never reveal all the problems in nontrivial multithreaded code.

There are few elements that bother me here. My primary complaint is the final sentence. Testing may never real all the problems in nontrivial multithreaded code. It implies that testing may reveal all the problems in nontrivial single-threaded code, which I believe is totally false. Testing will never reveal all the problems in any nontrivial system. Multithreaded programs are no different, but that's no reason to assign any particular menace to them.

My second complaint is this statement: "No problems were observed until the code deadlocked on April 26, 2004, four years later." Seriously? If your system had no observable defects whatsoever during a 4-year active usage period by a large and diverse group of users, then my hat is off to you. I assume what he meant is "No problems were observed that appeared to be related to threading until the code deadlocked". It would be shocking of none of these users found bugs in the UI, or errors related to pointer logic, or any of the dozen other problems that commonly occur in complex systems. I believe that the Ptolemy system was well-written and well-designed, so I am not trying to claim that it is buggy or problematic, I am simply claiming that singling out the 4 year gap before the first thread-related bug was found is misleading. I could argue that multithreading is in fact the least of their worries if the first bug was found 4 years after the release of the system. How many UI bugs were found in that period. 50? 500?

My final complaint is that these papers are what stokes the fear burning in ordinary programmers: that they somehow be exposed by the complexity of building a multithreaded system. Perhaps it is that programmers are ultimately control freaks, and somehow, the nondeterminism of multithreading seems more out of control than the ordinary nondeterminism of the ridiculous things human users do to every application.  Whatever the reason, I recommend to all programmers that they become familiar with the tools of multithreaded coding in the same way they might learn graphics or databases, and just start writing lots and lots of multithreaded programs.  This type of exposure is the only way to beat this phobia that plagues the industry.

I have always wanted to invent a new, commonly used expression or cliche. Most of the things I come up with refer to events or states that don't happen that much, or are just not catchy enough to endure. Here are some of my creations. I apologize if I actually stole them from somewhere else, but I've seen no evidence on the web:

  • Throwing rocks down the well - There is a fable by Aesop called The Crow and the Pitcher. In short, a crow uses stones to slowly raise the level of water in a pitcher until he can drink it. It's supposed to be about ingenuity and perseverance. I always took it to be about recognizing when brute force is your only option. I use this more dramatic sounding variant to describe situations where you have been doing things in a slow, grinding way because there is simply no (known) alternative. For example, if you are trying to build a new line of business for your company, you can't just create it through a flash of insight. You have to find new customers, and convince them of your worth, etc. In short, you can only build a new business by throwing rocks down the well.
  • All '5's and 'yes's - I've purchased two cars from two different car companies. In both cases, the sales and service staff has admonshed me, "<car company> will be calling you for a follow up survey. Please, please, please do not give us a rating other than a 5 or a yes; PLEASE, ALL '5's AND 'YES'S!!!", with the idea being that they need a 5 on a scale from 1 to 5 for the numerical questions, and yes on the yes or no questions. On a side note, this notion of customer satisfaction where anything besides perfection is failure is patently ridiculous, leading to the situation described; the company actually gets no feedback at all, but that's a topic for another post. Anyway, I've adapted this expression to describe a situation where you want an honest critique, but you just get superlatives or gladhanding. "I wanted her feedback on the latest design document, but she was all '5's and 'yes's."

So that brings me to my latest creation: the Check Engine Light. My old car had a little orange light that many cars have, and is generally referred to as the Check Engine light.  In my limited understanding of cars, I have only a vague notion of what this light is supposed to tell you, especially with a title like "Check Engine".  Basically, I was told, it is supposed to come on when one of the multitudes of sensors that track the efficiency and emission level of the engine and exhaust detect an out-of-bounds condition.  So when that happens, you should dutifully take it over to the nearest official service shop and have them look at it because, at least on my car, it stores a code that indicates what was wrong.

Here's the problem: if you had the car up at highway speeds, that little light would come on for about 10 minutes and then it would turn off for 10 minutes, over and over.  We'd take it to the dealer and everytime it would read some random code and they couldn't find anything wrong with that code.  One out of the dozens of times we had it in, they found a hose that had a leak and replaced it.  Eventually, the theory was that the engine had a timing problem causing it to misfire occasionally at higher speeds, and this would trigger the light.  Overall though, this car had very few problems and was an excellent vehicle, so it was just annoying.  It got to the point that I began referring to it as the "Everything is Fine" light because it would come on when we were happily cruising along.  I would have been worried if it didn't start it's off-on cycle.

Ultimately, I understood this light though, because I have built "Check Engine" lights into software many times, and I see it in software that I use.  It happens like this: you have a piece of software that is doing a fair number of complex things, and there are hundreds of possible things that can go wrong, so in general, you are better off monitoring a limited set of outputs for problems than you are putting lots of checks in the code itself.  The trouble is, you are just not sure what constitute "normal" values for outputs in every case.

A good example is something like a server thread stuck timer.  It's good practice to put in a timer that waits a certain period of time before declaring a thread "stuck".  Since threads can become stuck for many reasons, this is much easier than trying to detect every cause, you can just kill the thread and let someone know.  The problem is, if you set the threshold too high, then threads will be stuck for a long period of time before being noticed, and if load is heavy, the system might lock up in a cascading effect.  If it's too low, then jobs that take a long time might be misclassified as "stuck".  So it has to be calibrated to the application, but there will always be cases of this "Check Engine" light coming on for no apparent reason.

So into your arsenal of debugging terms,  I hope you will add "Check Engine Light", defined as errors which indicate non-specific, recurring fault conditions, and which may have a reasonable cause, or may be false positives.  Or at least the next time you are poring over a log file with someone and they dismiss a stack trace with a wave of their hand and a reference to "The Check Engine Light", you'll know what they mean.

Now that I've covered the basics, I'd like to talk about using Linux with the Dash. It turned out to be a learning experience in many ways. First of all, Windows Mobile wants you to use windows, and has a lot of elements that make it especially painful to use with any other operating system. Take for example, application installation. In my previous dealings with Palm and Symbian OS, it was generally a matter of copying the application onto the memory card from the computer (which could be done from anywhere) and then either installing the application from the card, or copying the application from the card into the proper location on the phone or PDA file system to get it to be recognized. Windows Mobile applications appear to be distributed as a Windows executable (.exe) that you execute on your Windows machine and it installs the actual application on your PDA when you sync. So essentially, unless you can run windows software and have ActiveSync (the synchronization app) installed, you are SOL. More on that in a second. Here are the highlights of what I've been able to do thus far:

  • I've gotten hooked up with the excellent synce project, which aims to provide tools for working with Windows Mobile devices on Linux. They had been supporting lots of devices, but after a long drought where there were few developers, it looks like attention has been focused on WM5. It also appears that the community is once again picking up steam. With a very active mailing list where I got a very quick response to a question, and new stuff being added every day, it looks like there may be some serious momentum for getting these devices fully supported.  I'd like to offer my technical assistance as well, so I'm going to dig into the code as well and see what's happening.
  • Using the info and tools on the SynCE site, I was able to set up my machine so that I get desktop notifications when it's plugged and unplugged, I can list the contents of the device and copy files to and from it, and it appears that synchronization of contacts and calendar from Evolution really wants to work. Unfortunately, while both the computer and the phone think that they are exchanging information, my Evolution contacts just seems to get a bunch of blank entries. I'll post more when I get that sorted out. Word on the mailing list is that Task support is very close to being ready as well so that would be the big three PIM applications (I don't need email sync since I've just got both the phone and my desktop using the same IMAP account).
  • In terms of the Windows-dependency for application installation described earlier, there is a tool called "Orange" on the SynCE that can extract the .cab files, which can be installed on the phone directly, from some Windows installers, but I believe it is limited to self-extracting installers that were used in the past. It doesn't seem to handle this new breed of "Windows application as Windows Mobile Installer", which is quite frustating. I've tried a few things such as running the installers under WINE, which works, except that they all want ActiveSync to be installed, and I've tried installing ActiveSync under WINE, which fails at the moment for reasons unknown. I think I'm going to have to temporarily resort to installing applications with my Windows machine shudder until I come up with something better.

So, Linux support is moving along.  I'll post again when I start getting things synchronized or if I find a solution to the Windows installer problem.

Besides big projects eating up all my time, I did have one fun side pastime: a brand new T-Mobile Dash smartphone. The device, which is the same as the XTC Excalibur, runs Windows Mobile 5 (or 2005 as it's also called), has a full qwerty keyboard, and WiFi so it's pretty serious little thing. I spent a lot of time fiddling with it, and with associated tools. Here is a quick summary:

Pros:

  • This is well covered in other places, but it just looks cool. It's got a soft textured rubber exterior on the back and it's easy to grip, and a brushed metal look on the front.
  • Lots of useful built in applications, stuff for viewing word docs, windows media player (more on that in a second), an IM client, and mobile outlook.
  • I bought it as a replacement music player after the untimely demise of my iPod, so I went out and bought a big (2GB) microSD card which, quite frankly, I could easily inhale if I weren't careful. It's about the size of a quarter of a postage stamp. Anyway, it comes with earphones that plug into the "micro USB" port on the bottom, and I have been totally shocked (in a good way) at the quality of the audio. I'm not exactly an audiophile, but compared to my old iPod, the bass is much better, and it just has a nice clear, rich sound. That really surprised me. I have been using the built-in music player, but I'm going to try out some of the other players and compare. Overall, I would highly recommend it as an ipod replacement thus far.
  • It's a little slow switching between applications, but the applications themselves run without a hitch.

Cons:

  • There don't seem to be that many applications available for the Windows Smartphone platform. For anyone who has tried to produce an application for the mobile world, you know that each platform has it's own set of capabilities and quirks, and so it's time-consuming and often not worth the trouble to develop for multiple platforms. Windows Mobile itself is actually divided into two branches: the smartphone branch and the PocketPC branch. The main philosophical difference is that the smartphone branch does not use a touchscreen, but there are other subtle differences. Therefore, you can't just grab from the huge slate of existing PocketPC applications, there has to be a smartphone version.
  • Windows Mobile, like its big brother on the PC, is kind of uptight. I've spent a lot of time trying to figure out where it wants me to put things and how to get things installed. Part of this is probably because I refuse to use the standard Windows tools and want to make everything work on Linux (more on this tomorrow), so Windows users may have less trouble with this. There is talk that some people have gotten Linux to run on these phones, but I'm not quite that brave yet.
  • I got my email set up on it, which is very cool, but I have it check every 15 minutes or so for new mail and it insists on using the EDGE connection rather than the WiFi, and I have absolutely no idea how to change that. Overall the whole connectivity thing works, but mostly I just end up having the data service up all the time, and I only turn on the Wifi when I am browsing the web.

Tomorrow: Linux Tools for Windows Mobile

My apologies to those looking for new content here.  I am currently embroiled in a set of big projects that is occupying all my time.  Look for a bunch of new posts starting on Thursday:

- These (2.5) weeks in Debugging

- Is Multithreaded Debugging Really that Hard?

-  Fun with the Java Media Framework

I apologize to those that have come here recently to look at an individual post and found it unreadable. A helpful comment noted that while the main site looked fine, the individual post page was screwed up. I had forgotten to apply the same trick to the Single Post template (page.php) that I had applied to the Main Template (index.php). It should be fixed now. If this problem shows up anywhere else, please let me know.

For those who are interested, getting a scrollable DIV is fairly straightforward. The key is to set a fixed height, and then set overflow to "auto". Like this:

.scrolldiv {
height: 500px;
overflow: auto;
}

Then, if the content exceeds that height, it will give you a scroll bar automatically. Look for a belated "this week in debugging" with more info later today...

I've tweaked the layout a bit so that you can scroll the posts while keeping the sidebars and header in place.  My goal is to ultimately make it so that all the sidebar content fits onto a typical screen so that you can see all the archives, search, etc without having the scroll the page, and you only have to scroll the post content.  Let me know if this is good/bad for you.  I'm probably going to keep playing around with this for a few more iterations, so please excuse me in advance if you stop by and things are little out of whack.

Syndicate content