Distance Debugging Logo

I was recently reading The Pragmatic Programmer and was thinking about the concept of Good Enough software. Briefly, the idea is that when you learn to build software that meets users' high priority needs rather some standard of technical perfection, you and your users will be much happier. The critical piece to making it work is engaging them in the dialog about where corners can be cut and making them aware of the resource constraints that exist (time, money, ennui, etc) so that they can make an informed decision.

What I found interesting though is that even these authors writing from the pragmatic point of view felt the need for the caveat regarding high-performance, zero-fault software such as the stuff runnning pacemakers or controlling aircraft to say that there are cases where the Good Enough theory and process doesn't apply. Why do we feel the need for a methodology or process concept to span all types of projects when looking at its validity? People will dismiss extreme programming by saying "well that would never work if you needed to build software for the space shuttle". I would probably agree with that, but that's beside the point; in my experience, the level of implied quality and criticality of the system is generally inversely proportional to the scope of the requirements. In other words, if it really has to work all the time perfectly, chances are that you also have been told exactly what it needs to do and how it has to do it. No one ever says, "Can you build me a pacemaker, and while it's being used for humans right now, we might also use it for mice, and it might need to perform dialysis functions also."

We don't need agile techniques and rapid prototyping and usability studies and iterative development to build a pacemaker because from the get-go it's clear exactly what needs to do, and all the time and energy is spent on making sure it does that one thing perfectly. Extreme Programming, and the idea of Good Enough software which is closely related to the XP planning game, is for what has been in my experience the vast majority of cases where you really don't exactly know what you want, and where the feature set and degree of robustness are up for debate. Building systems that have nebulous requirements but a variable margin for error can be as difficult if not more than those with strict requirements but zero margin for error. The multitude of software project management theories and the endless stories of projects being killed for failing to deliver the right (or any) functionality attest to that.

In most instances, it should be possible to look at the number and detail of your requirements and figure out where your project is on the scope vs. quality management curve and select your methodology accordingly. Of course, you will likely end up drifting along the curve during your project's lifetime and should adapt accordingly. Project Management theories and companies that adopt one methodology as part of their identity ("We do Scrum here") tend to have trouble with one of the two inversely compatible goals, or at least they end up burning a lot of management overhead to solve the problem they aren't having. Conversly, we should stop making excuses or offering caveats for our management tools that do provide us a process to tackle the problems we are having.

I used to work as a projectionist in a movie theater that showed primarily classic films. This involved significantly more work than is required with modern films because of the need to switch between reels. In a nutshell, films used to be distributed in a stack of canisters each containing about 20 minutes of film, called a reel. Showing a film consisting of reels requires two projectors, and the projectionist has to load up each reel on the currently unused projector and wait for a cue to start up that next reel, which was loaded up such that the actual picture would begin exactly 10 seconds after the projector was started. Then, a second cue tells them exactly when to swap between the two projectors (in my case there was a floor pedal that would instantaneously close one projector and open the other). If done correctly the movie would appear to seemlessly transition from one reel to the next and audience would never be the wiser.

So what are the cues that the projectionist uses? In every movie you have ever seen (chances are) a small black dot appears in the upper-right hand corner of the film at 10 seconds remaining, and 1/4 second remaining. Now that I know it's there, I can't "unsee" it, so I notice it in pretty much every film I see in the theaters. Despite the fact that most films released nowadays are made of stronger materials and put together into one huge reel, making the projectionist job significantly easier and the dots superfluous, they are still there.

Even if you do know they are there, they can be very hard to notice unless you have spent a fair amount of time looking for them. To me, the most interesting part about the dots is that I've yet to meet someone who when I tell them about the phenomenon for the first time says "oh yeah, I always wondered what those were for!" Most people say, "No way, I would have noticed that" and are somewhat incredulous that this could have been happening in every film without them consciously processing it.

It just goes to show how powerful the brain's capacity to filter and make sense of the world is. Since those dots appear to have no meaning in terms of the film content, they are dismissed at some lower level of processing as perhaps some dirt on the film or a one-time anomaly to be corrected for. The brain is compensating for the eye's blind spot all the time, so throwing out a little black dot every so often is a piece of cake.

The black dot problem, as it relates to debugging, is the problem of failing to perceive useful information because it is grouped in with a larger, consistent data set.   For example, a rare but recurring error buried in hundreds of megabytes of uniform log files could be the black dot that you are missing when searching for the cause of a system crash.  Sometimes, simply being told that the "black dot" is there is enough to allow you to find it.  Other times, it's just a matter of stepping back and repeating to yourself that the information is there and trying to force those subconcious filters off for a while to really see the data in front of you. However it comes about, when you finally notice the dot, you will wonder how you could have missed it for so long.

Remember in math class the teacher admonishing you to go back and "Check Your Work" after completing a problem set? It's a simple request that can have amazing results for your overall score when you go back and notice simple errors, or identify answers that don't pass the smell test. Of course, this never helped me much in math because I am terrible at arithmetic, and no amount of work-checking ever seemed to allow me to notice that I'd added numbers that I meant to multiply or vice-versa. However, checking work is an important part of what I do when debugging.

Basically, if I am executing a set of steps to replicate or gather data about a problem, the first thing I do is to verify, to the extent possible, that I am really doing the steps I think I am. This is harder than it sounds for two main reasons: silent success and silent failure. Silent success means that things work as expected, but produce no output. This is fairly common since we tend not to write out logging information for succesful operations. Silent failure can be an attempt at fault-tolerance, or sometimes the message happens in a low-level library and can't be handled correctly, or is just buried or piped to an obscure log file (apparently silent failure).

Both of these problems need to be addressed if you want to check your work properly. In the case that the system can be run in a debugger, work-checking is trivial since you can step through the sequence in question. If not, some judicious print-outs to verify can work instead. Checking your work is more than that though, it also means looking for the evidence of execution. This can be reviewing log files after startup to see if everything looks kosher. It can be looking at process tables to verify that something is running. It can be pinging ports to verify that something is alive. Anything that gives you some level of confidence that things are doing what they say they are.

Without taking these kinds of actions and checking your work, you run the risk of burning many hours debugging a shadow bug, which hints at the real problems in an indirect way like Plato's shadows on the cave wall.  For example, you build a new version of an application, but the build script output a warning that a properties file is missing and so it is skipping part of the build. Since you are not expecting the build script to fail, this warning goes unnoticed. You run the latest version and suddenly weird things are going wrong, and you are convinced that you have broken something in the code. When you finally notice the build script output, you may have spent significant time debugging a non-problem, or at least trying to debug an epiphenomenal problem rather than the real thing. Checking work means more time fixing real problems, and less time chasing after specters of bugs that are faults of the process and not of the application.

A recent publication by Edward A. Lee at UC Berkeley called "The Problem with Threads" is an interesting look at why multithreaded programming is hard, and more specifically, why the Thread abstraction makes is harder. While I'm not going to disagree with these sentiments in general, I am shocked by the common paralyzing fear of multithreaded programming among otherwise competent, confident, programmers.

There seems to be a disconnect between the actual difficulty level of writing a multithreaded system and the perceived difficulty. The reasoning goes like this:

  1. Single-threaded programs are deterministic.
  2. It is possible to exhaustively test only deterministic programs.
  3. Multi-threaded programs are non-deterministic.
  4. Therefore, it is not possible to exhaustively test multi-threaded programs.

Take for example, this passage from the Lee's paper:

A part of the Ptolemy Project experiment was to see whether effective software engineering practices could be developed for an academic research setting. We developed a process that included a code maturity rating system (with four levels, red, yellow, green, and blue), design reviews, code reviews, nightly builds, regression tests, and automated code coverage metrics... The reviewers included concurrency experts, not just inexperienced graduate students...We wrote regression tests that achieved 100 percent code coverage. The nightly build and regression tests ran on a two processor SMP machine, which exhibited different thread behavior than the development machines, which all had a single processor. The Ptolemy II system itself began to be widely used, and every use of the system exercised this code. No problems were observed until the code deadlocked on April 26, 2004, four years later.

It is certainly true that our relatively rigorous software engineering practice identi?ed and ?xed many concurrency bugs. But the fact that a problem as serious as a deadlock that locked up the system could go undetected for four years despite this practice is alarming. How many more such problems remain? How long do we need test before we can be sure to have discovered all such problems? Regrettably, I have to conclude that testing may never reveal all the problems in nontrivial multithreaded code.

There are few elements that bother me here. My primary complaint is the final sentence. Testing may never real all the problems in nontrivial multithreaded code. It implies that testing may reveal all the problems in nontrivial single-threaded code, which I believe is totally false. Testing will never reveal all the problems in any nontrivial system. Multithreaded programs are no different, but that's no reason to assign any particular menace to them.

My second complaint is this statement: "No problems were observed until the code deadlocked on April 26, 2004, four years later." Seriously? If your system had no observable defects whatsoever during a 4-year active usage period by a large and diverse group of users, then my hat is off to you. I assume what he meant is "No problems were observed that appeared to be related to threading until the code deadlocked". It would be shocking of none of these users found bugs in the UI, or errors related to pointer logic, or any of the dozen other problems that commonly occur in complex systems. I believe that the Ptolemy system was well-written and well-designed, so I am not trying to claim that it is buggy or problematic, I am simply claiming that singling out the 4 year gap before the first thread-related bug was found is misleading. I could argue that multithreading is in fact the least of their worries if the first bug was found 4 years after the release of the system. How many UI bugs were found in that period. 50? 500?

My final complaint is that these papers are what stokes the fear burning in ordinary programmers: that they somehow be exposed by the complexity of building a multithreaded system. Perhaps it is that programmers are ultimately control freaks, and somehow, the nondeterminism of multithreading seems more out of control than the ordinary nondeterminism of the ridiculous things human users do to every application.  Whatever the reason, I recommend to all programmers that they become familiar with the tools of multithreaded coding in the same way they might learn graphics or databases, and just start writing lots and lots of multithreaded programs.  This type of exposure is the only way to beat this phobia that plagues the industry.

I have always wanted to invent a new, commonly used expression or cliche. Most of the things I come up with refer to events or states that don't happen that much, or are just not catchy enough to endure. Here are some of my creations. I apologize if I actually stole them from somewhere else, but I've seen no evidence on the web:

  • Throwing rocks down the well - There is a fable by Aesop called The Crow and the Pitcher. In short, a crow uses stones to slowly raise the level of water in a pitcher until he can drink it. It's supposed to be about ingenuity and perseverance. I always took it to be about recognizing when brute force is your only option. I use this more dramatic sounding variant to describe situations where you have been doing things in a slow, grinding way because there is simply no (known) alternative. For example, if you are trying to build a new line of business for your company, you can't just create it through a flash of insight. You have to find new customers, and convince them of your worth, etc. In short, you can only build a new business by throwing rocks down the well.
  • All '5's and 'yes's - I've purchased two cars from two different car companies. In both cases, the sales and service staff has admonshed me, "<car company> will be calling you for a follow up survey. Please, please, please do not give us a rating other than a 5 or a yes; PLEASE, ALL '5's AND 'YES'S!!!", with the idea being that they need a 5 on a scale from 1 to 5 for the numerical questions, and yes on the yes or no questions. On a side note, this notion of customer satisfaction where anything besides perfection is failure is patently ridiculous, leading to the situation described; the company actually gets no feedback at all, but that's a topic for another post. Anyway, I've adapted this expression to describe a situation where you want an honest critique, but you just get superlatives or gladhanding. "I wanted her feedback on the latest design document, but she was all '5's and 'yes's."

So that brings me to my latest creation: the Check Engine Light. My old car had a little orange light that many cars have, and is generally referred to as the Check Engine light.  In my limited understanding of cars, I have only a vague notion of what this light is supposed to tell you, especially with a title like "Check Engine".  Basically, I was told, it is supposed to come on when one of the multitudes of sensors that track the efficiency and emission level of the engine and exhaust detect an out-of-bounds condition.  So when that happens, you should dutifully take it over to the nearest official service shop and have them look at it because, at least on my car, it stores a code that indicates what was wrong.

Here's the problem: if you had the car up at highway speeds, that little light would come on for about 10 minutes and then it would turn off for 10 minutes, over and over.  We'd take it to the dealer and everytime it would read some random code and they couldn't find anything wrong with that code.  One out of the dozens of times we had it in, they found a hose that had a leak and replaced it.  Eventually, the theory was that the engine had a timing problem causing it to misfire occasionally at higher speeds, and this would trigger the light.  Overall though, this car had very few problems and was an excellent vehicle, so it was just annoying.  It got to the point that I began referring to it as the "Everything is Fine" light because it would come on when we were happily cruising along.  I would have been worried if it didn't start it's off-on cycle.

Ultimately, I understood this light though, because I have built "Check Engine" lights into software many times, and I see it in software that I use.  It happens like this: you have a piece of software that is doing a fair number of complex things, and there are hundreds of possible things that can go wrong, so in general, you are better off monitoring a limited set of outputs for problems than you are putting lots of checks in the code itself.  The trouble is, you are just not sure what constitute "normal" values for outputs in every case.

A good example is something like a server thread stuck timer.  It's good practice to put in a timer that waits a certain period of time before declaring a thread "stuck".  Since threads can become stuck for many reasons, this is much easier than trying to detect every cause, you can just kill the thread and let someone know.  The problem is, if you set the threshold too high, then threads will be stuck for a long period of time before being noticed, and if load is heavy, the system might lock up in a cascading effect.  If it's too low, then jobs that take a long time might be misclassified as "stuck".  So it has to be calibrated to the application, but there will always be cases of this "Check Engine" light coming on for no apparent reason.

So into your arsenal of debugging terms,  I hope you will add "Check Engine Light", defined as errors which indicate non-specific, recurring fault conditions, and which may have a reasonable cause, or may be false positives.  Or at least the next time you are poring over a log file with someone and they dismiss a stack trace with a wave of their hand and a reference to "The Check Engine Light", you'll know what they mean.

Now that I've covered the basics, I'd like to talk about using Linux with the Dash. It turned out to be a learning experience in many ways. First of all, Windows Mobile wants you to use windows, and has a lot of elements that make it especially painful to use with any other operating system. Take for example, application installation. In my previous dealings with Palm and Symbian OS, it was generally a matter of copying the application onto the memory card from the computer (which could be done from anywhere) and then either installing the application from the card, or copying the application from the card into the proper location on the phone or PDA file system to get it to be recognized. Windows Mobile applications appear to be distributed as a Windows executable (.exe) that you execute on your Windows machine and it installs the actual application on your PDA when you sync. So essentially, unless you can run windows software and have ActiveSync (the synchronization app) installed, you are SOL. More on that in a second. Here are the highlights of what I've been able to do thus far:

  • I've gotten hooked up with the excellent synce project, which aims to provide tools for working with Windows Mobile devices on Linux. They had been supporting lots of devices, but after a long drought where there were few developers, it looks like attention has been focused on WM5. It also appears that the community is once again picking up steam. With a very active mailing list where I got a very quick response to a question, and new stuff being added every day, it looks like there may be some serious momentum for getting these devices fully supported.  I'd like to offer my technical assistance as well, so I'm going to dig into the code as well and see what's happening.
  • Using the info and tools on the SynCE site, I was able to set up my machine so that I get desktop notifications when it's plugged and unplugged, I can list the contents of the device and copy files to and from it, and it appears that synchronization of contacts and calendar from Evolution really wants to work. Unfortunately, while both the computer and the phone think that they are exchanging information, my Evolution contacts just seems to get a bunch of blank entries. I'll post more when I get that sorted out. Word on the mailing list is that Task support is very close to being ready as well so that would be the big three PIM applications (I don't need email sync since I've just got both the phone and my desktop using the same IMAP account).
  • In terms of the Windows-dependency for application installation described earlier, there is a tool called "Orange" on the SynCE that can extract the .cab files, which can be installed on the phone directly, from some Windows installers, but I believe it is limited to self-extracting installers that were used in the past. It doesn't seem to handle this new breed of "Windows application as Windows Mobile Installer", which is quite frustating. I've tried a few things such as running the installers under WINE, which works, except that they all want ActiveSync to be installed, and I've tried installing ActiveSync under WINE, which fails at the moment for reasons unknown. I think I'm going to have to temporarily resort to installing applications with my Windows machine shudder until I come up with something better.

So, Linux support is moving along.  I'll post again when I start getting things synchronized or if I find a solution to the Windows installer problem.

My apologies to those looking for new content here.  I am currently embroiled in a set of big projects that is occupying all my time.  Look for a bunch of new posts starting on Thursday:

- These (2.5) weeks in Debugging

- Is Multithreaded Debugging Really that Hard?

-  Fun with the Java Media Framework

I have not worked at that many different companies, although I've been in the industry for a while and I've seen the day-to-day workings of many workplaces. A recurring theme in many of these workplaces, which always surprises me, is the failure of the information technology to meet the needs of the users on a day-to-day basis. I've often asked why this is such a problem. Is it because it's hard to know what users want? Do the solutions not exist? Are they too expensive to implement? Is it not a priority?

I've concluded that businesses are not serious about IT. By not serious, I mean that they treat it as a nuisance, or that, in general, the least possible effort is expended on providing employees with the tools and technologies that would help them do their jobs. What would taking IT seriously mean?

  • Problems with the computer infrastructure would be treated like an emergency. When users go without email for 4 hours, this is often shrugged off like it was inevitable. What if the heat went off in the middle of the day for 4 hours, or the water stopped running for 4 hours? It would be just about as disruptive to business, but it would be treated like a crisis.
  • Solutions would not just be, "Don't let anyone do anything". Most big business seem to have an IT policy which is unconcerned with stomping on users legitimate needs in the interest of preventing possible problems. For example, the prevalence of ridiculously aggressive email filters that don't let any zip file in or out. Yes, it stops a certain class of viruses from spreading, but it gets in the way of transmitting legitmate files probably 10 times for every virus stopped? Why not just not let people use computers at all? Then you'd have no computer problems.
  • Intermittent problems would be dealt with instead of glossed over. Have you ever worked at a company where a critical server went down once a week and instead of just fixing the problem, it was rebooted? That's what I'm talking about.
  • Users' requests would be taken treated as requirements instead of burdens. Most IT policies are a top-down affair, with some sort of group deciding which capabilities will be offered to which users, and how those services will be delivered. Unfortunately, since the people who make the policy are generally the people who have to implement it, the decisions often lean towards "easy-to-administer" instead of "good-for-users". When users complain about policies that interfere with their work, or offer alternatives for software or hardware to help them do their jobs more effectively, the message is "but that will mean more work for us!". The result is that large gains in user efficiency are sacrificed for much smaller gains in IT efficiency.

That last point is the most important. The IT staff often has a conflict of interest because they want to reduce their burden (which honestly is hard to blame them for, given that IT departments are often dramatically understaffed) but by doing so they create a suboptimal environment for the day-to-day users.  Businesses need to recognize this and provide a better mandate for their IT staff. Besides raising staffing levels and constantly trying to improve the staff, management needs to judge the IT department on user satisfaction. Those two steps would be a serious commitment to  IT, and ultimately, they can help improve morale, and make the business more competitive.

I've been reading the book Collapse: How Societies Choose to Fail or Succeed, by Jared Diamond for the last few weeks. It's a fascinating read, and I highly recommend it. One of the key points that I've taken away has to do with the correlation between degree of environmental fragility and a society's level of risk aversion. Societies living in fragile environments often make a series of missteps until they upon something that works, but then become incredibly resistant to change. This makes sense since, in their experience, things that they have tried have been much more likely to result in disastrous failure than in improvement. This has carried over into even modern societies that live in these environments, such as Iceland. For contrast, societies in places such as in New Guinea, which have a somewhat forgiving environment, tend to seek out improvements and are constantly looking for ways to get more out of their resources. In their experience, constant tinkering tends to result in successive improvement, and changes that result in failure are rarely tragic.

What does this have to do with software engineering? In my experience, a culture of risk aversion has arisen within many successful organizations. I believe it plays out like this: an early group of risk-tolerant developers (such as in a startup) go out on a limb to do things differently, whether it's develop a new genre of software, or simply attempt a style of development, as happened with the rise of agile methodologies. They either experience success and continue to exist, or they fail and disappear. It would seem like this results in software organizations that are built on risk-taking and would thrive in this manner, but the key is what happens next.

Slowly, this original group of risk takers retires (especially if they were made wealthy by their early success), or moves on to other projects. The people that come in to replace them are often more "serious", and are devoted to keeping the company going rather than innovation. Like the inhabitants of fragile environments, they excel at keeping a steady hand, and since they are taking over successful enterprises, they are rewarded for preventing change, since in their experience, change can lead to disaster as frequently as it can reward. Doing nothing is for them an excellent strategy; the missed opportunities won't show up on the balance sheet the way that failed attempts will.

I'm sure many of you who have worked for large or even medium-sized enterprises that exhibit these symptoms, which I call Management by Saying 'No'. It's the default answer for any change in policy, any innovative program, or any request to go beyond the circumscribed activities that have brought success in the past. Since the managers who have been the best at saying 'No' are seen as "good at managing risk", they are often promoted into higher positions of power, and the idea permeates deeper.

Is this attitude justified? Is the world of software Iceland, or is it New Guinea? I would argue that companies that aren't afraid of being unorthodox, or flying in the face of common wisdom are big winners right now. One only has to look as far as Apple (you'll never build a business selling mp3 players!) or Google (you can't possibly earn enough revenue from advertising!) to see that it might be a little more New Guinea than many companies make it out to be. As a developer, that makes me happy, because I ultimately want to believe that software is a fertile world where unlike architects or doctors or aerospace engineers, we have the freedom to take risks and innovate in a forgiving world. That aspect is one of the things that attracts creative people, and I fear that management by saying 'No' will squeeze this element out.

A recent issue of CACM covered software product lines in depth. The articles tended to be a glowing review of fabulous benefits being gained through the practice, and lots of technical jargon about how they do so. I should start by saying that I am extremely skeptical of one of the main proponents of SPLs, SEI, ever since I had the misfortune to experience high-level CMM first-hand. To briefly summarize, software product line development is a set of techniques or practices oriented towards building a family of related products that differ in a factorable set of ways. From the SEI page:

A software product line (SPL) is a set of software-intensive systems that share a common, managed set of features satisfying the specific needs of a particular market segment or mission and that are developed from a common set of core assets in a prescribed way.

Here's my core problem with the idea: most of the suggestions seem to boil down to: "spend some time/money making sure that you reuse code", basically, be explicit about code sharing and don't just make it an afterthought. I'm simplifying, but unfortunately, I don't think that much. From SEI's "About SPL" page:

Software product lines represent a significant departure from software reuse schemes in which attempts are made to make assets as general as possible without the context provided by an architecture and a scope definition, and from opportunistic reuse schemes in which low-payoff assets are scavenged adhoc from a reuse repository.

In other words, create an architecture that encompasses your family of products instead of trying to build totally general components, and don't just try to paste together a bunch of random code that may or not have good reuse potential. I feel like I must be missing something because that seems to be the very basics of good architectural design as captured by the concept of extensibility, i.e. that it should be easy to extend it in the future to do the things that you might want it to do. Perhaps there is an interesting kernel related to the idea of taking into account a fundamental set of features to be mixed-and-match, sort of a limited componentization, but it seems to be ignoring the key aspect of the problem. You often have absolutely no idea what a related system might look like at the time you build the original system, so you are often mostly guessing about where things might go. The SEI documents seem to admit that, and the theme of "hire a good architect" recurs throughout.

So if Software Product Lines amount to planned reuse, and smart archtecture, I fail to see what the actual theory or research is about beyond just reiterating things that good software engineers have been doing for years. Perhaps I have had the good fortune to work with people who take these things as tacit goals and this idea is big news to other software companies that have been writing dozens of one-off programs or throwing money at totally general frameworks, in which case, I hope they are getting value from them. To me, Software Product Lines appear to be a fuzzy restatement of clear software engineering principles, or else I must be missing something

Syndicate content