Distance Debugging Logo

Often in the course of a longer bug investigation, you will discover one or more errors or problems that appear to be unrelated to the original bug. You may be tempted to make a note of these problems and get back to the problem you were working on. I would instead advise that you fix what you find. Here is why:

  • Unless you are absolutely certain you know what the problem, you can't know that the problem you found has no bearing on your primary issue. It might be contributing to the problem, hiding the problem, creating a first-order problem for which your problem is a second order, etc. I am shocked at how often fixing what appears to be an innocuous or seemingly totally unrelated bug will reveal critical information about or even fix the bug I started with.
  • Sometimes the process of fixing the found bug will refresh your memory about some other section of code or otherwise give you a mental break from the primary bug, and that can often trigger new approaches, new ideas, or make you look at something you hadn't previously considered looking at in the original investigation.
  • You've got the bug state "swapped in" at the time you find the bug.  In other words, the information about what is wrong, how you found it, how you demonstrated the problem, etc is all in your working memory.  Even if you do a good job of recording all the details, chances are that you will need to spend some time swapping this information back in and walking through the code at the point you come back to it, so it's inefficient to come back later.
  • As long as you have another known bug that turned up in the course of another investigation you can never rule it out as a possible factor.  This might not seem like a big deal, but speaking from experience, nothing is more annoying than exhausting a series of fixes that have no effect on a bug, and having this gnawing feeling in the back of your mind that the secondary bug you found is really to blame, for reasons not currently clear.  Leaving this "trapdoor" in your reasoning  unnecessarily complicates your investigation.

Fixing what you find will lead to better code overall, and I guarantee that you will save time and energy in the long run.

A recent post on Slashdot discussed Microsoft's response to the results of a group who discovered several what they call denial of service exploits (since that term is somewhat loaded, I am reserving judgment) in Word 2007 using a fuzzer. Microsoft responded that crashing on these ill-formed unparsable inputs was in fact the correct behavior, and so it constituted a feature and not a bug. The follow-up debate on Slashdot was, as usual, enlightening and frustrating all at once. There was the usual Microsoft bashing, a lot of discussion about bugs v. features, and quite a bit of argument about what constitutes good programming and what standard of perfection a program should aspire to.

The core of the debate is a big question: when is crashing the right thing to do? On the one hand, you have many people arguing that crashing is never the correct behavior since a crash implies that the program attempted to do something bad or illegal. On the other hand, a well-respected book like "The Pragmatic Programmer"contains an extended discussion of why simply crashing ("Dead Programs Tell No Lies") instead of allowing second-order data corruption or other damage to occur is a Good Thing. Ultimately, the answer is probably somewhere in between.

There are three real issues that need to be addressed when something goes wrong: notification, consistency, longevity. Notification to users or developers that something went wrong should be orthogonal to the question of crashing or not crash. Certainly crashing tells people that something went wrong, but it's not a very useful message, especially if there is no good reason to crash. On the other hand, I've covered the topic of silent error-handling ad nauseum in this space, so my viewpoint should be clear, but as long as your strategy of notification is not dependent on crashing, you should be fine.

Consistency is trickier. In a perfect world, every system you build would be able to meet the Consistency requirement of an ACID-type transactional system. In other words, you could at any point choose to begin tracking your system state, and if something went wrong, perfectly undo the sequence of operations that it took to get there. If you could make that guarantee, you could eliminate the concerns of the Pragmatic Programmer since you could say, "It doesn't matter if I let things continue after an error, because I know I've rolled back to the state before I started to do the thing that failed". This is unrealistic for several reasons. First, many operations have no good notion of undo. Second, you would likely have to graft a very sophisticated set of nested transactions onto your system, imposing a massive overhead in performance, and possibly code complexity. Third, fail or don't fail is not really a valid outcome for most systems. For instance, you couldn't build a browser where the system either successfully rendered the entire page, or refused to render anything. Graceful degradation in the face of invalid inputs is also desirable, even if it opens up the possibility of leaving things in an inconsistent state.

Finally, there is the issue of longevity. Crashing is simply not an option for some systems. I have built servers for which "it got an error and exited" is not a valid outcome, and in general, no one wants their applications crashing all the time, even for semi-valid reasons of prevention of data corruption.

So are there any good rules for when to crash? Crashing seems like the right option under the following three conditions:

  1. The problem is fatal - Set the bar high. Crashing should equal program panic, not program annoyed or program concerned.
  2. A human must take some action to correct the problem - It can be hard to know when this is necessarily true, but if the condition encountered does not appear to be programmatically addressable, and it is fatal, crashing makes sense. For instance, many systems require configuration files to exist, be well-formatted, and exist in certain places on disk. If they do not exist, it's often better to crash than to try to come up with sensible defaults, which often just serves to confuse the user who can't understand why their configuration is being "ignored" when they put it in the wrong place.
  3. Rollback is impossible or meaningless - If the application's best effort to contain the damage has failed, or the extent of the problem is totally unknown, or rolling back would mean rolling back to essentially nothing, and it meets the above conditions, it's time to crash.

So does the Word 2007 crash meet these criteria?

  1. Fatal? Sort of. The malformed doc means that Word's only raison d'etre, to load a document for display and editing is impossible.
  2. Human-addressable? Yes, this one it clearly meets. Word can't possible know what the "intent" of the document was in order to repair it programmatically.
  3. Rollback Failure? No, in this case there is a clear rollback position which is to restore the application to the state it was in before the load was attempted. Taking the whole application down in this situation is overkill, to put it mildly.

So in my opinion, the Word 2007 crashes fail to meet the necessary standard. Ultimately, the problem with crashing has to do with intent. It seems fairly clear that the Microsoft rep is claiming that the behavior was something they put in intentionally, when in fact it was very likely totally unexpected. If you are going to adopt a crash-sometimes-ok policy, you need to make sure you only crash under the right conditions and not because you simply failed to build a robust system.

How is task management like gambling? There is a direct connection in selecting a set of tasks to fill a timebox. You are essentially "betting" a certain number of work hours, and therefore dollars, will be enough to complete a particular task. Let's leave aside for the moment the question of whether that task will result in something of value and assume that every completed task produces an amount of value that is linear with the cost of execution. That way we can say that 8 hours of effort that results in a completed task is worth 8 units of value.

The problem is, we don't know a priori how many hours will be necessary to complete a task, which is where the risk comes in. In general, we hope that increasing the number of hours spent on a task increases the likelihood of completion, although that isn't guaranteed. Standard project estimation techniques rely on hi-low ranges such as "Task 8 will take from 2-8 hours to complete". What we might really be saying is, "There is a 10% chance that this will only take me 2 hours, and a 90% chance it will take me less than 8 hours". The percentages change from task to task, and it could be %90 less than 2 hours and 95% less than 8 hours for another task, which is why the hi-low estimate leaves out a lot of information.

I like to turn this around and state the range as: "If you give me 2 hours, there is a 10% chance that I will complete the task, and if you give me 8 hours, there is a 90% chance I will complete it". That's nice from a management standpoint, because the number of hours you have are generally fixed, so you can look across your tasks and decide how high you want to raise the percentage for any given task. The output of a timebox then is a set of completion probabilities. Ideally, you group things by low, medium, and high probability of success in order to get a sense of what is likely to be the state of your system at the end of the timebox.

What about when you don't have any idea how many hours are needed for a particular probability of success? That's where the metatasks that involve improving your task estimates come in. Whether or not you think of these estimation tasks as metatasks, most projects do quite a bit of them. Processes such as CMM are oriented around helping you provide the best estimates possible. But how much are better estimates worth?

Let's say you have a task that you think will take between 8 and 24 hours to perform. How much it is worth to you to know that it will actually take between 6 and 15 hours (reducing the max-min ratio of 3 to 1.5)? It depends on the cost of a bad estimate. The costs of estimating too low include deadline slippage, loss of trust from stakeholders, and overwork of employees trying to use heroics to meet impossible estimates, to name a few. Estimating too high has other costs, the main one being that tasks tend to fill the space allocated to them, so high estimates lead to developers spending unnecessary time on things. Even if this effect is tempered, there are other problems such as having to constantly rebalance to add more tasks to a timebox, and just simply looking less efficient than other comparable groups.

Your decision about whether or not to improve your estimate depends on how expensive you believe over and under-estimation to be.  In my experience, underestimation is significantly more costly than overestimation, so I like to improve my estimates when they look suspiciously low, since that tends to result in the most benefit.  Overall, I don't like to spend more than about a quarter of the low-estimate time on better estimation since you are usually better off spending that time on doing the work.

Integrating these estimate-improvement tasks into your timebox along other metatasks, plus the addition of success probabilities to every element means that you can conceive of your outcome as what we will likely have done, and what we will likely know.  Making this switch can often help communicate to stakeholders better about what you are accomplishing, where you see the risks, and where you are taking your gambles.

When you think of generating a list of tasks for a particular timebox, you generally think of things that are construction-oriented. For instance:

  • Implement Feature X
  • Fix Bug Y
  • Refactor API section Z

which all involve actually constructing or changing a piece of code. However, there are a bunch of other tasks that are process-oriented, which are usually verbally stated in the minutes of meetings or implicitly executed by the management team, but which show up on no formal task list.

For some reason, software managers (and maybe managers in general?) have an aversion to comingling development and process tasks into a single planning system. While some management tasks don't make sense in this way, like ongoing activities or recurring reporting, I've noticed that by keeping everything in one place and by using some simple metatask designations (tasks that involve working with other tasks instead of with code), the schedule becomes much more transparent and manageable.

Here are a few metatask designations that I commonly use:

  • Task Scope Analysis - This is the general heading for tasks whose outcome is simply more knowledge about the scope of another tasks. There are a few subtypes of this:
    • Increase Estimate Quality - The most generic form of analysis attempts to take a task that has a wildly varying or low-confidence estimate and either narrow the estimate range, or increase your confidence level.
    • Cap Scope - This is a very specific kind of tasks that involves taking a broadly-defined task (such as a typical Chop) and breaking out a more limited set of well-defined, and time-capped (i.e. this must take no more than 2 hours) tasks. This metatask is useful when you have a strongly fixed amount of time, but nebulous tasks that need to be tightly managed.
    • Go/No Go - Many development tasks are not required for the success of a project, and as any software manager will tell you, you spend much of your time figuring out what you don't need to do. This metatask makes the work that goes into those decisions explicit.
  • Triage - This idea should be well-known to most managers, but it's rarely explicitly stated. In short, going through some set of tasks and organizing them by priority.
  • Schedule - We generate schedules all the time, but we rarely put "generate the next schedule" as an item on our current schedule. It is implied that by the time one timebox finishes, the next one will be ready to go.
  • Balance/Rebalance - During a timebox, do two thingsL 1) look at each developer's task list and determine if they have too much, not enough, etc. and redistribute tasks as necessary 2) if you are overscheduled, knock some tasks out of the timebox, and if you are underscheduled, bring in some tasks from the on-deck circle.
  • Purge - I don't know if I've ever seen this as an explicit tasks, but the idea is to go over your task list and just get rid of tasks. This can be for many reasons: it was a dupe of something that's already done, the task is OBE but was never discarded, or the task is so poorly described or understood that it will never get scheduled in its current form.

Using metatasks has many benefits. I've already mentioned the transparency aspect. To me, the biggest benefit is the transformation of a timebox outcome from simply "we added these features, and fixed these bugs" to "we added these features, and fixed these bugs, and acquired this knowledge". In many cases, gaining the knowledge of the scope of a task is as valuable as the task itself. Without a metatask, we are forced to schedule the task directly, and treat the analysis of its scope as part of the task.  Another benefit is the ability to schedule when information will become available.  A great example is the Go/No-Go metatask.  In timebox N, you schedule a series of Go/No-Go metatasks on a handful of tasks that you might schedule in timebox N + 1.  This guarantees that by the time you needed to add the task or tasks to the schedule, the information about which one or ones you plan to do is already available to you.

With or without metatasks, the problem of how much time is spent analyzing a tasks or set of tasks, versus doing them is a constant struggle, and is the topic of the next post.

Next: Task Management III: Task Management as Gambling

Well, I've made it to 100 in only 6 months. I've installed a new plugin called Bad Behavior, that is supposed to block SPAM bot access attempts, which I guess appear as distinctly different accesses from regular folks. If you get blocked, I apologize, but I've been forced to look through 100+ comment SPAM messages a day in Akismet and I'm more afraid that some legitimate content will get flushed accidentally at this point, so I've stepped it up a notch. Time will tell if this solution meets its promises.

I've been thinking a lot about software project management lately, specifically all the pieces that they seem to leave unspecified. These next few posts attempt to fill in the gaps with some of my experiences and thoughts.

One discussion that is conspicously absent from most, if not all, software project management methodologies is the question of task scoping. The two agile methodologies with which I'm most familiar, XP and Scrum, both have notions of organizing tasks into timeboxes based on priority and estimated time (i.e. do the most important things in the current timebox, and select a set of tasks that will fit into that box). When I've tried to apply these ideas in practice, I'm always left with the same question. How do I handle tasks with a totally unknown or wildly varying estimate? Most tasks can be scoped as some value * or / by 2, but some tasks have a range of 10 or even 100 to 1 from the high estimate to the low estimate. I've seen too many projects where the desire to get something done and fit it into a timebox causes people to lowball estimates or put an arbitrary cap on the high estimate in order to make it work, thereby defeating the point of the timebox.

I've developed a simple system that uses basic task classifications and metatasks (more on this tomorrow) to handle this issue without sacrificing good estimation or timeboxing. To begin with, every task is given a designation: Chop, Craft, or File. Image you are making a piece of sculpture from a block of stone. When you begin, you must "chop" large sections of the block away in order to approximate the shape of the final figure. Once that is done, you execute the more artistic and careful "crafting" of the actual figure, including the shapes of the arms and legs, the torso, and face. Once a particular portion of the figure has been crafted, it must be fine-tuned through controlled, meticulous "filing" such as creating fingernails, adding detail to the hair, and so on.

These steps can be translated directly into software tasks:

  • Chop tasks, which tend to cluster towards the beginning of a project, are the more open-ended infrastructure design and construction tasks that provide the foundation for the main system. They are the tasks that most methodologies seem to pretend don't exist, but which lead to the most headaches for projects because they are so variable in scope.
  • Craft tasks are the most fun, in general, because they tend to be clearly scoped and deliver clear functionality. Craft tasks generally produce the most code per unit time because developers can crank out features, building on top of what was "chopped" out earlier.
  • File tasks are the least fun, in general, because they are all the little annoying and often tedious things that separate a quick-and-dirty prototype that has been crafted but never filed, from a real usable system. These are tasks like "Fix the logic that disables the buttons at the correct time" or "Make sure every database connection error is properly reported". They can be especially unpopular because the result is often mostly invisible or rarely encountered, leaving developers little to show for their time other than a more robust or usable system. File tasks also consume a massive amount of time on any project. The 80/20 rule (or whatever split you may use as a rule of thumb) results from the fact that Craft tasks churn out so much code in so little time that they give you a false sense of the remaining work, which are the time-sink Filing tasks.

There is a second problem though, which is that you often don't know a priori what kind of task you actually have. Sometimes a Craft will look like a Chop, or a File like a Craft, or vice-versa. That's where the introduction of tasks to help you understand and organize your tasks comes into play.

Tomorrow: Metatasks