The one thing that I stress over and over again when working with people fixing bugs is to look at the changes. I base this on a pretty simple fact: if it was working fine before, then it's probably working fine now. We like to anthropomorphize computers and imagine that a little homunculus is inside pulling levers and pushing electrons around, and that like humans, this little guy might zig when he should have zagged. Occasionally, computers suddenly fail in spectacularly bizarre ways, but the vast majority of the time, a human has changed something and that's why your program is suddenly failing.
What does this mean for your daily debugging practice? There are several ramifications:
- When searching for the most probable explanations, a good heuristic is to include at least one thing that has changed in every explanation. An explanation that relies on an unchanged thing failing or behaving differently than previously known is extremely improbable.
- Keeping careful track of things that have changed is of absolutely critical importance. This means not only within your own code, but across all aspects of the system. This is where distance can often cause the most problems.
- When gathering data about a bug where it is not readily clear what has changed, make establishing that fact your first goal.
When looking for changes you will likely first check for software and hardware changes, and that is a reasonable place to start. However, there are two commonly overlooked sources of change: user behavior, and the passage of time. User behavior may change for several reasons. It could be that there has been a customer policy change, or it just may be due to changes in their business environment. For example, let's say that your system has a comment field where users rarely enter information. Suddenly, the powers-that-be mandate that every record modification must be tagged with the name and date of the person doing the modification, so users begin tacking on their names and dates in the comment field, to them, a logical place. Suddenly people are unable to save records. You might be totally stumped until you realize that you limited the comment field to 100 characters, and that limit was quickly exceeded after several user edits. This policy change had the indirect effect of creating a "bug" where there previously was none.
In terms of business environment changes, a classic bug problem is that of using a numerical field (instead of a character field) for zip codes. The bug appears when zip codes that begin with '0' suddenly start getting truncated because the leading zero is meaningless in a numerical field. An interesting complication of this problem would be a deployed system for a company that has done business in nothing but California for several years and so the problem goes unnoticed. One day, they get a new client in Massachusetts and the system can't print invoices correctly for that new client. You might wonder why the system is "suddenly" failing, but looking at the change in clientele would give you the hint you need.
The passage of time can be much trickier to notice and catch. The passage of time often means that hidden assumptions are exposed and limits are exceeded after a set period of time, and we fail to realize the significance of that time period. The classic example of this type of bug is the much-discussed Y2K bug, which wasn't a bug until enough time had passed. This type of problem crops up more often than you would think. I was working on a system that after exactly 4 months (I now know) suddenly completely failed. The cause turned out to be the logging system I designed using Oracle's partitioning feature where it can divide up the database data based on a criterion such as the value of a date column. This allows for easier maintenance in terms of log rolling because you can drop individual partitions without having to take the entire table offline. It turns out that I had made a mistake in designing the table though, and it only had 4 partitions each holding a month's worth of data. I had assumed that we would institute a policy of rolling logs within that 4 month window thereby avoiding the issue, but in the fog of those early months after a new deployment, it was totally forgotten. Of course, once that magic date was exceeded, Oracle threw up it's hands saying "I have no place to put this data!" and that brought the whole system to a screetching halt. While the fix of putting in an additional "everything else" partition was no trouble, it certainly was not my finest hour.
Now that some problems with debugging have been discussed, one one powerful heuristic for locating bugs quickly, tomorrow I'll begin discussing the actual process of debugging itself, from bug report to deployed fix.
Tomorrow: The Bug Report
