A recent post on Slashdot discussed Microsoft's response to the results of a group who discovered several what they call denial of service exploits (since that term is somewhat loaded, I am reserving judgment) in Word 2007 using a fuzzer. Microsoft responded that crashing on these ill-formed unparsable inputs was in fact the correct behavior, and so it constituted a feature and not a bug. The follow-up debate on Slashdot was, as usual, enlightening and frustrating all at once. There was the usual Microsoft bashing, a lot of discussion about bugs v. features, and quite a bit of argument about what constitutes good programming and what standard of perfection a program should aspire to.
The core of the debate is a big question: when is crashing the right thing to do? On the one hand, you have many people arguing that crashing is never the correct behavior since a crash implies that the program attempted to do something bad or illegal. On the other hand, a well-respected book like "The Pragmatic Programmer"contains an extended discussion of why simply crashing ("Dead Programs Tell No Lies") instead of allowing second-order data corruption or other damage to occur is a Good Thing. Ultimately, the answer is probably somewhere in between.
There are three real issues that need to be addressed when something goes wrong: notification, consistency, longevity. Notification to users or developers that something went wrong should be orthogonal to the question of crashing or not crash. Certainly crashing tells people that something went wrong, but it's not a very useful message, especially if there is no good reason to crash. On the other hand, I've covered the topic of silent error-handling ad nauseum in this space, so my viewpoint should be clear, but as long as your strategy of notification is not dependent on crashing, you should be fine.
Consistency is trickier. In a perfect world, every system you build would be able to meet the Consistency requirement of an ACID-type transactional system. In other words, you could at any point choose to begin tracking your system state, and if something went wrong, perfectly undo the sequence of operations that it took to get there. If you could make that guarantee, you could eliminate the concerns of the Pragmatic Programmer since you could say, "It doesn't matter if I let things continue after an error, because I know I've rolled back to the state before I started to do the thing that failed". This is unrealistic for several reasons. First, many operations have no good notion of undo. Second, you would likely have to graft a very sophisticated set of nested transactions onto your system, imposing a massive overhead in performance, and possibly code complexity. Third, fail or don't fail is not really a valid outcome for most systems. For instance, you couldn't build a browser where the system either successfully rendered the entire page, or refused to render anything. Graceful degradation in the face of invalid inputs is also desirable, even if it opens up the possibility of leaving things in an inconsistent state.
Finally, there is the issue of longevity. Crashing is simply not an option for some systems. I have built servers for which "it got an error and exited" is not a valid outcome, and in general, no one wants their applications crashing all the time, even for semi-valid reasons of prevention of data corruption.
So are there any good rules for when to crash? Crashing seems like the right option under the following three conditions:
- The problem is fatal - Set the bar high. Crashing should equal program panic, not program annoyed or program concerned.
- A human must take some action to correct the problem - It can be hard to know when this is necessarily true, but if the condition encountered does not appear to be programmatically addressable, and it is fatal, crashing makes sense. For instance, many systems require configuration files to exist, be well-formatted, and exist in certain places on disk. If they do not exist, it's often better to crash than to try to come up with sensible defaults, which often just serves to confuse the user who can't understand why their configuration is being "ignored" when they put it in the wrong place.
- Rollback is impossible or meaningless - If the application's best effort to contain the damage has failed, or the extent of the problem is totally unknown, or rolling back would mean rolling back to essentially nothing, and it meets the above conditions, it's time to crash.
So does the Word 2007 crash meet these criteria?
- Fatal? Sort of. The malformed doc means that Word's only raison d'etre, to load a document for display and editing is impossible.
- Human-addressable? Yes, this one it clearly meets. Word can't possible know what the "intent" of the document was in order to repair it programmatically.
- Rollback Failure? No, in this case there is a clear rollback position which is to restore the application to the state it was in before the load was attempted. Taking the whole application down in this situation is overkill, to put it mildly.
So in my opinion, the Word 2007 crashes fail to meet the necessary standard. Ultimately, the problem with crashing has to do with intent. It seems fairly clear that the Microsoft rep is claiming that the behavior was something they put in intentionally, when in fact it was very likely totally unexpected. If you are going to adopt a crash-sometimes-ok policy, you need to make sure you only crash under the right conditions and not because you simply failed to build a robust system.