Each type of data collection comes with its own risks and problems that you need to keep in mind:
- Probe Execution - While it tends to be the most reliable of the techniques, it also happens to be the one that is least often available to you, and it also tends to give you very little information. It is only available to you if you have all the parts of the system handy, and in the case of something like a debugger, is only really useful if you have source code. The other problem is that while you step through 1000 mind-numbing lines of code looking for a slight divergence, you are likely to give up before you find it. It ends up as too much of a fishing expedition. However, if you have narrowed down the problem to a specific region or a specific problematic operation, a runtime probe can work. The use of automated debugging tools such as the Delta Debugger (see the link in the right side bar) can also make your runtime probes significantly more powerful and time-effective.
- Human Inquiry - Generally the least reliable technique, just ask any attorney. Humans are biased, and they are easily influenced so they will often tell you what you want to hear, or will fill in gaps in their knowledge with invented content so as to appear helpful. In many cases, the person who knows is also often not available making the technique useless. The key is to make sure that you are talking to the person who would know, and to keep your questions very specific. Asking something like, "Did the error message say anything about a network problem?" is generally more valuable than, "What did the error message say?"
- Test Case Execution - Very reliable, and generally available, test cases are my recommended data collection technique in most instances. The biggest caveat is that your test case may have a bug that hides data from you, so don't ignore counterintuitive results. Also, as a matter of course, write test cases that you expect to fail than ones you expect to succeed. This might sound silly, but I often see a bug and ask to see what tests were written only to discover that they test the functionality so meekly that they barely test anything at all. Tests that fail mean more work to do, but they will mean less work overall. On a related note, I like to say that a test that you expect to fail but actually succeeds is 10x as valuable as one that you expect to succeed but actually fails. When you think something will fail, but it succeeds, you can be sure that you are not really understanding the problem. When you expect success but get failure, you've just found another thing that fails and that may or may not be new information.
- Artifact Examination - Artifacts suffer from 2 major issues: their reliability is totally unknown, and they are static. Like test cases, the production of system artifacts can suffer from bugs such as failing to write out important information or writing out incorrect information. A good example of this is inconsistent error logging, a problem that plagues many systems. If you have an error that is actually a cascading effect from a previous error but only the second one gets recorded, you will be missing a huge piece of the puzzle, and the artifact might actually lead you astray. In terms of being static, think of all of the times that you have looked at an error log and thought, "I wish I had written out the value of [important variable]". It's extremely hard to predict what information will be useful at some point in the future and there is no way to go back and fix it. In a few days I will discuss techniques to make your artifacts more reliable.
- Differential Computation - Suffers from an assortment of issues. The biggest is knowing what to diff. You can't run the calculation on everything, you need to pick things that you think have changed. Next is knowing how to diff. Text files and even most binaries can be easily compared, although you may not always understand the results. Diffing something like hardware configurations is a lot tougher and generally has to be done manually, and that introduces human error. Finally, the problem of spurious or ignorable differences can cloud your findings. Simple things like line breaks will complicate the results of a text diff. Trying to figure out whether differences in CPU chip speed is important is significantly more difficult.
Tomorrow: Putting it all together, Diagnosing in Action
