This post will start from the assumption that you have at least one theory to work with. Tomorrow's post will cover developing a theory when you have none. There are few questions you need to ask yourself about the theory in order to determine how to proceed:
- Is the theory testable, given the current distance constraints?
- If not, could the distance issue be overcome by one of your trusted contacts?
- If not, could the theory be testable in some simulated environment with some likelihood of success?
The problem is that while you may have an excellent theory, it might be quite difficult to determine whether or not it's true, and you may want to investigate some less probable theories that are quickly testable before looking at the most probable one.
Here is an example: you have built and deployed a client-server system for a customer. You receive a phone call on Friday morning that the server mysteriously stopped responding to client requests on Wednesday. The local administrator simply brought the server application down and back up again, and the server appeared to go back to normal. However, the server was displaying the same symptoms again that morning, and they decided to call you. Upon the problem being described, you have a few ideas that immediately jump to mind, in order of estimated probability
- The system is designed to write a log file to disk, and rolls to a new log every 100MB. It is possible that the disk has filled up and when it runs out of log space gets wedged in that state. The reboot causes it to clear the most recent log and start again, but when it fills up, it has the same problem. This would explain the long lead up to the first problem, but the rapid reoccurence.
- Perhaps a deadlock is occurring because a greater number of users are using the system and a section of code that was not properly protected is now causing a problem. While it is unlikely to occur, it will be become increasingly likely as the load on the server increases. This also would match a condition with a long lead-up, and then a relatively rapid reoccurrence.
- The disk that the server is running on is failing. Whenever a bad sector is accessed, the server goes into a long read/retry loop until it finally fails, leaving the system in a bad state. It hit that bad sector for the first time on Wednesday, and then hit it again this morning.
While they all are ultimately testable, they have different issues. Despite the fact that 2 is more likely than 3, it might make sense to investigate 3 first since it very quickly can be ruled in or out. Here is how I might proceed in this investigation:
- Start with the first theory, which has two prerequisites. First, the disk has to be almost out of space. This will require a trusted contact to verify, if the system is physically remote. Second, you will have to mimic the condition of the system running out of disk space in a local capacity to see what actually happens. If both of these things turn out to be true, then you are almost certainly correct. If the disk has plenty of space, it's probably not even worth checking the second condition. if the disk is almost out of space, but it doesn't fail in the same way when you test it locally, it still might be a viable theory, and you will have to judge whether it's worth freeing up disk space and hoping for the best.
- If there is plenty of disk space, then checking with a trusted contact to determine if there are disk errors occurring is probably your best next step. This is usually easy to establish by looking at operating system logs.
- If there is plenty of disk space, and no errors occurring, it's probably time to start doing some more code investigation to determine if a deadlock or other thread issue is the problem, and you can proceed from there.
Tomrorow: The Distance Bug Investigation, Part III
