Distance Debugging Logo

Yesterday covered some "don't"s, and today we'll cover the "do"s:

  • Segmented Logging - Besides rolling the logs at intervals to allow for multiple files, it also makes sense to segment your logging into different tiers by seriousness and verbosity. I like to use at least these four logs as available places for content:
  1. The Standard Out Jungle - Anything goes in this log. Here, developers can spit out pretty much whatever they'd like without fear of cluttering up the important logs. Data structure dumps, "Here", anything that helps them observe and diagnose the running system. However, it's a shared resource, so expect to go digging through other people's Standard Out junk as well. This log is not serious, but it may be verbose.
  2. System-at-a-glance - A concise summary of every logged message that the running system produces. Includes a timestamp, severity, short summary, and a reference number for each message. This is the serious, but not verbose log. A remote, non-technical resource should be able to quickly skim this file and visually determine if anything important/worrisome is happening.
  3. System-debug - A verbose explanation of the information being logged in System-at-a-glance. If a skim of the at-a-glance log seems to indicate a problem or something that requires further investigation, the reference number associated with the message can be used to cross reference the message in the verbose log. In fact, these two logs receive the exact same set of messages (and this is done automatically by the logging layer, not relying on users to write to them both) but with different information culled from the message to keep them in sync. This is the serious and verbose log.
  4. Critical-at-a-glance - This is the log that you can check every morning. In general, it should have nothing in it. The appearance of anything means that there is a serious problem that needs to be addressed immediately because it is unresolvable without human intervention and will have deleterious consequences. The information in the System logs might be beneficial for digging into the problem, but they contain significantly more information and so are not ideal for a daily review.
  • Think About your Reader - Who do you expect to read the log file? You will probably see it, eventually, but there may be a lot of other eyes on it first. The user, other admin or IT support staff who are local to the system, and possibly more. Ideally, you'll never have to see the log because either a) the information is so clear that someone else can handle it (not very likely) or b) the log has the important information carefully outlined and packaged so that a relevant section can be sent off to you (hopefully very likely).Automated tools for log mining have their place, but they often strike me as arising from sloth, i.e. it implies that you'd rather spend hours slicing and dicing a 500MB log file every time an error occurs rather than spend a day or two upfront cleaning up and organzing your logging. It also fails to take into account the reality of a distance debugging situation. Big logs don't email well, and it totally rules out the possibility that a system-local resource might be able to tell you the important stuff, unless they want to become log mining gurus themselves.As was hinted at yesterday with Alarmist Logging, you also don't want the user to open up the log to find thousands of lines of their personal data interspersed with other random outputs about fatal errors and who knows what. It certainly will not instill them with a sense of confidence.
  • Practice Debugging with Logs - It's hard to know what information will prove useful, but if you implement a logging policy and infrastructure early on, you can start using it to try to debug problems in the development phase, to prove that you will be able to do it in production. This will show what you need to put in the logs, and how much logging is sufficient.
  • Institute a Logging Clean-up Phase of Development - To avoid logorrhea, make sure that before any release is cut that the code is inspected for useless or misleading log statements. This can be executed just like a standard code review, but it is pretty quick since you can just jump around from log() call to log() call and simply question the necessity and validity of each. In many cases, all that is needed is a redirection of the message from the System log to the Standard Out Jungle, which can be disabled in the production system.

The moral of the story is: treat logs as a key debugging resource. You can signficantly improve their value to you and others with a small amount of time spent on the details of what gets logged and how it is recorded.