Distance Debugging Logo

I have not worked at that many different companies, although I've been in the industry for a while and I've seen the day-to-day workings of many workplaces. A recurring theme in many of these workplaces, which always surprises me, is the failure of the information technology to meet the needs of the users on a day-to-day basis. I've often asked why this is such a problem. Is it because it's hard to know what users want? Do the solutions not exist? Are they too expensive to implement? Is it not a priority?

I've concluded that businesses are not serious about IT. By not serious, I mean that they treat it as a nuisance, or that, in general, the least possible effort is expended on providing employees with the tools and technologies that would help them do their jobs. What would taking IT seriously mean?

  • Problems with the computer infrastructure would be treated like an emergency. When users go without email for 4 hours, this is often shrugged off like it was inevitable. What if the heat went off in the middle of the day for 4 hours, or the water stopped running for 4 hours? It would be just about as disruptive to business, but it would be treated like a crisis.
  • Solutions would not just be, "Don't let anyone do anything". Most big business seem to have an IT policy which is unconcerned with stomping on users legitimate needs in the interest of preventing possible problems. For example, the prevalence of ridiculously aggressive email filters that don't let any zip file in or out. Yes, it stops a certain class of viruses from spreading, but it gets in the way of transmitting legitmate files probably 10 times for every virus stopped? Why not just not let people use computers at all? Then you'd have no computer problems.
  • Intermittent problems would be dealt with instead of glossed over. Have you ever worked at a company where a critical server went down once a week and instead of just fixing the problem, it was rebooted? That's what I'm talking about.
  • Users' requests would be taken treated as requirements instead of burdens. Most IT policies are a top-down affair, with some sort of group deciding which capabilities will be offered to which users, and how those services will be delivered. Unfortunately, since the people who make the policy are generally the people who have to implement it, the decisions often lean towards "easy-to-administer" instead of "good-for-users". When users complain about policies that interfere with their work, or offer alternatives for software or hardware to help them do their jobs more effectively, the message is "but that will mean more work for us!". The result is that large gains in user efficiency are sacrificed for much smaller gains in IT efficiency.

That last point is the most important. The IT staff often has a conflict of interest because they want to reduce their burden (which honestly is hard to blame them for, given that IT departments are often dramatically understaffed) but by doing so they create a suboptimal environment for the day-to-day users.  Businesses need to recognize this and provide a better mandate for their IT staff. Besides raising staffing levels and constantly trying to improve the staff, management needs to judge the IT department on user satisfaction. Those two steps would be a serious commitment to  IT, and ultimately, they can help improve morale, and make the business more competitive.

Two items this week:

- The WordPress 2.1 upgrade has already been covered here and here.

- I don't discuss my day job very often, but I had a major investigation that finally got resolved today that I hope that the information will be helpful for others. It all started because I was trying to get a coworker set up with Subclipse, the Eclipse subversion plugin, on a new repository over ssh (using an svn+ssh URL). I'd gotten set up with no trouble, and was able to get a few others set up with no issue. However, whenever he tried to connect, it came back with a message saying

svn: Network Connection Closed Unexpectedly

The few hits on this on the web were in relation to failing to have svnserve in ones non-interactive shell path. Basically, when you use svn+ssh, it opens up an ssh non-interactive sesson, and then invokes svnserve in non-interactive mode. That means that often a different environment is created and it can mean that svnserve is not in ones path and so the command fails. However, that's easy enough to test, and it was clearly not the issue in this case. After uninstalling and reinstalling Subclipse, trying new keys, and dozen other small things, we were fairly stumped. As an experiment, I had him send me his private key and I tried to set up a connection posing as him. This worked fine so the problem was very likely tied to his machine, and not to his account.

Grasping at straws, I googled around a bit more and stumbled across this post from a few weeks ago on the Subclipse-users mailing list. It describes a similar problem and a workaround. Basically, when you try to connect to a repository for the first time, you are asked to enter an Author Name, which is associated with your commits and such. This can be different than your login name, if you so choose. My coworker chose to enter his full name, separated by a space, and it turns out that if you choose to do so, it will result in this connection closed problem. The real issue is that once you have entered it, Eclipse will store it deep down in its config and without explicitly deleting it, it is nearly impossible to get it to prompt you again. Because it is in Eclipse itself, reinstalling Subclipse, restarting Eclipse, and deleting and recreating the connection will have no effect. You have to delete the keyring file that stores the info.

So we started by deleting the keyring file, which is in ECLIPSE_HOME/configuration/org.eclipse.runtime.core/.keyring. If you are following these directions, I would make a copy of this file before deleting. I wasn't concerned about losing data in this case, and it's probably all information that ought to be recreateable anyway, but you never know. He then tried to connect again. This time, it did prompt him for his author name. He entered just his last name this time, and voila, the repository was all set. I posted a message to the Subclipse-Users list reporting the problem as replicable, so hopefully this will be fixed in a subsequent release. Hope this helps someone!

Yesterday covered some "don't"s, and today we'll cover the "do"s:

  • Segmented Logging - Besides rolling the logs at intervals to allow for multiple files, it also makes sense to segment your logging into different tiers by seriousness and verbosity. I like to use at least these four logs as available places for content:
  1. The Standard Out Jungle - Anything goes in this log. Here, developers can spit out pretty much whatever they'd like without fear of cluttering up the important logs. Data structure dumps, "Here", anything that helps them observe and diagnose the running system. However, it's a shared resource, so expect to go digging through other people's Standard Out junk as well. This log is not serious, but it may be verbose.
  2. System-at-a-glance - A concise summary of every logged message that the running system produces. Includes a timestamp, severity, short summary, and a reference number for each message. This is the serious, but not verbose log. A remote, non-technical resource should be able to quickly skim this file and visually determine if anything important/worrisome is happening.
  3. System-debug - A verbose explanation of the information being logged in System-at-a-glance. If a skim of the at-a-glance log seems to indicate a problem or something that requires further investigation, the reference number associated with the message can be used to cross reference the message in the verbose log. In fact, these two logs receive the exact same set of messages (and this is done automatically by the logging layer, not relying on users to write to them both) but with different information culled from the message to keep them in sync. This is the serious and verbose log.
  4. Critical-at-a-glance - This is the log that you can check every morning. In general, it should have nothing in it. The appearance of anything means that there is a serious problem that needs to be addressed immediately because it is unresolvable without human intervention and will have deleterious consequences. The information in the System logs might be beneficial for digging into the problem, but they contain significantly more information and so are not ideal for a daily review.
  • Think About your Reader - Who do you expect to read the log file? You will probably see it, eventually, but there may be a lot of other eyes on it first. The user, other admin or IT support staff who are local to the system, and possibly more. Ideally, you'll never have to see the log because either a) the information is so clear that someone else can handle it (not very likely) or b) the log has the important information carefully outlined and packaged so that a relevant section can be sent off to you (hopefully very likely).Automated tools for log mining have their place, but they often strike me as arising from sloth, i.e. it implies that you'd rather spend hours slicing and dicing a 500MB log file every time an error occurs rather than spend a day or two upfront cleaning up and organzing your logging. It also fails to take into account the reality of a distance debugging situation. Big logs don't email well, and it totally rules out the possibility that a system-local resource might be able to tell you the important stuff, unless they want to become log mining gurus themselves.As was hinted at yesterday with Alarmist Logging, you also don't want the user to open up the log to find thousands of lines of their personal data interspersed with other random outputs about fatal errors and who knows what. It certainly will not instill them with a sense of confidence.
  • Practice Debugging with Logs - It's hard to know what information will prove useful, but if you implement a logging policy and infrastructure early on, you can start using it to try to debug problems in the development phase, to prove that you will be able to do it in production. This will show what you need to put in the logs, and how much logging is sufficient.
  • Institute a Logging Clean-up Phase of Development - To avoid logorrhea, make sure that before any release is cut that the code is inspected for useless or misleading log statements. This can be executed just like a standard code review, but it is pretty quick since you can just jump around from log() call to log() call and simply question the necessity and validity of each. In many cases, all that is needed is a redirection of the message from the System log to the Standard Out Jungle, which can be disabled in the production system.

The moral of the story is: treat logs as a key debugging resource. You can signficantly improve their value to you and others with a small amount of time spent on the details of what gets logged and how it is recorded.

I noticed some people winding up here because they searched for "WordPress 2.1" and "broken" and "Tiga" (the theme that I adapted for my site) or some combination of those terms. I wanted to provide a slightly more detailed explanation of what I did to fix it. In the Tiga theme, the links list is generated by a foreach that loops over the links, printing them out in category blocks. The problematic call in Wordpress 2.1 is the one that assigns the original variable to iterate over:

$link_cats = $wpdb->get_results("SELECT cat_id, cat_name FROM $wpdb->linkcategories");

As I mentioned before, linkcategories no longer exists, so instead you need to query the categories table.

$link_cats = $wpdb->get_results("SELECT cat_id, cat_name FROM $wpdb->categories");

Except if you do that, you get all of your post categories showing up as well. So you want to filter on only the categories that contain links and that was the tricky part. By looking at the MySQL table structures that are printed in the WordPress Codex, I noticed that there is a link_count field. I changed the query to filter on anything that has a link_count > 0, like this:

$link_cats = $wpdb->get_results("SELECT cat_id, cat_name FROM $wpdb->categories WHERE link_count > 0");

That did the trick, and it doesn't suffer from the potential problems that I described in the original post where I was filtering on link_id. If this helps you, or if you have a better solution, would you mind leaving a comment? I'll eventually post this somewhere more permanent if it seems to be helping people. Thanks!

Most applications produce some kind of log. Often it is the primary or only resource available to debug a problem. Yet few applications have any unified approach to figure out what should be logged, how it should be logged, and why it should be logged. Most of the time, while there is a shared logging API available, developers use it haphazardly and to record a wide variety of information. There are a few documents on the web such as this one that try to indicate some logging best practices. Over the next few posts I will try to explain some of my own best logging practices and contribute to this body of knowledge.

At the most basic level, you need to ask yourself, if my log files were all I had available, how confident would I be in my ability to fix bugs? If you have little confidence, then you are probably doing it wrong. Before discussing how to create logs that build confidence, let's talk about some common mistakes:

  • One Log File Syndrome - There is a misguided desire to put all that is logged into a single file. This is a very bad idea, for reasons that will soon become clear. Having a logging standard that assumes more than one log file offers a built-in way to segregate data, and quick access to the data that you need is critical in a distance debugging situation.
  • Logorrhea - Developers like to put things in the log file, and they hate to take them out. This often leads to a situation where the log becomes an indiscriminant mess of random garbage. Worse, it kills system performance, often mysteriously (until you put it in a profiler and see 100M calls to log()).
  • Alarmist Logging - Have you ever opened a log file to see messages like "FATAL Error: File Already Open"? The terminiology used within the log can send your distance resource into fits if they see a bunch of "fatal" errors, even if the error was in some sense "fatal" in that it caused the operation to fail. The way that logs are phrased, and the overall tone of the messages can help guide a remote reader to the actual issues in the log. Alarmist logging will make them think that their system is milliseconds away from disaster.
  • Logging what ain't broke - Logs are for capturing information that will probably be useful for debugging, should the need arise. Sometimes it's not clear what is useful, and so having examples of success to compare against are useful. But I've seen too many logs that are 99.9% lines that say "Commit Successful". If you have to spend more time ignoring success data than you do reading and analyzing failure data, you need to revisit your logging.

Tomorrow: Logging at a Distance, Part 2

1/27/07: I've updated my fix for the link categories issue, so check out this more recent post if you are looking for help.

I've just upgraded to WordPress 2.1 and there were a few glitches that I've mostly resolved, but I thought I would make a note of things for others who might be having trouble.

  • The big problem is one that appears to be somewhat widespread. There are no longer separate database tables for link categories and post categories, just a single categories table. WordPress is nice, and it is free, and it works great, so I'm not here to complain, but their justification that they were somehow changing this to take advantage of some elegant abstraction seems questionable to me at best. It broke my template, and I think it broke a lot of other people's templates, and most annoyingly, the API doesn't seem to have been modified to offer a method that only returns just the link categories, or at least I didn't have time to puzzle through and figure it out. I hacked it by looking at the range of category IDs that constitute the link categories and changing the "get links" query to filter out anything lower than the lowest link category. This only works because the post categories are a contiguous block, and the link categories are a contiguous block and as soon as I add one or the other it will probably break. There is clearly some other way to determine what is a link category, but I haven't had the time to get serious about figuring it out. Anyway, the links on the front page are now accurate, but the display is screwy because the wp_get_links() method seems to have changed as well, so bear with me... [Ed note: I have a better fix now, see this post.]
  • The admin interface has some nice enhancements; more AJAX background loading so things are quicker, autosave while editing a post, plus integrated spellchecker! That all makes me very happy.
  • All my plugins appear to still be functioning, so kudos to the developers.
  • They changed the name of the Links tab to Blogroll, which is a weird narrowing, considering that most blogs I read link to a few blogs, but lots of other miscellaneous sites. At first I thought the functionality had changed, but it's just a renaming of the existing Links functionality.
  • There's a little bit more confusion between what the "Write" tab does and what the "Manage" tab does. I'm sure I'll get it figured out.

Overall, it was a snap to upgrade, after I'd sorted out the category confusion, and it seems to be a bit faster. I would definitely recommend the new version.

Another quiet-ish week. Spent most debugging time helping out family members:

  • My dad received some data at work that was formatted, as he put it, as a "qz" file. I was fairly certain that this was not in fact, a "qz" file, but was in fact a "gz", i.e. gzip file. I based this on the assumption that a) I've never heard of qz, so unless it was highly specific to some particular application, that was probably wrong, b) gz is a very common format but not that well known among non-Unix people so it would make sense that he would have received one from a third party and c) "qz" and "gz" look very similar, especially in certain fonts. He reexamined the file and verified that it was indeed a "gz" file. The next question became how to open it. I was fairly certain that winzip would know what to do, which turned out to be the case. Misreading the extension sent him on a bit of a wild goose chase, so I'm glad I was able to help.
  • My parents have a new HDTV, but with just the off-air HD signal, so I have spent some time directly working on and remotely advising them to get it set up so that they can use their DVD player, VCR, HD signal, etc. with everything routed correctly. They have this weird setup where their DVD player is also the amplifier for the sound system, which I've never encountered before, so it ended up requiring lots of back and forth wiring to get everything run through the DVD player, and then to the TV, and then to the VCR so that stuff can be recorded off of live TV, etc. I wasn't able to sell them on TiVo despite all the hassle.
  • My wife had intended to use the hide column feature in Excel and instead accidentally selected "Hide Document". This is surprisingly easy to do and she was baffled/terrified as to what had happened to the document. I'm not totally sure what "Hide Document" is supposed to do when you only have a single document open. I can see the logic when you are trying to manage multiple documents but to hide the only document you have open with no warning, and no clear on-screen indication that you have something hidden seems like a recipe for disaster.

Skepticism and Cynicism may not seem like opposites, but in the world of debugging, they often are. As discussed yesterday, the Cynical Debugger says "I believe that things can and will go wrong, and I want to plan for that."  Balancing that attitutde, the Skeptical Debugger says "I don't believe that bug report" and "There's probably a much simpler explanation". An expert debugger knows how to manage the tension between wanting to accept the crazy things that happen to an application in the wild, and wanting to dismiss bad or highly improbable data.

When a bug report is submitted to me, one of the first questions I like to ask myself is: what of this information, if any, seems totally out of line? As described in this post about saying 'no' to a bug report, an application can't be all things to all people, and it is necessary to say 'no' at times. Reading the report with a cynical eye will give you plenty of reasons to say 'yes', but reading the report with a skeptical eye will tell you why you can say 'no'. It is what prevents you both from catering to the whims of particular users, and from embarking on a wild goose chase for a bug that occurs under questionable circumstances. Take for example a bug report like this:

"I was testing the latest build of the client software, and when I sent a login request to the server it spit out a stack trace back at me. Please fix the server."

It's quite possible that the latest build of the client software uncovered some bug in the login procedures on the server. But it seems much more likely that the thing that changed, i.e. the client software, is to blame and is now doing the login procedure incorrectly. Spending a lot of time investigating this "server" bug is very likely going to result in simply figuring out what the client is doing wrong and then telling them about it. Unfortunately, people often want to blame the component that is producing the error and not think about what has changed. With a skeptical eye, you can step back and ask "why is this bug happening now?", and try to communicate that you are not absolving the server of responsibility, you are simply skeptical that a properly functioning server that has not been changed would suddenly fail.

The same goes for bug reports that look like, "My computer has been acting flaky all day, and your application keeps crashing. Please fix your application." Users want something to blame, and if you have a reputation as a responsive team that can fix things in a hurry, these kinds of bug reports will only increase. The key is again, remaining skeptical, and communicating your skepticism to others. "I'm guessing that your computer acting flaky is the cause of the application crashing, and not the other way around. You normally don't have any problems with our application, and your computer has acted flaky many times in the past, right?" These kinds of skeptical statements help clarify that you are open to fixing real problems with your and even others' applications, but you are not all-powerful.

Distance Debugging is ultimately about figuring out where to apply your limited resources in a highly uncertain environment. Being appropriately skeptical can help you avoid getting trapped into expending these resources in inappropriate, counterproductive ways while still letting your team and your users understand that you will invest the necessary resources when the situation warrants.

One of the qualities that I look for in other developers to signal that they have the requisite experience is what I call "Healthy Cynicism". Programmers fresh from school, or who have only been given token work in the past have an attitude that says, "Things will work out". That's a great attitude to have when you are building something new, but it's counterproductive when you need to debug something. Debugging is about expecting the worst. I have a lot of conversations with people that go like this:

Me: "So if the user decides to minimize and then maximize the window in the three seconds while the application is loading, the system will segfault."

Cheery Programmer: "Will that ever happen?"

Me: "I'm reading to you from the bug report database"

Certainly cynicism can get out of hand, and lead to a situation where a group becomes mired in inaction because everything seems likely to result in failure. The attitude of a Healthy Cynic is not that failure is inevitable, but that any misuse of the system you can think of will probably be done, and any bad system state you can imagine will be entered at some point. The Healthy Cynic says, let's be serious about what bad things can go wrong and make sure our system can handle them.

In debugging, the attitude is invaluable. As I've alluded to before, part of the success in debugging is simply accepting that a bug exists in the first place. For example, I see replication as much about being able to prove you fixed the bug as it is about proving the bug exists. A Healthy Cynic looks at a bug and says, yeah, that could probably happen and has a jump on others who are still trying to explain the bug away. A Healthy Cynic also assumes that users are mostly stumbling through your application and will do ridiculous things with it. It would be nice if they didn't, but pretending that they will stay on your nice rutted paths is a recipe for disaster.

This combination of considering worst-case scenarios and then trying to build a system which anticipates and handles them, and treating bugs as real and likely right from the start is what makes Healthy Cynicism a core debugging skill. Unfortunately, it's not enough without an offsetting force. Tomorrow's post will discuss the flip side, Healthy Skepticism.

In a long-standing organization, the line between the things that are formally stated in policy documents and things that are just accreted tendencies often gets blurred. The former represent agreed-upon standards and practices that have (hopefully) undergone some formal review process and have been signed off at one or more levels of management. The latter represent some combination of once-upon-a-time best practices, inertia, and whim. In many cases, they are simply generally accepted intepretations of policy.

Jewish law is an interesting case of this dichotomy. There is the Torah, which is the policy, and the oral tradition, which is the convention. They are considered to be essentially equal in their binding nature. From Wikipedia:

Jewish tradition holds that the Torah has been transmitted in parallel with an oral tradition. Jews point to texts of the Torah, where many words and concepts are left undefined and many procedures mentioned without explanation or instructions; the reader is required to seek out the missing details from the oral sources. For example, many times in the Torah it says that/as you are/were shown on the mountain in reference of how to do a commandment (Exodus 25:40).

This certainly works for Jewish law, where in many ways the conventions have simply been rebranded as policy. Unfortunately, in most cases, confusion of policy and convention can lead to serious problems.

Consider the following situation: a project manager produces a set of reports every month for her boss laying out costs, work performed, and goals. The boss is happy with these reports, but they are being produced by convention, i.e. there is no document that could be referred to that would explain what the reports are, what should and should not be in them, who should or should not be reading them, etc. That original boss leaves or is promoted, and a new boss comes in. The project manager produces the same report she has always produced, but the new boss does not like these reports. She has her own expectations for project reporting, and thinks that the reports cover the wrong subjects or are at the wrong level of detail and gets a negative opinion of the capabilities of the project manager. This is clearly not fair to the project manager, who is following her previous convention and doesn't understand why the new boss is unhappy. The problem is that convention is being treated as policy. There is an unspoken assumption that things have been done a certain way, and therefore there must be a good reason, but it's really just the whim of the person in charge.

Convention/Policy confusion is also very difficult for new employees who want to understand how things are done. As they learn about the company's practices, they have to spend a significant amount of time figuring out of what they are being told, how much is actual policy, and how much is convention. They fail to do so at their own peril because conventions can often be broken while policies cannot.

Finally, Convention/Policy confusion is dangerous for an organization's cohesiveness over time. Primarily, this is because employees are coming and going and the turnover can lead to a "lurching" of conventions from one form to another wasting time and energy, Additionally it leads to a situation where each part of a company does things in a slightly (or not so slightly) different way, causing huge headaches when parts need to be integrated. A more insidious effect is the weakening of the role of policies in general. When it is no longer clear which things are being done because someone decided it was so, and which things were certified as policies, it stops being possible to drive an organization with policy at all.

It's worth a look around your own business to ask the question: why do we do the things we do? Is it because each level of hierarchy is making up conventions for the level below them, or is it because we all got together and agreed upon the way to do something? You will be shocked by the number of things that you do every day that have no basis beyond historical precedent.

Syndicate content