Distance Debugging Logo

SATA Saga

Returning to the subject of Sunday's post, I have gotten to the bottom of all of the issues regarding SATA vs. PATA and the missing DVD drive. I am grateful to those on the web who have posted various solutions regarding these issues. I'm going to go one step further and document my thought process a bit in order to assist others who don't even know where to begin looking at these issues.
1) I started from the issue that my DVD drive was not being enumerated by the ata_piix driver. I got hooked on the notion that I needed to enabled ATAPI support in the driver. It's experimental so it's not enabled by default. Depending on whether it's compiled into the kernel or loaded as a module, you have to either pass a boot parameter (libata.atapi_enabled=1) or do the same via the modprobe.conf file.

2) I spent an inordinate amount of time working on this solution. Trying it each way, compiling new kernels, rebuilding my initrd, etc. All trying to get it to accept the parameter. One thing that really hindered me was the fact that there is no way that I know of to query the loaded modules and ask "what parameters did you receive at the time you were loaded?" to sanity check that I was doing the right thing.
3) After reading a few dozen web pages, I noticed one mentioned that you should set your BIOS to AHCI if possible to get this work. While I think this is actually in error (since if you were set to AHCI it would use a different driver), it led me to the eventual correct conclusion. My BIOS (on the Abit AW8D) supports a bunch of different modes, and I for some reason had it set to Combined Mode. I now know that Combined Mode has issues, but I didn't know that at the time I set things up. What I really wanted was enhanced mode, and to set the IDE mode to AHCI. Since I have an ICH7 chipset, this should be no big deal, and Fedora will use the ahci driver instead of ata_piix. This is exactly how it works on my laptop with the same chipset.
4) However, I couldn't get the system to boot in that configuration. I assumed that I was doing something totally wrong and that it wasn't recognizing the drives. I was totally stumped because it would boot and then print out

"GRUB GRUB GRUB" or something like that, as if Grub was trying to load and getting screwed up.

5) I tried booting in AHCI mode with the rescue disk. No problem. System comes up, drives can be mounted.

6) Now I'm starting to suspect that it's just an issue with the grub configuration. Sure enough, I run:

grub-install --recheck /dev/sda (since my MBR and boot partition are on the SATA drive)

and notice that previously, the device map looked like:

hd0 /dev/sda (the SATA drive)

hd1 /dev/hda (the PATA drive)

and now it's swapped. Apparently the switch from Combined to Enhanced (required for AHCI) caused the drives to be enumerated a different order. This wouldn't be such a big deal, except that grub has to be installed on the first drive, or you have to add a chainloader directive. But what's with the GRUB GRUB GRUB stuff? Then I remember, I used the PATA drive as the main drive on a previous installation so it has an old (and now totally corrupted) grub installation. That was the big red herring this whole time.

7) So I grub-install onto /dev/hda and then change all the hd0s to hd1 (since the boot partition is still on the second drive) in the grub.conf file. System boots, but hangs since I still have the ide0=noprobe directive. Take that out, try again, system boots to login.
8) It now comes up with the ahci driver (I guess kudzu on redhat figures this all out), and the the PATA drive and the DVD drive are handled by the ide subsystem. I look at the drives, and everything is golden. DMA is enabled on everything, and I'm seeing hdparm timing results consistent with what I was seeing with the ata_piix driver.

So that ends the saga of the slow or missing drives. What is frustrating is how much time I spent on the problem of trying to keep the ata_piix driver from fighting with the ide-generic driver, rather than noticing that my real problem was that I was just in the wrong mode. The ahci driver doesn't have this problem so it's not even an issue. I was so focused on thinking that I was clever with the initial solution of setting the combined_mode flag that I ignored all the evidence that was telling me that I was doing things wrong from the beginning. One last note, I think that everyone using Linux on SATA drives owes Jeff Garzik a note of thanks for all his hard work, and if you'd like to learn more about this stuff, his webpages have a wealth of information.

The Magic of DMA

I have posted previously about my new Linux server, that started somewhat auspiciously with a bad motherboard. I've had some other odd problems which I have mostly solved (or am in the process of solving) that I thought might be of use to others.

To start with, my server runs in so-called "Combined Mode", where the SATA and PATA channels are separated on the motherboard. This is the only mode in which I can boot my server currently. This may have something to do with the fact that the boot partition (/boot) and the MBR are on the SATA drive, and I would have to build the kernel differently since SATA support is built as a module rather than built-in to the kernel. This is just speculation though. Mostly I am taking an "if it ain't broke" position and leaving it alone.

I actually have two HDs in the machine, a 160GB SATA that holds most of my real data, the boot partition, and the swap area, and a 80GB PATA drive that holds the Fedora Core (now 6) installation. The Abit mobo only has a single IDE connector, so I have the PATA drive and my DVD writer on that cable as master-slave, and the SATA drive is separate. I noticed right away that the server seemed blazing fast for a lot of things, but it was jerky and slow for others. I immediately thought about the mishmash of drives and figured something was up. So I did a little hdparm analysis:

hdparm -tT /dev/hdc (the PATA drive)

Timing cached reads: 3496 MB in 2.00 seconds = 1748.62 MB/sec
Timing buffered disk reads: 7 MB in 3.02 seconds = 2.31 MB/sec

Ugh. There's the problem. I should be getting in the range of 40-50MB at least. So I checked the status, and of course no DMA or anything so I tried the obvious:

hdparm -d1 /dev/hdc

And I get the common "Operation not Permitted" error that usually means that the specific driver for your motherboard chipset is not available in the kernel. However, I verified that my chipset (ICH7) was there. Long story short, I found my way once again to the incredibly useful ThinkWiki, to the page about Linux and SATA:

http://www.thinkwiki.org/wiki/Problems_with_SATA_and_Linux#No_DMA_on_DVD...

along with a Redhat bugzilla report along the same lines. It turns out that the issue is that you want the libata driver to grab the PATA drives as well, but usually the regular ide driver grabs them and in combined mode that means you are unable to do anything like manipulate DMA (I am paraphrasing here, the reality is much more complex). Anyway, by adding the flag combined_mode=libata to the kernel boot parameters, I now get:

Timing cached reads: 3496 MB in 2.00 seconds = 1748.62 MB/sec
Timing buffered disk reads: 168 MB in 3.02 seconds = 55.64 MB/sec

So like 25x faster with a simple boot-parameter change. The whole system is just so much faster now, it's hard to believe. Unfortunately, this had the side-effect of making the DVD drive disappear. It looks like I somehow am not enabling ATAPI in the libata driver, so that's what I'm working on. I'll post again when I've got that solved.

Besides my work with computers, my other main interest is education, and I actually went so far as to obtain an Ed. M a few years back. One notion that stuck with me long after I left was that of authenticity, specifically authentic assessment. In contrast to the more prevalent high-stakes testing, authentic assessment is about testing people by making them do the thing itself rather than asking questions about it. The example we used in the class is that of a chef. If you were hiring a chef, how would you gauge their ability? It seems fairly obvious that you would have them cook you a meal. Contrast that with most classrooms, where they would be asked to draw a map to the grocery store or tell how many teaspoons are in a tablespoon. It is ridiculous in that context, but it's not too far from the reality of school testing.

I've discovered that authenticity is a generally useful concept, and shows up (although not under that name) in the testing literature. Proponents of Extreme Programming refer to test-first programming, where as you would expect, you write your tests first and then write the code. The idea is that the tests tell you how things ought to work and then you make the code fit that mode. I like this because it tries to impose an authenticity constraint on testing: make your tests do the things your code should do. I do this quite a bit, and I've discovered that the value of the approach is not so much in the fact that you find and fix bugs in your code quickly this way (you do) but that you find and fix bugs in your API and your thinking. If you write your API first, you will write tests that call it the way it was written and you aren't as likely to notice how awkward a certain set of method calls is, or how you never really need to use some methods. If you write your tests first with an imaginary "perfect" API, you will then code up that nice, simple, authenic interface. Try it some time on a new project. You will be surprised how different your methods end up looking since you have the freedom to write them as you might use them, rather than testing some predetermined API.
You can also impose authenticity on your debugging. Try to imagine how a real user would be using the system, and use the most realistic data available. Think through a day in the life of user including what other applications they might be running or how they might easily perform an action that you thought would never happen. You will be surprised how many times you can end up replicating an seemingly unreplicatable problem just by making it a little more real.

A read an interesting article in Wired a few weeks ago (I get the print edition, but you can read it here.) I thought it was relevant to Distance Debugging because of the comparison of the traditional approach to map-making versus that of the up-and-comer. Basically, it boils down to driving all the roads to see what's going on, versus collecting huge amounts of data such as instant email alerts regarding road changes, scouring local media for information, and looking at satellite imagery to try to infer road status (speed, construction, one-way, etc) without ever leaving the office.

This new approach has started to gain significant ground despite being used by a smaller player. The problem with the "close" approach is that it is about exhaustion, where the drivers are going out and directly observing the state of roads and feeding that into their model. The "distance" approach relies on the fact that people local to the changes will be noting them anyway, and so there is no need to drive roads over and over again.

This is analogous to Distance Debugging versus a close approach such as running the program in a debugger. The debugger is like driving the roads: as you go along you will eventually come along to something worth noting, but will spend a lot of time looking at things you already knew. In the distance approach, you determine what information seems relevant and collect it directly, or have the system dump out information at certain key points. In my experience, and as has been illustrated by the growing market share of Tele Atlas, the distance approach can be more effective and less costly than "driving the roads" since you spend so much less time filtering out redundant data. It will be interesting to see if this trend will spread to other industries.

My grandmother has an old laptop (going on 5 years) that has started to have a lot of odd problems. Of course, she bought the long warranty so it's the manufacturers problem for the most part. I haven't been able to convince her that all the time she spends fretting over it, on the phone with them, trying to fix the problem herself, calling me to get my advice on fixing the problem, etc. are far more costly than running out to Best Buy or whatever and picking up a new laptop, which now can be had for less than $500 easily. it's been another interesting exercise in Distance Debugging both with her on the phone, and listening to the results of her latest conversation with the manufacturer. Some highlights:

  • The initial problem as she described it, was that she saw a flash of light and then smoke came out of the side of it. I don't know what really happened, because there is very little inside of a modern computer that dies in such a dramatic fashion.
  • As it was under warranty, she sent it off to the manufacturer after I backed up the hard drive. They claimed that the screen was cracked (which it most certainly was not when it left here) and so they were sending it back.
  • When she got it back, they had replaced the screen. So I guess they invented/discovered and then fixed a problem that as far as we know was unrelated to the original problem. It was some of the worst distance debugging I've ever seen.
  • When she got it back, she started have a weird problem with not being able to click links in applications and have them open a web browser properly. While this had to be unrelated, she spent hours on the phone with their tech support who were completely stumped. Fortunately, I knew that she had tried to install Firefox and I guessed (correctly) that it screwed up her "Applications to Use" settings for HTML mime types and http:// URLs. I fixed it in about 30 seconds much to her amazement.
  • A few days ago, the screen went black while she was working on it. The lights come on, it gets hot, etc. So something is happening, but it's impossible to tell what is going on. I assume that the brand-new screen they quietly installed failed, but my grandmother insisted that it was getting hotter than it used to. However, there is no proof either way.

To me, the basic principle being illustrated again and again here is looking at what has changed, In the case of the bad linking, I first ascertained what if anything was different and only after I learned that a new browser had been installed was it clear what the likely problem was. On the second issue, I assume that the new screen was to blame both because it was the thing that changed, and because of the famous "bathtub" failure curve that electronic components tend to follow. While the heat issue might in actuality have something to do with the problem, it has to be thrown out because we have no hard data about whether or not the computer is actually hotter than it used to be. I've wasted so much time in my debugging life because I've decided that some error or condition that I've noticed is responsible for a problem, when in fact, it was there before and so it probably has little or nothing to do with the problem. In the long run, this whole process showed me again why I don't get the warranties on anything. Companies are so bad at debugging problems that I might as well put that saved warranty money towards a replacement and save all the time and hassle. Warranties also convince you to keep something alive long after it should have been discarded. Keep that in mind the next time they try to upsell you.

I stayed at a hotel recently that offered wireless internet for free. However, it didn't work very well. Despite having good signal strength (iwconfig showed a 70/100 or higher), and the fact that I had no trouble talking the router itself, it was dropping packets like crazy, and it would just plain stop responding. I would have IM open, and then I would suddenly be disconnected, or the person I was talking to would stop receiving messages or vice versa. It was very annoying.

As is my habit, I started poking around on the network a little bit just to see what was going on. Nothing seemed awry, so I entered the address of the router itself into my web browser seeing if I could get some sort of status page. With the page that came up, I knew immediately that I was looking at a Linksys router. When I clicked on the status page, it asked for a username and password, and just out of curiosity and fully expecting it to fail, I entered the defaults. Of course, I was immediately dropped into administrative mode.

So now I could get a little more information about what was going on. Looking at the admin console, it became clear that they were running an ancient firmware and that an upgrade was desperately overdue. However, I started to feel weird about the whole thing. Partially it's because I always feel ambiguous about even benignly poking around on someone else's system, and partially because I know that in the current litigious culture, you can get thrown in jail for even thinking about cracking something. I knew it would be so easy to just upgrade their firmware, reboot, and no one would be the wiser (except for maybe the front desk that would note a drop-off in complaints about their servier), but it just didn't seem right. I tried to think of the analogous situation in real life: I figure it's like coming over to someone's house, noticing that their door is open and walking in to discover a leaky faucet that you fix before leaving. I like that analogy because like the real life situation, performing a good deed subjects you to legal issues such as being accused of breaking and entering.

We as a society have a very anti-intrusion bias, and I like that even if it means occasionally good deeds cannot be performed. However, I think the emergence of Wikis and other community controlled resources shows that under the right circumstances we are willing to sacrifice control for the possibility of greater positive outcomes. Perhaps someday we will trust each other enough to open up this idea to a larger set of environments.

I had a funny distance debugging experience while traveling in Boston. We were staying with in-laws who had a high-speed connection, but had it hooked up to only one computer. I decided to go into my old office in Somerville, and while I was there, my wife called because she wanted to hook up her computer so she could check her email. She told me that it looked like a cable modem (which it was), so I knew from experience that generally they are relatively simple to hook up to. You don't need a username and password most of the time since it is lan-style connection (i.e. plug in your ethernet cable and go) rather than PPPoE or something else more complicated. However, I also knew, from a very frustrating past experience, that the cable modem learns your MAC address, and so you can't just disconnect one computer and hook up another, you have to turn the modem off and back on again. Here is a transcript of our conversation:

Me: Plug the ethernet cable into your computer.

Wife: Okay, done.

Me: Now turn the modem off and back on again.

Wife: So I should just hit the on/off button on the top and then hit it again?

Me: Yeah, wait 15 seconds or so before turning it back on again.

Wife: Okay...now what?

Me: (long-winded description of configuring dhcp on Mac OS X)

Wife: It won't give me an IP address

Me: Hmmm

I was pretty much stumped and chalked it up to some system I hadn't seen before or which required a password as some cable services do. . Later on when I got home, I looked at the modem and saw the "on/off" button. I realized then that it was actually the "standby" button, which wasn't what we needed at all. The modem has no on/off button; you have to unplug and replug it.

It's funny because at the time I thought, "Gee, these things almost never have on/off switches since it saves 3 cents. This must be a model I've never seen before." Instead of asking better questions like "are all the lights off now?" (they wouldn't be in standby mode, the standby light would be lit). To me it's a classic example of falling into a trap of assuming that the information being given by the remote person (and my wife is very technically savvy, so I had no reason to doubt her) is completely accurate, rather than relying on the observables to verify information. I had done everything right, except for asking her to push a button that didn't exist.

My new Abit board seems to work great. I was able to boot it with all peripherals hooked up the first time I launched it, so I"m going to assume that something was genuinely wrong with the old one (now on it's way back to Abit).

I had a very interesting and somewhat terrifying experience traveling home from Boston yesterday. When I arrived at the ticket counter with my wife and toddler, they proceeded to inform us that while his ticket had been issued, our seats had been "reserved, but not ticketed". It turned out there had been some agent error and they had put our seats on "Courtesy Hold" instead of just booking them, and that was combined with some sort of computer error where the hold was automatically cleared. Ultimately, we got on the flight, but after the fact, I was in a distance debugging mindset and tried to think of the fundamental issues, and how they might be prevented in the future.

  1. The system was in an essentially impossible state according to the average person's (i.e. my) mental model. I wasn't aware that we could exist in a limbo state where we had reservations but not tickets, although I do now. Developers tend to put those kinds of states into programs all the time, sometimes directly, sometimes implicitly. Usually they are meant as a temporary "holding" state to allow a certain transition to take place, as was in this case in the form of a "courtesy hold", whatever that means. However, in my experience, these states are the source of most problems, because the people on the inside (the gate agent) have the problem of not only trying to understand it themselves, but they have to communicate it to a stultified outside party. Also, since they are generally poorly understood, humans almost always do The Wrong Thing when they are encountered, leading to a worse situation as in this case where our hold was unceremoniously cancelled at some point.
  2. This state was indistinguishable from the normal, ticketed state. We received confirmation emails with an itinerary, etc. The only clue I might have had is that we were billed only for my son's ticket. If you are going to allow in-between states that exist only in software, at least make a huge note of it any communication so that we know.
  3. It turned out that my 18-month-old son had in fact been ticketed, so there was actually a ticket in the system for solo infant flyer. I'm guessing that should not have been allowed or should have been flagged immediately.

Once the gate agent determined that we had done nothing wrong and that we truly had all the stuff you would have had if you had actually been ticketed, she proceeded to try to get us our tickets. Oddly, the real problem wasn't that the plane was full, it was that she wanted to get us our original fare, as well she should. She kept saying things like "the fare no longer exists", which to me brought up another key point: many times software keeps users from doing necessary things, I assume in an effort to avoid fraud. For example, you can't arbitrarily change the price of a piece of clothing or a hamburger at the register. This makes sense from the corporate point of view; they don't have to worry about someone giving all their friends a 50% discount. On the other hand, if you've ever waited for 15+ minutes when an item rings up wrong and no one on site has the power to change it, you can easily see the downside. This is the state we found ourselves in. The agent was very nice about it and was able to get our original fare eventually, but it took multiple phone calls and lots of typing.

It seems like there is a better way to handle these circumstances: auditing. Allow users to make justified changes at the "register", with a required explanation and an audit timestamp and user credential. If people knew that any time they access these features it raises a flag, they would be unlikely to try them for fun and profit. Or better yet, stop sweating it so much. The small amounts you lose in employee theft would be compensated by greater customer satisfaction. I am hesistant to fly this airline again (although I'm sure I will due to their greater availability of direct flights), but I will likely think twice when there are competing fares and routes. As it stands, employees are restricted from certain types of fraud, but they are also prevented from meeting customer needs.

I came across this post, linked from slashdot. In it, Steven Levy speaks somewhat philosophically about the iPod shuffle feature, and the well-documented non-randomness problem. He comes to the conclusion that there is no deep conspiracy to play certain artists and the iPod is not telepathic, we are simply illustrating two well-known cognitive phenomena: our general failure to understand statistics (see John Allen Paulos's Innumeracy for an excellent discussion), and our desire to see patterns where there are none.

The article is interesting, but does not discuss what to me is the most interesting part of the "problem". Does their shuffle feature have a bug? Their development team originally said no. They insisted their random number generator was a perfectly valid algorithm. I'm sure that it is. However, that isn't the bug. The bug is giving people "good" randomness instead of what they want, which is something that feels random to a person. Of course, they are a customer-oriented company and fixed the bug. From the article above:

But the non-randomness illusion was so prevalent that ultimately Apple felt compelled to address it. In the version of iTunes rolled out in September 2005, there appeared a new feature: smart shuffle. It presents iPodders with a scroll bar that "allows you to control how likely you are to hear multiple songs in a row by the same artists or on the same album". If you pull the lever to the right, the iPod will mess with its usual distribution pattern, intentionally spacing out songs by a given artist. As Jobs explained it in his presentation the day the new iTunes rolled out, he gave what he hoped would be the last word on the Great iPod Randomness Controversy: "We're making it less random to make it feel more random."

I think the last quote really sums it up. They added a feature to fix a bug in perception despite the fact that there is no bug in execution. This is a lesson that many of us have to learn the hard way, when we continue to fight a losing battle to avoid fixing a bug because we did things "right", but it turns out it isn't what was wanted.