Two tales of "smart" software gone awry
Friday, July 3, 2009
Keywords: Technology
Don't mess with my data
In an era when our lives are increasingly defined by and stored as data, it's important to keep that data safe. Backups are not always enough, because before you can use a backup to recover when something goes wrong, you first need to know that something has gone wrong. While the catastrophic death of a hard drive is hard to miss, subtler forms of corruption can be much more difficult to detect.
A couple of years ago, one of the sticks of memory in my desktop computer developed a defect. It was a very, very small defect: a single bit (not even a full byte) that had gone bad. While one bit does not sound like much, it can still be problematic, especially when dealing with compressed data in which even one bit of corruption has the potential to ruin all of the data that comes after that it. But since this was just one bit that had gone bad, it was not readily apparent that there was a problem. The effect of a bit of bad memory depends on what that section of memory is currently being used for, which means that the effect of this bad bit can potentially vary frequently, as memory gets shuffled around and reallocated to various tasks. Most of the time, there was no discernible effect. Sometimes, there might be minor glitches in software that could be easily misattributed to a programming bug. And sometimes, such as when that piece of memory is used as a part of a file buffer, it could corrupt a file on the disk, which, depending on the type of the file, could easily go unnoticed.
Fortunately, I had a habit of hashing important files and periodically checking those hashes, as a sort of a trip wire to alert me to potential problems with data integrity. When I started getting hash mismatches, I knew that something was wrong. I had originally suspected a faulty disk, but when I considered that I had also ran into a small handful of minor, but odd software glitches, I decided to test my memory, which was how I discovered the defect. Without my practice of hashing important files, it would have almost certainly taken me months longer to realize that there was a problem.
This afternoon, when I checked the hashes of my music files, I was a little alarmed to see 20 hash mismatches. I quickly noticed that only two folders were affected: every file within those two folders failed the check, while every file outside of those two folders were fine; this obviously did not look like random data corruption. I recalled that, while testing Windows 7 RC on my desktop computer, I had played these two folders of music (which were located on my laptop; this was done over the network) using Windows Media Player (I usually use Winamp to play my music). Comparing the altered files with my backup revealed that WMP had altered the files' metadata (stuff like tags, ratings, etc.).
Now, I understand that WMP is just trying to be "smart" and "help" me organize my music, but I already have my music organized just the way I want it using the file system hierarchy, thank you very much. And most importantly, I didn't ask for its help. I didn't tag the music, I didn't click any of those silly rating stars, I didn't do anything beyond dropping a couple of folders into WMP and hitting the play button. The expectation should be that when a user opens/reads/plays some file, it should be a read-only operation: in other words, "don't mess with my data!" Only when the user is explicitly editing the file or its associated metadata should programs open a file in anything other than read-only mode.
Among the many problems raised by this "smart" behavior is excess disk activity. The metadata that WMP altered was located at the start of each file, which meant that any alteration of the data that changes the size of the metadata would require rereading and rewriting the entire file to disk, which is an expensive operation when done locally, and is a downright idiotic thing to unnecessarily do over a network where, aside from incurring an expensive disk cost, you will also incur an even more expensive cost to shoot the entire file over the network and back again.* This "smart" behavior also altered the timestamp of the file, which ruins the ability to search for and identify files based on timestamps. And, of course, this erodes my ability to detect file integrity problems by altering files and thus raising false alarms.
An unwanted wake-up call
While testing Windows 7 RC on my old desktop, I ran into a problem where the system would not stay in hibernation. I would hibernate the system, and within minutes, it would boot itself back up. I quickly discovered that this was a problem with the Wake-on-LAN (WOL). The default settings for the network driver in Windows XP was to wake only on a magic packet, while the default settings for the driver in Windows 7 was to wake on any directed packet, which was problematic since the brain-dead router provided by the ISP was probing machines on the network every few minutes. This problem was easily diagnosed and fixed by setting the WOL to respond only to magic packets.
That night, shortly after I had gone to bed, my desktop booted itself up. I crawled out of bed, disabled the WOL completely, and hibernated the system again; having just dealt with the system inappropriately waking from WOL, I had incorrectly assumed that this was somehow related to that. Then it happened again the next night. Increasingly frustrated at the system waking up, I disconnected the network cable the third night and double-checked to make sure that I did not have any system wakeups scheduled in the system BIOS and that Windows Update was indeed set to my usual setting of manual. I also checked the operating system's logs, which did log the wakeup, but unhelpfully noted that the source of the wakeup was "Unknown". Still, it happened again the next night. This time, I noticed my bedside clock when the system woke up: it was half past the hour. That's an unusually round time for the system to be waking up randomly, I thought. I examined the operating system's logs, and while they unhelpfully indicated that the source was "Unknown", they did contain one very important piece of information: the system woke up at exactly half past the hour (well, it was off by 0.3 seconds, but that's close enough). These wakeups are almost certainly scheduled.
Having already ruled out Windows Update and BIOS-scheduled wakeups, I turned to the Task Scheduler service, which I always have disabled on Windows XP (in Windows 7, it is not possible to disable it, in part because it now plays a far more prominent role than it did back in the days of NT 5.x). I was shocked to see the enormous list of items in the task scheduler; it took a long time for me to wade through all of them. Some were triggered events along the lines of "do A if B happens", and some were scheduled events along the lines of "do A at 10 AM every day". Almost every one of the scheduled events were instances of Windows trying to be "smart" and "helpful" and performing periodic self-diagnostics, checks, updates, etc. I got trigger-happy and purged well over half of the items in the task scheduler, including virtually all of the scheduled events; almost none of the items in the task scheduler were necessary or even particularly useful. Among those purged were two events (or was it just one? I don't recall the specifics) that were scheduled for the time when the computer was waking up. After purging the task scheduler, I never experienced the late-night wakeup problem again.
It should be noted, however, that Microsoft did not intend for the system to be woken up for these mundane tasks: while events have the option of waking the computer from sleep or hibernation, none of these events were configured to exercise that option: they were configured to fire only if the computer was awake. So I had encountered what appears to be a bug with the task scheduler (well, this is pre-release software, after all). Nevertheless, this highlights the pitfalls of a "smart" system that operates on autopilot. I can understand the usefulness for the average computer user who would appreciate the system looking after itself, but I have always preferred keeping the reins in my hands.
________________
* Some readers may wonder why WMP simply does not make use of file system streams. After all, storing metadata is one of the purposes of file system streams. The problem with out-of-file storage is that it requires file system support, which means that such metadata would be lost if a file is transmitted over the Internet, burned to a CD, or copied to a USB drive that uses FAT (many devices can't afford the overhead of anything but the simplest of file systems, which is why is still so common).
