debugging

Sit back and grab a drink: it’s time for the story of one of my favorite bugs.

‘Twas my very first IT job, a software developer summer internship at a manufacturer of very serious medical equipment, primarily anesthesia delivery systems and hospital patient monitors. The latter are the boxes you see bleeping next to bed-ridden patients, drawing graphs of pulse, blood pressure, respiration etc and summoning nurses when they flatline. The office was full of 2-meter cylinders of laughing gas, luxuriously bearded gurus expert in embedded systems, and entire rooms devoted to the documentation needed for the FDA certification of various devices. Hushed voices still spoke of a bug that had slipped through testing ten years ago in one piece of anesthesia kit and caused the machine to reset in the middle of surgery, and needless to say a teenage greenhorn like myself was kept well clear of all production systems.

Instead I was given a plum prototyping project to test out the hot buzzwords of 1997: build a C++ server that listens to the serial port of a patient monitor and dumps a selection of interesting events to a SQL Server database, then send the data via CORBA to a Java applet, so doctors, relatives etc could see how the patient was doing via the Internet, with both the ability to see realtime data and browse previous records. Spiffy! Especially since I had zero practical experience in any of these languages or systems!

After a few weeks of butcherous hackery, much of it spent on inhaling the migraine-inducing Visibroker ORB manual and squishing utterly mundane type conversion bugs, I had my “Simpson” system more or less running, with “Homer” recording and serving the data and “Bart” displaying it. Along the way I had learned out that CORBA was hopelessly overcomplicated, that AWT was painful (GridBagLayouts, brr), that applets were slow as molasses, and that Java otherwise seemed like a pretty decent language. But there was a niggling bug: every now and then, the C++ server would crash hard, and I set about figuring why.

Since patient monitor whose output I was listening to was in another cubicle, I had done most of my development and testing using the handy “Demo” mode where it happily played back simulated cardiac arrests and such in a loop, and as far as I could tell my server never crashed while using this. It did sometimes crash when I or somebody else twiddled with the controls manually, especially during live demos, but for the life of me I couldn’t work out a way to make it crash repeatably, no matter what I did. I logged all events to disk to try to find out what had happened just before a crash, but slow and careful manual replication of the exact preceding event sequence (“set filter to X, twist control knob exactly three notches to the right, press button…”), complete with running back and forth because I couldn’t see my PC’s logs when using the patient monitor, failed to elicit a crash. Whatever “the evil event”, as I took to calling it, was, it seemed to cause crashes while eluding all logging. Was there some sort of serial I/O or hardware issue corrupting the events? Were cosmic rays flipping bits in my PC?!

After several weeks of frustration, where entire days devoted to experimentation had produced no results, I ended up basically adding printf statements to every single line between receiving the event from the serial port and writing it in the database… and in the process, as I revisited every line of that code, a sinking realization dawned on me.

When setting up the database schema, in a misguided attempt at saving space, I had made the amateur mistake of using the event timestamp as the primary key — so if two events happened to come in at the same millisecond, the DB would throw a unique key constraint violation. I noticed this early on, but determined that this only seemed to happen in unusual and uninteresting circumstances like when futzing about in the monitor’s internal configuration settings, so I added a catch block that logged a warning and continued on its merry way.

But. This was old-school code, and the log string was written C-style into a fixed-length 80-character buffer. While the length of the unique key violation message itself was invariable, the log timestamp happened to be formatted so that it used the long-form English weekday name (%E), so the output looked like “Monday, July 17, 1997, 10:38:47.123”. Turns out English weekday names have an interesting property:

Name	Length
Sunday	6
Monday	6
Friday	6
Tuesday	7
Thursday	8
Saturday	8
Wednesday	9

Caught on yet? On Wednesdays, and only on Wednesdays, if somebody manually twiddled certain bits in the monitor settings in a certain way, two events would occur during the same millisecond and cause the DB to throw an exception, and the error message that logged this would be exactly 81 bytes long including the null terminator, overflowing the 80-char buffer and causing the program to crash!

Ever since, I’ve made sure to use a dedicated and preferably auto-incremented integer ID as the primary key of every database table I need, and to log everything in ISO format, YYYY-MM-DD, no weekdays needed. And over the years, I’ve learned that no matter how seemingly random and unpredictable the behavior of the bug, there’s always a logical explanation for it that you can find if you dig deep enough; that “unrelated” errors rarely are; and that it’s almost certainly your own damn fault.

Update: This story is now available in Russian! Когда программа падает только по средам

Gyrovague

vaguely heretical musings on travel and technology

Crashes only on Wednesdays