In addition to the regular posts documenting features of 0.6 and giving hints and tips about it’s usage, release announcements and so-forth; I’ll also be posting insights and anecdotes about Upstart’s ongoing development. A particular story cropped up again this month, and I thought I’d share it with you.
When I began work on Upstart, one of the earliest decisions I made was to make sure the code was very-well covered by a comprehensive test suite. I’d been working with Robert Collins a lot in the previous couple of years and he is very much an advocate of practices such as Extreme Programming (XP) and Agile Development; especially the discipline of Test Driven Development.
I’d also recently seen a keynote by Andrew Tridgell in which he talked about some of the development of Samba 4, in particular the high use of both test cases and code generation in that code-base. Something he said in the keynote stuck with me: “untested code is broken code”.
Statistics obviously depend on exactly how you count lines of code, but using a simple semi-colon count the combined source code of libnih and Upstart is slightly over 20,000 lines of code. The combined source code of the test suite for both is slightly over 120,000 lines of code.
The init daemon is an extremely important part of a Linux system, if it crashes then you’re left with a kernel panic; if it simply misbehaves, you’re left with just severe problems. Not only was I changing it, but I was replacing a very simple dumb system (Sys V init) with something comparatively complex with rules and behaviours that needed rigorous testing.
It would have been very scary to have developed it without the careful testing, and I would have been very worried if anyone had agreed to replace such a core component of the system without this test suite to back up its behaviour.
That being said, maintaining the test suite can be a huge burden. Don’t believe what anybody tells you, if you’re writing test cases as well as code, then your pace of development slows as well. They’re right that you spend a lot less time debugging of course, but unlike in the commercial software business free software developers tend to release first and debug later. If you use a similarly high test to code ratio in your own project, then you’ll find that the time until your first release will be pretty long and the time between releases longer as well.
Another decision is whether to do Test Driven Development or not; that discipline requires that you always write the tests first, to fail, and only write code in order to make the tests pass. I’m not a fan of TDD, and I’ve no problem admitting that I mostly did not use it for Upstart. My gut feel is that TDD produces code that hangs, swings and loops just to deal with testing. It also just doesn’t suit my coding style: I like to write code from the middle outwards, the function API is the last thing I tend to fix, where TDD forces it to be the first.
I’m also not convinced TDD is really suitable for a language like C; it’s pretty hard to get a test case to compile, run and fail without writing any supporting code such as a header file, etc.
I have found TDD useful when I have code that really does break down into a single unit with a well-defined and obvious API, and that while the inputs and outputs have been obvious, the algorithm for getting between them wasn’t at the time.
What I’ve tended to do instead is write code naturally how I would, and write test cases alongside to run the code and make sure it’s working. As the code grows more complex, more test cases appear for it. One big advantage to this is then I don’t need to reboot or fire up a VM as much, I can test a large proportion of Upstart’s operation through testing.
Now, onto the stories. There are two similar ones.
One of the side-effects of testing Upstart so strongly is that the tests are not only driving the code I’ve written but also code in libraries and even in the Kernel. One particular set of tests was covering the code in libnih and Upstart that handles watching the configuration directory for changes, it’s this code that means Upstart automatically reloads jobs when you edit them without needing an explicitly signal.
One day these test cases started failing without warning. Investigation showed that they passed fine under older kernels, but with the newest kernel update to Ubuntu, they failed.
The inotify subsystem in the kernel had undergone a radical overhaul and rewrite. Rather than being its own code, it was completely rebased onto the new fsnotify system. Fortunately I was aware of this, and after careful checking that it was indeed the kernel behaviour that was now incorrect (and that it wasn’t incorrect before), I got in touch with the Eric Paris, the author of the new code, and was able to give him minimal example code to replicate the problem.
inotify: check filename before dropping repeat events
This was a while ago, but pretty much the same story happened again recently, just this time not with the kernel.
Again, the story started with Upstart’s test suite failing. The engineer who first noticed it assumed it was an issue with the new build daemon and disabled the test for the time being. The test was in the part of the code testing Upstart’s interaction with D-Bus.
Now, sometimes I tend to write tests to deal with corner-cases and “what if” scenarios that I dream up. This isn’t always about testing my code, often it’s a case of finding out whether something is really possible or whether that thing misbehaves. These tests still stay in the suite of course.
A particular set of tests were intended to find out what happened if the D-Bus daemon crashed during initial connection, I considered this fairly important because at times the libdbus library has called exit() or abort() when things happened that it didn’t like. If you call that from the init daemon, the kernel panics.
These tests had worked fine for a couple of years (actually at the time I had to fix bugs in libdbus to make them pass) but now one of these tests was breaking. The disconnection was causing SIGPIPE to be delivered to the test.
Again, this turned out to be due to a change to D-Bus. Lennart Poettering had been working on some changes to avoid libdbus’s awkward SIGPIPE handling and replace it with the use of the MSG_NOSIGNAL flag. Unfortunately he’d missed a case in the authentication code. The side-effect was that if the D-Bus daemon had crashed, been killed, OOM’d, etc. during initial connection – the connecting application would have gone too. Especially bad for an init daemon.
Fortunately Upstart’s test suite caught it, and the fix was simple.
sysdeps-unix: use MSG_NOSIGNAL when sending creds
(reposted from http://upstart.at/2010/12/20/the-importance-of-being-tested/ – post comments there)