Saturday, September 22, 2007
Software performance after a few hours with large sets of data
As a corollary to yesterday's entry about testing—make sure you test for several hours.
I found a bug in the latest version of the greylist daemon that only manifests itself after about six hours of running. For some as yet unknown reason, the program just stops responding. It doesn't segfault (if it did, it would automatically restart). It just doesn't quit (if it did, I wouldn't see it running in the process list). It just gets into a weird state. When I attach gdb
to the running instance the stack frame is somewhere in the weeds (that's a technical term) so its hard to isolate the problem.
This type of bug is very difficult to diagnose.
Although I do have an idea of what it might be. The latest feature (as a request by Smirk) is to checkpoint the program every hour or so—it dumps its internal state so it can pick up again when it restarts. When I checked the logs, the last two times it crashed (after running for about six hours) it was just as it was checkpointing itself (which is logged).
I removed the checkpoint feature from the “production” version, and hopefully, I won't get another influx of spam in six hours (the Postfix module accepts the incoming email if it doesn't get a response from the greyist daemon after five seconds—I figure a) it's better to receive spam than lose email and b) getting a ton of spam is a clear indication something is wrong).
Meanwhile, I'm running the grueling test slowly (one tuple per second), with the hopes of triggering (or at least, reproducing the problem) in six hours.