Tuesday, May 01, 2007
Which version was I using again?
Sorry if you've signed up for email notification and gotten some wierd notifications over the past twelve hours. That's because there was a show-stopping bug in the code that only now was triggered.
You see, I rewrote the email notification code some time ago, but hadn't fully tested it yet. Then, a few days ago I decided to add in custom error pages and kept having issues. I knew I had done it before, then I realized I did, over there, but the custom error pages weren't working over here. I thought maybe it was a different version of the code.
So I checked out a fresh copy over at the other place to see if I broke whatever it was that made the custom pages work (it was using an older code base there), and it still worked. But since I'm not using email notification over there, the new email notification code was still untested.
So I checked out a fresh copy here. The custom error pages were still broken.
Turns out it was the installation of Apache on the server here (I used the distro-installed version for a change)—seems that it has a directory alias for
/error/, which just so happens is what I named the directory. Had I named it
/errors/ (plural) things would have worked the first time and all of this wouldn't have happened (or rather, would have happened at a later time).
But I did.
But notice that I hadn't tested the email notification code yet.
Until yesterday and today.
And boy, did things blow up.
The program kept crashing. Normally, that isn't much of an issue (well, okay, it is an issue) because normally, one is in front of the program to watch it crash, but in this case, I wasn't aware of the crash, because I used the email interface.
mod_blog takes the entry (from a web page, a file, or through email), adds it to the archive, generates the new page, and then, if enabled, sends the email notifications.
It was crashing (unbeknowst to me) during the email notifications.
Postfix has been instructed to send emails to a particular address to
mod_blog, which it does. But Postfix also notices that the program crashed. So it requeues the email.
And when the program crashes, the update lock it has is released (technically, it's a file lock that the kernel maintains, but hey, the program goes away, so does the lock).
Next entry comes along, process repeats, and Postfix realizes it has some email queued up, and tries to send that at the same time. Now we have two processes trying to update the blog. Normally, this isn't an issue because of the update lock. But the lock is released in an uncontrolled state and …
Things blow up and get ugly.
But what I'm seeing is a blog that failed a validation check, I fix it, only to have it fail the validation check again, which I fix, only to fail the validation yet again, only now I notice that the same entry I thought I fixed twice keeps coming back to haunt me and even worse, the entire front page is a mess, and is remaining a mess no matter what I do.
I tracked the problem down, but I may have sent some spurious notifications in doing so. Once I realized it was the new email notification code (and man, how did I ever check that code in? Leaks memory like you wouldn't believe! Sheesh) I reverted to the old version, cleared out the mail queue and hopefully, got this situation under control.