Wednesday, May 05, 2010
Millions of moving parts
In a system of a million parts, if each part malfunctions only one time out of a million, a breakdown is certain.
—Stanislaw Lem
In between paying work, I'm getting syslogintr
ready for release—cleaning up
the Lua scripts, adding
licensing information, making sure everything I have actually works, that
type of thing. I have quite a few scripts that isolated some aspects of
working scripts—for instance, checking for ssh
attempts and
blocking the offending IP but
weren't fully tested. A few were tested (as I'm using them at home), but
not all.
I update the code on my private server, rewrite its script to use the new modules (as I'm calling them) only to watch the server seize up tight. After a few hours of debugging, I fixed the issue.
Only it wasn't with my code.
But first, the scenario I'm working with. Every hour,
syslogintr
will check to see if the webserver and nameserver
are still running (why here? Because I can, that's why) and log some stats
gathered from those processes. The checks are fairly easy—for the
webserver I query mod_status
and log the results; for the nameserver, I pull the PID from /var/run/named.pid
and from that, check
to see if the process exists. If they're both running, everything is fine.
It was when both were not running that syslogintr
froze.
Now, when the appropriate check determines that the process isn't running
it not only logs the situation, but sends me an email to alert me of the
situation. If only one of the two processes were down,
syslogintr
would work fine. It was only when both
were down that it froze up solid.
I thought it was another type of syslog deadlock—Postfix spews forth multiple log entries
for each email going through the system and it could be that too much data
is logged before syslogintr
can read it, and thus, Postfix
blocks, causing syslotintr
to block, and thus, deadlock.
Sure, I could maybe increase the socket buffer size, but that only pushes
the problem out a bit, it doesn't fix the issue once and for all. But any
real fix would probably have to deal with threads, one to just read data
continuously from the sockets and queue them up, and another one to pull the
queued results and process them, and that would require a major restructure
of the whole program (and I can't stand the pthreads
API). Faced with that,
I decide to see what Stevens
has to say about socket buffers:
With UDP, however, when a datagram arrives that will not fit in the socket receive buffer, that datagram is discarded. Recall that UDP has no flow control: It is easy for a fast sender to overwhelm a slower receiver, causing datagrams to be discarded by the receiver's UDP …
Hmm … okay, according to this, I shouldn't get deadlocks
because nothing should block. And when I checked the socket receive buffer
size, it was way larger than I expected it to be (around 99K if you
can believe it) so even if a process could be blocked sending a
UDP packet, Postfix (and
certainly syslogintr
wasn't sending that much
data.
And on my side, there wasn't much code to check (around 2300 lines of
code for everything). And when a process list showed that
sendmail
was hanging, I decided to start looking there.
Now, I use Postfix, but Postfix comes with a “sendmail” executable that's compatible (command line wise) with the venerable sendmail. Imagine my surprise then:
[spc]brevard:~>ls -l /usr/sbin/sendmail lrwxrwxrwx 1 root root 21 Feb 2 2007 /usr/sbin/sendmail -> /etc/alternatives/mta [spc]brevard:~>ls -l /etc/alternatives/mta lrwxrwxrwx 1 root root 26 May 5 16:30 /etc/alternatives/mta -> /usr/sbin/sendmail.sendmail
Um … what the … ?
[spc]brevard:~>ls -l /usr/sbin/sendmail* lrwxrwxrwx 1 root root 21 Feb 2 2007 /usr/sbin/sendmail -> /etc/alternatives/mta -rwxr-xr-x 1 root root 157424 Aug 12 2006 /usr/sbin/sendmail.postfix -rwxr-sr-x 1 root smmsp 733912 Jun 14 2006 /usr/sbin/sendmail.sendmail
Oh.
I was using sendmail's sendmail
instead of
Postfix's sendmail
all this time.
Yikes!
When I used Postfix's sendmail
everything worked
perfectly.
Sigh.
mod_lua patched
And speaking of bugs, the bug I submitted to Apache was fixed!
Woot!