Sunday, April 18, 2010
Off to the races
I mentioned briefly yesterday about the issue I was having with syslogintr
while booting the
computer. On my system, the system would hang just after loading
syslogintr
. I tracked it down to initlog
hanging. Further investigation revealed that both
syslogintr
and initlog
were hanging, but the
significance of that escaped me until an epiphany I had while sleeping: I
was experiencing yet another race
condition!
A quick test today proved that yes, it was a race condition. A particularly nasty race condition too, since once again, I wasn't explicitly writing multi-threaded code.
syslogintr
creates a local socket (/dev/log
for
those that who are curious) and then waits to receive logging
messages sent to said socket, something like:
local = socket(...) while(!interrupted) { read(socket,buffer,sizeof(buffer)); process(buffer); }
But in the process of processing the incoming message,
syslogintr
may itself call syslog()
:
while(!interrupted) { read(socket,buffer,sizeof(buffer)); process(buffer); syslog(LOG_DEBUG,"processed message"); }
syslog()
(which is part of the standard library under Unix)
sends the message to the local socket (/dev/log
). The data is
queued up in the socket, but it's okay because it'll cycle around quickly
enough to pick up the new data. Unless there's too much data already queued
in the local socket, at which point whoever calls syslog()
will
block until the backlogged data in the local socket is dealt with.
The startup script (/etc/init.d/syslog
for those of you
following along at home) starts both syslogintr
and
klogd
. klogd
is a program that pulls the data
from the kernel logging queue (logging messages the kernel itself generates,
but the kernel can't use /dev/log
as that's just a Unix convention, not
something enforced by the kernel itself) and logs that data via
syslog()
. And by the time klogd
starts up,
there's quite a bit of logging data generated by the kernel. So that data
gets blasted at syslogintr
(and in the process, so much data is
being sent that klogd
is blocked from running). But
syslogintr
is still coming up to speed and generating a bunch
of internal messages and suddenly, its calls to syslog()
are
blocking, thus causing a deadlock:
while(!interrupted) { read(socket,buffer,sizeof(buffer)); process(buffer); /* this takes some time */ /*-------------------------------------------------- ; meanwhile, something like klogd could be blasting ; data to the local socket, filling it up, thus when ; we get to: ;---------------------------------------------------*/ syslog(LOG_DEBUG,"processed message"); /*------------------------------------------------- ; the call to syslog() blocks, thus blocking the ; program until something else (in this case, *us*) ; drains the data waiting in the socket. But we ; can't drain the data because we're waiting (via ; syslog()) for the data to be drained! ; ; Can you say, *deadlock* boys and girls? I knew ; you could. ;--------------------------------------------------*/ }
This also explains why it only happened when booting—because that's
about the only time so much data is pushed to syslogintr
that
socket operations (reading, writing) are blocked. It also explains why I
haven't seen it on any other system I'm running it on, since those systems
don't run klogd
(being virtual hosts, they don't have
klogd
).
If you've ever wondered why software tends to crash all the time, it's odd interactions like this that are the cause (and this was an easy problem to diagnose, all things considered).
So now I internally queue any logging messages and handle them in the main loop, something along the lines of:
while(!interrupted) { foreach msg in queued_messages process(msg); read(socket,buffer,sizeof(buffer)); process(buffer); queuelog(LOG_DEUBG,"processed message"); }