Thursday, October 18, 2007
Everything always works the first time on Star Trek
They always made it look so easy on Star Trek: The Next Generation.
The Enterprise (NCC-1701-D) is facing certain doom by the imminent implosion of a nearby nebula because the Thalaxians stole the warp core because they need it for a new face-lift technology that's a little less than enviromentally sound.
Wesley barges onto the bridge. “Hey guys! What if—”
“Get that kid off the bridge,” says Captain Jean-Luc “I'm a Shakespearian actor, not a ham” Picard, as he does his patented Picard Maneuver.
“But guys,” says Wesley, as he flips Worf onto his back and steps on his throat to prevent being forcibly removed from the bridge, “what if we were to reverse the tachyon field and reroute it through the shield generators to generate an anti-inertial damping field while simultaneously firing a photon torpedo loaded with an anti-matter implosion detonation device?”
“Why is that kid still—”
“And because we've generated an anti-inertial damping field
outside the ship,” says Chief Engineer Geordi “Kunta
‘Reading Rainbow’ Kinte” La Forge, interrupting the
French British French British
(aw, the heck with it) stuffy shirt Captain.
“The force generated by the anti-matter implosion detonation device will push us,” says Lt. “Fully functional, if you know what I mean, and I think you do, wink wink nudge nudge say no more say no more” Data, as he's rapidly calculating something on the computer console in front of him, “into Warp two point seven eight.”
“Which is fast enough to get us out of range of the imminent implosion
of a nearby nebula!” says Wesley, still holding down a struggling Worf.
Captain stuffy shirt Picard looks momentarily confused as he
quickly glances at all three expository spewing characters.
“And if we can place the photon torpedo loaded with the anti-matter implosion detonation device 22,453 kilometers at zero one five mark seven two, then we'll end up right next to the Thalaxian ship.”
“Make is so,” says Picard. “And hurry—we only have one more commercial break before the end of the show.”
And so, in twenty minutes, Wesley, Data and Geordi have managed to execute this complicated technobabblish plan flawlessly. On the first attempt. Based only on a hunch. Of a teenaged geek.
Ah, if only real life were as simple.
But I got the heisenbug nailed.
Start | Wed Oct 17 23:48:32 2007 |
---|---|
End | Thu Oct 18 17:01:55 2007 |
Running time | 17h 13m 23s |
IPs | 117 |
Requests | 3381 |
Requests-Cu | 3 |
Tuples | 1031 |
Graylisted | 3240 |
Whitelisted | 18 |
Graylist-Expired | 2216 |
Whitelist-Expired | 0 |
So, what was the fix?
It was a comment from Mark about signal handlers under Unix that lead the way.
Now, on to your bug. I would say 99% it is timing. Furthermore if I were a betting man I would say your bug rests within a signal handler. The only thing you can ever do inside a signal handler is set the value of a
volatile sig_atomic_t
variable really. Nothing else is safe.
That's pretty much how I wrote the signal handlers:
static volatile sig_atomic_t mf_sigint; static volatile sig_atomic_t mf_sigquit; static volatile sig_atomic_t mf_sigterm; static volatile sig_atomic_t mf_sigpipe; static volatile sig_atomic_t mf_sigalrm; static volatile sig_atomic_t mf_sigusr1; static volatile sig_atomic_t mf_sigusr2; static volatile sig_atomic_t mf_sighup; /**********************************************/ void sighandler_sigs(int sig) { switch(sig) { case SIGINT : mf_sigint = 1; break; case SIGQUIT : mf_sigquit = 1; break; case SIGTERM : mf_sigterm = 1; break; case SIGPIPE : mf_sigpipe = 1; break; case SIGALRM : mf_sigalrm = 1; break; case SIGUSR1 : mf_sigusr1 = 1; break; case SIGUSR2 : mf_sigusr2 = 1; break; case SIGHUP : mf_sighup = 1; break; default: _exit(EXIT_FAILURE); /* because I'm stupid */ } } /*************************************************/ void check_signals(void) { if (mf_sigalrm) handle_sigalrm(); if (mf_sigint) handle_sigint(); if (mf_sigquit) handle_sigquit(); if (mf_sigterm) handle_sigterm(); if (mf_sigpipe) handle_sigpipe(); if (mf_sigusr1) handle_sigusr1(); if (mf_sigusr2) handle_sigusr2(); if (mf_sighup) handle_sighup(); }
check_signals()
is called during the main loop of the
program; it's only sighandlers_sigs()
that's called in a signal
context. The only two signals that get (or rather, got) special treatment
are SIGSEGV
and SIGCHLD
. The handler for
SIGSEGV
logs that it was called, unblocks any signals and re-executes itself.
The other signal is SIGCHLD
. SIGCHLD
is sent
when a child process ends. It's then up to the parent process to call
wait()
(or one of its variants) to obtain the return code of
the child process. If the parent doesn't do that, the child process becomes
a zombie. The
SIGCHLD
signal handler called waitpid()
(which is
supposedly “safe” to call from a signal handler) as well as some logging routines. What may be
happening (or was happening) is some very subtle race condition. The
logging routine does allocate some memory (temporarily). The problem comes
in if we're in the middle of allocating some memory when the program
receives a SIGCHLD
, at which point, we immediately enter the
signal handler, which attempts (indirectly, through the logging routines) to
allocate memory. But we were already in the memory allocation code
when it got interrupted, thus leaving things in an undefined (or
inconsistent) state.
I'll expound on what Mark said, and say that once you start dealing with signals, you're dealing with a multi-threaded application, and all the hurt that implies.
I rewrote the SIGCHLD
handler to work like the other
handlers (signal handler sets a flag, which is checked later from the main
execution loop) and well, 17 hours later it's still running, on the virtual
server, checkpointing itself every hour.
This only took, what? About a month to find?
But then again, I wasn't facing down the Thalaxians.