The Boston Diaries

Thursday, October 18, 2007

Everything always works the first time on Star Trek

They always made it look so easy on Star Trek: The Next Generation.

The Enterprise (NCC-1701-D) is facing certain doom by the imminent implosion of a nearby nebula because the Thalaxians stole the warp core because they need it for a new face-lift technology that's a little less than enviromentally sound.

Wesley barges onto the bridge. “Hey guys! What if—”

“Get that kid off the bridge,” says Captain Jean-Luc “I'm a Shakespearian actor, not a ham” Picard, as he does his patented Picard Maneuver.

“But guys,” says Wesley, as he flips Worf onto his back and steps on his throat to prevent being forcibly removed from the bridge, “what if we were to reverse the tachyon field and reroute it through the shield generators to generate an anti-inertial damping field while simultaneously firing a photon torpedo loaded with an anti-matter implosion detonation device?”

“Why is that kid still—”

“And because we've generated an anti-inertial damping field outside the ship,” says Chief Engineer Geordi “Kunta ‘Reading Rainbow’ Kinte” La Forge, interrupting the ~~French~~ ~~British~~ ~~French~~ ~~British~~ (aw, the heck with it) stuffy shirt Captain.

“The force generated by the anti-matter implosion detonation device will push us,” says Lt. “Fully functional, if you know what I mean, and I think you do, wink wink nudge nudge say no more say no more” Data, as he's rapidly calculating something on the computer console in front of him, “into Warp two point seven eight.”

“Which is fast enough to get us out of range of the imminent implosion of a nearby nebula!” says Wesley, still holding down a struggling Worf. Captain ~~stuffy shirt~~ Picard looks momentarily confused as he quickly glances at all three expository spewing characters.

“And if we can place the photon torpedo loaded with the anti-matter implosion detonation device 22,453 kilometers at zero one five mark seven two, then we'll end up right next to the Thalaxian ship.”

“Make is so,” says Picard. “And hurry—we only have one more commercial break before the end of the show.”

And so, in twenty minutes, Wesley, Data and Geordi have managed to execute this complicated technobabblish plan flawlessly. On the first attempt. Based only on a hunch. Of a teenaged geek.

Ah, if only real life were as simple.

But I got the heisenbug nailed.

Latest statistics from the greylist daemon
Start	Wed Oct 17 23:48:32 2007
End	Thu Oct 18 17:01:55 2007
Running time	17h 13m 23s
IPs	117
Requests	3381
Requests-Cu	3
Tuples	1031
Graylisted	3240
Whitelisted	18
Graylist-Expired	2216
Whitelist-Expired	0

So, what was the fix?

It was a comment from Mark about signal handlers under Unix that lead the way.

Now, on to your bug. I would say 99% it is timing. Furthermore if I were a betting man I would say your bug rests within a signal handler. The only thing you can ever do inside a signal handler is set the value of a volatile sig_atomic_t variable really. Nothing else is safe.

That's pretty much how I wrote the signal handlers:

static volatile sig_atomic_t mf_sigint;
static volatile sig_atomic_t mf_sigquit;
static volatile sig_atomic_t mf_sigterm;
static volatile sig_atomic_t mf_sigpipe;
static volatile sig_atomic_t mf_sigalrm;
static volatile sig_atomic_t mf_sigusr1;
static volatile sig_atomic_t mf_sigusr2;
static volatile sig_atomic_t mf_sighup;

/**********************************************/

void sighandler_sigs(int sig)
{
  switch(sig)
  {
    case SIGINT   : mf_sigint   = 1; break;
    case SIGQUIT  : mf_sigquit  = 1; break;
    case SIGTERM  : mf_sigterm  = 1; break;
    case SIGPIPE  : mf_sigpipe  = 1; break;
    case SIGALRM  : mf_sigalrm  = 1; break;
    case SIGUSR1  : mf_sigusr1  = 1; break;
    case SIGUSR2  : mf_sigusr2  = 1; break;
    case SIGHUP   : mf_sighup   = 1; break;
    default: _exit(EXIT_FAILURE); /* because I'm stupid */
  }
}

/*************************************************/

void check_signals(void)
{
  if (mf_sigalrm)  handle_sigalrm();
  if (mf_sigint)   handle_sigint();
  if (mf_sigquit)  handle_sigquit();
  if (mf_sigterm)  handle_sigterm();
  if (mf_sigpipe)  handle_sigpipe();
  if (mf_sigusr1)  handle_sigusr1();
  if (mf_sigusr2)  handle_sigusr2();
  if (mf_sighup)   handle_sighup();
}

check_signals() is called during the main loop of the program; it's only sighandlers_sigs() that's called in a signal context. The only two signals that get (or rather, got) special treatment are SIGSEGV and SIGCHLD. The handler for SIGSEGV logs that it was called, unblocks any signals and re-executes itself.

The other signal is SIGCHLD. SIGCHLD is sent when a child process ends. It's then up to the parent process to call wait() (or one of its variants) to obtain the return code of the child process. If the parent doesn't do that, the child process becomes a zombie. The SIGCHLD signal handler called waitpid() (which is supposedly “safe” to call from a signal handler) as well as some logging routines. What may be happening (or was happening) is some very subtle race condition. The logging routine does allocate some memory (temporarily). The problem comes in if we're in the middle of allocating some memory when the program receives a SIGCHLD, at which point, we immediately enter the signal handler, which attempts (indirectly, through the logging routines) to allocate memory. But we were already in the memory allocation code when it got interrupted, thus leaving things in an undefined (or inconsistent) state.

I'll expound on what Mark said, and say that once you start dealing with signals, you're dealing with a multi-threaded application, and all the hurt that implies.

I rewrote the SIGCHLD handler to work like the other handlers (signal handler sets a flag, which is checked later from the main execution loop) and well, 17 hours later it's still running, on the virtual server, checkpointing itself every hour.

This only took, what? About a month to find?

But then again, I wasn't facing down the Thalaxians.

Thursday, October 18, 2007

Everything always works the first time on Star Trek

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer