The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Thursday, October 18, 2007

Everything always works the first time on Star Trek

They always made it look so easy on Star Trek: The Next Generation.

The Enterprise (NCC-1701-D) is facing certain doom by the imminent implosion of a nearby nebula because the Thalaxians stole the warp core because they need it for a new face-lift technology that's a little less than enviromentally sound.

Wesley barges onto the bridge. “Hey guys! What if—”

“Get that kid off the bridge,” says Captain Jean-Luc “I'm a Shakespearian actor, not a ham” Picard, as he does his patented Picard Maneuver.

“But guys,” says Wesley, as he flips Worf onto his back and steps on his throat to prevent being forcibly removed from the bridge, “what if we were to reverse the tachyon field and reroute it through the shield generators to generate an anti-inertial damping field while simultaneously firing a photon torpedo loaded with an anti-matter implosion detonation device?”

“Why is that kid still—”

“And because we've generated an anti-inertial damping field outside the ship,” says Chief Engineer Geordi “Kunta ‘Reading Rainbow’ Kinte” La Forge, interrupting the French British French British (aw, the heck with it) stuffy shirt Captain.

“The force generated by the anti-matter implosion detonation device will push us,” says Lt. “Fully functional, if you know what I mean, and I think you do, wink wink nudge nudge say no more say no more” Data, as he's rapidly calculating something on the computer console in front of him, “into Warp two point seven eight.”

“Which is fast enough to get us out of range of the imminent implosion of a nearby nebula!” says Wesley, still holding down a struggling Worf. Captain stuffy shirt Picard looks momentarily confused as he quickly glances at all three expository spewing characters.

“And if we can place the photon torpedo loaded with the anti-matter implosion detonation device 22,453 kilometers at zero one five mark seven two, then we'll end up right next to the Thalaxian ship.”

“Make is so,” says Picard. “And hurry—we only have one more commercial break before the end of the show.”

And so, in twenty minutes, Wesley, Data and Geordi have managed to execute this complicated technobabblish plan flawlessly. On the first attempt. Based only on a hunch. Of a teenaged geek.

Ah, if only real life were as simple.

But I got the heisenbug nailed.

Latest statistics from the greylist daemon
Start Wed Oct 17 23:48:32 2007
End Thu Oct 18 17:01:55 2007
Running time 17h 13m 23s
IPs 117
Requests 3381
Requests-Cu 3
Tuples 1031
Graylisted 3240
Whitelisted 18
Graylist-Expired 2216
Whitelist-Expired 0

So, what was the fix?

It was a comment from Mark about signal handlers under Unix that lead the way.

Now, on to your bug. I would say 99% it is timing. Furthermore if I were a betting man I would say your bug rests within a signal handler. The only thing you can ever do inside a signal handler is set the value of a volatile sig_atomic_t variable really. Nothing else is safe.

That's pretty much how I wrote the signal handlers:

static volatile sig_atomic_t mf_sigint;
static volatile sig_atomic_t mf_sigquit;
static volatile sig_atomic_t mf_sigterm;
static volatile sig_atomic_t mf_sigpipe;
static volatile sig_atomic_t mf_sigalrm;
static volatile sig_atomic_t mf_sigusr1;
static volatile sig_atomic_t mf_sigusr2;
static volatile sig_atomic_t mf_sighup;

/**********************************************/

void sighandler_sigs(int sig)
{
  switch(sig)
  {
    case SIGINT   : mf_sigint   = 1; break;
    case SIGQUIT  : mf_sigquit  = 1; break;
    case SIGTERM  : mf_sigterm  = 1; break;
    case SIGPIPE  : mf_sigpipe  = 1; break;
    case SIGALRM  : mf_sigalrm  = 1; break;
    case SIGUSR1  : mf_sigusr1  = 1; break;
    case SIGUSR2  : mf_sigusr2  = 1; break;
    case SIGHUP   : mf_sighup   = 1; break;
    default: _exit(EXIT_FAILURE); /* because I'm stupid */
  }
}

/*************************************************/

void check_signals(void)
{
  if (mf_sigalrm)  handle_sigalrm();
  if (mf_sigint)   handle_sigint();
  if (mf_sigquit)  handle_sigquit();
  if (mf_sigterm)  handle_sigterm();
  if (mf_sigpipe)  handle_sigpipe();
  if (mf_sigusr1)  handle_sigusr1();
  if (mf_sigusr2)  handle_sigusr2();
  if (mf_sighup)   handle_sighup();
}

check_signals() is called during the main loop of the program; it's only sighandlers_sigs() that's called in a signal context. The only two signals that get (or rather, got) special treatment are SIGSEGV and SIGCHLD. The handler for SIGSEGV logs that it was called, unblocks any signals and re-executes itself.

The other signal is SIGCHLD. SIGCHLD is sent when a child process ends. It's then up to the parent process to call wait() (or one of its variants) to obtain the return code of the child process. If the parent doesn't do that, the child process becomes a zombie. The SIGCHLD signal handler called waitpid() (which is supposedly “safe” to call from a signal handler) as well as some logging routines. What may be happening (or was happening) is some very subtle race condition. The logging routine does allocate some memory (temporarily). The problem comes in if we're in the middle of allocating some memory when the program receives a SIGCHLD, at which point, we immediately enter the signal handler, which attempts (indirectly, through the logging routines) to allocate memory. But we were already in the memory allocation code when it got interrupted, thus leaving things in an undefined (or inconsistent) state.

I'll expound on what Mark said, and say that once you start dealing with signals, you're dealing with a multi-threaded application, and all the hurt that implies.

I rewrote the SIGCHLD handler to work like the other handlers (signal handler sets a flag, which is checked later from the main execution loop) and well, 17 hours later it's still running, on the virtual server, checkpointing itself every hour.

This only took, what? About a month to find?

But then again, I wasn't facing down the Thalaxians.

Obligatory Picture

An abstract representation of where you're coming from]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.