Friday, March 09, 2007
Cheating your way to a robust daemon
Programs written in Erlang have minimal (if any) error checking. The intent by the designers of Erlang is for buggy Erlang code to crash early and hard. No defensive programming for these guys, which seems odd given that Erlang is used primarily in phone switches, which have ridiculous uptime and reliability requirements, but not really.
You see, most Erlang programs are watched over by an even simpler program that simply waits for a crashing program and restarts it automatically, while logging the incident.
It's a pretty neat concept, and for the daemon I'm writing, I've done just that.
Well, I don't actually have a separate process watching, because
one isn't needed. No, what I've done is catch a few signals that end up
killing the program (like SIGSEGV
) and instead of terminating
the program, restarting it.
extern char **environ; char *global_argv[]; int main(int argc,char *argv[]) { /*---------------------------------- ; save our command line. I do this ; so *if* we re-exec ourselves, we ; re-exec ourselves as we were initially ; exec'ed. ;----------------------------------*/ global_argv = argv; /*-------------------------------------- ; Wrote my own signal() function that ensures ; reliable signal semantics (via W. Richard Stevens' ; Advanced Programming in the UNIX Environment). ; ; Here, I'm capturing those signals that ; may be the result of bad programming and ; end up generating a core file, and catching ; those to restart the program. ;----------------------------------------*/ set_signal(SIGSEGV,crash_recovery); set_signal(SIGBUS, crash_recovery); set_signal(SIGFPE, crash_recovery); /* ... */ return(EXIT_SUCCESS); } void crash_recovery(int sig) { syslog(LOG_ERR,"received sig %d---restarting",sig); /*-------------------------------------------- ; The signal we're handling may very well be ; blocked, which will persist across the execve() ; call. This results in the first crash being ; caught, but not subsequent crashes. By unblocking ; all signals, we assue we can catch further ; crashes. ;----------------------------------------------*/ sigfillset(&sigset); sigprocmask(SIG_UNBLOCK,&sigset,NULL); /*--------------------------------------- ; close all the files, but keep the standard ; STDIN, STDOUT and STDERR open. Sure, we ; loose any connections, but we'll loose them ; anyway if the program were to go away. ;----------------------------------------*/ for (i = 3 ; (i < OPEN_MAX) && (close(i) == 0) ; i++) ; /*--------------------------------------- ; restart myself. ;------------------------------------*/ execve(global_argv[0],global_argv,environ); /*---------------------------------- ; well ... if we get here, we're screwed ; so might as well give up. ;------------------------------------*/ _exit(EXIT_FAILURE); }
This could be a very bad idea but I'll see how well it works out.