Thursday, April 08, 2004
It works, but mysteriously crashes after a day or so …
A program I'm trying to run (for a small side project) keeps crashing.
Well, “crashing” isn't the right term—it technically doesn't
crash, but calls exit()
when certain errors occur. The error
in question happens with the following code:
x = fcntl(fd, F_GETFL, &fl); if (x < 0) { syslog(LOG_ERR, "fcntl F_GETFL: FD %d: %s", fd, strerror(errno)); exit(1); }
and the error in question is:
fcntl F_GETFL: FD -1: Bad file descriptor
It's in a function called set_nonblock()
and it pretty much
takes a file desriptor (reference to an open file) as a parameter and makes
two calls to fcntl()
and it's failing with an invalid file
descriptor on the first call. So I check the code that calls
set_nonblock()
; there are only two locations were
set_nonblock()
is called, and in both cases, the file
descriptor is checked before the call to
set_nonblock()
which means that the file descriptor is being
clobbered between the initial test and the call.
Not good.
So I add more logging, and run again (mind you, this is over the course of several days).
I finally get a location:
stp.c:233: failed assertion newsock >= 0
Okay, check the code:
int wait_for_connection(int s) { int newsock; int len; struct sockaddr_in peer; ddt(s > -1); len = sizeof(struct sockaddr_in); newsock = accept(s, (struct sockaddr *) &peer, &len); /* dump_sockaddr (peer, len); */ if (newsock < 0) { if (errno != EINTR) perror("accept"); } get_hinfo_from_sockaddr(peer, len, client_hostname); ddt(newsock >= 0); set_nonblock(newsock); return (newsock); }
Line 233 is highlighted, and ddt()
(which is a function I
wrote) basically checks the condition and if false, logs it (via
syslog()
) and exits the program. And I see the error. It's
subtle, but it's there. The fragment:
newsock = accept(s, (struct sockaddr *) &peer, &len); if (newsock < 0) { if (errno != EINTR) perror("accept"); }
is the culprit.
Under Unix, a system call (like accept()
) can be
interrupted, and if so, the call fails with an error code of
EINTR
. Why could a system call be interrupted? Well, say a
program creates a child process (which this one does), and that child does
its job and exits, then the parent process (which created the child process)
is “interrupted” with a message: “your child process has finished.”
Normally, if a system call is interrupted, you want to try the system call
again, only this code doesn't do that! (although it looks like the
author intended to recall accept()
but forgot to write that
code).
Patch the code:
int wait_for_connection(int s) { int newsock; int len; struct sockaddr_in peer; ddt(s > -1); do { len = sizeof(struct sockaddr_in); newsock = accept(s,(struct sockaddr *) &peer,&len); if (newsock < 0) { if (errno != EINTR) { perror("accept"); return(-1); } } while (newsock < 0); get_hinfo_from_sockaddr(peer,sizeof(struct sockaddr_in),client_hostname); set_nonblock(newsock); return(newsock); } }
and try again. Hopefully, this (and some other minor cleanup) will fix the problem.