Monday, October 15, 2007
“I coulda had a V8!”
Why didn't I think of that sooner?
Anyway, I installed valgrind
on the virtual server and started up a test run to see if I
could locate the current bug.
Not two minutes in, and I got the following:
==27967== Use of uninitialised value of size 4 ==27967== at 0x80508DF: crc32 (crc32.c:83) ==27967== by 0x804A6C1: send_packet (main.c:566) ==27967== by 0x80496FC: mainloop (main.c:151) ==27967== by 0x80493A5: main (main.c:68) ==27967== ==27967== Syscall param socketcall.sendto(msg) points to uninitialised byte(s) ==27967== at 0x40EDFD1: sendto (in /lib/tls/libc-2.3.4.so) ==27967== by 0x80496FC: mainloop (main.c:151) ==27967== by 0x80493A5: main (main.c:68) ==27967== Address 0xBEEDB2EA is on thread 1's stack ==27967== ==27967== Source and destination overlap in memcpy(0x4251028, 0x4251028, 256) ==27967== at 0x401D236: memcpy (mac_replace_strmem.c:394) ==27967== by 0x804F00D: tuple_expire (tuple.c:231) ==27967== by 0x804BC3D: handle_sigalrm (signals.c:278) ==27967== by 0x804B6BF: check_signals (signals.c:50) ==27967== by 0x80493BD: mainloop (main.c:83) ==27967== by 0x80493A5: main (main.c:68)
I was mystified by that first warning, as the function in question isn't all that big:
CRC32 crc32(CRC32 crc,const void *data,size_t size) { const byte *p = data; while(size--) { crc = ((crc >> 8) & 0x00FFFFFF) ^ m_crc32_table[ (crc ^ *p++) & 0xFF ]; } return(crc); }
What could it possibly be complaining about? Only when
investigating the second warning did I realize what the problem was—while
I wasn't calculating the CRC over the entire structure, valgrind
didn't
know that. That explains the first two warnings. The third one deals with
this bit of code:
for (i = j = 0 ; i < g_poolnum ; i++) { /* other code */ if (difftime(Tao,g_pool[i].atime) < c_timeout_gray) { g_pool[j++] = g_pool[i]; continue; } /* rest of loop */ }
g_pool
is an array of structures, and I'm using the C89
feature of structure assignment to avoid having to write
memcpy(&g_pool[j++], &g_pool[i], sizeof(struct tuple))
. But
memcpy()
(which is what the compiler changes the code to
internally) has undefined semantics when the memory blocks overlap, and they
do when i == j
. This doesn't seem to cause a problem,
but who knows?
I also found out that valgrind
stops reporting errors after
100,000 are found:
==27967== More than 100000 total errors detected. I'm not reporting any more. ==27967== Final error counts will be inaccurate. Go fix your program! ==27967== Rerun with --error-limit=no to disable this cutoff. Note ==27967== that errors may occur in your program without prior warning from ==27967== Valgrind, because errors are no longer being displayed.
Most of these deal with the child process created to save the state. I
don't bother with cleaning up too much since upon calling
_exit()
all memory is reclaimed anyway, so I'm not terribly
concerned about it. But I did lose two hours of testing to this.
Update on Tuesday, October 1616, 2007, 6 hours, 20 minutes and 9 seconds after starting the current test run of the greylist daemon
Bad news—it's still running.
You might think this is good news, but it's not. It means that the debugging conditions have changed the conditions the code runs under, so something very subtle is happening.
I did, however, make one small change to the code, clearing up
two of the warnings mentioned above. I would hate to think that
making that one small change fixed the program. If it's still
running in the morning, then I'll run it again without
valgrind
and see if it works.
Sigh.