Tuesday, October 16, 2007
Heisenbugs … they're everywhere!
So I ran the greylist daemon for over eight hours under valgrind without it once hanging. I then restarted the server, this time running alone.
A few hours later, it hung.
And just for the record, when I normally attach to the running processing
using gdb
, it's where I would expect it to be:
(gdb) where #0 0x008067a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x003c2dd1 in recvfrom () from /lib/tls/libc.so.6 #2 0x08049411 in mainloop (sock=0) at src/main.c:88 #3 0x080493a6 in main (argc=1, argv=0xbfe5c084) at src/main.c:68
but when the process hangs:
(gdb) where #0 0x00dff7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2 #1 0x00955e5e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6 #2 0x008e7e4f in _L_mutex_lock_10231 () from /lib/tls/libc.so.6 #3 0x00000000 in ?? ()
I have no clue as to what's going on (and neither does gdb
apparently). Running the program under valgrind
obviously
changes the environment, enough to mask the condition that causes the bug in
the first place.
This is proving to be a very difficult bug to find.
Help! My compiler is leaking!
I fixed the problems that valgrind
was complaining about in
the greylist daemon (and I had hoped such problems would fix the problem,
but alas, it doesn't seem to be the case, but I'm still testing), but the
very nature of one of the complaints is interesting in and of itself.
A design pattern I've long used in code is the following:
/*-------------------------------------------- ; the following is written to clarify ; the issue I'm writing about. This ; is not how I would normally write such ; code. It presents a simplified, yet ; realistic bit of code. ; ; And yes, this is the format I use for block ; comments in C. ;------------------------------------------*/ char buffer[BUFSIZ]; size_t i; size_t j; for (i = j = 0 ; i < BUFSIZ ; i++) { if (!isctrl(buffer[i])) buffer[j++] = buffer[i]; }
Basically, I'm going through a character array, removing certain elements (in this example, control characters) and doing so in place, to conserve memory consumption. Yes, there are instances where I'm reading and writing the same value to the same location, but in the scheme of things, it's not a bad thing. And, at least for this example, the overhead of trying to avoid unnecessary copying of data overwhelms the amount of time it takes to just do the copy.
So when I was writing the code to clean up the tuple array in the greylist daemon, I naturally wrote it as:
for (i = j = 0 ; i < g_poolnum ; i++) { /* other code */ if (difftime(Tao,g_pool[i].atime) < c_timeout_gray) { g_pool[j++] = g_pool[i]; continue; } /* rest of loop */ }
It follows a successful pattern I've used plenty of times before. I saw
nothing necessarily wrong with it, yet valgrind
complained bitterly about this fragment of code. And I was a
mildly surprised to see a call to memcpy()
when I never
explicitly called memcpy()
.
I just got bit with a leaky abstraction, and a rather insidious one at that.
Pre-ANSI C, I wouldn't have been able to write that code, since back
then, C didn't have the concept of structure assignments, and thus, I would
have had to explicitly call memcpy()
:
if (difftime(Tao,g_pool[i].atime) < c_timeout_gray) { memcpy(&g_pool[j++],&g_pool[i],sizeof(struct tuple)); continue; }
But one of the advantages of working up the abstraction scale (so to
speak) is that you can let the compiler take care of the grunt work for you.
I mean, what if struct tuple
was the size of an
int
? The overhead of calling memcpy()
would swamp
the actual act of copying the data. In fact, if struct tuple
was a small multiple of an int
in size, it might still not be
worth the overhead of calling memcpy()
. And the compiler is a
better place to push such knowledge, since it can keep track of not only
structure sizes, but the overhead of calling a function to copy memory and
handle things accordingly.
So ANSI C allowed the compiler to handle structure assignment. And it
can do a pretty fine job of it too. For instance, using a recent version of
gcc
with the
compiler options -O3 -fomit-frame-pointer
(some heavy duty
optimization), it compiled the following bit of code:
struct foo { int f; }; struct foo A[256]; size_t i; size_t j; for (i = j = 0 ; i < 256 ; i++) { if (A[i].f == 56) A[j++] = A[i]; }
to the following bit of x86 code (and frankly, I was surprised at the
quality—and it's been translated from the alien AT&T syntax
gcc
uses to proper Intel syntax):
xor edx,edx xor eax,eax jmps .L6 .L4: inc eax cmp eax,255 ja .L12 .L6: cmp [A + eax*4],56 jne .L4 inc eax mov [A + edx*4],56 ; A[j].f = 56 inc edx ; j++ cmp eax,255 jbe .L6 .L12:
It didn't even copy the data since the compiler figured out it didn't need to. Even if we increased the size of the structure a bit:
struct foo { size_t f; size_t f2; char f3a; char f3b; char f3c; char f3d; short f4a; short f4b; }; struct foo A[256]; size_t i; size_t j; for (i = j = 0 ; i < 256 ; i++) { if (A[i].f == 56) A[j++] = A[i]; }
gcc
still has yet to call memcpy()
:
push ebx xor edx,edx xor ebx,ebx mov ecx,255 jmps .L18 .L16: add edx,16 dec ecx js .L23 .L18: cmp [A + edx],56 jne .L16 mov [A + ebx],56 ; A[j].f = 56 mov eax,[A + 4 + edx] ; A[j].f2 = A[i].f2 mov [A + 4 + ebx],eax mov eax,[A + 8 + edx] ; A[j].f3(a-d) = A[i].f3(a-d) mov [A + 8 + ebx],eax mov eax,[A + 12 + edx] ; A[j].f4(a,b) = A[i].f4(a,b) mov [A + 12 + ebx],eax add edx,16 add ebx,16 dec ecx jns .L18 .L23: pop ebx
It just copies the 16 bytes of the structure as one assignment (because of constant propagation) and three four-byte moves. It's not until the structure gets significantly large:
struct foo { size_t f; char b1[124]; size_t s2; char b2[124]; }; struct foo A[256]; size_t i; size_t j; for (i = j = 0 ; i < 256 ; i++) { if (A[i].f == 56) A[j++] = A[i]; }
that the compiler switches to calling memcpy()
:
push edi push esi push ebx xor esi,esi mov edi,offset A mov ebx,255 jmps .L40 .L38: add esi,256 dec ebx js .L45 .L40: cmp [A + esi],56 jne .L38 push ecx ; ??? push 256 lea eax,[A + esi] mov edx,edi push eax push edx call memcpy add edi,256 add esp,16 add esi,256 dec ebx jns .L40 .L45: pop ebx pop esi pop edi
(The only anomaly in the code is the push ecx
. The register
isn't initialized, and memcpy()
only takes three parameters,
not four. My guess is that this “additional parameter” exists to keep the
stack aligned along a cache line and it's this type of detail that compilers
exist to keep track of. It's also interesting to note that when compiling
for an 80486, gcc
never bothers to call memcpy()
and instead inlines the operation using REP MOVS
.)
By now, you're probably asking yourself, “So? Where's the leaky abstraction?”
Well, the leaky abstraction comes in the call to memcpy()
,
which happened inherently.
Synopsis
#include <string.h> void *memcpy(void *s1,void *s2,size_t n);Description
The
memcpy
function copiesn
characters from the object pointed to bys2
into the object pointed to bys1
. If copying takes place between objects that overlap, the behavior is undefined.Returns
The
memcpy
function returns the value ofs1
.The C Standard (emphasis added)
Well, how about that? I'm inadvertantly invoking undefined behavior in
the greylist daemon! (actually, I was hoping this was the cause of the
problem, but I don't think it is—sigh) Technically speaking, I don't see
how there could be a problem when I'm copying a block of memory over itself,
except for consuming some extra CPU time. But I would think that a compiler could see I was
modifying an array of items, and either include a check to skip the
operation if they overlapped completely, or switch to using
memmove()
(which allows the objects to overlap).
But such is the nature of working with high abstractions. When they leak, they leak!
I suppose I could have realized I was ultimately calling
memcpy()
, and that memcpy()
has undefined
semantics when the source and destination overlap, but I also expected the
compiler to inline the code to copy the structure (much like
gcc
did when compiling on an 80486), not actually call
memcpy()
!
Sheesh.