The Boston Diaries

Tuesday, October 16, 2007

Heisenbugs … they're everywhere!

So I ran the greylist daemon for over eight hours under valgrind without it once hanging. I then restarted the server, this time running alone.

A few hours later, it hung.

And just for the record, when I normally attach to the running processing using gdb, it's where I would expect it to be:

(gdb) where
#0  0x008067a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x003c2dd1 in recvfrom () from /lib/tls/libc.so.6
#2  0x08049411 in mainloop (sock=0) at src/main.c:88
#3  0x080493a6 in main (argc=1, argv=0xbfe5c084) at src/main.c:68

but when the process hangs:

(gdb) where
#0  0x00dff7a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00955e5e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#2  0x008e7e4f in _L_mutex_lock_10231 () from /lib/tls/libc.so.6
#3  0x00000000 in ?? ()

I have no clue as to what's going on (and neither does gdb apparently). Running the program under valgrind obviously changes the environment, enough to mask the condition that causes the bug in the first place.

This is proving to be a very difficult bug to find.

Help! My compiler is leaking!

I fixed the problems that valgrind was complaining about in the greylist daemon (and I had hoped such problems would fix the problem, but alas, it doesn't seem to be the case, but I'm still testing), but the very nature of one of the complaints is interesting in and of itself.

A design pattern I've long used in code is the following:

/*--------------------------------------------
; the following is written to clarify 
; the issue I'm writing about.  This
; is not how I would normally write such 
; code.  It presents a simplified, yet
; realistic bit of code.
;
; And yes, this is the format I use for block
; comments in C.
;------------------------------------------*/

char   buffer[BUFSIZ];
size_t i;
size_t j;

for (i = j = 0 ; i < BUFSIZ ; i++)
{
  if (!isctrl(buffer[i]))
    buffer[j++] = buffer[i];
}

Basically, I'm going through a character array, removing certain elements (in this example, control characters) and doing so in place, to conserve memory consumption. Yes, there are instances where I'm reading and writing the same value to the same location, but in the scheme of things, it's not a bad thing. And, at least for this example, the overhead of trying to avoid unnecessary copying of data overwhelms the amount of time it takes to just do the copy.

So when I was writing the code to clean up the tuple array in the greylist daemon, I naturally wrote it as:

for (i = j = 0 ; i < g_poolnum ; i++)
{
  /* other code */

  if (difftime(Tao,g_pool[i].atime) < c_timeout_gray)
  {
    g_pool[j++] = g_pool[i];
    continue;
  }

  /* rest of loop */
}

It follows a successful pattern I've used plenty of times before. I saw nothing necessarily wrong with it, yet valgrind complained bitterly about this fragment of code. And I was a mildly surprised to see a call to memcpy() when I never explicitly called memcpy().

I just got bit with a leaky abstraction, and a rather insidious one at that.

Pre-ANSI C, I wouldn't have been able to write that code, since back then, C didn't have the concept of structure assignments, and thus, I would have had to explicitly call memcpy():

if (difftime(Tao,g_pool[i].atime) < c_timeout_gray)
{
  memcpy(&g_pool[j++],&g_pool[i],sizeof(struct tuple));
  continue;
}

But one of the advantages of working up the abstraction scale (so to speak) is that you can let the compiler take care of the grunt work for you. I mean, what if struct tuple was the size of an int? The overhead of calling memcpy() would swamp the actual act of copying the data. In fact, if struct tuple was a small multiple of an int in size, it might still not be worth the overhead of calling memcpy(). And the compiler is a better place to push such knowledge, since it can keep track of not only structure sizes, but the overhead of calling a function to copy memory and handle things accordingly.

So ANSI C allowed the compiler to handle structure assignment. And it can do a pretty fine job of it too. For instance, using a recent version of gcc with the compiler options -O3 -fomit-frame-pointer (some heavy duty optimization), it compiled the following bit of code:

struct foo
{
  int f;
};

struct foo A[256];
size_t     i;
size_t     j;
  
for (i = j = 0 ; i < 256 ; i++)
{
  if (A[i].f == 56)
    A[j++] = A[i];
}

to the following bit of x86 code (and frankly, I was surprised at the quality—and it's been translated from the alien AT&T syntax gcc uses to proper Intel syntax):

	xor	edx,edx
	xor	eax,eax
	jmps	.L6

.L4:	inc	eax
	cmp	eax,255
	ja	.L12   

.L6:	cmp	[A + eax*4],56
	jne	.L4
	inc	eax   

	mov	[A + edx*4],56	; A[j].f = 56
	inc	edx		; j++
	cmp	eax,255
	jbe	.L6 
                
.L12:

It didn't even copy the data since the compiler figured out it didn't need to. Even if we increased the size of the structure a bit:

struct foo
{
  size_t f;
  size_t f2;
  char   f3a;
  char   f3b;
  char   f3c;
  char   f3d;
  short  f4a;
  short  f4b;
};

struct foo A[256];
size_t     i;
size_t     j;
  
for (i = j = 0 ; i < 256 ; i++)
{
  if (A[i].f == 56)
    A[j++] = A[i];
}

gcc still has yet to call memcpy():

	push	ebx
	xor	edx,edx
	xor	ebx,ebx
	mov	ecx,255
	jmps	.L18   

.L16:	add	edx,16
	dec	ecx   
	js	.L23  

.L18:	cmp	[A + edx],56
	jne	.L16
     
	mov	[A + ebx],56		; A[j].f   = 56
	mov	eax,[A + 4 + edx]	; A[j].f2 = A[i].f2
	mov	[A + 4 + ebx],eax
	mov	eax,[A + 8 + edx]	; A[j].f3(a-d) = A[i].f3(a-d)
	mov	[A + 8 + ebx],eax
	mov	eax,[A + 12 + edx]	; A[j].f4(a,b) = A[i].f4(a,b)
	mov	[A + 12 + ebx],eax

	add	edx,16
	add	ebx,16
	dec	ecx   
	jns	.L18  

.L23:	pop	ebx

It just copies the 16 bytes of the structure as one assignment (because of constant propagation) and three four-byte moves. It's not until the structure gets significantly large:

struct foo
{
  size_t f;
  char   b1[124];
  size_t s2;
  char   b2[124];
};

struct foo A[256];
size_t     i;
size_t     j;

for (i = j = 0 ; i < 256 ; i++)
{
  if (A[i].f == 56)
    A[j++] = A[i];
}

that the compiler switches to calling memcpy():

	push	edi
	push	esi
	push	ebx
	xor	esi,esi
	mov	edi,offset A
	mov	ebx,255
	jmps	.L40

.L38:	add	esi,256
	dec	ebx
	js	.L45

.L40:	cmp	[A + esi],56
	jne	.L38
	push	ecx	; ???
	push	256
	lea	eax,[A + esi]
	mov	edx,edi
	push	eax
	push	edx
	call	memcpy
	add	edi,256
	add	esp,16
	add	esi,256
	dec	ebx
	jns	.L40

.L45:	pop	ebx
	pop	esi
	pop	edi

(The only anomaly in the code is the push ecx. The register isn't initialized, and memcpy() only takes three parameters, not four. My guess is that this “additional parameter” exists to keep the stack aligned along a cache line and it's this type of detail that compilers exist to keep track of. It's also interesting to note that when compiling for an 80486, gcc never bothers to call memcpy() and instead inlines the operation using REP MOVS.)

By now, you're probably asking yourself, “So? Where's the leaky abstraction?”

Well, the leaky abstraction comes in the call to memcpy(), which happened inherently.

Synopsis
#include <string.h>
void *memcpy(void *s1,void *s2,size_t n);
Description

The memcpy function copies n characters from the object pointed to by s2 into the object pointed to by s1. If copying takes place between objects that overlap, the behavior is undefined.

Returns

The memcpy function returns the value of s1.

The C Standard (emphasis added)

Well, how about that? I'm inadvertantly invoking undefined behavior in the greylist daemon! (actually, I was hoping this was the cause of the problem, but I don't think it is—sigh) Technically speaking, I don't see how there could be a problem when I'm copying a block of memory over itself, except for consuming some extra CPU time. But I would think that a compiler could see I was modifying an array of items, and either include a check to skip the operation if they overlapped completely, or switch to using memmove() (which allows the objects to overlap).

But such is the nature of working with high abstractions. When they leak, they leak!

I suppose I could have realized I was ultimately calling memcpy(), and that memcpy() has undefined semantics when the source and destination overlap, but I also expected the compiler to inline the code to copy the structure (much like gcc did when compiling on an 80486), not actually call memcpy()!

Sheesh.

Tuesday, October 16, 2007

Heisenbugs … they're everywhere!

Help! My compiler is leaking!

Synopsis

Description

Returns

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer