Obsessing over stupid benchmarks (and you thought my obsessing over a greylist daemon was bad … )

Friday, November 30, 2007

I'm still obsessing over stupid benchmarks. This time, I was curious about the overhead of calling functions and system calls, to simulate a typical locking scenario, like:

int gv;

void t(void)
{
  while(gv < 1000000000)
  {
    lock();
    gv++;
    unlock();
  }
}

So, I wrote some test code:

int gv

void null(void)
{
}

void t1(void)
{
  while(gc < 1000000000)
  {
    null();
    gv++;
    null();
  }
}

void t2(void)
{
  pid_t me;

  me = getpid();
  while(gc < 1000000000)
  {
    kill(me,0);
    gv++;
    kill(me,0);
  }
}

The first test is pretty simple—it just tests the function call overhead, and that particular piece of code takes 9 seconds (give or take a few tenths of a second) to run. But let me explain about the second test, and therein lies the interesting bit of this tale.

I wanted to test the system call overhead, and thus, I needed a system call that did very little. I found this bit in the kill() manpage:

If sig is 0, then no signal is sent, but error checking is still performed.

That's about as close to a null system call as I could find. No real bad side effects, and I don't particularly care about the return code (since this is just measuring overhead). So I ran that bit of code. Fifteen minutes later it finished.

Fifteen minutes.

Either system calls are slow, or that particular system call does a vast amount of nothing. Curious, I delved into that part of the kernel source (I'm running all this under Linux, by the way) when I came across this bit of kernel code:

asmlinkage long sys_getpid(void)
{
	return current->tgid;
}

Now, I knew that getpid() was a system call, but I didn't realize it was a system call. What I mean by that is, even though it's documented in section 2 of the manual (usually reserved for system calls), it may not actually call into the kernel. For instance, alarm() is documented in section 2, but it's a library call that eventually calls other system calls. And I thought that getpid() was a library call that just returned a value (the current process ID) stored in userspace (which makes sense to me—it's not like it changes or anything as the process runs). I didn't expect it to be an actual system call.

Well then.

And it's about as low overhead of a system call as you can get.

So I rewrote t2():

void t2(void)
{
  while(gc < 1000000000)
  {
    getpid();
    gv++;
    getpid();
  }
}

And the runtime dropped to 17 seconds.

Okay, much better. Not quite twice the function call overhead. But what the heck was kill() doing that warranted fifteen minutes? I checked the Linux source code and it didn't appear to be that much. What the heck was going on?

So what's the actual code to getpid()? Was it perhaps making the system call once and storing the result? Time to check the code to getpid(). Easiest way to do that was to statically compile the test program, and disassemble it (using objdump):

0804e380 <__getpid>:
 804e380:       b8 14 00 00 00          mov    $0x14,%eax
 804e385:       cd 80                   int    $0x80
 804e387:       c3                      ret

Okay. Not much to it. What if I make the system call directly then? And here, I resorted to assembly (since the inline assembly directives for gcc make no sense to me whatsoever):

t41:	mov	eax,20
	int	$80    
	mov	eax,[gv]
	inc	eax
	mov	[gv],eax
	mov	eax,20
	int	$80
	mov	eax,[gv]
	cmp	eax,1000000000
	jl	t41
	ret

(oh, the reason the code looks a bit different? objdump uses AT&T syntax, while nasm uses Intel syntax)

It was a bit surprising to find myself waiting for 11 minutes for this bit of code to run.

Okay, that's more on par with using the system call kill(), but that doesn't explain why my version took 66 times longer. Perhaps it's a CPU pipeline issue? Perhaps it is faster to call a function that invokes a system call than it is to inline the actual system call. So I'll called the function getpid():

	extern	__getpid

t42:	call	__getpid
	inc	dword [gv]
	call	__getpid
	cmp	dword [gv],1000000000
	jl	t42
	ret

Back to 17 seconds. Okay, what if I supply my own wrapper function around the getpid() system call?

t43:            call    getpid
                inc     dword [gv]
                call    getpid
                cmp     dword [gv],1000000000
                jl      t43
                ret

getpid:         mov     eax,20
                int     $80
                ret

And just to make sure, check with objdump:

080484b0 <getpid>:
 80484b0:       b8 14 00 00 00          mov    $0x14,%eax
 80484b5:       cd 80                   int    $0x80
 80484b7:       c3                      ret

 ...

0804e420 <__getpid>:
 804e420:       b8 14 00 00 00          mov    $0x14,%eax
 804e425:       cd 80                   int    $0x80
 804e427:       c3                      ret

So, other than the memory location of the routine (and the name, but the name isn't part of the executable) they're the same. So why does my version of the getpid() function take 11 minutes to run? Something odd is going on somewhere.

I then wrote a version of getpid() that caches the return value from the first call (and this time, I made it callable by C functions):

		global	my_getpid

my_getpid:	mov	eax,[pid]
		or	eax,eax
		jz	.10
		ret
.10		mov	eax,20
		int	$80
		mov	[pid],eax
		ret

Two tests, one calling this routine from C, another test calling this routine from another assembly routine. And both take around 10 seconds to run, which is just a shade slower than calling a null routine.

But calling __getpid() (the system supplied C library routine that calls the system call) is faster than calling getpid() (the function I wrote to call the system call), even though they're identical functions.

Is there something else going on that I'm not aware of?

Weird …

The Boston Diaries

Friday, November 30, 2007

Obsessing over stupid benchmarks (and you thought my obsessing over a greylist daemon was bad … )

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous