Friday, November 30, 2007
Obsessing over stupid benchmarks (and you thought my obsessing over a greylist daemon was bad … )
I'm still obsessing over stupid benchmarks. This time, I was curious about the overhead of calling functions and system calls, to simulate a typical locking scenario, like:
int gv; void t(void) { while(gv < 1000000000) { lock(); gv++; unlock(); } }
So, I wrote some test code:
int gv void null(void) { } void t1(void) { while(gc < 1000000000) { null(); gv++; null(); } } void t2(void) { pid_t me; me = getpid(); while(gc < 1000000000) { kill(me,0); gv++; kill(me,0); } }
The first test is pretty simple—it just tests the function call overhead, and that particular piece of code takes 9 seconds (give or take a few tenths of a second) to run. But let me explain about the second test, and therein lies the interesting bit of this tale.
I wanted to test the system call overhead, and thus, I needed a system
call that did very little. I found this bit in the kill()
manpage:
If
sig
is 0, then no signal is sent, but error checking is still performed.
That's about as close to a null system call as I could find. No real bad side effects, and I don't particularly care about the return code (since this is just measuring overhead). So I ran that bit of code. Fifteen minutes later it finished.
Fifteen minutes.
Either system calls are slow, or that particular system call does a vast amount of nothing. Curious, I delved into that part of the kernel source (I'm running all this under Linux, by the way) when I came across this bit of kernel code:
asmlinkage long sys_getpid(void) { return current->tgid; }
Now, I knew that getpid()
was a system call, but I didn't
realize it was a system call. What I mean by that is, even though
it's documented in section 2 of the manual (usually reserved for system
calls), it may not actually call into the kernel. For instance,
alarm()
is documented in section 2, but it's a library call
that eventually calls other system calls. And I thought that
getpid()
was a library call that just returned a value (the
current process ID) stored in userspace (which makes sense to me—it's not
like it changes or anything as the process runs). I didn't expect it to be
an actual system call.
Well then.
And it's about as low overhead of a system call as you can get.
So I rewrote t2()
:
void t2(void) { while(gc < 1000000000) { getpid(); gv++; getpid(); } }
And the runtime dropped to 17 seconds.
Okay, much better. Not quite twice the function call overhead. But what
the heck was kill()
doing that warranted fifteen minutes? I
checked the Linux source code and it didn't appear to be that much. What
the heck was going on?
So what's the actual code to getpid()
? Was it
perhaps making the system call once and storing the result? Time to check
the code to getpid()
. Easiest way to do that was to statically
compile the test program, and disassemble it (using
objdump
):
0804e380 <__getpid>: 804e380: b8 14 00 00 00 mov $0x14,%eax 804e385: cd 80 int $0x80 804e387: c3 ret
Okay. Not much to it. What if I make the system call directly then?
And here, I resorted to assembly (since the inline assembly directives for
gcc
make no sense to me whatsoever):
t41: mov eax,20 int $80 mov eax,[gv] inc eax mov [gv],eax mov eax,20 int $80 mov eax,[gv] cmp eax,1000000000 jl t41 ret
(oh, the reason the code looks a bit different? objdump
uses AT&T
syntax, while nasm
uses Intel syntax)
It was a bit surprising to find myself waiting for 11 minutes for this bit of code to run.
Okay, that's more on par with using the system call kill()
,
but that doesn't explain why my version took 66 times longer. Perhaps it's
a CPU pipeline issue?
Perhaps it is faster to call a function that invokes a system call
than it is to inline the actual system call. So I'll called the function
getpid()
:
extern __getpid t42: call __getpid inc dword [gv] call __getpid cmp dword [gv],1000000000 jl t42 ret
Back to 17 seconds. Okay, what if I supply my own wrapper function
around the getpid()
system call?
t43: call getpid inc dword [gv] call getpid cmp dword [gv],1000000000 jl t43 ret getpid: mov eax,20 int $80 ret
And just to make sure, check with objdump
:
080484b0 <getpid>: 80484b0: b8 14 00 00 00 mov $0x14,%eax 80484b5: cd 80 int $0x80 80484b7: c3 ret ... 0804e420 <__getpid>: 804e420: b8 14 00 00 00 mov $0x14,%eax 804e425: cd 80 int $0x80 804e427: c3 ret
So, other than the memory location of the routine (and the name, but the
name isn't part of the executable) they're the same. So why does
my version of the getpid()
function take 11 minutes to
run? Something odd is going on somewhere.
I then wrote a version of getpid()
that caches the return
value from the first call (and this time, I made it callable by C
functions):
global my_getpid my_getpid: mov eax,[pid] or eax,eax jz .10 ret .10 mov eax,20 int $80 mov [pid],eax ret
Two tests, one calling this routine from C, another test calling this routine from another assembly routine. And both take around 10 seconds to run, which is just a shade slower than calling a null routine.
But calling __getpid()
(the system supplied C library
routine that calls the system call) is faster than calling
getpid()
(the function I wrote to call the system call), even
though they're identical functions.
Is there something else going on that I'm not aware of?
Weird …