Sunday, January 12, 2003
When shell scripts are faster than C programs
My stats program for The Boston Diaries
basically consists of a shells script that calls a custom program (in C) to
print out only certain fields from the website logfile which is then fed
into a pipeline of some twenty invocations of grep
. It
basically looks like:
cat logfile | \ escanlog -status 200 -host -command -agent | \ grep Mozilla | \ grep -v 'Slurp/cat' | \ grep -v 'ZyBorg' | \ grep -v 'bdstyle.css' | \ grep -v 'screen.css' | \ grep -v '^12.148.209.196'| \ grep -v '^4.64.202.64' | \ grep -v '^213.60.99.73' | \ grep -v 'Ask Jeeves' | \ grep -v 'rfp@gmx.net' | \ grep -v '"Mozilla"' | \ grep -v 'Mozilla/4.5' | \ grep -v '.gif ' | \ grep -v '.png ' | \ grep -v '.jpg ' | \ grep -v 'bostondiaries.rss' | \ grep -v 'bd.rss' | \ grep -v 'favicon.ico' | \ grep -v 'robots.txt' | \ grep -v $HOMEIP
It's servicable, but it does filter out Lynx and possibly Opera users since I filter for
Mozilla
and then reject what I don't want. Twenty
grep
s—that's pretty harsh, especially on my server. And given that more and
more robots are hiding themselves it
seems, the list of exclusions could only get longer and longer.
I think that at this point, a custom program would be much better.
So I wrote one. In C. Why not Perl? Well, I don't know Perl,
and I have all the code I really need in C already; there's even a
regex
library installed on the systems I can call, so that,
mixed with the code I already have to parse a website log file, and an
extensive library of of C code to handle higher level data structures it
wouldn't take me all that long to write the program I wanted.
First, start out with a list of rules:
# configuration file for filtering web log files
reject host '^XXXXXXXXXXXXX$' # home ip
filter status '^2[0-9][0-9]$'
reject command '^HEAD.*'
reject command '.*bostondiaries\.rss.*'
reject command '.*/robots\.txt.*'
reject command '.*/bd\.rss.*'
reject command '.*/CSS/screen\.css.*'
reject command '.*/bdstyle\.css.*'
reject command '.*\.gif .*'
reject command '.*\.png .*'
reject command '.*\.jpg .*'
reject command '.*favicon\.ico.*'
accept agent '.*Lynx.*'
filter agent '.*Mozilla.*'
reject agent '.*Slurp/cat.*'
reject agent '.*ZyBorg.*'
reject agent '.*la2@unspecified\.mail.*'
reject agent '.*Ask Jeeves.*'
reject agent '.*Gulper Web Bot.*'
reject agent '^Mozilla$'
reject agent '^Mozilla/4.7$'
reject agent '^Mozilla/4.5 \[en\] (Win98; I)$'
reject agent '.*efp@gmx.net.*'
reject host '^4\.64\.202\.64$'
reject host '^63\.148\.99.*'
The second column is which field you want to check from the logfile while
the last column is the regular expression to match on. The first column is
the rule to apply; basically, filter
will continue processing
if the field matches the regular expression, otherwise that request is
discarded. reject
will automatically reject the request if the
given field matches the regular expression, and accept
will
automatically accept the request (both of these are forms of short
circuiting the evaluation). Once all the rules are finished processing (or
an accept
is run, the resulting request is printed.
So, for example, the first line there rejects any requests from my home network) (as requests from there would skew the results), and the second line will reject any request that doesn't result in a valid page. And we go on from there.
The programing itself goes rather quickly. That is good.
The program itself goes rather slowly. That is not good.
It is, in fact, slower than the shell script with twenty
invocations to grep
. It is, in fact, a C program that is
beaten by a shell script.
That is really not good.
Time to profile the code. Now, it might be that the
regex
library I was calling was slow, but I discounted
that—it should be well tested, right?
Right?
% time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | function |
---|---|---|---|---|---|---|
65.69 | 0.90 | 0.90 | 11468 | 0.08 | 0.08 | read_line |
8.03 | 1.01 | 0.11 | 11467 | 0.01 | 0.02 | process_rules |
5.84 | 1.09 | 0.08 | 265271 | 0.00 | 0.00 | NodeValid |
5.84 | 1.17 | 0.08 | 11467 | 0.01 | 0.01 | read_entry |
2.92 | 1.21 | 0.04 | 11495 | 0.00 | 0.00 | mem_alloc |
So right away we can see that read_line()
is sucking up a
lot of time here. Let's see what it could be:
char *read_line(FILE *fpin) { int c; char *buffer = NULL; char *np; size_t size = 0; size_t ns = 0; while(!feof(fpin)) { if (size == ns) { ns = size + LINESIZE; np = realloc(buffer,ns); if (np == NULL) return(buffer); buffer = np; buffer[size] = '\0'; } c = fgetc(fpin); if (c == '\n') return(buffer); buffer[size] = c; buffer[++size] = '\0'; } return(buffer); }
Not much going on. About the only thing that might be sucking up time is
the allocation of memory (since LINESIZE
was set to 256). I
ran a quick test and found that the longest line in the logfiles is under
1,000 characters, while the average size was about 170. Given the number of
lines in some of the files (40,000 lines, which as logfiles go, isn't
that big) and given that the code I have makes a duplicate of the
line (since I break one of those up into individual fields) I'm calling
malloc()
upwards of 80,000 times!—more if the lines are
exceptionally long.
Fast forward over rewriting and profiling. Since malloc()
can be an expensive operation, skip that so we need to pass in the memory to
use to read_line()
and since we're going to make a copy of it
anyway why not pass in two blocks of memory?
void read_line(FILE *fpin,char *p1,char *p2,size_t size) { int c; while((size-- > 0) && (c = fgetc(fpin)) != EOF) { if (c == '\n') break; *p1++ = *p2++ = c; } *p1 = *p2 = '\0'; }
Now, run it under the profiler:
% time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | function |
---|---|---|---|---|---|---|
80.93 | 2.97 | 2.97 | 1 | 2970.00 | 3670.00 | process |
8.99 | 3.30 | 0.33 | 11468 | 0.03 | 0.03 | read_line |
5.18 | 3.49 | 0.19 | 11467 | 0.02 | 0.02 | read_entry |
4.90 | 3.67 | 0.18 | 11467 | 0.02 | 0.02 | process_rules |
0.00 | 3.67 | 0.00 | 31 | 0.00 | 0.00 | empty_string |
Much better. Now the actual processing is taking most of the time, so time to test it again on the server.
Still slower than the shell script.
Grrrrrrr
I'm beginning to suspect that it's the call to regexec()
(the regular expression engine) that is slowing the program down, but there
is still more I can do to speed the program up.
I can remove the call to fgetc()
. Under Unix, I can map the
file into memory using mmap()
and then it just becomes
searching through memory looking for each line instead of having to
explicitly call the I/O routines.
Okay, so I code up my own File
object, which
mmap()
's the file into memory, and modify
read_line()
appropriately, compile and profile:
% time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | function |
---|---|---|---|---|---|---|
42.86 | 0.06 | 0.06 | 11467 | 5.23 | 5.23 | process_rules |
35.71 | 0.11 | 0.05 | 11467 | 4.36 | 4.36 | read_entry |
7.14 | 0.12 | 0.01 | 11468 | 0.87 | 0.87 | file_eof |
7.14 | 0.13 | 0.01 | 11467 | 0.87 | 0.87 | read_line |
7.14 | 0.14 | 0.01 | 1 | 10000.00 | 140000.00 | process |
Finally! process_rules()
finally shows up as
sucking up the majority of runtime. Test it on the server and … it's
still slow.
Okay, so now I know that regexec()
is making a C
program slower than a shell script with twenty invocations of
grep
. Just to satisfy my own curiousity, I crank up the
optimizations the compiler uses (using gcc -O4 -fomit-frame-pointer
...) and create two versions—one with the call to
regexec()
stubbed out and one with the call still in. I then
run both on the server, timing the execution of each.
Stubbed version | regexec() version |
|
---|---|---|
User time | 6.16 | 2842.71 |
System time | 0.21 | 41.02 |
Elapsed time | 00:06.38 | 48:41.89 |
Not quite seven seconds for the stubbed version, and almost an
hour for the one calling regexec()
. And this on this
month's logfile, which isn't even complete yet (approximately 11,400 lines).
I'm beginning to wonder just what options RedHat used to compile the
regex
library.
I then search the net for a later version of the library. It seems there really is only one which nearly everybody doing regular expressions in C uses, written by Henry Spencer and last updated in 1997 (which means it's very stable code, or no one is using C anymore—both of which may be true). I sucked down a copy, compiled and ran the regression tests it came with. It passed, so I recompiled with heavy optimzation (gcc -O4 -fomit-frame-pointer ...) and used that in my program:
% time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | function |
---|---|---|---|---|---|---|
60.68 | 1.96 | 1.96 | 424573 | 0.00 | 0.00 | sstep |
21.67 | 2.66 | 0.70 | 137497 | 0.01 | 0.02 | smatcher |
10.84 | 3.01 | 0.35 | 20124 | 0.02 | 0.11 | sfast |
2.79 | 3.10 | 0.09 | 137497 | 0.00 | 0.02 | regexec |
2.17 | 3.17 | 0.07 | 11467 | 0.01 | 0.28 | process_rules |
Okay, now we're getting somewhere. Try it on the server and it runs in less than a minute (which is tollerable, given the machine).
Finally!
I guess the main point of this is to make sure when you profile code, to
make sure that any libraries you may be using are also being
profiled! I could have saved quite a bit of time if I knew for a fact
that it was the regex
library that was slowing me down (I could
have, in fact, just did the stub thing first and see if it was indeed my
code, but alas … ). But even so, it was a fun exercise to do.