The Boston Diaries

Sunday, January 12, 2003

When shell scripts are faster than C programs

My stats program for The Boston Diaries basically consists of a shells script that calls a custom program (in C) to print out only certain fields from the website logfile which is then fed into a pipeline of some twenty invocations of grep. It basically looks like:

cat logfile | 				     \
escanlog -status 200 -host -command -agent | \
        grep Mozilla |                       \
        grep -v 'Slurp/cat' |                \
        grep -v 'ZyBorg' |                   \
        grep -v 'bdstyle.css' |              \
        grep -v 'screen.css' |               \
        grep -v '^12.148.209.196'|           \
        grep -v '^4.64.202.64' |             \
        grep -v '^213.60.99.73' |            \
        grep -v 'Ask Jeeves' |               \
        grep -v 'rfp@gmx.net' |              \
        grep -v '"Mozilla"' |                \
        grep -v 'Mozilla/4.5' |              \
        grep -v '.gif ' |                    \
        grep -v '.png ' |                    \
        grep -v '.jpg ' |                    \
        grep -v 'bostondiaries.rss' |        \
        grep -v 'bd.rss' |                   \
        grep -v 'favicon.ico' |              \
        grep -v 'robots.txt' |               \
        grep -v $HOMEIP

It's servicable, but it does filter out Lynx and possibly Opera users since I filter for Mozilla and then reject what I don't want. Twenty greps—that's pretty harsh, especially on my server. And given that more and more robots are hiding themselves it seems, the list of exclusions could only get longer and longer.

I think that at this point, a custom program would be much better.

So I wrote one. In C. Why not Perl? Well, I don't know Perl, and I have all the code I really need in C already; there's even a regex library installed on the systems I can call, so that, mixed with the code I already have to parse a website log file, and an extensive library of of C code to handle higher level data structures it wouldn't take me all that long to write the program I wanted.

First, start out with a list of rules:

# configuration file for filtering web log files

reject	host	'^XXXXXXXXXXXXX$'	# home ip

filter  status  '^2[0-9][0-9]$'

reject  command '^HEAD.*'
reject  command '.*bostondiaries\.rss.*'
reject  command '.*/robots\.txt.*'
reject  command '.*/bd\.rss.*'
reject  command '.*/CSS/screen\.css.*'
reject  command '.*/bdstyle\.css.*'
reject  command '.*\.gif .*'
reject  command '.*\.png .*'
reject  command '.*\.jpg .*'
reject  command '.*favicon\.ico.*'

accept  agent   '.*Lynx.*'

filter  agent   '.*Mozilla.*'

reject  agent   '.*Slurp/cat.*'
reject  agent   '.*ZyBorg.*'
reject  agent   '.*la2@unspecified\.mail.*'
reject  agent   '.*Ask Jeeves.*'
reject  agent   '.*Gulper Web Bot.*'
reject  agent   '^Mozilla$'
reject  agent   '^Mozilla/4.7$'
reject  agent   '^Mozilla/4.5 \[en\] (Win98; I)$'
reject  agent   '.*efp@gmx.net.*'
reject  host    '^4\.64\.202\.64$'
reject  host    '^63\.148\.99.*'

The second column is which field you want to check from the logfile while the last column is the regular expression to match on. The first column is the rule to apply; basically, filter will continue processing if the field matches the regular expression, otherwise that request is discarded. reject will automatically reject the request if the given field matches the regular expression, and accept will automatically accept the request (both of these are forms of short circuiting the evaluation). Once all the rules are finished processing (or an accept is run, the resulting request is printed.

So, for example, the first line there rejects any requests from my home network) (as requests from there would skew the results), and the second line will reject any request that doesn't result in a valid page. And we go on from there.

The programing itself goes rather quickly. That is good.

The program itself goes rather slowly. That is not good.

It is, in fact, slower than the shell script with twenty invocations to grep. It is, in fact, a C program that is beaten by a shell script.

That is really not good.

Time to profile the code. Now, it might be that the regex library I was calling was slow, but I discounted that—it should be well tested, right?

Right?

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
65.69	0.90	0.90	11468	0.08	0.08	read_line
8.03	1.01	0.11	11467	0.01	0.02	process_rules
5.84	1.09	0.08	265271	0.00	0.00	NodeValid
5.84	1.17	0.08	11467	0.01	0.01	read_entry
2.92	1.21	0.04	11495	0.00	0.00	mem_alloc

So right away we can see that read_line() is sucking up a lot of time here. Let's see what it could be:

char *read_line(FILE *fpin)
{
  int     c;
  char   *buffer = NULL;
  char   *np;
  size_t  size   = 0;
  size_t  ns     = 0;

  while(!feof(fpin))
  {
    if (size == ns)
    {
      ns = size + LINESIZE;
      np = realloc(buffer,ns);
      if (np == NULL) return(buffer);
      buffer = np;
      buffer[size] = '\0';
    }

    c = fgetc(fpin);
    if (c == '\n') return(buffer);
    buffer[size]   = c;
    buffer[++size] = '\0';
  }
  return(buffer);
}

Not much going on. About the only thing that might be sucking up time is the allocation of memory (since LINESIZE was set to 256). I ran a quick test and found that the longest line in the logfiles is under 1,000 characters, while the average size was about 170. Given the number of lines in some of the files (40,000 lines, which as logfiles go, isn't that big) and given that the code I have makes a duplicate of the line (since I break one of those up into individual fields) I'm calling malloc() upwards of 80,000 times!—more if the lines are exceptionally long.

Fast forward over rewriting and profiling. Since malloc() can be an expensive operation, skip that so we need to pass in the memory to use to read_line() and since we're going to make a copy of it anyway why not pass in two blocks of memory?

void read_line(FILE *fpin,char *p1,char *p2,size_t size)
{
  int          c;

  while((size-- > 0) && (c = fgetc(fpin)) != EOF)
  {
    if (c == '\n') break;
    *p1++ = *p2++ = c;
  }
  *p1 = *p2  = '\0';
}

Now, run it under the profiler:

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
80.93	2.97	2.97	1	2970.00	3670.00	process
8.99	3.30	0.33	11468	0.03	0.03	read_line
5.18	3.49	0.19	11467	0.02	0.02	read_entry
4.90	3.67	0.18	11467	0.02	0.02	process_rules
0.00	3.67	0.00	31	0.00	0.00	empty_string

Much better. Now the actual processing is taking most of the time, so time to test it again on the server.

Still slower than the shell script.

Grrrrrrr

I'm beginning to suspect that it's the call to regexec() (the regular expression engine) that is slowing the program down, but there is still more I can do to speed the program up.

I can remove the call to fgetc(). Under Unix, I can map the file into memory using mmap() and then it just becomes searching through memory looking for each line instead of having to explicitly call the I/O routines. Okay, so I code up my own File object, which mmap()'s the file into memory, and modify read_line() appropriately, compile and profile:

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
42.86	0.06	0.06	11467	5.23	5.23	process_rules
35.71	0.11	0.05	11467	4.36	4.36	read_entry
7.14	0.12	0.01	11468	0.87	0.87	file_eof
7.14	0.13	0.01	11467	0.87	0.87	read_line
7.14	0.14	0.01	1	10000.00	140000.00	process

Finally! process_rules() finally shows up as sucking up the majority of runtime. Test it on the server and … it's still slow.

Okay, so now I know that regexec() is making a C program slower than a shell script with twenty invocations of grep. Just to satisfy my own curiousity, I crank up the optimizations the compiler uses (using gcc -O4 -fomit-frame-pointer ...) and create two versions—one with the call to regexec() stubbed out and one with the call still in. I then run both on the server, timing the execution of each.

Timings of two programs—one calling `regexec()` and one not
	Stubbed version	`regexec()` version
User time	6.16	2842.71
System time	0.21	41.02
Elapsed time	00:06.38	48:41.89

Not quite seven seconds for the stubbed version, and almost an hour for the one calling regexec(). And this on this month's logfile, which isn't even complete yet (approximately 11,400 lines). I'm beginning to wonder just what options RedHat used to compile the regex library.

I then search the net for a later version of the library. It seems there really is only one which nearly everybody doing regular expressions in C uses, written by Henry Spencer and last updated in 1997 (which means it's very stable code, or no one is using C anymore—both of which may be true). I sucked down a copy, compiled and ran the regression tests it came with. It passed, so I recompiled with heavy optimzation (gcc -O4 -fomit-frame-pointer ...) and used that in my program:

Results of profiling: each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	function
60.68	1.96	1.96	424573	0.00	0.00	sstep
21.67	2.66	0.70	137497	0.01	0.02	smatcher
10.84	3.01	0.35	20124	0.02	0.11	sfast
2.79	3.10	0.09	137497	0.00	0.02	regexec
2.17	3.17	0.07	11467	0.01	0.28	process_rules

Okay, now we're getting somewhere. Try it on the server and it runs in less than a minute (which is tollerable, given the machine).

Finally!

I guess the main point of this is to make sure when you profile code, to make sure that any libraries you may be using are also being profiled! I could have saved quite a bit of time if I knew for a fact that it was the regex library that was slowing me down (I could have, in fact, just did the stub thing first and see if it was indeed my code, but alas … ). But even so, it was a fun exercise to do.

Sunday, January 12, 2003

When shell scripts are faster than C programs

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer