Saturday, July 10, 2004
Profiling is still viable, if you run the program long enough
So last night I said, “I might have to find a slower machine to get a better feel for how to improve the speed of [mod_blog].” Well, I found one other way—increase the amount of work the program does.
mod_blog
, or at least the program I was working with last
night, bp
(for “build page”) is responsible for building the
HTML pages served up
the webserver. In its default mode (which is what I was working with last
night) it generates the main index page, and the RSS file, which only
(only!) generates about 100,000 bytes of output and happened too
quickly to get any meaningful data out of profiling the program. But it can
also generate pages with
hundreds of entries given the right options, so by having it generate a
page with every entry in 2000 through 2003 inclusive (which
generates a page that is 3,352,028 bytes in size) I was able to get
meaningful profiling information:
% time | cumulative seconds | self seconds | calls | self ms/call | total ms/call | name |
---|---|---|---|---|---|---|
29.73 | 0.22 | 0.22 | 5817769 | 0.00 | 0.00 | line_ioreq |
14.86 | 0.33 | 0.11 | 6650539 | 0.00 | 0.00 | BufferIOCtl |
10.81 | 0.41 | 0.08 | 2908866 | 0.00 | 0.00 | ht_readchar |
9.46 | 0.48 | 0.07 | 790760 | 0.00 | 0.00 | ufh_write |
5.41 | 0.52 | 0.04 | 2908398 | 0.00 | 0.00 | line_readchar |
Yes, with a run time of nearly 3 seconds (2.992) I was able to generate
consistant, meaningful profiling information (meaning, we went from calling
BufferIOCtl()
141,779 times to 6,650,539 times) which was good
enough to see if the GCC optimzation of using -freorder-blocks
with __builtin_expect()
would help any, at least on the Intel
x86 line.
Well, after running a dozen tests, I can say that it does help
on the Intel platform, but at best, using __builtin_expect()
with -freorder-blocks
will only give you a few percent boost in
speed. As in, single digit percentage boost.
In certain cases.
Certainly, when I did tests generating the output to
/dev/null
you could see the boost, but in running tests
generating an actual file (as opposed to just tossing the data in the bit
bucket), it's not quite so clear cut (and for each test, I ran it five
times, taking the timings from the fifth run—this to help smooth out any
system caching effects). The best improvement came when using -O3
-march=pentium3
-fomit-frame-pointer
to compile the program, but it was still hardly
noticable from a user perspective (maybe about a tenth or two tenths of a
second).
Mark is expecting __builtin_expect()
to have better impact
on systems where the CPU
will use the hint.