The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Saturday, July 10, 2004

Is profiling even viable now?

Mark brought up (in email) an interesting optimization technique using GCC 3:

I came across an interesting optimization that is GCC specific but quite clever.

In lots of places in the Linux kernel you will see something like:

p = get_some_object();
if (unlikely(p == NULL))
{
  kill_random_process();
  return (ESOMETHING);
}

do_stuff(p);

The conditional is clearly an error path and as such means it is rarely taken. This is actually a macro defined like this:

#define unlikely(b)   __builtin_expect(b, 0)

On newer versions of GCC this tells the compiler to expect the condition not to be taken. You could also tell the compiler that the branch is likely to be taken:

#define likely(b)     __builtin_expect(b, 1)

So how does this help GCC anyhow? Well, on some architectures (PowerPC) there is actually a bit in the branch instruction to tell the CPU's speculative execution unit if the branch is likely to be taken. On other architectures it avoids conditional branches to make the “fast path” branch free (with -freorder-blocks).

I was curious to see if this would actually help any, so I found a machine that had GCC 3 installed (swift), compiled a version of mod_blog with profiling information, ran it, found a function that looked good to speed up, added some calls to __builtin_expect(), reran the code and got a rather encouragine interesting result.

I then reran the code, and got a completely different result.

In fact, each time I run the code, the profiling information I get is nearly useless—well, to a degree. For instance one run:

Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
100.00 0.01 0.01 119529 0.00 0.00 line_ioreq
0.00 0.01 0.00 141779 0.00 0.00 BufferIOCtl
0.00 0.01 0.00 60991 0.00 0.00 line_readchar
0.00 0.01 0.00 59747 0.00 0.00 ht_readchar

Then another run:

Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
33.33 0.01 0.01 119529 0.00 0.00 line_ioreq
33.33 0.02 0.01 60991 0.00 0.00 line_readchar
33.33 0.03 0.01 21200 0.00 0.00 ufh_write
0.00 0.03 0.00 141779 0.00 0.00 BufferIOCtl

Yet another run:

Each sample counts as 0.01 seconds. no time accumulated
% time cumulative seconds self seconds calls self ms/call total ms/call name
0.00 0.00 0.00 141779 0.00 0.00 BufferIOCtl
0.00 0.00 0.00 119529 0.00 0.00 line_ioreq
0.00 0.00 0.00 60991 0.00 0.00 line_readchar
0.00 0.00 0.00 59747 0.00 0.00 ht_readchar

And still another one:

Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
50.00 0.01 0.01 60991 0.00 0.00 line_readchar
50.00 0.02 0.01 1990 0.01 0.01 HtmlParseNext
0.00 0.02 0.00 141779 0.00 0.00 BufferIOCtl
0.00 0.02 0.00 119529 0.00 0.00 line_ioreq

Like I said, nearly useless. Sure, there are the usual suspects, like BufferIOCtl() and line_ioreq(), but it's impossible to say what improvements I'm getting by doing this. And by today's standards, swift isn't a fast machine being only (only!) a 1.3GHz Pentium III with half a gig of RAM. I could only imagine the impossibility of profiling under a faster machine, or even imagining what could be profiled under a faster machine.

I have to wonder what the Linux guys are smoking to even think, in the grand scheme of things, if __builtin_expect() will even improve things all that much.

Unless they have access to better profiling mechanics than I do.

Looks like I might have to find a slower machine to get a better feel for how to improve the speed of the program.


Profiling is still viable, if you run the program long enough

So last night I said, “I might have to find a slower machine to get a better feel for how to improve the speed of [mod_blog].” Well, I found one other way—increase the amount of work the program does.

mod_blog, or at least the program I was working with last night, bp (for “build page”) is responsible for building the HTML pages served up the webserver. In its default mode (which is what I was working with last night) it generates the main index page, and the RSS file, which only (only!) generates about 100,000 bytes of output and happened too quickly to get any meaningful data out of profiling the program. But it can also generate pages with hundreds of entries given the right options, so by having it generate a page with every entry in 2000 through 2003 inclusive (which generates a page that is 3,352,028 bytes in size) I was able to get meaningful profiling information:

Each sample counts as 0.01 seconds.
% time cumulative seconds self seconds calls self ms/call total ms/call name
29.73 0.22 0.22 5817769 0.00 0.00 line_ioreq
14.86 0.33 0.11 6650539 0.00 0.00 BufferIOCtl
10.81 0.41 0.08 2908866 0.00 0.00 ht_readchar
9.46 0.48 0.07 790760 0.00 0.00 ufh_write
5.41 0.52 0.04 2908398 0.00 0.00 line_readchar

Yes, with a run time of nearly 3 seconds (2.992) I was able to generate consistant, meaningful profiling information (meaning, we went from calling BufferIOCtl() 141,779 times to 6,650,539 times) which was good enough to see if the GCC optimzation of using -freorder-blocks with __builtin_expect() would help any, at least on the Intel x86 line.

Well, after running a dozen tests, I can say that it does help on the Intel platform, but at best, using __builtin_expect() with -freorder-blocks will only give you a few percent boost in speed. As in, single digit percentage boost.

In certain cases.

Certainly, when I did tests generating the output to /dev/null you could see the boost, but in running tests generating an actual file (as opposed to just tossing the data in the bit bucket), it's not quite so clear cut (and for each test, I ran it five times, taking the timings from the fifth run—this to help smooth out any system caching effects). The best improvement came when using -O3 -march=pentium3 -fomit-frame-pointer to compile the program, but it was still hardly noticable from a user perspective (maybe about a tenth or two tenths of a second).

Mark is expecting __builtin_expect() to have better impact on systems where the CPU will use the hint.

Obligatory Picture

[The future's so bright, I gotta wear shades]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.