The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Friday, September 21, 2007

Software performance with large sets of data

I have seen this many, many times. Something that runs fast during development and maybe even testing because the data used in testing was too small and didn't match real world conditions. If you are working on a small set of data, everything is fast, even slow things.

Software performance with large sets of data

The primary test I use with greylist daemon is (I think) brutal. I have a list of 27,155 tuples (real tuples, logged from my own SMTP server) of which 25,261 are unique. When I run the greylist daemon, I use an embargo timeout of one second (to ensure a significant number of tuples make it to the whitelist), a greylist timeout of at least two minutes, with the cleanup code (which checks for expired records and removes them) running every minute. Then to run the actual test, I pump the tuple list through a small program that reformats the tuples that the Postfix module expects, which then connects to the daemon. There is no delay in the sending of these tuples— we're talking thousands of tuples per minute, for several minutes, being pumped through the greylist daemon.

Now, there are several lists the tuple is compared against. I have a list of IP addresses that will cause the daemon to accept or reject the tuple. I can check the sender email address or sender domain, and the recipient email address or domain. If it passes all those, then I check the actual tuple list. The IP list is a trie (nice for searching through IP blocks). The other lists are all sorted arrays, using a custom binary search to help with inserting new records.

Any request that the server can handle immediately (say, checking a tuple, or returning the current config or statistics to the Greylist Daemon Master Control Program) are done in the main processing loop; for longer operations (like sending back a list of tuples to the Master Control Program) it calls fork() and the child process handles the request.

I haven't actually profiled the program, but at this point, I haven't had a need to. It doesn't drop a request, even when I run the same grueling test on a 150MHz PC).

I just might though … it would be interesting to see the results.

Obligatory Picture

[“I am NOT a number, I am … a Q-CODE!”]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site:, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.