Wednesday, October 24, 2007
144 points of failure
I'm not even sure where to begin with this.
A customer is having a problem with duplicate emails being sent about a month after being initially sent, and it's causing the recipients to freak out (since they can't be bothered to check the date and see it's either a duplicate or a very late email message).
Our problem is obtaining the information we need to troubleshoot this problem. Our customer has no idea what “email headers” are (but then again, our customer has no idea what “a program” is or how she even checks her email) and doesn't want to bother the recipients with such details.
The real problem?
The sheer number of participants in exchanging an email between two parties. Between the two, in this case, are at least four operating systems (running on the customer's computer, our computer, the recipient's email server, and the recipient's computer), six networks (customer's local network, their ISP, our network, the network of the recipient's email server, the recipient's ISP, and the recipient's local network) across an unknown number of routers and at least six programs (customer's email client, incoming and outgoing mail daemons on our server, incoming mail daemon on recipient's server, the mailbox daemon on the recipient's email server, and the recipient's email client), any one of those could cause a minor problem that causes duplicate emails to be sent (and I'll spare you those details).
It's amazing that this crazy patchwork of servers, networks and software works at all, but boy, when it breaks, it breaks in very odd ways. I'm sure that the problem is understandable once we figure out what went wrong, but how to determine what went wrong? Especially after the fact?
Our log files don't go back that far, and what we do have is 10G worth (and due to how sendmail
logs emails, an individual email at minimum generates three lines of logging information, and good luck in trying to piece all that together).
I think what I'm grousing about is my inability to fully troubleshoot the issue. The participants aren't necessarily technically inclined (which makes it difficult to get help from them, or even real solid information), and it involves more than just us. And somehow, it's our fault.
Oops. Gotta go. Yet another email issue to troubleshoot.
Now, where did I put my gun?
Thoughts on optimizing a greylist daemon
Speaking of optimization, I thought it might be fun to profile the greylist daemon. It's not difficult—just recompile the program with the appropriate compiler option and run it.
Now, when I wrote the greylist daemon, I did gave thought to how I was going to search through the vast amount of information, and wrote the code with that in mind (oooh, premature optimizations? Or was that a mature optimization?). I was also curious as to any hotspots in the code, as I tend to max out at 130 requests per second on my development server.
The results were a bit surprising:
% time | cumulative seconds | self seconds | calls | self Ts/call | total Ts/calls | name |
---|---|---|---|---|---|---|
% time | cumulative seconds | self seconds | calls | self Ts/call | total Ts/calls | name |
50.00 | 0.01 | 0.01 | 661477 | 0.00 | 0.00 | tuple_cmp_ift |
50.00 | 0.02 | 0.01 | 1 | 10.00 | 10.00 | whitelist_dump_stream |
0.00 | 0.02 | 0.00 | 140892 | 0.00 | 0.00 | edomain_cmp |
0.00 | 0.02 | 0.00 | 108648 | 0.00 | 0.00 | crc32 |
0.00 | 0.02 | 0.00 | 95853 | 0.00 | 0.00 | edomain_search |
0.00 | 0.02 | 0.00 | 28270 | 0.00 | 0.00 | StreamEOF |
0.00 | 0.02 | 0.00 | 27439 | 0.00 | 0.00 | report_stderr |
0.00 | 0.02 | 0.00 | 27273 | 0.00 | 0.00 | ipv4 |
0.00 | 0.02 | 0.00 | 27165 | 0.00 | 0.00 | check_signals |
0.00 | 0.02 | 0.00 | 27162 | 0.00 | 0.00 | send_packet |
0.00 | 0.02 | 0.00 | 27155 | 0.00 | 0.00 | ip_match |
0.00 | 0.02 | 0.00 | 27155 | 0.00 | 0.00 | send_reply |
0.00 | 0.02 | 0.00 | 27155 | 0.00 | 0.00 | type_graylist |
0.00 | 0.02 | 0.00 | 25458 | 0.00 | 0.00 | tuple_search |
0.00 | 0.02 | 0.00 | 24692 | 0.00 | 0.00 | tuple_add |
0.00 | 0.02 | 0.00 | 24692 | 0.00 | 0.00 | tuple_allocate |
The big surprise was that the execution time was split between two
functions, tuple_cmp_ift()
(which compares two tuples) and
whitelist_dump_stream()
, and the amusing bit is the disparity
of the number of calls between the two, 661,477 calls to the former vs. the
one call to the latter.
But in this case, the one call to whitelist_dump_stream()
was made when the program ended (all it does is write out the tuples that
have been whitelisted). Remove that from the program, and we pretty much
have our hotspot—tuple_cmp_ift()
.
The other surprise is that there weren't any real surprises. The test
data consists of 27,155 tuples with some duplicates, and you can see the
various checks were called that many times. The only reason
edomain_search()
was called as many times as it was is that
there are four lists that are checked 24,000 times each. crc32
is called twice for each packet (yes, that's intentional) so divide that by
four (request and response) and it fits.
The code is pretty much even except for tuple_cmp_ift()
,
which is our obvious hotspot.
int tuple_cmp_ift(const void *left,const void *right) { const struct tuple *l = left; const struct tuple *r = right; int rc; ddt(left != NULL); /* similar to assert() */ ddt(right != NULL); /* but logs to syslogd */ ddt(l->pad == 0xDECAFBAD); /* compiled out of profiled */ ddt(r->pad == 0xDECAFBAD); /* code */ /*------------------------------- ; sizeof(l->ip) is 16 bytes, enough ; space to hold an IPv6 address; not ; that we use IPv6 currently, but planning ; for the future. ;------------------------------------------*/ if ((rc = memcmp(l->ip,r->ip,sizeof(l->ip))) != 0) return(rc); if (l->fromsize < r->fromsize) return(-1); else if (l->fromsize > r->fromsize) return(1); if ((rc = memcmp(l->from,r->from,l->fromsize)) != 0) return(rc); if (l->tosize < r->tosize) return(-1); else if (l->tosize > r->tosize) return(1); rc = memcmp(l->to,r->to,l->tosize); return(rc); }
Now, before I spend any time trying to optimize this bit of code, I
thought I'd rerun the program without it. I changed
tuple_search()
(which calls tuple_cmp_ift()
) to
simply return “not found,” and I removed the call to
whitelist_dump_stream()
. The results were even more
amusing:
% time | cumulative seconds | self seconds | calls | self Ts/call | total Ts/calls | name |
---|---|---|---|---|---|---|
% time | cumulative seconds | self seconds | calls | self Ts/call | total Ts/calls | name |
0.00 | 0.00 | 0.00 | 140892 | 0.00 | 0.00 | edomain_cmp |
0.00 | 0.00 | 0.00 | 108668 | 0.00 | 0.00 | crc32 |
0.00 | 0.00 | 0.00 | 95853 | 0.00 | 0.00 | edomain_search |
0.00 | 0.00 | 0.00 | 27438 | 0.00 | 0.00 | report_stderr |
0.00 | 0.00 | 0.00 | 27273 | 0.00 | 0.00 | ipv4 |
0.00 | 0.00 | 0.00 | 27169 | 0.00 | 0.00 | check_signals |
0.00 | 0.00 | 0.00 | 27167 | 0.00 | 0.00 | send_packet |
0.00 | 0.00 | 0.00 | 27155 | 0.00 | 0.00 | ip_match |
0.00 | 0.00 | 0.00 | 27155 | 0.00 | 0.00 | send_reply |
0.00 | 0.00 | 0.00 | 27155 | 0.00 | 0.00 | type_graylist |
0.00 | 0.00 | 0.00 | 25458 | 0.00 | 0.00 | tuple_add |
0.00 | 0.00 | 0.00 | 25458 | 0.00 | 0.00 | tuple_allocate |
0.00 | 0.00 | 0.00 | 25458 | 0.00 | 0.00 | tuple_search |
0.00 | 0.00 | 0.00 | 3577 | 0.00 | 0.00 | StreamEOF |
Yeah, a lot of called functions, but not enough accumulated time to even survive rounding up.
I'm begining to think that the 130 requests per second limit I'm seeing
isn't a function of the code, but of the network stack. Given this result,
I doubt I'll bother optimizing tuple_cmp_ift()
any time
soon, which seems to re-enforce the whole “premature optimization is
eeeeevil” thought, but in reality, I wrote the code around a few
optimizations (namely, well considered data structures) in the beginning and
didn't need to optimize it later.