Wednesday, October 24, 2007
144 points of failure
I'm not even sure where to begin with this.
A customer is having a problem with duplicate emails being sent about a month after being initially sent, and it's causing the recipients to freak out (since they can't be bothered to check the date and see it's either a duplicate or a very late email message).
Our problem is obtaining the information we need to troubleshoot this problem. Our customer has no idea what “email headers” are (but then again, our customer has no idea what “a program” is or how she even checks her email) and doesn't want to bother the recipients with such details.
The real problem?
The sheer number of participants in exchanging an email between two parties. Between the two, in this case, are at least four operating systems (running on the customer's computer, our computer, the recipient's email server, and the recipient's computer), six networks (customer's local network, their ISP, our network, the network of the recipient's email server, the recipient's ISP, and the recipient's local network) across an unknown number of routers and at least six programs (customer's email client, incoming and outgoing mail daemons on our server, incoming mail daemon on recipient's server, the mailbox daemon on the recipient's email server, and the recipient's email client), any one of those could cause a minor problem that causes duplicate emails to be sent (and I'll spare you those details).
It's amazing that this crazy patchwork of servers, networks and software works at all, but boy, when it breaks, it breaks in very odd ways. I'm sure that the problem is understandable once we figure out what went wrong, but how to determine what went wrong? Especially after the fact?
Our log files don't go back that far, and what we do have is 10G worth (and due to how sendmail
logs emails, an individual email at minimum generates three lines of logging information, and good luck in trying to piece all that together).
I think what I'm grousing about is my inability to fully troubleshoot the issue. The participants aren't necessarily technically inclined (which makes it difficult to get help from them, or even real solid information), and it involves more than just us. And somehow, it's our fault.
Oops. Gotta go. Yet another email issue to troubleshoot.
Now, where did I put my gun?