Since Friday, I've been trying to track down an email problem, and I still haven't been able to solve it. Part of it was trying to figure out what the problem was from the initial reporting of it:
Happening again with XXXXXXXXXXX XXXX—this time the email was from EVE [I'll be referring to the individual email addresses by name—they have no relation to the actual email addresses involved, but will make it easier to follow along. Trust me on this, it gets confusing] and it did not come through. Any suggestion?
It took about four exchanges and checking the account information before I had a handle on what exactly was happening. The actual events were something like this: DAVE sends a message to CAROL. The address CAROL then automatically forwards the email to ALICE (a local mailbox on the server) and BOB (somewhere else on the Internet). DAVE's messages get through. EVE (from the same domain as DAVE incidently) sends a message to CAROL. The message is delivered to ALICE, but not to BOB.
I have to figure out why.
And that, my friends, is easier said than done.
We have mail logs going back about a month. Five log files (including the current one) in total, and each one averages about 800M in size (about 4,000,000 lines of logging information each). Log files. Which means a linear search through the file for any type of information needed. Yeah, it takes time, but I can do other things while I'm running a search. The major problem is that the information I need to check a single email through the system is spread out throughout the log. For instance, it took me about 10 minutes to fish the following information out of the log files for a single message (manual search, output formatted by hand):
message-id: 200603271643.k2RGhesn032068 queue-id: k2RGheP6012025 from: DAVE to: CAROL stat: queued at: 11:43:40 queue-id: k2RGhesn032068 from: DAVE to: CAROL stat: sent at: 11:43:40 queue-id: k2RGheP6012025 to: ALICE stat: sent at: 11:43:42 queue-id: k2RGhgR2032089 from: DAVE to: ALICE to: BOB stat: sent at: 11:43:45
This single message took a total of five seconds to be processed. In
that five seconds, the information was spread across eight lines of the log
file, among 30 such records made in that five second period. It took me
four searches to retrieve the eight lines from the log file. You see, each
email message (and this is true globally) has a unique
message-id, but sadly, only two of the log records had the
message-id. The message ended up with four different
queue-ids (and that's my term for it) within the system (which
is why it took multiple searches to obtain all the information).
And since we don't have a program to analyse the mail logs, this has to be done manually right now.
It also turns out that the mail from EVE isn't logged as being from EVE,
but from MALLET, due to the software EVE's using to send email. DAVE is
also being logged as coming from MALLET (same software, or email server),
which is further confusing the issue (and since the message
200603271643.k2RGhesn032068 was delivered sucessfully, I'm
assuming it's from DAVE and not EVE).
Even the thought of writing a program to analyze the logs is giving me a headache.
Update on Tuesday, March 28th, 2006
Never did figure out where the supposed missing emails went.