Wednesday, June 03, 2009
Waist deep in emails
I'm having a lot of fun writing the email indexing program, despite having to code around a few broken mbox files. I've also been surprised at what I've found so far (not in the “oh, I forgot about that email!” way but more in the “What the—?” way).
At first, I assumed that no
email header would be longer than 64K, but no, turns out that isn't big enough.
Turns out I have an email with a header that is 81,162 bytes in size, and it
has enough email addresses (in the Cc:
header) to populate a
small mass-mailing list (and yes, it's spam).
I'm also tracking unique sets of headers and unique message bodies (via the SHA1 hashing function). There are 118 messages with the same body but with different headers and the amusing bit is that the emails in question wheren't spam! It's from a mailing list I used to run years ago where one of the members apparently changed his email address, and for a period of time each message that went out caused his automated system to send an update to the list.
And of course, he didn't unsubscribe his old email address.
Heh.
The tracking was done to keep from indexing duplicate emails (since my testing corpus is 1,600 mbox files, some of which may be backups—I don't know which ones though, which is part of the reason I'm writing this program) so in the end I should end up with a set of unique headers.
I got down to 16 emails with duplicate headers, but unique bodies.
That scared me.
A small digression: at this point, the program pulls each email out of the mbox file, and writes the headers into one file (the original, plus a few I add during processing, like the SHA1 hash results) and the body of the email into another file (my dad likes to send me photos and videos in email, so the bodies of those messages tend to be rather large, and I'm concentrating on the headers at the moment). I currently end up with about 50M of headers and almost a gigabyte-worth of email bodies. Now, continuing on …
I pick one of the duplicate hashes, scan for it, and then check the messages:
>find header_raw/ | xargs grep FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D ./000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D ./000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D >grep X-SHA1 header_raw/000008069 header_raw/000026823 header_raw/000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D header_raw/000008069:X-SHA1-Body: 5C823DD92D3DCDC5AD43953D72B1D60017A134D6 header_raw/000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D header_raw/000026823:X-SHA1-Body: 85584F0167666BAA506E41A3D9ED927227F0FEF0 >
(Note: I can't just grep PATTERN *
because there are simply
too many files (over 45,000) which exceeds the command line limit—that's
why I use find
and xargs
).
Okay, same headers, different body. Just what is going on here? I check the bodies:
>more body/000008069 Status: RO Accept All Major Credit Cards!!! Don't be fooled by the copycats. We are one of the original company's offering merchant credit card services for all kinds of business's. [sic]
This isn't looking good—it looks like my header parsing code is missing a header. What about the other email?
>more body/000026823 Status: RO Content-Length: 2815 Lines: 104 Accept All Major Credit Cards!!! Don't be fooled by the copycats. We are one of the original company's offering merchant credit card services for all kinds of business's. [sic]
Okay, check the mbox files to see what's messing up the header parsing. What I find actually reassures me:
From cherylg1582@msn.com Wed Dec 12 14:13:00 2001 Return-Path: <cherylg1582@msn.com> Received: from gig.armigeron.com ([204.29.162.10]) by conman.org (8.8.7/8.8.7) with ESMTP id OAA06543 for <spc.wopr@conman.org>; Wed, 12 Dec 2001 14:12:59 -0500 Received: from mercury.aibusiness.net (emi.net [208.10.128.2] (may be forged)) by gig.armigeron.com (8.11.0/8.11.0) with ESMTP id fBCJ8Aa31356 for <spc@armigeron.com>; Wed, 12 Dec 2001 14:08:10 -0500 Received: from domainmail.ionet.net (domainmail.ionet.net [206.41.128.18]) by mercury.aibusiness.net (8.9.3/8.9.3) with ESMTP id NAA19835 for <spc@emi.net>; Wed, 12 Dec 2001 13:52:26 -0500 Received: from kqyfqkpby.motor.com (r145h250.afo.net [209.149.145.250] (may be forged)) by domainmail.ionet.net (8.9.1a/8.7.3) with SMTP id MAA02841; Wed, 12 Dec 2001 12:38:11 -0600 (CST) Date: Wed, 12 Dec 2001 12:38:11 -0600 (CST) Message-Id: <200112121838.MAA02841@domainmail.ionet.net> From: "griffin" <griffinfpzwrhlllngc@aol.com> Subject: No fee! Accept Credit Cards for the Holidays! (bbjlm) Reply-To: elicasabona1787@mailexcel.com MIME-Version: 1.0 X-Mailer: Mozilla 4.7 [en]C-CCK-MCD NSCPCD47 (Win98; I) Content-Type: text/plain Status: RO Accept All Major Credit Cards!!!
It wasn't my code (thank God! The parsing code is getting a bit convoluted at this point), but some clueless spammer trying to add additional headers in the body of the message (the other one was the same). So I'll assume the other 14 “duplicates” are similar in nature—spammers trying to be clever.
And now, back to coding …