The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Wednesday, June 03, 2009

Waist deep in emails

I'm having a lot of fun writing the email indexing program, despite having to code around a few broken mbox files. I've also been surprised at what I've found so far (not in the “oh, I forgot about that email!” way but more in the “What the—?” way).

At first, I assumed that no email header would be longer than 64K, but no, turns out that isn't big enough. Turns out I have an email with a header that is 81,162 bytes in size, and it has enough email addresses (in the Cc: header) to populate a small mass-mailing list (and yes, it's spam).

I'm also tracking unique sets of headers and unique message bodies (via the SHA1 hashing function). There are 118 messages with the same body but with different headers and the amusing bit is that the emails in question wheren't spam! It's from a mailing list I used to run years ago where one of the members apparently changed his email address, and for a period of time each message that went out caused his automated system to send an update to the list.

And of course, he didn't unsubscribe his old email address.


The tracking was done to keep from indexing duplicate emails (since my testing corpus is 1,600 mbox files, some of which may be backups—I don't know which ones though, which is part of the reason I'm writing this program) so in the end I should end up with a set of unique headers.

I got down to 16 emails with duplicate headers, but unique bodies.

That scared me.

A small digression: at this point, the program pulls each email out of the mbox file, and writes the headers into one file (the original, plus a few I add during processing, like the SHA1 hash results) and the body of the email into another file (my dad likes to send me photos and videos in email, so the bodies of those messages tend to be rather large, and I'm concentrating on the headers at the moment). I currently end up with about 50M of headers and almost a gigabyte-worth of email bodies. Now, continuing on …

I pick one of the duplicate hashes, scan for it, and then check the messages:

>find header_raw/ | xargs grep FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
./000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
./000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
>grep X-SHA1 header_raw/000008069 header_raw/000026823
header_raw/000008069:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
header_raw/000008069:X-SHA1-Body: 5C823DD92D3DCDC5AD43953D72B1D60017A134D6
header_raw/000026823:X-SHA1-Header: FFCC3E0BCBF960EBBEA583E77E51CE0CEB59E04D
header_raw/000026823:X-SHA1-Body: 85584F0167666BAA506E41A3D9ED927227F0FEF0

(Note: I can't just grep PATTERN * because there are simply too many files (over 45,000) which exceeds the command line limit—that's why I use find and xargs).

Okay, same headers, different body. Just what is going on here? I check the bodies:

>more body/000008069
Status: RO

Accept All Major Credit Cards!!!

Don't be fooled by the copycats. We are one of the original company's
offering merchant credit card services for all kinds of business's. [sic]

This isn't looking good—it looks like my header parsing code is missing a header. What about the other email?

>more body/000026823
Status: RO
Content-Length: 2815
Lines: 104

Accept All Major Credit Cards!!!

Don't be fooled by the copycats. We are one of the original company's
offering merchant credit card services for all kinds of business's. [sic]

Okay, check the mbox files to see what's messing up the header parsing. What I find actually reassures me:

From  Wed Dec 12 14:13:00 2001
Return-Path: <>
Received: from ([])
        by (8.8.7/8.8.7) with ESMTP id OAA06543
        for <>; Wed, 12 Dec 2001 14:12:59 -0500
Received: from ( [] 
	(may be forged))
        by (8.11.0/8.11.0) with ESMTP id fBCJ8Aa31356
        for <>; Wed, 12 Dec 2001 14:08:10 -0500
Received: from ( [])
        by (8.9.3/8.9.3) with ESMTP id NAA19835
        for <>; Wed, 12 Dec 2001 13:52:26 -0500
Received: from ( [] 
	(may be forged)) 
	by (8.9.1a/8.7.3) with SMTP id MAA02841;
	Wed, 12 Dec 2001 12:38:11 -0600 (CST)
Date: Wed, 12 Dec 2001 12:38:11 -0600 (CST)
Message-Id: <>
From: "griffin" <>
Subject: No fee! Accept Credit Cards for the Holidays!      (bbjlm)
MIME-Version: 1.0
X-Mailer: Mozilla 4.7 [en]C-CCK-MCD NSCPCD47  (Win98; I)
Content-Type: text/plain

Status: RO

Accept All Major Credit Cards!!!

It wasn't my code (thank God! The parsing code is getting a bit convoluted at this point), but some clueless spammer trying to add additional headers in the body of the message (the other one was the same). So I'll assume the other 14 “duplicates” are similar in nature—spammers trying to be clever.

And now, back to coding …

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site:, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2022 by Sean Conner. All Rights Reserved.