The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Monday, June 01, 2009

And here I thought dating was easy …

I'm building on the work of indexing my filesystem by indexing all of my email. I have a ton of it spread across various directories and when ever I have to search for something (such as the time I flamed an entire department at FAU on a public mailing list—ah, those were the days), it's a long drawn out ordeal to find it.

Initial stab at the problem is to just index a few email headers, like From:, To: (and the related Cc:), Date: and Subject:—the primary headers one would be interested in.

I decided to tackle one of the harder fields to process first—From:. While the format is specified in RFC-822 and RFC-2822, there's still quite a bit of variance in the format to be annoying.

I was able to squish 23 different formats into four cases:

  1. email address and real name aren't delimited, in which case, the only thing to parse is the email address;
  2. email isn't delimited, but the real name is (between parentheses, or quotes), so extract the real name from between the delimeters, and anything that isn't delimited is the email address;
  3. email is delimited (between angle brackets or square brackets), but the real name isn't, so extract the email address, and anything that isn't delimited is the real name;
  4. both the email address and real name are delimited, so it's trivial to extract both.

Then, I decided to parse the Date: header. Now, this is specified, quite plainly:

5.  DATE AND TIME SPECIFICATION

5.1.  SYNTAX

date-time   =  [ day "," ] date time        ; dd mm yy
                                            ;  hh:mm:ss zzz

day         =  "Mon"  / "Tue" /  "Wed"  / "Thu"
            /  "Fri"  / "Sat" /  "Sun"

date        =  1*2DIGIT month 2DIGIT        ; day month year
                                            ;  e.g. 20 Jun 82

month       =  "Jan"  /  "Feb" /  "Mar"  /  "Apr"
            /  "May"  /  "Jun" /  "Jul"  /  "Aug"
            /  "Sep"  /  "Oct" /  "Nov"  /  "Dec"

time        =  hour zone                    ; ANSI and Military

hour        =  2DIGIT ":" 2DIGIT [":" 2DIGIT]
                                            ; 00:00:00 - 23:59:59

zone        =  "UT"  / "GMT"                ; Universal Time
                                            ; North American : UT
            /  "EST" / "EDT"                ;  Eastern:  - 5/ - 4
            /  "CST" / "CDT"                ;  Central:  - 6/ - 5
            /  "MST" / "MDT"                ;  Mountain: - 7/ - 6
            /  "PST" / "PDT"                ;  Pacific:  - 8/ - 7
            /  1ALPHA                       ; Military: Z = UT;
                                            ;  A:-1; (J not used)
                                            ;  M:-12; N:+1; Y:+12
            / ( ("+" / "-") 4DIGIT )        ; Local differential
                                            ;  hours+min. (HHMM)

STANDARD FOR ARPA INTERNET TEXT MESSAGES, § 5.1

Okay, clear if you're into such things. And from the most recent specification:


date-time       =       [ day-of-week "," ] date FWS time [CFWS]
   
day-of-week     =       ([FWS] day-name) / obs-day-of-week

day-name        =       "Mon" / "Tue" / "Wed" / "Thu" /
                        "Fri" / "Sat" / "Sun"

date            =       day month year
   
year            =       4*DIGIT / obs-year

month           =       (FWS month-name FWS) / obs-month

month-name      =       "Jan" / "Feb" / "Mar" / "Apr" /
                        "May" / "Jun" / "Jul" / "Aug" /
                        "Sep" / "Oct" / "Nov" / "Dec"

day             =       ([FWS] 1*2DIGIT) / obs-day

time            =       time-of-day FWS zone

time-of-day     =       hour ":" minute [ ":" second ]

hour            =       2DIGIT / obs-hour

minute          =       2DIGIT / obs-minute

second          =       2DIGIT / obs-second

zone            =       (( "+" / "-" ) 4DIGIT) / obs-zone

RFC-2822: Internet Message Format, § 3.3

Really, the only thing this does is mandate that the year be four digits long, moves to a numeric-only timezone format and clarifies a bit where white space appears, but otherwise, is pretty much the same as the older spec.

So, if I ignore the timezone for now (because the Standard C library has such piss-poor support for it, but that's a rant for another time), the only real issue is handling two or four digit years.

And in poking around the man pages for the various Standard C library routines, I came across strptime(), which is the functional opposite of strftime()—instead of converting the time to a human representation, it'll take a human representation and convert it to a time value. It isn't a Standard C call, but hey, why not use it for now?

And it appears that the two-digit/four-digit year isn't a problem for strptime():

When a century is not otherwise specified, values in the range [69,99] shall refer to years 1969 to 1999 inclusive, and values in the range [00,68] shall refer to years 2000 to 2068 inclusive; leading zeros shall be permitted but shall not be required.

man page for strptime()

Sounds perfect!

Only it blew up when it encounted Wen, 2 Mar 2005 01:39:42 +0000.

Sigh.

Okay, make sure I start parsing past the optional day of the week. It then blew up on Sat Mar 5 18:58:36 2005.

What the—? That's not even a standard format! And then there was Wed,19 十二月 2001 20:23:05 (I added the question marks because I can't determine the character set that was used for the month—there's nothing in that particular email that even hints what language it might be I found out which language---Chinese. Figures).

And let's not forget 9/8/99 1:01:12 AM Pacific Daylight Time (lovely) or Fri Jun 28 10:07:44 PDT 2002 or even Wed 8-Jan-2003 08:24:20.

Oh, and we mustn't forget Tue, 23 May 100 22:18:56 -0400.

Double sigh.

I found it amazing—one of the more strictly defined fields in an email and yet there still was an amazing amount of garbage to be found (although to be fair, these anomalies account for less than one per cent of all the emails scanned, but when you have thousands of emails, it can still add up).

(And one more interesting note—I did not see one email use the military time zone format.)

Obligatory Picture

[It's the most wonderful time of the year!]

Obligatory Links

Obligatory Miscellaneous

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: http://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

http://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2019 by Sean Conner. All Rights Reserved.