Taking away the spam filter's Little Orphan Annie Secret Decoder Ring

Wednesday, April 27, 2005

A few months ago I wrote about some character encoding problems I was having, namely that it was a real mess under the web. But apparently, it's not a mess with email.

We have a dedicated computer that does nothing but filter spam (and the statistics from that are depressing); you can add additional fitering via regular expressions. Smirk has been receiving quite a bit of foreign spam, stuff in Russian, Korean, Chinese, which he can't even read since it's in Cyrillic, Wansung and Hangul. But (for instance) some (if not most) of the email had subject lines like:

Subject: =?Windows-1251?B?amFlQGxlZWhvbS5uZXQg?=

where the character set is encoded within the subject line. So Smirk thought a regular expression like ^Subject: .*Windows-1251.* would work and filter out the spam in Cyrillic (with appropriate regular expressions for Wansung and Hangul).

Only it didn't work.

It caught subject lines that had “Windows-1251” as part of the legitimate subject line (I sent him a test message with the subject of “Did you get Windows-1251 yet?”) but not if it was part of an encoding. Which meant only one thing: the spam filtering system was applying the regular expressions to the decoded characters!

Well … that's certainly a surprise.

But it doesn't help the current problem. We're now waiting to hear back from the company if that “feature” can be turned off.

The Boston Diaries

Wednesday, April 27, 2005

Taking away the spam filter's Little Orphan Annie Secret Decoder Ring

Obligatory Picture

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer