Thursday, Debtember 01, 2005
Yet more thoughts on spam
Last month I switched email clients from Thunderbird to mutt (I found Thunderbird to be too sluggish but that's a story for another entry) and configured our primary email server to forward my mail directly to my workstation, where procmail can then filter it.
So now I can burn through mail in about half the time it used to take me.
I get a ton of email, most of it from the various servers (from root mostly) and most of that is generated by the mail system itself, informing me that it's found, yet again, another email infected with a virus (oh, easily 500 a day) or it couldn't deliver a message (another 500 a day easy) or the multi-thousand line output of logwatch (each easily 15,000 lines of summary per day).
So it was a simple matter to set up procmail
to filter the
messages (and say, automatically delete the virus warnings—I tried turning
that off on the servers themselves, but … well … control panels and
hidden configuration files and I'm stuck getting them even though I don't
care for them). Now, since our mail goes through a dedicated spam filtering
system and can mark emails as spam, I thought it would be a good idea to
simply delete those upon receipt as well.
Only I kept receiving emails marked as spam.
31 N Dec 01 trespassers@gre ( 306) [SPAM] Breaking News
Puzzled, I moved the procmail
configuration to delete such marked
spam:
:0: * ^Subject: .*SPAM.* in-TRASH
to the start of my .procmailrc
, and yet, I still
get the emails. I bumped up the verbosity of logging, and yes,
some of it was actually being caught and trashed, but not all of
it.
What the heck?
In mutt
I see:
From: <trespassers@greenoblivion.com>
To: <apache@XXXXXXXXXXX>
Subject: [SPAM] Breaking News
Date: Thu, 1 Dec 2005 22:49:10 +0200
But when I checked the actual raw email message …
From: <trespassers@greenoblivion.com>
To: <apache@XXXXXXXXXXX>
Subject: =?ascii?B?W1NQQU1dICBCcmVha2luZyBOZXdz?=
Date: Thu, 1 Dec 2005 22:49:10 +0200
That funky subject line? A form of MIME encoding for email headers. In this case,
the subject line uses the US-ASCII
character set and is encoded
as base-64. procmail
knows
nothing about MIME encodings. It's looking for “SPAM” in the
subject line and not finding it.
Well now …
Obviously, I can add
:0: * ^Subject: =\?.*\?W1NQQU1dIC.* in-TRASH
(“[SPAM]” encoded as base-64) to my .procmailrc
file, but is there a
better way?
Sure, Bayesian filtering is pretty cool, but I still think that a few simple heuristics in place would help just as much.
One idea: check the character encoding of the incoming email. In my
case, if it isn't US-ASCII
, ISO-8859-1
or
UTF-8
(oh, might as well include WINDOWS-1251
for
those unfortunate friends that are abused by Microsoft), then discard it.
It doesn't matter if it's legitimate email if I don't understand the
language it's written in.
Now, with ISO-8859-1
, UTF-8
or
WINDOWS-1251
, I still might not be able to read the message
(since ISO-8859-1
and WINDOWS-1251
covers western
European langauges like French and German, and UFT-8
covers
just about all written languages), but my second idea should take care of
that.
Second idea: spell check the incoming email.
No, seriously.
Take this bit of spam I received today:
lt is really hard to recollect a company: the market is full of sugqestions and the information is overwhelming; but A GOOD CATCHY LOGO, STYLISH STATlONERY and OUTSTANDING WEBSIT E wilI make the task much easier.
We do not promise that having ordered a loqo your company wiIl automaticaIly become a worId Ieader: it is quite clear that without good products ,effective business orqanization and practicable aim it will be hot at nowadays market; but we do promise that your marketing efforts will become much more effective.
Twelve spelling errors (and one punctuation error, which I marked, but not counting in the following statistic) for a 14% spelling error rate. And if the email is in a different language, the spelling error rate will easily go past 95%. So, if the number of misspelled words exceeds say, 70%, delete it, and if it's above say, 5% (hey, we all make mistakes sometimes) mark it as possible spam.
This would definitely piss off the V1@gr@ pushers.
Third idea: Unless whitelisted, any email that consists of any type of attachment, delete it (well, for me at least).
And this is before explicit filtering, Bayesian or otherwise.
I wonder just how hard something like that would be to write …