Friday, June 18, 2004
The greylist approach to spam
I've been rather relunctant to add much in the way of anti-spam measures on the mailserver I use, if only because I'm horribly afraid of false positives; loosing email is a big thing with me. Also, most of the anti-spam measures are processor intensive, having to do an analysis of the actual email in question and attempt to classify it as “spam” or “non-spam”.
But during a thread on a mailing list I'm on, I came across a concept of “greylisting,” which seems very promising, as the statistics in the whitepaper state:
Analysis of Effectiveness
Based on testing with the example implementation, over a testing period of about 6 weeks, we had raw numbers of:
- Unique triplets seen: 346968
- Unique triplets that passed email: 8950
- Effectiveness (based on triplets): 97.4%
So we have a better than 97 percent efficiency assuming that all email is spam, but it's actually better than that, since most of the email that got through was not spam. Unfortunately, telling exactly how much better we did is impossible without individually inspecting each email, which of course we did not do.
Now lets look at our inefficiency:
- Total emails passed: 85745
- Total deliveries deferred where email was eventually passed: 33586
- Percentage of emails delayed: 39.2%
Unfortunately, this is a pretty poor number. But let's correct it a bit. Almost all of these delayed emails were mailing list traffic which used a unique id for the sender address (see above note regarding VERP). So if we disregard all triplets that passed only one email, we should exclude that type of traffic, and we get a new set of numbers:
- Total emails passed: 85745
- Total deliveries deferred where more than one email was eventually passed: 3512
- Percentage of emails delayed (adjusted): 4.1%
This puts things in a much more favorable light, and merely disregards delays for emails that are generally not timely anyway.
Now let's see what effect greylisting would have on network bandwidth, based on some general averages.
- Average size of spam emails: 5000 bytes
- Average SMTP delivery attempt overhead: 500 bytes
These numbers are based on spam collected via various methods before the testing period. We picked these as nice round numbers that are pretty closely in line with analysis of previously seen spam. As for the SMTP overhead, in most cases it was less than 500 bytes, but we decided to err on the conservative side.
From this, it follows that for every spam blocked using Greylisting, we save enough bandwidth to "pay" for 10 deferred delivery attempts. If we total that up to give a real-world number (using the unadjusted numbers to give a worst case picture):
338018 (# spams) x 5000 bytes = 1.69 Gbytes of bandwidth saved
33586 (# blocks) x 500 bytes = 16.7 Mbytes of bandwidth wasted
This gives us a net gain of over 1.67 Gbytes of traffic that was saved by implementing Greylisting in our tests. And that's just on a fairly small site.
Even better is the actual method—there's no modification of SMTP itself, and the processing is dead simple. In fact, the processing doesn't even require looking at the email itself. All it requires is that the SMTP server keep track of the sender, receiver and the client IP address (the “triplets” mentioned above) and for an IP address never recorded, simply sending back a “try again later” response.
It's that simple.
Which is why I like it.
After a period of time (the whitepaper suggests one hour) you can then let the email through, and you keep the IP address on a “whitelist” that allows that IP address to send through without going through the “try again later” phase; the records that comprise the whitelist should expire after a period of time of inactivity (the whitepaper suggests 36 days).
I know the company that Spring works for, Negiyo, started using some anti-spam measures and for the past two or three weeks, they're getting slammed with both spam and complaints from customers about massive false positives—in fact, their email system is in total melt down right now. So this “greylisting” method would seem to be something they need to add right now (are you listening Rob? JeffK? Look into this!). Granted, there are some issues (such as large sites using multiple machines to send out email) that need to be resolved (record a range of addresses instead of individual addreses for instance) before implementing this, but I think that the returns given for such a simple thing are worth it.
While in the short term this seems like a good easy way to control spam, one concern is the adaption of spammers of techniques around this. Such issues are covered in the whitepaper, but there's another technique that was mentioned on the mailing list that is just as easy to implement and raises the cost of mail delivery to spammers, if used in conjunction with greylisting.
In RFC-2821, there are recomended timeouts for each phase of email delivery using SMTP (averages five minutes). This can be used to our advantage, and during the “try again later” responses, and for the initial “let mail get through” phase—all it would take is a minute or two delay on the server and it gets too expensive for the spammer. In some previous entries I go over just how many emails can be sent by a spammer, and the calculations don't include intentional delays on the part of the server. And delays of even a few seconds can cause a spammers costs to rise.
I do however, think I will attempt to get this installed on my mail server and see how it works.