The Boston Diaries

The ongoing saga of a programmer who doesn't live in Boston, nor does he even like Boston, but yet named his weblog/journal “The Boston Diaries.”

Go figure.

Friday, June 18, 2004

The greylist approach to spam

I've been rather relunctant to add much in the way of anti-spam measures on the mailserver I use, if only because I'm horribly afraid of false positives; loosing email is a big thing with me. Also, most of the anti-spam measures are processor intensive, having to do an analysis of the actual email in question and attempt to classify it as “spam” or “non- spam”.

But during a thread on a mailing list I'm on, I came across a concept of “greylisting,” which seems very promising, as the statistics in the whitepaper state:

Analysis of Effectiveness

Based on testing with the example implementation, over a testing period of about 6 weeks, we had raw numbers of:

So we have a better than 97 percent efficiency assuming that all email is spam, but it's actually better than that, since most of the email that got through was not spam. Unfortunately, telling exactly how much better we did is impossible without individually inspecting each email, which of course we did not do.

Now lets look at our inefficiency:

Unfortunately, this is a pretty poor number. But let's correct it a bit. Almost all of these delayed emails were mailing list traffic which used a unique id for the sender address (see above note regarding VERP). So if we disregard all triplets that passed only one email, we should exclude that type of traffic, and we get a new set of numbers:

This puts things in a much more favorable light, and merely disregards delays for emails that are generally not timely anyway.

Now let's see what effect greylisting would have on network bandwidth, based on some general averages.

These numbers are based on spam collected via various methods before the testing period. We picked these as nice round numbers that are pretty closely in line with analysis of previously seen spam. As for the SMTP overhead, in most cases it was less than 500 bytes, but we decided to err on the conservative side.

From this, it follows that for every spam blocked using Greylisting, we save enough bandwidth to "pay" for 10 deferred delivery attempts. If we total that up to give a real-world number (using the unadjusted numbers to give a worst case picture):

338018 (# spams) x 5000 bytes = 1.69 Gbytes of bandwidth saved
33586 (# blocks) x 500 bytes = 16.7 Mbytes of bandwidth wasted

This gives us a net gain of over 1.67 Gbytes of traffic that was saved by implementing Greylisting in our tests. And that's just on a fairly small site.

Greylisting: Whitepaper

Even better is the actual method—there's no modification of SMTP itself, and the processing is dead simple. In fact, the processing doesn't even require looking at the email itself. All it requires is that the SMTP server keep track of the sender, receiver and the client IP address (the “triplets” mentioned above) and for an IP address never recorded, simply sending back a “try again later” response.

It's that simple.

Which is why I like it.

After a period of time (the whitepaper suggests one hour) you can then let the email through, and you keep the IP address on a “whitelist” that allows that IP address to send through without going through the “try again later” phase; the records that comprise the whitelist should expire after a period of time of inactivity (the whitepaper suggests 36 days).

I know the company that Spring works for, Negiyo, started using some anti-spam measures and for the past two or three weeks, they're getting slammed with both spam and complaints from customers about massive false positives—in fact, their email system is in total melt down right now. So this “greylisting” method would seem to be something they need to add right now (are you listening Rob? JeffK? Look into this!). Granted, there are some issues (such as large sites using multiple machines to send out email) that need to be resolved (record a range of addresses instead of individual addreses for instance) before implementing this, but I think that the returns given for such a simple thing are worth it.

While in the short term this seems like a good easy way to control spam, one concern is the adaption of spammers of techniques around this. Such issues are covered in the whitepaper, but there's another technique that was mentioned on the mailing list that is just as easy to implement and raises the cost of mail delivery to spammers, if used in conjunction with greylisting.

In RFC-2821, there are recomended timeouts for each phase of email delivery using SMTP (averages five minutes). This can be used to our advantage, and during the “try again later” responses, and for the initial “let mail get through” phase—all it would take is a minute or two delay on the server and it gets too expensive for the spammer. In some previous entries I go over just how many emails can be sent by a spammer, and the calculations don't include intentional delays on the part of the server. And delays of even a few seconds can cause a spammers costs to rise.

I do however, think I will attempt to get this installed on my mail server and see how it works.

Obligatory Picture

An abstract representation of where you're coming from]

Obligatory Contact Info

Obligatory Feeds

Obligatory Links

Obligatory Miscellaneous

Obligatory AI Disclaimer

No AI was used in the making of this site, unless otherwise noted.

You have my permission to link freely to any entry here. Go ahead, I won't bite. I promise.

The dates are the permanent links to that day's entries (or entry, if there is only one entry). The titles are the permanent links to that entry only. The format for the links are simple: Start with the base link for this site: https://boston.conman.org/, then add the date you are interested in, say 2000/08/01, so that would make the final URL:

https://boston.conman.org/2000/08/01

You can also specify the entire month by leaving off the day portion. You can even select an arbitrary portion of time.

You may also note subtle shading of the links and that's intentional: the “closer” the link is (relative to the page) the “brighter” it appears. It's an experiment in using color shading to denote the distance a link is from here. If you don't notice it, don't worry; it's not all that important.

It is assumed that every brand name, slogan, corporate name, symbol, design element, et cetera mentioned in these pages is a protected and/or trademarked entity, the sole property of its owner(s), and acknowledgement of this status is implied.

Copyright © 1999-2024 by Sean Conner. All Rights Reserved.