Stopping Spam

August 2003

There are many ways to fight spam. Which works best? So far the best single solution is probably Bayesian filtering. But you don't have to choose just one. Many of the following solutions could be used in combination.



Complaining to Spammers' ISPs

Good: Raises cost of spamming.
Bad: Laborious.
Role: Partial solution, for experts.

This was the original spam solution. Believe it or not, complaints can have an effect. True, spammers expect to be shut down, and already have fresh accounts lined up. Constantly switching providers is just a cost of doing business. But the faster they get booted due to complaints, the greater this cost becomes.

Complaining effectively is difficult. Most spammers forge the headers of their emails to disguise their origin. You have to learn to interpret headers to understand where a spam really came from.

Another option is to complain to the ISP hosting the site advertised in the spam (or, if the ISP is a spam hosting service, their upstream provider). But again, it can take some effort to figure out who this is.



Mail Server Blacklists

Good: Block spam right at the server.
Bad: Incomplete, sometimes irresponsible.
Role: A first pass to eliminate up to 50% of spam early on.

Groups of volunteers maintain blacklists of mail servers either used by spammers, or that have security holes that would let spammers use them. Some ISPs subscribe to such blacklists, and automatically refuse any mail from servers on them.

Blacklists have two downsides. One is that they never manage to list more than about half the servers that spam comes from. Another is that a blacklist is only as good as the people running it. Some blacklists are run by vigilantes who shoot first and ask questions later. Using the wrong blacklist could mean bouncing a lot of legitimate mail.

Blacklists are useful as at the ISP level, as long as you (a) use a responsible one (if there are any) and (b) don't expect it to be more than a first cut at the problem.



Signature-Based Filtering

Good: Rarely blocks legitimate mail.
Bad: Catches only 50-70% of spam.
Role: A first-pass filter on big email services.

Signature-Based filters work by comparing incoming email to known spams. Brightmail does it by maintaining a network of fake email addresses. Any email sent to these addresses must be spam. So when they see the same email sent to an address they're protecting, they know they can filter it out.

In order to tell whether two emails are the same, these systems calculate "signatures" for them. One way to calculate a signature for an email would be to assign a number to each character, then add up all the numbers. It would be unlikely that a different email would have exactly the same signature.

The way to attack a signature-based filter is to add random stuff to each copy of a spam, to give it a distinct signature. When you see random junk in the subject line of a spam, that's why it's there-- to trick signature-based filters.

The spammers have always had the upper hand in the battle against signature-based filters. As soon as the filter developers figure out how to ignore one kind of random insertion, the spammers switch to another. So signature-based filters have never had very good performance.



Bayesian (aka Statistical) Filtering

Good: Catch 99% to 99.9% of spam, low false positives.
Bad: Have to be trained.
Role: Best current solution for individual users.

Bayesian filters are the latest in spam filtering technology. They recognize spam by looking at the words (or "tokens") they contain.

A Bayesian filter starts with two collections of mail, one of spam and one of legitimate mail. For every word in these emails, it calculates a spam probability based on the proportion of spam occurrences. In my own email, "Guaranteed" has a spam probability of 98%, because it occurs mostly in spam; "This" has a spam probability of 43%, because it occurs about equally in spam and legitimate mail; and "deduce" has a spam probability of only 3%, because it occurs mostly in legitimate email.

When a new mail arrives, the filter collects the 15 or 20 words whose spam probabilities are furthest (in either direction) from a neutral 50%, and calculates from these an overall probability that the email is a spam.

Because they learn to distinguish spam from legitimate mail by looking at the actual mail sent to each user, Bayesian filters are extremely accurate, and adapt automatically as spam evolves.

Bayesian filters vary in performance. As a rule you can count on filtering rates of 99%. Some, like SpamProbe, deliver filtering rates closer to 99.9%.

Bayesian filters are particularly good at avoiding "false positives"-- legitimate email misclassified as spam. This is because they consider evidence of innocence as well as evidence of guilt. A Bayesian filter is unlikely to reject an otherwise innocent email that happens to contain the word "sex", as a rule-based filter might.

The disadvantage of Bayesian filters is that they need to be trained. The user has to tell them whenever they misclassify a mail. Of course, after the filter has seen a couple hundred examples, it rarely guesses wrong, so in the long term there is little extra work involved.

Another disadvantage of Bayesian filters is that they're new. The technology only became widespread in 2003. Most commercial spam filters are still rule-based.



Rule-Based (aka Heuristic) Filtering

Good: The best catch 90-95% of spam, easy to install.
Bad: Static rules, relatively high false positives.
Role: Easy server-level solution.

Rule-based filters look for patterns that indicate spam: specific words and phrases, lots of uppercase and exclamation points, malformed headers, dates in the future or the past, etc. This is how nearly all spam filters worked until 2002.

The performance of rule-based filters varies hugely. The simplest just reject any email that contains certain "bad" words. These are laughably easy for spammers to beat, and also tend to reject a lot of legitimate email.

On the other hand, sophisticated rule-based filters like Spamassassin can be quite effective. You can probably count on a good rule-based filter catching 90-95% of current spam.

The main disadvantage of rule-based filters is that they tend to have high false positive rates--often as high as .5%. (A trained Bayesian filter's false positive rate would be less than a tenth of that.)

Another disadvantage is that the rules are static. When spammers learn new tricks, the filter's authors have to write new rules to catch them. And because rule-based filters are static targets, spammers can tune their mails to get past them. Sophisticated spammers already test their mails on popular rule-based filters before sending them. In fact, there are sites that will do this for free.

The advantage of rule-based filters over Bayesian filters is that they're easy to install at the mail server level. Bayesian filters require users to train them by telling them when they misclassify an email, so running one on the server is a little more complicated (but probably worth it).



Challenge-Response Filtering

Good: Stops 99.9% of spam.
Bad: Rude, delays or drops legitimate email.
Role: Grandmothers, cranks.

When you get an email from someone you haven't had mail from before, a challenge-response filter sends an email back to them, telling them they must go to a web page and fill out a form before the email can be delivered.

The advantage of challenge-response filters is that they let through very little spam. (At least, so far.) The disadvantage is that they're rude. Spam means extra work for all of us. By using a challenge-reponse filter, you're saying that you expect the extra work of keeping your inbox free of spam to be done by the people who send you mail.

The other disadvantage of challenge-response filters is that much legitimate mail will either be lost, or delayed until it's too late to be useful. Suppose an acquaintance is going to a party tonight and decides to invite you too. Your filter replies with a challenge. But she doesn't see this till she checks her mail again the next day, by which time it's too late.

Occasionally senders will never reply to the challenge, and the email they sent you will be lost. Some respond to this by saying "if they're not willing to do a little work to talk to me, I probably don't want to hear from them anyway." But there are cases where this is clearly not true. I have several email addresses that all get forwarded to one account. If someone using a challenge-response filter sends me a question, the reply will come from an address their filter hasn't seen and I'll get a challenge back. If such challenges seem rude when you're the one initiating contact, imagine how they seem when you get one after replying to someone's question. I never bother to respond, and I'm probably not the only one.

There are also technical objections to challenge-response filters. What happens when spammers happen to use some innocent person's address as the from-address in a spam, for example? What happens when spamware authors (who aren't stupid) figure out how to spoof challenges? How do you create a challenge that blind people can pass?

If such technical objections could be overcome, challenge-response filters would have a place. They'd be suitable for users like my mother, who says that she only gets email from about 10 different addresses, and gets mail from a new address only about once a year. They might also be good in combination with other kinds of spam filter; you could challenge the mails that a Bayesian filter classified as spam, for example, just in case any were legit.

But using just challenge-response on all your incoming email is like putting a ten-foot chain-link fence around your house. Yes, it will keep people out, but it also sends the world a certain message about you.



Laws

Good: Truly threaten spammers.
Bad: Aren't enforced, or are full of loopholes.
Role: Could eliminate 80% of spam, if done right.

There are two problems with laws against spam: they usually contain large loopholes inserted by lobbyists, and the worst class of spammers ignore them.

But it isn't necessarily a waste of time to try to pass laws against spam. Even if some spammers do ignore laws, getting rid of the rest would still be worthwhile.

To be effective, spam laws would have to have criminal and not just civil penalties. The most prolific spammers in the US have made themselves judgment proof-- by putting all their assets in their wife's name, for example, or by buying a house in Florida.

(Florida law protects real estate against civil judgments. This is one of the reasons Florida, and Boca Raton in particular, is the spam center of the country.)

But the main thing any law against spam needs is enforcement. There are plenty of state laws against spam, and they seem to have no effect. One reason is that the states don't enforce them.

One option currently being considered is a do-not-spam list, like the US do-not-call list. I don't think this will work. All it would take is one person to crack the security, and the list would be out there, irretrievably.

The main loophole in spam laws is usually in the definition of spam. Most spam laws allow unsolicited email to recipients who have a prior relationship with the sender. This is reasonable, but you have to define carefully what a prior relationship consists of. There is a whole class of spammers (they currently call themselves "permission-based email marketers") who get email addresses by buying them from websites with unscrupulous privacy policies. By calling the site they bought your address from an "affiliate" or "partner", the spammers claim that they too have a prior relationship with you, and are thus exempt from spam laws.

This loophole would have to be closed for any anti-spam law to work. The way to draw the line between spam and marketing is to look at where the sender got your email address. If they merely bought your address, or harvested it from web pages, chat rooms, or newsgroups, then they don't have a prior relationship with you. If a federal spam law simply said that any email to an address thus obtained had to have ADV in the subject line, that alone could get rid of 50% of spam. Legitimate direct marketers would have little objection to such a measure; they don't want their brands to be tarnished by spamming, and never buy or harvest email addresses.

(Only a few well-known brands use spam. Gevalia, owned by Kraft, is probably the most notorious.)

Some see First Amendment problems with laws against spamming. But there are plenty of precedents. The closest are probably the federal laws against junk faxes and telemarketing with recorded messages. There doesn't seem to have been much protest against these on free speech grounds.

A law against spam could have some effect, even if it wasn't very well enforced, because it would further stigmatize spam. Spammers have families, friends, and neighbors, and these all exert some amount of pressure on them. Alan Ralsky said that he had promised his wife not to send porn spams. (You probably have to pay extra attention to your wife when all your assets are in her name.) Perhaps if there were a federal law against spam, with criminal penalties, she'd make him stop altogether.



FFBs

Good: Raise cost of spamming.
Bad: Involve blacklists.
Role: Speculative idea.

About 95% of spams contain links to web pages. If everyone who received a spam actually followed the links in it, the traffic would be a heavy burden on the spammers' servers.

That's the idea behind FFBs (Filters that Fight Back). If many spam filters automatically crawled sites mentioned in spams, the resulting traffic could generate high server loads and bandwidth costs for spammers.

The biggest spammers could probably protect themselves against FFBs overloading their server, but even in their case the bandwidth would have to be paid for, raising the cost of each spam. Smaller spammers would be crushed by FFBs. A medium-sized spam hosting account allows 50 GB of transfer per month. A moderately popular FFB could drain this in a matter of hours.

To protect against people sending fake spams in order to provoke FFBs to attack innocent sites, this system would have to rely on blacklists. Only sites listed on the blacklist would be crawled when spam mentioning them arrived. Technically, that is the weak point of this solution: blacklists are not always responsibly managed.

Another disadvantage to this plan is the resemblance to a denial of service attack. It isn't a DoS attack, according to most definitions. Even so, some users wouldn't want to do this, even to spammers.

There are no FFBs yet, though there is now one filter that automatically retrieves web pages to improve accuracy.



Slow Senders

Good: Raises cost of spamming.
Bad: Requires new email protocol.
Role: Speculative idea.

Spam has low response rates (on the order of 15 per million) but spammers make up for it with high volumes, sending millions of emails per day. If you could slow down the rate at which they send email, you could put them out of business.

One way to do this would be to make any computer that wanted to send you mail perform a time-consuming computation before you would accept it.

Whatever these computations were, they couldn't be too arduous, because legitimate corporate mail servers have to be able to send high volumes of mail. And corporate mail servers would be running on stock hardware. Many computations can be made hundreds or thousands of times faster by custom hardware. Spammers already have highly tuned systems and would not be deterred by the need for custom hardware.

So for this idea to work, you'd need to figure out a kind of computation that couldn't easily be speeded up by custom hardware.

(You could improve the odds by incorporating Bayesian spam recognition; instead of always requiring the same calculation, require a calculation whose difficulty depends on the spamminess of the incoming mail.)

Even if you could find a suitable computation, this idea would require new email protocols. Any new protocol has a chicken-and-egg problem: no one needs to adopt it till everyone else does. As a result, it is practically impossible to get a new protocol adopted for anything. How are you going to get sysadmins who don't even bother to install patches for years-old security holes to switch to a new email protocol?



Penny per Mail

Good: Raises cost of spamming.
Bad: Requires new email protocol, bureaucracy.
Role: Speculative idea.

There are various ideas floating around for charging some small amount per email sent. If it cost even half a cent to send an email, spam wouldn't pay, and would disappear.

Unfortunately, I think there would be insuperable practical problems in setting up such a system. These proposals run into the same chicken-and-egg problem as anything that requires a new protocol; there is no incentive to be an early adopter.

In this case the protocol would be particularly onerous to administer, because it would involve money. Setting up a mail server would mean establishing a line of credit.

Security would become much more important once money was involved. Mail servers would now in effect transfer funds, like servers within the banking system. This despite the fact that they are connected to the Internet.

Charging per email wouldn't stop the worst spammers. They'd just break into companies' computers and send mail at their expense. And the possibility of a spammer breaking into one's system and racking up big email bills would not make the average sysadmin eager to become an early adopter, to say the least.

For this kind of approach to work, we'd first have to solve the problem of making the average small and medium-sized company's network secure. So we'd just be exchanging a hard problem for a harder one.



Secret Address

Good: Easy.
Bad: Doesn't work.
Role: Facile recommendation for brief news articles.

Some recommend that you keep your address secret in order to avoid spam. But it's hard to keep your address secret, because other people have to know it to send you email. All it takes is one naive friend to enter your address in a web site to send you an electronic greeting card, and it's all over.

Even if no one discloses your address, spammers can still get it through dictionary attacks. In a dictionary attack, spammers try sending a test mail to millions of possible addresses. Any that don't bounce are probably valid. My mother gets spam as a result of a dictionary attack on AOL, even though she only sends email to a handful of people and never uses the Web.



Junk Address

Good: Cuts some spam.
Bad: Can't always use them.
Role: Use on web sites that make you register.

You could in principle avoid spam by giving a different email address to everyone. Then you could just shut off any address that got compromised. And, for what it's worth, you'd know who was responsible.

This is a good idea when you have to enter your address on a web site, e.g. to register for an account to read news articles online. I usually use junk addresses for such purposes.

This is hard to do in all cases, though. You would not be able to just print an email address on your web site or business card. Instead you'd have to have a page you sent people to, where they could request an email address to use to send mail to you.



Thanks to Bill Yerazunis and Brian Burton for reading this and suggesting several fixes.