| |
Is this code available anywhere?
No; it's written in Arc, which is itself not released yet.
Is there a Bayesian filter for Outlook?
I know of nine so far: Spammunition,
SpamBayes,
Spam Bully,
InboxShield,
Junk-Out,
Outclass,
Disruptor OL,
SpamTiger,
and
JunkChief.
Is there anything that can protect my company's server?
The best commercial server-level Bayesian filter is probably
Death2Spam.
SpamProbe,
one of the best open-source Bayesian filters, can also be run
on the server.
Most commercial server-level spam filters are still rule-based.
But there are starting to be some
that use Bayesian filtering. The way to find them is probably
to search in Google.
The question to ask the salesman is, does the filter learn to recognize
spam based on the spam and nonspam mail we receive? If it
doesn't learn, it isn't Bayesian.
Does Arc/Lisp have some advantage for writing this kind of software?
Lisp's symbol type is useful in manipulating databases of words,
because it lets you test for equality by just comparing pointers.
(If you find yourself using checksums to identify words, it's symbols
you mean to be using.) Prolog, Erlang, and Ruby also support symbols.
So does Python, to a degree, though I don't think it has any syntax for them yet.
More generally, Lisp was designed for rapid
prototyping, and this application involved a lot of that. I probably
spent 95% of the development time typing expressions into the
toplevel, trying variously tweaked algorithms on individual emails.
Do you mind if I write filters based on this algorithm?
Of course not. I don't claim to have invented anything new here. Bayesian
text classification is an old and established field. If there is anything new
in this article (at least, it was news to me) it is that such a simple
variant of it is so effective at filtering spam.
So by all means go write spam filters. It is a very rewarding hack. If
you end up creating something other people can use, let me know and
I'll make a link to it.
I don't know Lisp; can you explain the algorithm to me?
It's expressed in math notation in Hackers
& Painters.
Could you use Bayesian filtering to make Web content filters?
I've never tried it, but yes, I think it would work well.
Will this algorithm filter out 99.5% of my spam with no false positives?
It does filter out 99.5% of my spam. I got similar results for the one
other spam collection I've tested it on so far. (I couldn't measure false
positives, because I only had the spams.)
I do worry a bit that my email might just be especially easy to filter.
A lot of my spam comes from "opt-in" spammers like Qves and Virtumundo,
and that stuff is really easy to detect. Plus my own mail is full of references
to programming that make it easy to distinguish from spam (though not
"programming" itself, ironically, which often occurs in spams selling
satellite dishes).
So if you do try implementing this filter for yourself, I'd appreciate it if you
could let me know how well it works for you. My gut feel is that it will work
pretty effectively for almost everyone, but it would be reassuring to hear
numbers.
Could spammers fool Bayesian filters by filling their spams with random words?
They would have to get rid of the bad words as well as adding neutral ones.
Only the most interesting fifteen words contribute to the probability,
and neutral words like "onion", no matter how many there are of them,
can't compete with the incriminating "viagra" for statistical significance.
To outweigh incriminating words, the spammers would need to dilute
their emails with especially innocent words, i.e. those that are not merely
neutral but occur disproportionately often in the user's legitimate email.
But these words (e.g. the nicknames of one's friends, terms one uses
in one's work) are different for each recipient, and the spammers have
no way of figuring out what they are.
Once this software was available, couldn't spammers just tune their spams
to get through it?
They couldn't necessarily tune their
emails and still say what they wanted to say. If they wanted to send you to a url that
is known to the filters, for example, they would find it hard to
tune their way around that.
Second, tune using what? Each user's filters will be different, and the innocent
words will vary especially. At most, spammers will be able to dilute
their mails with merely neutral words, and those will not tend to be much use
because they won't be among the fifteen most interesting.
If the spammers did try to get most of the incriminating words
out of their messages, they
would all have to use different euphemisms, because if they
all started saying "adolescents" instead of "teens", then "adolescents"
would start to have a high spam probability.
Finally, even if spammers worked to get all the incriminating words out of the
message body, that wouldn't be enough, because in a typical spam a
lot of the incriminating words are in the headers.
What if spammers sent their messages as images?
They already do sometimes, and we are able to catch them.
Such emails include a lot of damning content, actually. The headers,
to start with, will be as bad as ever. And remember that we scan
html as well as text. Within the message body there
will probably be a link as well as the image,
both containing urls, which would probably score high. "Href" and "img"
themselves both have spam probabilities approaching pornographic words.
Within the url, the image has to have some kind of name, and these are
usually far from random.
Can your program deal with the spam trick of inserting html comments
in the middle of words?
Yes, I ignore html comments down at the
level of scanning tokens, not even considering them
as separators.
Spammers sometimes use randomly generated tags to break
up tokens, since html rendering software will usually ignore
meaningless tags. I do allow these to separate tokens, and it
works fine. Broken bits of words simply get high spam probabilities.
Would this article itself be filtered out by your filters?
No. Someone sent it to me as a test, and it wasn't.
Although the article contains a lot of
spam words, it also contains a lot of even more provably innocent
words (like "spammers" ironically, which occurs in a lot of my
real email, and never in my spam). Since only the most
interesting words count, the innocent words
crowded out the spam words in the rankings, and this mail
ended up with the minimum possible probability.
That makes sense, because an article someone writes is more likely
to resemble the content of their own email than it would a spam, even
if the article happens to be about spam.
How well can your software filter out mail that's about spam?
This is a problem for anyone who works on spam filtering. In the extreme
case, if someone you've never heard from merely forwards you a spam
with no additional comment, it's going to be hard to filter out. There it will
all come down to the headers.
But after all, if someone could forward you spam with
impunity, then spammers could forward you spam too. So for
these situations we may have to have a special password
that people could include in mail to get it by the filters, and
also (I have one already) a special trash folder for metaspam
so that it doesn't contribute to the probabilities.
If there is a good side to this, it's that if we can create filters that work
acceptably well for us, they'll work even better for everyone else.
Will this software filter out good automated email
like order receipts and newsgroup faqs?
Mostly not. That kind of mail may be automated, but
the text usually has a very different character from spam.
The only automated responses that tend to get filtered out are the
ones that contain a lot of additional advertising. In effect, these
emails consist of an automated response with a spam appended to the
end, so it is not surprising filters catch them.
Will the kind of people who would respond to a spam use filters?
I think so. I'm guessing here, but I suspect that people
stupid enough
to respond to a spam will often get email through one of the free services
like Yahoo Mail or Hotmail, or through big ISPs like AOL or Earthlink.
Once word spreads that it is possible to
filter out most spam, they'll be forced to offer effective filters.
If filters catch most spam, won't spammers just send more to compensate?
Spammers already operate at capacity.
How do you feel about blacklists?
As ISPs use them now, they're equivalent to very bad filters.
But the information they supply could be incorporated into Bayesian filtering.
Doesn't Bayes' Rule assume the predictors are independent?
Yes, strictly speaking it is only valid assuming the probabilitiy of e.g.
"sex" occurring in an email is unrelated to the probability of "sexy"
occurring in it. Obviously that is not the case.
On the other hand, there is a long tradition of violating
this requirement. When you do that it's called a "naive Bayesian" algorithm
and in practice it works pretty well, just as in practice (if you stay away
from the edges of precision) it works pretty well to treat floating point
numbers as if they were reals.
Does this algorithm take into account the prior probability of a mail being a spam?
No. It assumes it's equally likely for a mail to be a spam or not. So accuracy
could be slightly improved by taking into account the actual proportion of spam
received by each user. You would have to do this by hour though, because
the proportion varies so much depending on the time of the day that the
overall prior probability is useless as a predictor.
Why do you consider 15 words? What's special about 15?
Actually, in the latest version of the filter I use 20, because
marking tokens by context means the
software now sees more distinct
tokens per mail.
Nearly all spams tend to have that many fairly incriminating words
in them. If you
look at more than that, you'll start to include neutral words,
which are noise at best. But you want to look at as many words as you can,
because legitimate emails may well contain two or three words with high
spam probabilities-- if you used only the top 5 words, for example,
you'd start to get false positives.
Will this work for languages other than English?
As long as they use whitespace to separate tokens, it should.
|
|