April 17, 2003
Bayesian spam zapping with bogofilter

It's now been a week since I started using Bogofilter a Bayesian network spam catching affair by ESR, to filter out the 11-odd spam messages I get per day. I have previously been using an elaborate procmail system called SpamBouncer, which works reasonably well, but blocks some BigPond users (Telstra being a major source of spam), and is generally hard to update.

So far bogofilter has worked very well, with no false positives, and only a few misses. The best part was that I got to use my lovingly hoarded spam collection to 'train' the network:

cat Mail/spam.incoming | formail -s bogofilter -s

formail, in particular formail -s <cmd> is extremely useful for mbox tinkering. The -s option splits an incoming mbox stream, and runs <cmd> for every message; in this case, telling bogofilter to classify the contents as spam.

In the same way, one tells bogofilter what isn't spam:

cat $MAIL | formail -s bogofilter -n

bogofilter is now trained and can be used to filter incoming mail. The man page has a sample procmail recipe that positively reinforces whatever decision is made, so the network is constantly adapting. When bogofilter lets a spam mail through, this can be rectified with (in mutt) ''|bogofilter -Ns', and then bouncing the message to oneself to test the change.

Oh, and if bogofilter seems.. really uncannily good at classifying existing mail, check that your previous spam software hasn't added custom headers to each mail. Real spammers are very inconsiderate, and don't add headers like 'X-SBClass: spam', so its no good training on such emails ;)

Posted by jefft at April 17, 2003 10:19 PM
Post a comment

Email Address:



Remember info?