Official Versions: In
English or
French
Maintainer: David Relson <relson@osagesoftware.com>
This document is intended to answer frequently asked questions about bogofilter.
Bogofilter is a fast Bayesian spam filter along the lines suggested by Paul Graham in his article A Plan For Spam. Bogofilter uses Gary Robinson's geometric-mean algorithm with the Fisher's method modification to classify email as spam or non-spam.
The bogofilter home page at SourceForge is the central clearinghouse for bogofilter resources.
Bogofilter was started by Eric S. Raymond on August 19, 2002. It gained popularity in September 2002, and a number of other authors have started to contribute to the project.
The NEWS file describes bogofilter's version history.
Bogofilter is some kind of a bogometer or bogon filter, i.e., it tries to identify bogus mail by measuring the bogosity.
There are currently four mailing lists for bogofilter:
List Address | Links | Description |
---|---|---|
bogofilter-announce@aotto.com | [subscribe] [archive] | An announcement-only list where new versions are announced. |
bogofilter@aotto.com | [subscribe] [archive] | A discussion list where any conversation about bogofilter may take place. |
bogofilter-dev@aotto.com | [subscribe] [archive] | A list for sharing patches, development, and technical discussions. |
bogofilter-cvs@lists.sourceforge.net | [subscribe] [archive] | Mailing list for announcing code changes to the CVS archive. |
Bogofilter can instructed to display information on the scoring of a message by running it with flags "-v", "-vv", "-vvv", or "-R".
X-Bogosity: No, tests=bogofilter, spamicity=0.500000
X-Bogosity: No, tests=bogofilter, spamicity=0.500000 int cnt prob spamicity histogram 0.00 29 0.000209 0.000052 ############################# 0.10 2 0.179065 0.003425 ## 0.20 2 0.276880 0.008870 ## 0.30 18 0.363295 0.069245 ################## 0.40 0 0.000000 0.069245 0.50 0 0.000000 0.069245 0.60 37 0.667823 0.257307 ##################################### 0.70 5 0.767436 0.278892 ##### 0.80 13 0.836789 0.334980 ############# 0.90 32 0.984903 0.499835 ################################
Each row shows an interval, the count of tokens with scores in that interval, the average spam probability for those tokens, the message's spamicity score (for those tokens and all lesser valued tokens), and a bar graph corresponding to the token count.
In the above histogram there are a lot of low scoring tokens and a lot of high scoring tokens. They "balance" one another to give the spamicity score of 0.5000
X-Bogosity: No, tests=bogofilter, spamicity=0.500000 n pgood pbad fw U "which" 10 0.208333 0.000000 0.000041 + "own" 7 0.145833 0.000000 0.000059 + "having" 6 0.125000 0.000000 0.000069 + ... "unsubscribe.asp" 2 0.000000 0.095238 0.999708 + "million" 4 0.000000 0.190476 0.999854 + "copy" 5 0.000000 0.238095 0.999883 + N_P_Q_S_s_x_md 138 0.00e+00 0.00e+00 5.00e-01 1.00e-03 4.15e-01 0.100The columns printed contain the following information:
The final lines show:
The "-R" output is formatted for use with the R language for statistical computing. More information is available at The R Project for Statistical Computing.
If you have a working SpamAssassin installation (or care to create one), you can use its return codes to train bogofilter. The easiest way is to create a script for your MDA that runs SpamAssassin, tests the spam/non-spam return code, and runs bogofilter to register the message as spam (or non-spam). The sample procmail recipe below shows one way to do this:
BOGOFILTER = "/usr/bin/bogofilter" BOGOFILTER_DIR = "training" SPAMASSASSIN = "/usr/bin/spamassassin" :0 * ? $SPAMASSASSIN -e #spam yields non-zero #non-spam yields zero | $BOGOFILTER -n -d $BOGOFILTER_DIR #else (E) :0Ec | $BOGOFILTER -s -d $BOGOFILTER_DIR :0fw | $BOGOFILTER -p -e :0: * ^X-Bogosity:.Yes spam :0: * ^X-Bogosity:.No non-spam
Many people get unsolicited email using asian language charsets. Since they don't know the languages and don't know people there, they assume it's spam.
The good news is that bogofilter does detect them quite successfully. The bad news is that this can be expensive. You have basically two choices:
You can simply let bogofilter handle it. Just train bogofilter with the asian language messages identified as spam. Bogofilter will parse the messages as best it can and will add tokens to the spam wordlist. The wordlist will contain many tokens which don't make sense to you (since the charset cannot be displayed), but bogofilter can work with them and successfully identify asian spam.
A second method is to use the "replace_nonascii_characters" config file option. This will replace high-bit characters, i.e. those between 0x80 and 0xFF, with question marks, '?'. This keeps the database much smaller. Unfortunately this conflicts with european language which have many accented vowels and consonant in the high-bit range.
If you are sure you will not receive any legitimate messages in those languages, you can kill them right away. This will keep the database smaller. You can do this with an MDA script.
Here's a procmail recipe that will sideline messages written with asian charsets:
## Silently drop all asian language mail UNREADABLE='[^?"]*big5|iso-2022-jp|ISO-2022-KR|euc-kr|gb2312|ks_c_5601-1987' :0: * 1^0 $ ^Subject:.*=\?($UNREADABLE) * 1^0 $ ^Content-Type:.*charset="?($UNREADABLE) spam-unreadable :0: * ^Content-Type:.*multipart * B ?? $ ^Content-Type:.*^?.*charset="?($UNREADABLE) spam-unreadable
With the above recipe, bogofilter will never see the message.
To find the spam and ham counts for a token (word) use bogoutil's '-w' option. For example, "bogoutil -w $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".
If you want the spam score in addition to the spam and ham counts for a token (word) use bogoutil's '-p' option. For example, "bogoutil -p $BOGOFILTER_DIR example.com" gives the good and bad counts for "example.com".
To find out how many messages are in your wordlists query the special token .MSG_COUNT, i.e., run command "bogoutil -w $BOGOFILTER_DIR .MSG_COUNT" to see the counts for the spam and ham word lists.
To tell how many tokens are in your wordlists pipe the output of bogoutil's dump command to command "wc", i.e. use "bogoutil -d $BOGOFILTER_DIR/spamlist.db | wc -l " to display the count for the spamlist and use "bogoutil -d $BOGOFILTER_DIR/goodlist.db | wc -l" to display the count for the goodlist.
If you think your word lists are hosed, you can see what BerkeleyDB thinks by running:
db_verify spamlist.db db_verify goodlist.db
If there is a problem, you may be able to recover some (or all) of the tokens and their counts with the following commands:
bogoutil -d spamlist.db | bogoutil -l spamlist.db.new
or with
db_dump -r spamlist.db > spamlist.txt db_load spamlist.new < spamlist.txt
If you don't already have a v3.0+ version of BerkeleyDB, then download it, unpack it, and do these commands in the db directory:
$ cd build_unix $ sh ../dist/configure $ make # make install
Next, download a portable version of bogofilter.
On Solaris
Unpack it, and then do:
$ ./configure --with-db=/usr/local/BerkeleyDB.4.1 $ make # make install-strip
You will either want to put a symlink to libdb.so in /usr/lib, or use a modified LD_LIBRARY_PATH environment variable before you start bogofilter.
$ LD_LIBRARY_PATH=/usr/lib:/usr/local/lib:/usr/local/BerkeleyDB.4.1
Note that some make versions shipped with Solaris break when you try to build bogofilter outside of its source directory. Either build in the source directory (as suggested above) or use GNU make (gmake).
On FreeBSD
The FreeBSD ports and packages carry very recent versions of bogofilter. This approach uses the highly recommended portupgrade and cvsup software packages. To install these two fine pieces, type (you need to do this only once):
# pkg_add -r portupgrade cvsup
To install or upgrade bogofilter, just upgrade your portstree using cvsup, then type:
# portupgrade -N bogofilter
On HP-UX
See the file doc/programmer/README.hp-ux in the source distribution.
If all you're just reading from them, there are no problems. When you're updating them, you need to use the correct file locking to avoid data corruption. When you compile bogofilter, you will need to verify that the configure script has set "#define HAVE_FCNTL 1" in your config.h file. Popular UNIX operating systems will all support this. If you are running an unusual, or an older version of an operating system, make sure it supports fcntl(). If "#define HAVE_FCNTL 1" is set, then comment out "#define HAVE_FLOCK 1" so that the locking system uses fcntl() locking instead of the default of flock() locking. If your system does not support fcntl, then you will not be able to share word list files over NFS without the risk of data corruption.
Next, make sure you have NFS set up properly, with "lockd" running. Refer to your NFS documentation for more information about running "lockd" or "rpc.lockd". Most operating systems with NFS turn this on by default.
Likely the return codes are being reformatted by waitpid(2). Use WEXITSTATUS(status) in sys/wait.h, or comparable macro, to get the correct value.
With version 0.11 bogofilter's options for registering mail as ham or spam have been changed. They now allow registering (or unregistering) messages in the ham and spam word lists. Prior to this, there was no way to unregister a message from a word list (without registering it in the other word list).
Bogofilter has four registration options - '-s', '-n', '-S', and '-N'. With the release of version 0.11 the meaning of '-S' and '-N' has been changed to allow unregistering messages from the word lists. Here's what the four options mean:
Prior to version 0.11, the '-S' option was used to move a message from the ham word list to the spam word list, i.e. there were two actions. Now with 0.11 each of the two actions is invoked by its own option. To get the same effect as the old '-S', you should use '-N -s' (or '-Ns' which means the same thing).
Similarly, the old '-N' option is now '-Sn' (or '-S -n').
MDA scripts typically use '-s' and '-n' and don't need to change. Other scripts which use '-S' and '-N' for fixing registration errors do need to be changed.