bogofilter [ help options | classification options | registration options ] [algorithm options] [general options]
where
help options are:
[-h] [-V] [-Q]
classification options are:
[-p] [-e] [-t] [-u] [-2] [-3] [-M] [-b] [-B filename ...] [-F] [-R] [algorithm options] [general options] [parameter options]
registration options are:
[ -s | -n ] [ -S | -N ] [algorithm options] [general options]
general options are:
[-c filename] [-C] [-d dir] [-l] [-L tag] [-I filename] [-O filename]
algorithm options are:
[ -g | -r | -f ]
parsing options are:
[-Pi/-PI] [-Ph/-PH] [-Pt/-PT]
parameter options are:
[-m [value] [,value]] [-o [value] [,value]]
info options are:
[-q] [-v] [-y date] [-D] [-x flags]
Bogofilter is a Bayesian spam filter. In its normal mode of operation, it takes an email message or other text on standard input, does a statistical check against lists of "good" and "bad" words, and returns a status code indicating whether or not the message is spam. Bogofilter is designed with fast algorithms, uses the Berkeley DB for fast startup and lookups, coded directly in C, and tuned for speed, so it can be used for production by sites that process a lot of mail.
Bogofilter treats its input as a bag of tokens. Each token is checked against "good" and "bad" wordlists, which maintain counts of the numbers of times it has occurred in non-spam and spam mails. These numbers are used to compute the probability that a mail in which the token occurs is spam. After probabilities for all input tokens have been computed, a fixed number of the probabilities that deviate furtherest from average are combined using Bayes's theorem on conditional probabilities. If the computed probability that the input is spam exceeds a cutoff determined at compile time (currently 0.95, for the Robinson-Fisher algorithm), bogofilter returns 0, otherwise 1.
While this method sounds crude compared to the more usual pattern-matching approach, it turns out to be extremely effective. Paul Graham's paper A Plan For Spam is recommended reading.
This program substantially improves on Paul's proposal by doing smarter lexical analysis. In particular, hostnames and IP addresses are retained as recognition features rather than broken up. Various kinds of MTA cruft such as dates and message-IDs are discarded so as not to bloat the word lists. Lex's Swiss-army-knife nature rises again.
Another seeming improvement is that this program offers Gary Robinson's suggested modifications (S and f(w) but not g(w)) to the calculations. These modifications are described in Robinson's paper Spam Detection.
Since then, Robinson and others have realized that the S calculation can be further optimized: if a vector of length k contains random, uniformly-distributed probabilities p, then -2 * sum(ln(p)) is distributed as chi-squared with 2n degrees of freedom. This is believed to be the most sensitive test of the hypothesis that the vector of probabilities is, in fact, uniformly distributed. Bogofilter now offers the option of applying this test (known as Fisher's method) to yield P(spam) and P(not spam), and using the difference as the "spamicity" score.
The input may be one message or many. Messages are broken up on "From " lines. The algorithm is relatively insensitive to message miscounts.
Without command-line options, bogofilter returns 1 if the message is non-spam, 0 if it is spam. The non-spam wordfile is created if absent.
HELP OPTIONS
The -h option prints the help message and exits.
The -V option prints the version number and exits.
The -Q (query) option prints bogofilter's configuration, i.e. registration parameters, parsing options, bogofilter directory, etc.
CLASSIFICATION OPTIONS
The -p (passthrough) option writes a copy of the input mail to the output with an X-Bogosity header (in the style of SpamAssassin) inserted. The header will begin with "Yes" or "No" according as the mail is judged to be spam or non-spam respectively. Note: the memory consumption depends on whether the input file is regular and allows seek operations. Within these constraints, the file will be rewound and read a second time, without using much memory. If the input file however is not regular (for example, a pipeline or socket), then bogofilter will cache a copy if the entire mail in memory.
The -e (embed) option tells bogofilter to exit with code 0 even if the mail is not spam. This simplifies using bogofilter from procmail or maildrop.
The -t (terse) option tells bogofilter to print an abbreviated spamicity message containing 1 letter and the score. The letter will be "Y" to indicate spam and "N" to indicate non-spam.
The -u option tells bogofilter to register the message's text after classifying it as spam or non-spam. A spam message will be registered on the spamlist and a non-spam message on the goodlist. If using the Robinson-Fisher method and the classification is "unsure", the message will not be registered. Effectively this option runs bogofilter with the -s or -n flag, as appropriate. (Caution is urged in the use of this capability, as any classification errors bogofilter may make will be preserved and accumulated until corrected with the -Sn and -Ns option combinations.)
The -2 option tells bogofilter to binary classify the message as either ham or spam, and never as unsure. When this option is used with -u, a wordlist is always updated.
The -3 option tells bogofilter to use tristate classification for the message, i.e. classify the message as ham, spam, or unsure. This option is effective only if ham_cutoff is non-zero.
The -M option tells bogofilter to process its input as a mbox formatted file. If the -v or -t option is also given, a spamicity line will be printed for each message.
The -b (streaming bulk mode) option tells bogofilter to classify multiple messages whose names are read from stdin. If the -v or -t option is also given, bogofilter will print a line giving file name and classification information for each file.
The -Bfilename (bulk mode) option tells bogofilter to classify multiple messages named as files on the command line. If the -v or -t option is also given, bogofilter will print a line giving file name and classification information for each file.
The -F (force) ignores threshold values when printing spamicity statistics.
The -R option tells bogofilter to output an R data frame in text form on the standard output. See the section on integration with R, below, for further detail.
REGISTRATION OPTIONS
The -s option tells bogofilter to register the text presented on standard input as spam. The spam wordfile is created if absent.
The -n option tells bogofilter to register the text presented on standard input as non-spam.
Bogofilter doesn't detect if a message registered twice. If you do this by accident, the token counts will off by 1 from what you really want and the corresponding spam scores will be slightly off. Given a large number of tokens and messages in the wordlists, this doesn't matter. The problem _can_ be corrected by using the -S option or the -N option.
The -S option tells bogofilter to undo a prior registration of the same message as spam. If a message was incorrectly entered in the spam wordfile by '-n' or '-u' and you want to remove it from the spam wordfile and enter it in the non-spam wordfile, use options '-Sn'. If '-S' is used for a message that wasn't registered as spam, the counts will still be decremented.
The -N option tells bogofilter to undo a prior registration of the same message as non-spam. If a message was incorrectly entered in the non-spam wordfile by '-n' or '-u' and you want to remove it from the non-spam wordfile and enter it in the spam wordfile, then use '-Ns'. If '-N' is used for a message that wasn't registered as non-spam, the counts will still be decremented.
GENERAL OPTIONS
The -cfilename option tells bogofilter to read the config file named.
The -C option prevents bogofilter from reading configuration files.
The -d dir option allows you to set the directory under which wordlists will be found to dir. If omitted, the default directory will be $BOGOFILTER_DIR if BOGOFILTER_DIR is set and $HOME/.bogofilter otherwise.
The -l option writes an informational line to the system log each time bogofilter is run. The information logged depends on how bogofilter is run.
The -L tag option configures a tag which can be included in the information being logged by the -l option, but it requires a custom format that includes the %l string for now. This option implies -l.
The -I filename option tells bogofilter to read its input from the specified file, rather than from stdin
The -O filename option tells bogofilter where to write its output in passthrough mode. Note that this only works when -p is explicitly given.
ALGORITHM OPTIONS
The Robinson-Fisher method is the default algorithm used for computing a message's spamicity score, unless bogofilter has been compiled without it, by using the --disable-robinson-fisher option to the configure script. The method to be used can be specified on the command line or in the configuration file.
The -g option selects the original Graham form of the calculation method.
The -r option selects the Robinson modifications to the calculation method.
The -f option selects the Robinson-Fisher modifications to the calculation method.
The configure script has options --disable-graham-method, --disable-robinson-method, and --disable-robinson-fisher so that bogofilter can be built to support a subset of the available methods.
PARSING OPTIONS
Bogofilter has three special parsing options which can be enabled (or disabled) at the user's discretion. The options ar of form -Px and -PX where x designates an option letter. For the parsing options, a lower case letter enables the option and an upper case letter disables it.
Options -Ph and -PH are for header line markup, i.e. whether to create special tags for header lines. When enable, tokens in "To:", "From:", "Return-Path:", and "Subject:" lines will be given special prefixes. Enabling this option increases bogofilter's accuracy.
Options -Pi and -PI are for ignoring case, i.e. whether to map upper case to lower case (or not). Disabling this option increases bogofilter's accuracy.
Options -Ph and -PH are for header line markup, i.e. whether to create special tags for header lines. When enable, tokens in "To:", "From:", "Return-Path:", and "Subject:" lines will be given special prefixes. This option increases bogofilter's accuracy.
Options -Pt and -PT are for tokenizing the innards of 3 html tags, i.e. >a<, >img<, and >font<. Tokenizing these tags adds urls and font names to the message's tokens. Enabling this option increases bogofilter's accuracy.
PARAMETER OPTIONS
The -m [value][,value] option allows setting the min_dev value and, optionally, the robs value. If one value is supplied, then min_dev is set. If a comma followed by one value is supplied, then robs is set. With two values, both min_dev and robs are set. Note the syntax is misleading, at least one of the values MUST be present, and the comma determines whether it is to set the spam or the ham cutoff. Note: spaces are not allowed after the comma.
The -o [value][,value] option allows setting the spam_cutoff value and, optionally, the ham_cutoff value. If one value is supplied, then spam_cutoff is set. If a comma followed by one value is supplied, then ham_cutoff is set. With two values, both spam_cutoff and ham_cutoff are set. Note the syntax is misleading, at least one of the values MUST be present, and the comma determines whether it is to set the spam or the ham cutoff. Note: spaces are not allowed after the comma.
INFO OPTIONS
The -q (quiet) suppresses warning messages.
The -v option produces a report to standard output on bogofilter's analysis af the input. Each additional v will increase the verbosity of the output, up to a maximum of 4. With -vv, the report lists the tokens with highest deviation from a mean of 0.5 association with spam.
Option -y date is specifies the date to give to tokens that don't have dates.
The -D option redirects debug output to stdout.
The -x flags option allows setting of debug flags for printing debug information.
Bogofilter will initialize its data base directory to $BOGOFILTER_DIR if BOGOFILTER_DIR is set. If it is not set, bogofilter will use $HOME/.bogofilter instead. If neither BOGOFILTER_DIR nor HOME is set, the -d dir option must be present.
The bogofilter command line allows setting of many options that determine how bogofilter operates. File /usr/local/etc/bogofilter.cf can be used to set additional parameters that affect its operation. File /usr/local/etc/bogofilter.cf.example has samples of all of the parameters. Status and logging messages can be customized for each site (see /usr/local/etc/bogofilter.cf.example).
0 for spam; 1 for non-spam; 2 for I/O or other errors.
If both -p and -e are used, the return values are: 0 for spam or non-spam; 2 for I/O or other errors.
Error 2 usually means that the wordlist files bogofilter wants to read at startup are missing or the hard disk has filled up in -p mode.
Use with Procmail
The following procmail rule will take mail on stdin and direct it to Mail/spam if bogofilter thinks it's spam:
:0HB: * ? bogofilter Mail/spam
and this similar rule will also register the tokens in the mail according to the bogofilter classification:
:0HB: * ? bogofilter -u Mail/spam
If bogofilter fails (returning 2) the message will be treated as non-spam.
The following recipe (a) spam-bins anything that bogofilter rates as spam, (b) adds the words in messages rated as spam to the spam wordlist, and (c) adds the words in messages rated as non-spam to the non-spam wordlist. With this in place, it will normally only be necessary for the user to intervene (with -Ns or -Sn) when bogofilter miscategorizes something.
# filter mail through bogofilter, tagging it as spam and # updating the word lists :0fw | bogofilter -u -e -p # if bogofilter failed, return the mail to the queue, the MTA will # retry to deliver it later # 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h :0e { EXITCODE=75 HOST } # file the mail to spam-bogofilter if it's spam. :0: * ^X-Bogosity: Yes, tests=bogofilter spam-bogofilter
This one is for maildrop, it automatically defers the mail and retries later when the xfilter command fails, use this in your ~/.mailfilter:
xfilter "bogofilter -u -e -p" if (/^X-Bogosity: Yes, tests=bogofilter/) { to "spam-bogofilter" }
The following .muttrc lines will create mutt macros for dispatching mail to bogofilter.
macro index d "<enter-command>unset wait_key\n\ <pipe-entry>bogofilter -n\n\ <enter-command>set wait_key\n\ <delete-message>" "delete message as non-spam" macro index \ed "<enter-command>unset wait_key\n\ <pipe-entry>bogofilter -s\n\ <enter-command>set wait_key\n\ <delete-message>" "delete message as spam"
Integration with Mail Transport Agent (MTA)
bogofilter can also be integrated into an MTA to filter all incoming mail. While the specific implementation is MTA dependent, the general steps are as follows
Install bogofilter on the mail server
Prime the bogofilter databases with a spam and non-spam corpus. Since bogofilter will be serving a larger community, it is important to prime it with a representative set of messages.
Set up the MTA to invoke bogofilter on each message. While this is an MTA specific step, you'll probably need to use the -p, -u, and -e options.
Set up a mechanism for users to register spam/nonspam messages, as well as to correct mis-classifications. The most generic solution is to set up alias email addresses to which users bounce messages.
See the doc and contrib directories for more information
Use of R to verify Bogofilter calculations
The -R option tells bogofilter to generate an R data frame. The data frame contains one row per token analysed. Each such row contains the token, the sum of its database "good" and "spam" counts, the "good" count divided by the number of non-spam messages used to create the training database, the "spam" count divided by the spam message count, Robinson's f(w) for the token, the natural logs of (1 - f(w)) and f(w), and an indicator character (+ if the token's f(w) value exceeded the minimum deviation from 0.5, - if it didn't). There is one additional row at the end of the table that contains a label in the token field, followed by the number of words actually used (the ones with + indicators), Robinson's P, Q, S, s and x values and the minimum deviation.
The R data frame can be saved to a file and later read into an R session (see the R project website for information about the mathematics package R). Provided with the bogofilter distribution is a simple R script (file bogo.R) that can be used to verify bogofilter's calculations. Instructions for its use are included in the script in the form of comments.
Bogofilter writes messages to the system log when the -l option is used. What is written depends on which other flags are used.
A classification run will generate (we are not showing the date and host part here):
bogofilter[1412]: X-Bogosity: No, spamicity=0.000227 bogofilter[1415]: X-Bogosity: Yes, spamicity=0.998918
Using '-u' to classify a message and update a wordlist will produce (one a single line):
bogofilter[1426]: X-Bogosity: Yes, spamicity=0.998918, register -s, 329 words, 1 messages
Registering words ('-l' and '-s', '-n', '-S', or '-N') will produce:
bogofilter[1440]: register-n, 255 words, 1 messages
A registration run (using '-s', '-n', '-N', or '-S') will generate messages like:
bogofilter[17330]: register-n, 574 words, 3 messages bogofilter[6244]: register-s, 1273 words, 4 messages
System configuration file.
User configuration file.
List of good tokens.
List of spam tokens.
bogofilter counts messages on input by looking for "From " lines. As a special case, a single message without "From " line is counted correctly. Multiple messages without intervening "From " lines will be counted as one message.
Bogofilter does not canonicalize the transport encoding or character set, sacrificing precision. We used to believe that spam with enclosures invariably gives itself away through cues in the headers and non-enclosure parts, but this is not true. This will be fixed in a future version.