Linux E-mail: Customizing SpamAssassin

Exclusive offer: get 50% off this eBook here
Linux Email

Linux Email — Save 50%

Set up, maintain, and secure a small office email server

€20.99    €10.50
by Alistair McDonald | November 2009 | Linux Servers Open Source

In this article series by Alistair McDonald, we will cover some important topics that discuss the use of Spam Assassin in conjunction with Procmail to filter out the wide range of spam that afflicts the modern e-mail user.

In this article, we will learn:

  • How to customize SpamAssassin to update new rules set automatically to keep your system's spam detection well tuned.
  • How to integrate spam filtering with virus recognition using amavisd.

SpamAssassin is very configurable. Almost every setting can be configured on a system-wide or user-specific basis.

Reasons to customize

If SpamAssassin is so good, then why configure it? Well, there are several reasons why it's worth improving spam filtering with SpamAssassin.

  • SpamAssassin by default (that is, when installed but not customized) typically manages to detect over 80% of spam. After adding a few customizations, the detection rate can be greater than 95%.
  • Everyone's spam is different and one user's spam might look like another user's ham. By trying to be general, SpamAssassin may fail to filter spam for every user.
  • Some of the features of SpamAssassin are disabled by default. By enabling them, the spam recognition rate is increased.

The following configuration options are discussed in this article:

  • Altering the scores for rules: This allows rules to be disabled, poor rules to be given less weight, and better rules to be given a higher weight.
  • Obtaining and using new rules: This can improve spam detection.
  • Adding e-mail addresses to white and blacklists: This allows the e-mail from specified senders to always be treated as ham, no matter what the content is, or the opposite.
  • Enabling SpamAssassin's Bayesian filter: This can increase filtering accuracy from 80% to 95% or more.

Rules and scores

The configuration files for standard, sitewide, and user-specific settings are saved in different directories as follows:

  • Standard configuration settings are stored in /usr/share/spamassassin.
  • Site-wide customizations and settings are stored in /etc/mail/spamassassin/. All files matching *.cf are examined by SpamAssassin.
  • User-specific settings are stored in ~/.spamassassin/local.cf.

The bulk of the standard configuration files is devoted to simple rules and their scores.

A rule is typically a match for letters, numbers, or other printing characters. Rules are written using a technique called regular expressions, or regex for short. This is a shorthand method of specifying that certain combinations of characters will trigger the rule. A rule might try to detect a particular word, such as "Rolex", or it might look for particular words in certain orders, such as "buy Rolex online". The rules are stored in text files.

Default files are stored in /usr/share/spamassassin. These are files that are shipped with SpamAssassin and may change with each release. It's best not to modify these files or place new files in this directory, as an upgrade to SpamAssassin will overwrite these files. Most of the rules that SpamAssassin uses, and the scores applied to each rule, are defined within files in this directory.

The defaults can be overwritten by sitewide configuration files. These are placed in /etc/mail/spamassassin. SpamAssassin will read all files matching *.cf in this directory. Settings made here can overrule those in the default files. They can include defining new rules and new rule scores.

User-specific customizations can be placed in the ~/.spamassassin/local.cf file. Settings made here can override sitewide settings defined in /etc/mail/spamassassin, and default settings in /usr/share/spamassassin/. New rules may be defined here, and scores for existing rules can be overridden.

SpamAssassin first reads all the files in /usr/share/spamassassin in alphanumerical order; 10_misc.cf will be read before 23_bayes.cf. SpamAssassin then reads all the .cf files in /etc/mail/spamassassin/, again in alphanumeric order. Finally, SpamAssassin reads ~user/.spamassassin/user_prefs. If a rule or score is defined in two files, the setting in the last file read is used. This allows the administrator to override the defaults and a user to override the sitewide settings.

Each line in a rules file can be blank or contain a comment or a command. The hash or pound (#) symbol is used for comments. Rules generally have three parts, the rule definition, a textual description, and the score or series of scores. Convention dictates that all rule scores for rules provided by SpamAssassin should be located together in a separate file. That file is /usr/share/spamassassin/50_scores.cf.

Altering rule scores

The simplest configuration change is to change a rule score. There are two reasons why this might be done:

  • A rule is very good at detecting spam, but the rule has a low score. E-mails that fire the rule are not being detected as spam.
  • A rule is acting on non spam. As a result, e-mails that fire the rule are wrongly being detected as spam.

The rules that give a positive result when SpamAssassin is run are listed in the X-Spam-Status: header of the e-mail:

X-Spam-Status: Yes, score=5.8 required=5.0 tests=BAYES_05,HTML_00_10,
HTML_MESSAGE,MPART_ALT_DIFF autolearn=no
version=3.1.0-r54722

The rules applied to the e-mail are listed after tests=. If one continually appears in e-mail that should be marked as spam, but isn't, then the score for the rule should be increased. If a rule often fires in e-mail that is wrongly classified as spam, the score should be decreased.

To find the current score, use the grep utility in all the locations where a score can be defined.

grep score.*BAYES /usr/share/spamassassin/* /etc/mail/spamassassin/*
~/.spamassassin/local.cf

/etc/mail/spamassassin/local_scores.cf:score RULE_NAME 0 0 1.665 2.599
/etc/mail/spamassassin/local_scores.cf: 4.34

In the previous example, the rule has a default score that is overridden in

/etc/mail/spamassassin/local_scores.cf.

The original score for the rule had four values. SpamAssassin changes the scores it uses, depending on whether network tests (for example, those that test open relays) are in use and whether the Bayesian Filter is in use. Four scores are listed, which are used in the following circumstances:

//===INSERT TABLE 02===

If only one score is given, as overridden in /etc/mail/spamassassin/local_scores.cf, it is used in all circumstances.

In the previous example, the system administrator has overridden the default score in /etc/mail/spamassassin/local_scores.cf with a single value in /etc/mail/spamassassin/local_scores.cf. To change this value for a particular user, their ~/.spamassassin/local.cf might read:

score RULE_NAME 1.2

This changes the score used from 4.34, set in /etc/mail/spamassassin/local_scores.cf, to 1.2. To disable the rule entirely, the score can be set to zero.

score RULE_NAME 0

Endless hours can be spent configuring rule scores. SpamAssassin includes tools to recalculate optimal rule scores, by examining existing e-mails, both spam and non spam. They are covered in detail in the book SpamAssassin published by Packt.

Using other rulesets

SpamAssassin has a large following, and the design of SpamAssassin has made it easy to add new rulesets, which are sets of rules and default scores for those rules. There are many different rulesets available. Most are based on a particular theme, for example finding the names of drugs often sold with spam or telephone numbers found in spam e-mails. Most custom rulesets are listed on the Custom Rulesets page of the SpamAssassin Wiki at http://wiki.apache.org/spamassassin/CustomRulesets.

As the battle against spam is so aggressive, rulesets have been developed that may possibly be uploaded daily. SpamAssassin provides this ability with the sa-update utility. You can choose to use sa-update on a regular basis, or to download a particular ruleset and keep it, or to manually update the rulesets that you choose. To obtain the best results in filtering spam, use of sa-update is recommended.

If you wish to install rulesets manually, the Wiki page gives a general description of each ruleset and a URL to download it. Once a ruleset has been chosen, we install it as follows:

  1. In a browser, follow the link on the SpamAssassin Wiki page. In most cases, the link will be to a file with a name matching *.cf, and a browser will open it as a text file.
  2. Save the file using the browser (normally, the File menu has a Save as option).
  3. Copy the file to /etc/mail/spamassassin—the rules will be automatically run if the file is placed in this location.
  4. Check that the file has scores in it, otherwise the rules will not be used.
  5. Monitor spam performance to ensure that legitimate e-mail is not being detected as spam.

Adding rules to SpamAssassin will increase the memory used by SpamAssassin, and the time that it takes to process e-mails. It is best to be cautious and add new rulesets gradually, to ensure that the effect on the machine is understood.

You may manually monitor the ruleset and update it on your system using the same process.

If you choose to use sa-update, you should plan your use of it. sa-update can use several channels, which are basically sources of rulesets. By default, the channel updates.spamassassin.org is used; another popular channel is the OpenProtect channel, called saupdates.openprotect.com.

To enable sa-update, it must be run regularly, for example via cron. Add a cron entry to your system calling the following commands, to update the base rulesets:

sa-update

If you use an additional channel, the command might look like:

sa-update –channel saupdates.openprotect.com

To protect against DNS poisoning and impersonation, SpamAssassin allows digital signing of rulesets. To use a signed ruleset, use the –gpgkey parameter to sa-update. The correct value to use with the –gpgkey parameter will be described in the SpamAssassin wiki page for the ruleset.

Linux Email Set up, maintain, and secure a small office email server
Published: November 2009
eBook Price: €20.99
Book Price: €34.99
See more
Select your format and quantity:

Whitelists and blacklists

SpamAssassin is very good at detecting spam, but there is always a risk of errors. By using a list of e-mail addresses that are known spam producers (a blacklist), e-mails from spammers who consistently use the same e-mail addresses or domains can be filtered out. With a list of e-mail addresses that are legitimate e-mail senders (a whitelist), e-mails from regular or important correspondents are guaranteed to be filtered as ham. This prevents the delay or non delivery of important e-mails that may otherwise be marked as spam.

Blacklists that list individual e-mail addresses have limited use—spammers normally use different or random e-mail addresses for each spam run. However, some spammers use the same domain for multiple runs. As SpamAssassin allows wildcards in its blacklisting, entire domains can be blacklisted. This is more useful for filtering out spam.

Manual whitelisting and blacklisting involves adding configuration directives to the global configuration file /etc/mail/spamassassin/local.cf and/or in ~/.spamassassin/user_prefs.

The whitelist and blacklist entries allow the ? and * characters to be used to match a single character or many characters respectively. So, if a whitelist entry read *@domain.com, then joe@domain.com and bill@domain.com would both match. For an entry that read *@yahoo?.com, joe@yahoo1.com and bill@yahoo2.com would match, but billy@yahoo22.com would not match. *@yahoo*.com would match all three examples.

The whitelist and blacklists rules do not immediately cause an e-mail to be tagged as spam or ham, even though the scores are heavily weighted. The default score for the USER_IN_WHITELIST rule is -100.0. It is technically possible that an e-mail may match a whitelist entry and still trigger enough other tests to result in it being marked as spam. Although in practice, this is unlikely to occur, unless the scores have been changed from the defaults.

To blacklist an e-mail address or whole domain, use the blacklist_from directive.

blacklist_from user@spammer.com
blacklist_from *@spamdomain.com

To whitelist an e-mail address or domain, use the whitelist_from directive.

whitelist_from user@mycompany.com
whitelist_from *@mytradingpartner.com

SpamAssassin has more complex rules for managing white and blacklists, as well as an automatic whitelist/blacklist. Both blacklists and whitelists can be specified as discrete items (blacklist joe@domain.com and bill@another.com) or as wildcards (blacklist every joe, and blacklist everyone from domain.com). The wildcards are particularly powerful, and care should be taken to ensure that legitimate e-mail is not rejected.

Bayesian filtering

This uses a statistical technique to determine if an e-mail is spam, based on previous e-mails of both types. Before it will work, it needs to be trained with e-mail that is known spam and also e-mail that is known non spam. It is important that the e-mail is correctly categorized, otherwise the effectiveness of the filter will be reduced. The learning process is done on the e-mail server, and the sample e-mails should be stored in an accessible location.

The sa-learn command is used to train the Bayesian filter with e-mail messages that are known ham or spam. The SpamAssassin installation routine will have placed sa-learn in the path, normally in /usr/bin/sa-learn.

It is used on the command line and is passed a directory, file, or series of files. For this to work, the e-mail has to be stored on the server or exported from the client in a suitable format. SpamAssassin recognizes mbox format, and many e-mail clients use a compatible format. To use sa-learn, a directory or series of directories can be passed in to the command:

$ sa-learn --ham ~/.maildir/.Trash/cur/ ~/.maildir/cur
Learned from 75 message(s) (175 message(s) examined).

If the mbox format is used, the mbox flag should be used so that SpamAssassin searches the file for more than one e-mail.

$ sa-learn -mbox --spam ~/mbox/spam ~/mbox/bad-spam
Learned from 75 message(s) (175 message(s) examined).

If SpamAssassin has already learned from an e-mail, sa-learn detects this and will not process it twice. In the example above, 100 of the 175 e-mails had been processed already and were ignored on this run. The remaining 75 e-mails had not been processed before.

If sa-learn is passed a number of messages, there may be no feedback for some time. The --showdots flag provides feedback in the form of dots (.) whenever an e-mail is processed.

$ sa-learn --spam --showdots ~/.SPAM/cur ~/.SPAM/new
.........................
Learned from 20 message(s) (25 message(s) examined).

Once SpamAssassin has learned enough e-mails, it will begin to use the Bayesian filter automatically. It can be kept up-to-date by using the auto-learn feature.

Auto-learning should not be used without additional user input. There are two reasons for doing this.

  • SpamAssassin occasionally gets spam detection wrong, and e-mail that is spam may be learned as an example of non spam. Auto-learning would confuse the Bayesian filter and decrease its effectiveness.
  • The score threshold that an e-mail is auto-learned at is higher than that for detection as spam. In other words, an e-mail may be detected as spam, but not auto-learned. In this case, the rest of SpamAssassin is doing a fairly good job of detecting border-line spam (those with scores close to the threshold for spam), but the Bayesian filter is not being taught the e-mails.

To use automatic learning, set the bayes_auto_learn flag to 1. This can be configured sitewide in the /etc/mail/spamassassin/local.cf file, or can be overridden in a user's ~/.spamassassin/user_prefs file. Two other configuration flags also affect auto-learning, and are the thresholds for learning ham and spam. These values are in the same units as SpamAssassin's score for each e-mail.

bayes_auto_learn 1
bayes_auto_learn_threshold_nonspam 0.1
bayes_auto_learn_threshold_spam 12.0

When auto-learning is enabled, any e-mail that is assigned a score of less than bayes_auto_learn_threshold_nonspam, is learned as ham. Any e-mail that is assigned a value greater than bayes_auto_learn_threshold_spam, is learned as spam.

It is recommended that the bayes_auto_learn_threshold_nonspam threshold is kept low (close to or below zero). This will avoid the situation where a spam e-mail that escapes detection is used as an example to train the Bayesian filter. Keeping the bayes_auto_learn_threshold_spam threshold high is to some extent a matter of choice; however, it should be above the scores of any e-mails that have been wrongly classified as spam in the past. This may occur up to a score of 10 for the default spam threshold of 5. Therefore, using an auto-learn threshold of less than 10 for spam may cause non spam to be accidentally learned as spam. If this happens, the Bayesian database will begin to lose effectiveness, and future Bayesian results will be compromised.

SpamAssassin keeps the Bayesian database in three files in the .spamassassin directory within a user's home directory. The format used is usually Berkeley DB format and the files are named as follows:

bayes_journal
bayes_seen
bayes_toks

The bayes_journal file is used as a temporary storage area. Sometimes it is not present. This file is generally relatively small, with a size of around 10 KB. The bayes_seen and bayes_toks files can each be several megabytes in size.

Other SpamAssassin features

This article has only scratched the surface of SpamAssassin's capabilities. If spam is a problem for an organization, SpamAssassin will reward further study. Some of the other features that it contains are as follows:

  • Network tests: SpamAssassin can integrate with Open Relay Databases. (The 3.x distribution contains tests for over 30 databases, although not all of them are enabled by default.) Open Relay tests do not require a fast machine or lots of RAM, and so are relatively cheap tests to use. They have a fairly successful detection rate.
  • External content databases: SpamAssassin can integrate with external content databases. These work in a participating network. All the participants send details of all the e-mails they receive to central servers. If the e-mails have been sent many times before, the e-mail is probably a spam that has been sent to many users. The services are designed so that no confidential data is sent.
  • Whitelist and blacklist: SpamAssassin includes an automatic whitelist and blacklist, which work in a similar way to the manual lists described earlier. This is particularly effective at preventing regular correspondents from having their e-mail wrongly detected as spam.
  • Creating new rules: New rules can be written and developed. Creating rules is not particularly difficult, with a little imagination and a suitable source of spam. System Administrators can rid their users of any persistent spam that fails to be detected with the default SpamAssassin rules.
  • Customizable headers: The headers that SpamAssassin adds to e-mails can be customized, and new headers can be written. SpamAssassin will also attempt to detect viruses and Trojan software, and will wrap an e-mail address like that in a special envelope e-mail.
  • Multiple installations: SpamAssassin can be installed on multiple machines, serving one or more e-mail servers. In very high volume e-mail systems, many spam servers may be run, each only processing spam. This leads to a high-throughput, high-availability service.
  • Customizable rule scores: SpamAssassin includes tools to customize rule scores, based on samples of the spam and legitimate e-mail received at an organization. This helps to improve the filtering rate. With SpamAssasin 3.0, the tools were improved significantly, and the procedure to perform this is much less time consuming than it was in earlier versions.

Summary

In this article, we have learnt about:

  1. How to customize SpamAssassin to update new rules set automatically to keep your system's spam detection well tuned.
  2. How to integrate spam filtering with virus recognition using amavisd.

In this article series, you have seen how SpamAssassin can be obtained and installed. Three different methods of using SpamAssassin were presented, with suggestions on which option to choose for a particular installation.

Configuration of popular e-mail clients was also covered, namely Microsoft Outlook, Microsoft Outlook Express, and Mozilla Thunderbird.

 

If you have read this article you may be interested to view :

 

 

Linux Email Set up, maintain, and secure a small office email server
Published: November 2009
eBook Price: €20.99
Book Price: €34.99
See more
Select your format and quantity:

About the Author :


Alistair McDonald

Alistair McDonald is a freelance IT consultant based in the UK. He has worked in IT for over 15 years and specializes in C++ and Perl development and IT infrastructure management. He is a strong advocate of open source, and has strong cross-platform skills. He prefers vim over vi, emacs over Xemacs or vim, and bash over ksh or csh.

He is very much a family man and spends as much time as possible with his family enjoying life.

Books From Packt

ModSecurity 2.5
ModSecurity 2.5

FreePBX 2.5 Powerful Telephony Solutions
FreePBX 2.5 Powerful Telephony Solutions

Asterisk 1.4 – the Professional’s Guide
Asterisk 1.4 – the Professional’s Guide

trixbox CE 2.6
trixbox CE 2.6

Cacti 0.8 Network Monitoring
Cacti 0.8 Network Monitoring

Tomcat 6 Developer's Guide
Tomcat 6 Developer's Guide

Apache Maven 2 Effective Implementation
Apache Maven 2 Effective Implementation

Drools JBoss Rules 5.0 Developer's Guide
Drools JBoss Rules 5.0 Developer's Guide

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software