Linux E-mail: Busting Spam with SpamAssassin

Spam, or unsolicited commercial e-mail (UCE) as it is sometimes called, is the scourge of the Internet. Spam has increased relentlessly over the last ten years and now accounts for over half of all Internet bandwidth. One in six consumers have acted on spam e-mails, so there is a strong business case for keeping spam out of your users' inboxes. There are a variety of different spam solutions, ranging from outsourcing your spam entirely to no action at all. However, if you have your own e-mail server, you can add spam filtering very easily.

SpamAssassin is a very popular open source anti-spam tool. It won a Linux New Media Award-2006 as the "Best Linux-based Anti-spam Solution", and is considered by many to be the best free, open source, anti-spam tool, and better than many commercial products. In fact, several commercial products and services are based on SpamAssassin or previous versions of it.

Why filter e-mail

If you don't receive any spam, there may be no need to filter spam. However, once one spam message has been received, it is invariably followed by many more. Spammers can sometimes detect if a spam e-mail is viewed, using techniques such as Web bugs, which are tiny images in HTML e-mails that are fetched from web servers, and then know that an e-mail address is valid and vulnerable. If spam is filtered, the initial e-mail may never get seen, and consequently the spammer may not then target the e-mail address with further spam.

Despite legal efforts against spam, it is actually on the increase. In Europe and the US, the recent legislation against spam (Directive 2002/58/EC and bill number S.877 respectively) has had little effect and spam is still on the increase in both regions.

The main reason for this is that spam is a very good business model. It is very cheap to send spam, as little as one thousandth of a cent per e-mail, and it takes a very low hit rate before a profit is made. The spammer needs to turn just one spam in a hundred thousand or so into a sale to make a profit. As a result, there are many spammers and spam is used to promote a wide range of goods. Spamming costs are also negligible due to use of malware that uses innocent computers to send spam on their behalf.

In contrast, the costs of spam to the recipient are remarkably high. Estimates have varied, from 10 cents per spam received, through 1,000 dollars per employee per year, up to a total cost of 140 billion dollars globally in 2007 alone. This cost is mainly labor—distracting people from their work by clogging their inboxes and forcing them to deal with many extra e-mails. Spam interferes with day-to-day work and can include material that is offensive to most people. Companies have a duty to protect their employees from such content. Spam filtering is a very cheap way of minimizing the costs and protecting the workforce.

Spam is a moving target

Spam isn't static. It changes on a day-to-day basis, as spammers add new methods to their arsenal and anti-spammers develop countermeasures. Due to this, the anti-spam tools that work best are those that are updated frequently. It's a similar predicament to antivirus software—virus definitions need to be updated regularly or new viruses won't be detected.

SpamAssassin is regularly updated. In addition to new releases of the software, there is a vigorous community creating, critiquing, and testing new anti-spam rules. These rules can be downloaded automatically for up-to-date protection against spam.

Let's discuss some of the measures used by SpamAssassin to fight spam:

  • Open relays: These are e-mail servers that allow spammers to send e-mails even though they are not connected to the owner of the server in any way. To counter this, the anti-spam community has developed blocklists, also known as blacklists, which can be used by anti-spam software to detect spam. Any e-mail that has passed through a server on a blocklist is treated more suspiciously than one that has not. SpamAssassin uses a number of blocklists to test e-mails.
  • Keyword filters: These are useful tools against spam. Spammers tend to repeat the same words and phrases again and again. Rules to detect these phrases are used extensively by SpamAssassin. These make up the bulk of the tests, and the user community rules mentioned previously is normally of this form. They allow specific words, phrases, or sequences of letters, numbers, and punctuation to be detected.
  • Blacklists and whitelists: These are used to list known senders of spam and sources of good e-mail respectively. E-mails from an address on a blacklist are probably spam and are treated accordingly, while e-mails from addresses on a whitelist will be less likely to be treated as spam. SpamAssassin allows the user to enter blacklists and whitelists manually, and also builds up an automatic whitelist and blacklist based on the e-mails that it processes.
  • Statistical filters: These are automated systems that give the probability that an e-mail is spam. This filtration is based on what the filter has seen previously as both spam and non-spam. They generally work by finding words that are present in one type of e-mail but not the other, and using this knowledge to determine which type a new e-mail is. SpamAssassin has a statistical filter called the Bayesian filter that can be very effective in improving detection rates.
  • Content databases: These are mass e-mail detection systems. A lot of e-mail servers receive and submit e-mails to central servers. If the same e-mail is sent to thousands of recipients, it is probably a spam. The content databases prevent confidential e-mails from being sent to the server, by using a technique called hashing that also lowers the amount of data sent to the server. SpamAssassin can integrate with several content databases, notably Vipul's Razor (, Pyzor (, and the Distributed Checksum Clearinghouse, that is, DCC (
  • URL blocklists: These are similar to open relay blocklists, but list the websites used by spammers. In nearly all spams, a web address is given. A database of these is built so that spam e-mails can be quickly detected. This is a very efficient and effective tool against spam. By default, SpamAssassin uses Spam URI Realtime BlockLists (SURBLs), without any further configuration required.

Spam filtering options

Spam can be filtered on the server or the client. The two approaches are explained next. In the first scenario, spam is filtered on the client.

Linux Email

  1. Mail is processed by the MTA.
  2. The e-mail is then placed in the appropriate user's inbox.
  3. The e-mail client reads all new e-mail from the inbox.
  4. The e-mail client then passes the e-mail to the filter.
  5. When the filter returns the results, the client can display the valid e-mail and either discard spam or file it in a separate folder.

In this approach, the spam filtering is always done by the client and is always done when new e-mail is processed. Often when the user may be present, so he or she may either experience a delay before e-mail is visible or there may be a period where spam e-mail is present in the inbox before the client software can filter the spam from view. The amount of spam filtering that can be performed on the client may be limited. In particular, the network tests such as open relay blocklists or SURBLs might be too time consuming or complex to perform on the user's PC. As spam is a moving target, updating many client PCs can become a difficult administrative task.

In the second scenario, the spam filtering is performed on the e-mail server.

Linux Email

  1. Incoming e-mail is received by the MTA.
  2. It is then passed on to the spam filter.
  3. The results are then sent back to the MTA.
  4. Depending on the results, the MTA places the e-mail in the appropriate user's inbox (4a), or in a separate folder for spam (4b).
  5. The e-mail client accesses e-mails in the user's inbox and it can also access the spam folder if required.

This approach has several advantages:

  • The spam filtering is done when the e-mail is received, which may be any time of the day. The user is less likely to be inconvenienced by delays.
  • The server can specialize in spam filtering. It may use external services such as open relay blocklists, online content databases, and SURBLs.
  • Configuration is centralized, which will ease setup (for example, firewalls may need to be configured to use online spam tests) and also maintenance (updating of rules or software).

On the other hand, the disadvantages include:

  • A single point of failure now exists. However, with care, a broken spam filtering service can be configured around. If the service is not available, e-mail will still be delivered but spam will not be filtered.
  • All spam must be processed by one service. If this service is not scalable, large volumes of e-mail may affect mail delivery times, resulting in poor or intermittent filtering, or possibly even the loss of e-mail service.

Introduction to SpamAssassin

Spam filtering actually involves two phases—detecting the spam and then doing something with it. SpamAssassin is a spam detector and it modifies the e-mail it processes by putting in headers to mark whether it is spam. It is up to the MTA or the mail delivery agent in the e-mail system to react to the headers that SpamAssassin creates in an e-mail, to filter it out. However, it's possible that another part of the e-mail system could perform this task.

Linux Email

The previous figure gives a schematic representation of SpamAssassin. At the heart of SpamAssassin is its Rules Engine that determines which rules are called. Rules trigger whether the various tests are used, including the Bayesian Filter, the network tests, and the auto-whitelists.

SpamAssassin uses various databases to do its work, and these are shown too. The rules and scores are text files. Default rules and scores are included in the SpamAssassin distribution and, as we will see, both system administrators and users can add rules or change the scores of existing rules by adding them to files in specific locations. The Bayesian filter (which is a major part of SpamAssassin, and will be covered later) uses a database of statistical data based on previous spam and non-spam e-mails. The Auto-Blacklist/Whitelist also creates its own database.

Downloading and installing SpamAssassin

SpamAssassin is slightly different from most of the software that is used in this book. It is written in a language called Perl, which has its own distribution method called CPAN (Comprehensive Perl Archive Network). CPAN is a large website of Perl software (normally, Perl modules), and the term CPAN is also the name of the software used to download those modules and install them. Though SpamAssassin is provided as a package by many Linux distributions, we strongly recommend that you install it from source rather than use a package. This way, you will get the latest version of SpamAssassin rather than the one that was current when your Linux distributer created its release.

Most Perl users will build Perl modules using CPAN and experience no difficulties. CPAN can automatically locate and install any dependencies (other components that are required to make the desired component work properly). From a Perl point of view, using CPAN to install Perl modules is like using the rpm or apt-get commands in Linux. The basics are very simple and, once a system is configured, it generally works every time.

However, learning and configuring a new way of installing software may put off some people. A SpamAssassin release is distributed in source form, but administrators of Red Hat Package Manager (RPM) based systems can easily convert the latest SpamAssassin release into rpm format and then the regular rpm command can be used to install the package. The Debian repository is updated fairly quickly when SpamAssassin is updated and the regular apt-get commands can be used to install SpamAssassin. We strongly advise you to install via apt-get, CPAN, or using the rpmbuild command as described next, in preference to using an RPM provided by a distributor.

As SpamAssassin is a Perl Module, it appears on CPAN first. In fact, it is only released when it arrives at CPAN. Users of CPAN can download the latest version of SpamAssassin literally minutes after it has been released.

Support is also easier to obtain if SpamAssassin is built from source. Some distributors make unusual decisions when creating their RPM of SpamAssassin or may modify certain default values. These make obtaining support more difficult.

RPMs also take time to be delivered. Distributors need time to build and test new versions of software before they release them, and most software packages are not updated as quickly as SpamAssassin. So, Linux distributions may not provide the latest software, and what is provided can be several versions out of date.

Using CPAN

The prerequisites for installing SpamAssassin 3.2.5 using CPAN are as follows:

  • Perl version 5.6.1 or later: Most modern Linux distributions will include this as a part of the base package.
  • Several Perl modules: The current version of SpamAssassin needs the Digest::SHA1, HTML::Parser, and the Net::DNS modules. CPAN will install these if you configure it to follow dependencies, but there are many additional Perl modules that are optional and should be installed to get the best spam detection. CPAN will issue warnings with the module names, which will enable you to identify and install them.
  • C compiler: This may not be installed by default and may have to be added using the rpm command. The compiler used will normally be called gcc.
  • Internet connection: CPAN will attempt to download the modules using HTTP or FTP, so the network should be configured to allow this.

    Configuring CPAN

    If you've used CPAN before, you can skip to the next section, Installing SpamAssassin Using CPAN.

    If a proxy server is required for Internet traffic, CPAN (and other Perl modules and scripts) will use the http_proxy environment variable. If the proxy requires a username and password, these need to be specified using environment variables. As CPAN is normally run as root, these commands should be entered as root:

    # HTTP_proxy=
    # export HTTP_proxy
    # HTTP_proxy_user=username
    # export HTTP_proxy_user
    # HTTP_proxy_pass=password
    # export HTTP_proxy_pass

    Next, enter this command:

    # perl -MCPAN -e shell

    If the output is similar to the following, the CPAN module is already installed and configured, and you can skip to the next section, Installing SpamAssassin Using CPAN.

    cpan shell -- CPAN exploration and modules installation (v1.7601)
    ReadLine support enabled

    If the output prompts for manual configuration, as shown next, the CPAN module is installed but not configured.

    Are you ready for manual configuration? [yes]

    During configuration, the CPAN Perl module prompts for answers to around 30 questions. For most of the questions, selecting the default value is the best response. This initial configuration must be completed before the CPAN Perl module can be used. The questions are mainly about the location of various utilities, and the defaults can be chosen by pressing Enter. The only question for which we should change the default is the one about building prerequisite modules. If we configure CPAN to follow dependencies, it will install the required modules without prompting.

    Policy on building prerequisites (follow, ask or ignore)? [ask] follow

    Once CPAN is configured, exit the shell by typing exit and pressing Enter. We are now ready to use CPAN to install SpamAssassin.

Installing SpamAssassin using CPAN

To install SpamAssassin, enter the CPAN shell by typing the following command:

# cpan

If the CPAN module is correctly configured, the following output (or something similar) will appear:

cpan shell -- CPAN exploration and modules installation (v1.7601)
ReadLine support enabled

Now, at the cpan prompt, enter the following command:

cpan> install Mail::SpamAssassin

The CPAN module will query an online database to find the latest version of SpamAssassin and its dependencies, and then install them. Dependencies will be installed before SpamAssassin. The following is the sample output:

cpan> install Mail::SpamAssassin
CPAN: Storable loaded ok (v2.18)
Going to read '/root/.cpan/Metadata'
Database was generated on Mon, 03 Aug 2009 04:27:49 GMT
Running install for module 'Mail::SpamAssassin'
CPAN: Data::Dumper loaded ok (v2.121_14)
'YAML' not installed, falling back to Data::Dumper and Storable to
read prefs '/root/.cpan/prefs'
Running make for J/JM/JMASON/Mail-SpamAssassin-3.2.5.tar.gz
CPAN: Digest::SHA loaded ok (v5.45)
CPAN: Compress::Zlib loaded ok (v2.015)
Checksum for /root/.cpan/sources/authors/id/J/JM/JMASON/Mail-
SpamAssassin-3.2.5.tar.gz ok
Scanning cache /root/.cpan/build for sizes
CPAN: Archive::Tar loaded ok (v1.38)
Will not use Archive::Tar, need 1.00
.... Going to build F/FE/FELICITY/Mail-SpamAssassin-3.00.tar.gz

SpamAssassin may require the user to respond to a few questions. The responses provided might affect the module configuration or only be part of the testing performed before installation. Going to build J/JM/JMASON/Mail-SpamAssassin-3
What e-mail address or URL should be used in the suspected-spam report
text for users who want more information on your filter installation?
(In particular, ISPs should change this to a local Postmaster contact)
default text: [the administrator of that system] postmaster@myfomain.
NOTE: settings for "make test" are now controlled using "t/config.
See that file if you wish to customise what tests are run, and how.
checking module dependencies and their versions...

SpamAssassin, as with many Perl modules, is very flexible. It can make use of features if they are available, and will work even if they are not. When using CPAN, you may see messages such as the following:

optional module missing: Mail::SPF
optional module missing: Mail::SPF::Query
optional module missing: IP::Country
optional module missing: Razor2
optional module missing: Net::Ident
optional module missing: IO::Socket::INET6
optional module missing: IO::Socket::SSL
optional module missing: Mail::DomainKeys
optional module missing: Mail::DKIM
optional module missing: DBI
optional module missing: Encode::Detect

If you install the modules mentioned, SpamAssassin will make use of them and this will improve e-mail filtering. You can abort the installation of SpamAssassin and install the modules using cpan install Module::Name commands.

If you let the build process complete, it will test the capabilities of the C compiler, configure and build the module, create documentation, and test SpamAssassin. At the end of the build, the output should be similar to the following:

chmod 755 /usr/share/spamassassin
/usr/bin/make install -- OK

This indicates that SpamAssassin has been installed correctly. If SpamAssassin installation was successful, you can skip to the Testing the Installation section.

If the installation failed, the output may look like this:

Failed 17/68 test scripts, 75.00% okay. 50/1482 subtests
failed, 96.63% okay.
make: *** [test_dynamic] Error 29
/usr/bin/make test -- NOT OK
Running make install
make test had returned bad status, won't install without force

If the output does not end with the /usr/bin/make install -- OK message, an error has occurred. Firstly, you should examine all the output for possible warnings and error messages, especially for prerequisite packages. If this does not assist, then avenues for support are described in the section Testing the installation.

Using the rpmbuild utility

If a version of Linux based on the Red Hat Package Manager format is used, SpamAssassin can be installed using the rpmbuild command. Download the SpamAssassin source from into a working directory, then issue the following command to build SpamAssassin:

# rpmbuild -tb Mail-SpamAssassin-3.2.5.tar.gz
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.ORksvX
+ umask 022
+ cd /root/rpmbuild/BUILD
+ cd /root/rpmbuild/BUILD
+ rm -rf Mail-SpamAssassin-3.2.5
+ /usr/bin/gzip -dc /root/Mail-SpamAssassin-3.2.5.tar.gz
+ /bin/tar -xf -
+ '[' 0 -ne 0 ']'
+ cd Mail-SpamAssassin-3.2.5
+ /bin/chmod -Rf a+rX,u+w,g-w,o-w .
+ exit 0
Executing(%build): /bin/sh -e /var/tmp/rpm-tmp.zgpcdd
... (output continues)
Wrote: /usr/src/redhat/RPMS/i386/spamassassin-3.0.4-1.i386.rpm
Wrote: /usr/src/redhat/RPMS/i386/spamassassin-tools-3.0.4-1.i386.rpm
Wrote: /usr/src/redhat/RPMS/i386/perl-Mail-SpamAssassin-3.0.4-1.i386.
Executing(%clean): /bin/sh -e /var/tmp/rpm-tmp.65065
+ umask 022
+ cd /usr/src/redhat/BUILD
+ cd Mail-SpamAssassin-3.0.4
+ '[' /var/tmp/spamassassin-root '!=' / ']'
+ rm -rf /var/tmp/spamassassin-root
+ exit 0

It is possible that the installation will fail due to missing dependencies. These are Perl modules that SpamAssassin uses, and which are installed separately. Error messages often hint at the name of the dependency, as in the following installation:

# rpmbuild -tb Mail-SpamAssassin-3.2.5.tar.gz
error: Failed build dependencies:
perl(Digest::SHA1) is needed by spamassassin-3.2.5-1.i386
perl(HTML::Parser) is needed by spamassassin-3.2.5-1.i386
perl(Net::DNS) is needed by spamassassin-3.2.5-1.i386

In this case, the Perl modules Digest::SHA1, HTML::Parser, and Net::DNS are needed. The solution is to install it using CPAN. In some cases, SpamAssassin may require particular versions of packages, which may require the installed versions to be upgraded.

When installing SpamAssassin using CPAN, all the dependencies are installed automatically. However, while using the rpmbuild command, the dependencies need to be installed manually. Using CPAN is generally less troublesome than rpmbuild.

Using pre-built RPMs

SpamAssassin is packaged with many Linux distributions, and packages of new releases of SpamAssassin are often made available from other sources. As mentioned earlier, RPMs are not the recommended method of installing SpamAssassin but are more reliable than building from source on unusual platforms.

To install an RPM, simply download or locate it on the distribution CD, and install it using the rpm command. The following command can be used to install the RPM for SpamAssassin:

# rpm -ivh /path/to/rpmfile-9.99.rpm

Graphical installers can also be used to install SpamAssassin RPMs. The RPMs listed on the SpamAssassin website are usually the latest version of SpamAssassin and are complete. If these cannot be installed, the RPM provided by the Linux distribution should be installed instead.

Testing the installation

It is worth performing a few tests to ensure that SpamAssassin is installed correctly and the environment is complete. If you want to test a particular user account, you should log in to that account to perform the test.

SpamAssassin includes a sample spam e-mail and a sample non-spam e-mail. It can be tested by processing the sample e-mails. These e-mails are in the root of the SpamAssassin distribution directory. If you used CPAN to install SpamAssassin using the root user, then the path to this directory may be similar to ~root/.cpan/build/Mail-SpamAssassin-3.2.5/, where 3.2.5 is the version of SpamAssassin installed. If the files cannot be located, download the SpamAssassin source from and unpack the source into a temporary directory. The sample e-mails are in the root of the unpacked source.

To test SpamAssassin, change to the directory containing sample-spam.txt and use the following commands. Example results are shown after each command.

$ spamassassin -t < sample-nonspam.txt | grep X-Spam
[22674] warn: config: created user preferences file: /home/user/.
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
X-Spam-Status: No, score=0.0 required=5.0 tests=none autolearn=haX-
$ spamassassin -t < sample-spam.txt | grep X-Spam
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on
X-Spam-Level: **************************************************
X-Spam-Status: Yes, score=1000.0 required=5.0 tests=GTUBE,NO_RECEIVED,
X-Spam-Prev-Subject: Test spam mail (GTUBE)

The output from the command using sample-nonspam.txt should have X-Spam-Status: No, and that using sample-spam.txt should have X-Spam-Flag: YES and X-Spam-Status: Yes.

SpamAssassin can verify its configuration files with the --lint flag and report any errors. By default, a clean installation of SpamAssassin should not have any errors, but once a site is customized, some rules may fail. In the following example, a score entry does not match a rule:

$ spamassassin --lint
warning: score set for non-existent rule RULE_NAME
lint: 1 issues detected. please run with debug enabled for more

If the output includes warnings, something has gone wrong. It's worth fixing SpamAssassin before going on and using it. The best places to visit are the SpamAssassin Wiki (, the archives of the SpamAssassin mailing lists (, and your favorite search engine. In most open source projects, the developers are volunteers and appreciate users who search for the solution to their problem before posting a plea for help, as most problems have been encountered many times before.

Modified e-mails

In addition to the e-mail headers mentioned, SpamAssassin will modify an e-mail if it is thought to be spam. It takes the original e-mail and converts it to an e-mail attachment with a simple e-mail around it. SpamAssassin always wraps an e-mail if it detects a potential virus or other dangerous content. In its default configuration, it will add an envelope e-mail around the spam, but this can be turned off if desired. Consult the SpamAssassin documentation regarding the report_safe directive. The envelope e-mail looks like this:

Linux Email


In this article, we have learnt:

  1. Why spam is difficult to deal with and why spam filters requires regular updates?
  2. How to download, install, and configure SpamAssassin

If you have read this article you may be interested to view :


You've been reading an excerpt of:

Linux Email

Explore Title