Packt+ | Advance your knowledge in tech

You're reading from SpamAssassin: A practical guide to integration and configuration

Product type Book

Published in Sep 2004

Publisher Packt

ISBN-13 9781904811121

Pages 240 pages

Edition 1st Edition

Languages

Concepts

Cybersecurity

Table of Contents (24) Chapters

SpamAssassin

Credits

About the Author

About the Reviewers

1. Introduction

1. Introducing Spam

2. Spam and Anti-Spam Techniques

3. Open Relays

4. Protecting Email Addresses

5. Detecting Spam

6. Installing SpamAssassin

7. Configuration Files

8. Using SpamAssassin

9. Bayesian Filtering

10. Look and Feel

11. Network Tests

12. Rules

13. Improving Filtering

14. Performance

15. Housekeeping and Reporting

16. Building an Anti-Spam Gateway

17. Email Clients

18. Choosing Other Spam Tools

Glossary

Chapter 5. Detecting Spam

Although humans can easily distinguish between spam and ham, detecting spam with computer programs is not simple. Over the years, several methods have been developed to filter spam from ham. Some anti-spam tools use only a subset of these methods, but SpamAssassin uses almost all of them.

Content Tests

Content tests analyze the message part of the email, and sometimes the headers. These tests typically look for key words or phrases within emails. Usually, when using content tests, a scoring system is used. It is not uncommon for words normally associated with spam emails to also appear in legitimate emails, so a score or count of suspicious words is accumulated for each email. Each word associated with spam increases the overall score of an email. The final score is compared with a predefined threshold; this is used to decide whether an email is spam or ham.

Content tests need not focus on single words; phrases and sequences of punctuation are used. The words, phrases...

Content Tests

Content tests need not focus on single words; phrases and sequences of punctuation are used. The words, phrases, and other symbols tested are normally generated by a developer, who analyzes spam and manually creates tests.

Sometimes the message headers are examined as part of a content test. The message headers include dates, time, and other attributes, such as the mail application used. Often, spam-creation programs contain...

Header Tests

Header tests focus on the message headers. The tests are mainly concerned with detecting fake headers and determining whether a message has been routed via an open relay.

For example, a header test could flag all emails that appear to have been sent over 72 hours ago, or sent at a future date. Most email servers have accurate clocks. However, spammers frequently use trojanned PCs, which may have inaccurate clocks, and so spam messages might have dates that are in the past or the future. Examining email headers is described in more detail later in the chapter.

Header tests do not form a large proportion of tests that SpamAssassin performs. These tests use up considerable amounts of CPU, memory, and disk I/O resources.

DNS-Based Blacklists

There are many DNS-based blacklists (DNSBLs). These are also known simply as blacklists or blocklists. They provide a service that is used by MTAs and spam filters to indicate sites that are related to spammers. An MTA or spam filter may use one or more blacklists. SpamAssassin can use blacklists to filter spam.

Blacklists can generally be placed in one or more of these categories:

A list of known open relays
A list of known sources of spam
A list of sites hosted by an ISP that encourages spammers in some way

Every blacklist has unique policies for adding and removing domains from the list. Some are very aggressive, and block not only sources of spam, but also any address served by the same Internet service provider. The intention of this approach is to force ISPs to stop doing business with spammers and thus force them off the net. This approach, called the Internet black hole of death has been used with success against major ISPs in the past.

Blacklists provide a spam filter...

Statistical Tests

Various statistical techniques can be used to identify spam. These generally involve a training phase, where a database of spam and ham emails is taught to the filter or passed through it to identify typical characteristics of spam and ham. This allows future emails to be identified based on the learning from past emails. The various statistical techniques vary in their choice of tokens and the algorithms they use to predict whether an email is spam or ham. The tokens used are normally words, but can include email headers, HTML markup within emails, and other characters such as punctuation marks.

Statistical filters rely on regular training. They use the knowledge gained in training to estimate the probability that new emails are spam. As spam changes, the filter must adapt in order to continue to detect the spam.

SpamAssassin contains a statistical filter based on Bayesian analysis. This is enabled by default and, if trained properly, aids in the correct recognition of...

Message Recognition

Often, a spammer will send exactly the same message to many recipients. Although message headers may be different in each email, an email with the same body may be sent to many recipients. This has led to the creation of several anti-spam networks that contain a database of spam emails. By comparing incoming emails with the contents of this database, it is possible to quickly filter out known spam messages. SpamAssassin can use one or more message recognition systems.

To avoid sending the whole email across the network and comparing each character or line, a hash value is calculated and used. Hashing is a mathematical process that creates a small signature from a larger message. It is very unlikely that two email messages will have the same hash value, and so comparing hashes is statistically the same as comparing the whole message. As the hashes are much shorter than an email message, comparing hashes is significantly quicker than comparing the whole message.

The calculation...

URL Recognition

A Uniform Resource Locator (URL), more correctly called a Uniform Resource Identifier (URI), is typically a web link in a spam email. A Spam URI Real-time Block List (SURBL) is a database of URIs that have appeared in spam emails.

SpamAssassin version 2.63 supports SURBLs via a plug-in, and from version 3.0 onwards, SpamAssassin supports SURBLs without the need for a plug-in.

These tests use a small amount of network I/O while querying the database. They are suited to parallel processing.

Examining Headers

Email headers have a specific format. They take the form of a series of records. Each record starts on a new line, with a word followed by a colon. The word can be hyphenated, so Return-Path: in the following example indicates the start of an email header.

Email headers can span several lines. If a line starts with a space or a tab character, it is a continuation of the previous header.

The following example shows the headers concerned with overall delivery details—who any bounce messages should be sent to, who the email was destined for, and who it was finally delivered to.

Return-Path: <stockprofile@mymail-info.net>
X-Original-To: sales@domain.com
Delivered-To: sales@domain.com

The following lines show the MTAs that the email was routed through on its way from sender to recipient. Each MTA adds a line above the other lines as it processes the email.

Received: from unknown (HELO 81.21.65.156) (218.14.129.227)
by server25.lb.an-isp.co.uk with SMTP; 29 Jun 2004 04:27:31...

Reporting Spammers

Many ISPs take spam seriously; some do not. To report spam:

Determine the source of spam, discover the netblock owner, and send an email to their abuse@ address.
If a website is advertised, resolve the domain name. Discover the netblock owner and send an email to their abuse@ address.

Sadly, some ISPs ignore email received at their abuse@ addresses, and others actually bounce email to their abuse@ address.

Note

The spamcop.net website does a very good job of analyzing email message headers and helps you choose who to send spam report emails.

Valid Bulk Email Delivery

Mailing lists can have thousands of recipients, and as a result spam email message recognition services will be exposed to these messages enough to alter the results of a query. Some commercial email is only sent to subscribers who opt-in to a list, so the email is not spam, but is in fact desired email. However, the rules in message recognition services may falsely identify a desired or valid bulk email as spam. This is because the content can sometimes look like spam, especially to a filter. However, all emails can suffer from false detection. Emails discussing products in the finance and medical areas are likely to suffer, due to the number of spam rules targeting these types of products.

If some ham email is incorrectly detected as spam, try the following:

Personalize each email, or send a unique email to a small number of recipients. If the email body is different, then message recognition service will treat each email as unique. The emails can also be sent...

Summary

Spam can be identified using a number of techniques. Header and body tests search for known words and other characters; these are the bulk of tests provided by SpamAssassin. Various network tests including ORBLs and message databases can be used to detect spam.

Statistical filters that use mathematical techniques to learn the properties of spam and ham can be used to effectively filter spam in addition to email header analysis. Although tools to detect spam are available and can be used, it is also the users' responsibility to report spam to effectively reduce the constant menace of spam emails.