Chapter 13. Improving Filtering
SpamAssassin has a high spam detection rate, but despite this, some spam emails always escape detection. Conversely, legitimate emails are sometimes marked as spam.
This chapter looks at whitelists and blacklists—techniques for spam filtering that mark known good and bad senders. We then discuss the situation where emails have been wrongly classified, and how to resolve this by altering scoring on rules. Finally, we discuss filtering out certain foreign languages and character sets as a method of reducing spam.
Whitelists and Blacklists
SpamAssassin works very well at detecting spam, but there is always a risk of false positives or false negatives. By using a list of email addresses that are known spam producers (a blacklist), email from spammers who use consistently use the same email addresses or domains can be filtered out. With a list of email addresses that are legitimate email senders (a whitelist), emails from regular or important correspondents are guaranteed...
Whitelists and Blacklists
SpamAssassin works very well at detecting spam, but there is always a risk of false positives or false negatives. By using a list of email addresses that are known spam producers (a blacklist), email from spammers who use consistently use the same email addresses or domains can be filtered out. With a list of email addresses that are legitimate email senders (a whitelist), emails from regular or important correspondents are guaranteed to be filtered as ham. This prevents the delay or non-delivery of important emails that may otherwise be marked as spam.
Blacklists that list individual emails have limited use—spammers normally use different or random email addresses for each spam run. However, some spammers use the same domain for multiple runs. As SpamAssassin allows wildcards in its blacklisting, entire domains can be blacklisted. This is more useful for filtering out spam.
SpamAssassin uses a manual blacklist and whitelist, and also manages an automatic whitelist...
SpamAssassin manages an automatic whitelist (AWL). It actually functions as both an auto-blacklist and an auto-whitelist. Generally, an auto-blacklist is ineffective as spammers rarely use the same email address for any period of time. However, SpamAssassin tracks both the IP address of email sources and the email addresses used, adding to its effectiveness.
The auto-whitelist keeps a record of the SpamAssassin scores for emails from senders. Senders that only send ham emails receive a weighting towards ham by the auto-whitelist. If they later send an email marked as spam, then the SpamAssassin score for the new email will be adjusted downwards, due to their past behavior as a source of only ham emails. The converse is true of those who normally send spam.
The AWL works by adjusting the score of the email being processed towards the average of all previous emails received from that sender. The amount or strength of this adjustment can be altered using the auto_whitelist_factor...
Resolving Incorrect Classifications
The consequences of an email being wrongly classified can range from a minor inconvenience to a major catastrophe. If a spam email is marked as ham, then the recipient will only spend a few seconds removing it from their inbox. If an unimportant ham email is marked as spam, it may be no great problem. However, marking an important email as spam could be embarrassing at the least, and could well result in serious consequences, such as financial loss.
Consequently, it is worthwhile making a backup of emails marked as spam before purging them. Emails compress well, so it may be possible to keep this backup on local disk storage rather than tape or other removable media. When it becomes apparent that an email has gone missing, the archive of spam can be searched and the email retrieved.
Due to the potential cost of an incorrect classification, users should be encouraged to review any spam they have received on a regular basis. Any false positives should be brought...
Character Sets and Languages
SpamAssassin can detect certain languages and character sets. Both language and character set information are added by email clients when emails are composed and sent, so that the receiving email client can display the message correctly. There are many languages and character sets in use. If received messages are expected or known to use only some of them, then the others can be filtered out.
SpamAssassin detects languages by using email headers. There is a large list of languages that SpamAssassin can detect; these are listed in the documentation for Mail::SpamAssassin::Conf
. Use the man
or perldoc
commands to view the documentation:
Many man
and perldoc
implementations use the /
key to search for text. Once the page is displayed, enter /ok_languages
to locate the correct part in the documentation. Press the space bar to scroll forward through the documentation.
Once the languages...
SpamAssassin allows the user and system administrator to improve detection of spam using a number of techniques. Whitelists aid in preventing false-positives. SpamAssassin's auto-whitelist will prevent an occasional spam-like email from an otherwise spam-free correspondent from being filtered as spam.
The spam threshold of whitelists can be altered to reduce false-positive or false-negative emails, and individual rule scores can be altered to prevent incorrect classifications. Character sets and languages can be used to filter out spam from other countries or regions.