Recently in Mail and USENET News Category

Mail to WAP Gateways

It's coming up to Valentine's day again, and invariably my thoughts turn back to last year's rather memorable weekend-break to Stockholm, in which I learned two things:

  1. Stockholm makes a great Valentine's destination.
  2. My girlfriend of the time was not happy with me cracking out my iBook and checking my email halfway into the break.

The relationship, predictably, didn't last much longer, but it did occur to me that a quick and easy way to check my email when away from my computer would be very useful. One of the items that travels everywhere with me, and has some limited Internet access is my phone -- although admittedly this has only WAP access. WAP access, it seemed, would have to do...

The tool I ended up building fills my needs very well, but possibly won't be such a great match for others. This article looks at considerations when rendering email for display online, especially when space is very limited.

Overview of Messages

The first challenge is reading the contents of our target mailbox. For this, we turn to the Perl Email Project's Email::Folder:

 use Email::Folder;
 
 my $folder = Email::Folder->new( '/home/sheriff/mbox' );
 
 for my $message ( $folder->messages ) {
 
        ...

Email::Folder's messages() function returns Email::Simple objects. For my folder-view, I chose to group messages by date, and use the sender's "real name" as the subject. Something like:

 30 Jan 2004
    Michael Roberts
  * Paul Makepeace
    Uri Guttman
 29 Jan 2004
    Kate Pugh

Extracting header fields from Email::Simple objects couldn't be simpler:

 my $from = $message->header('from')

But people familiar with the various email RFCs will know that since email headers have to use only printable US-ASCII, they're very often encoded: your header field might well look like:

  =?iso-8859-1?q?Pete=20Sergeant?= <pete@clueball.com>

This will not look pretty if you use it literally. Thankfully, MIME::WordDecoder exports the function unmime -- rendering the above as "Pete Sergeant <pete@clueball.com>."

Getting the date from an email is also somewhat nontrivial -- an example "Date" header looks like:

 Fri, 30 Jan 2004 14:09:51 -0000

And that's if you're lucky, and it's well-formed, without starting to think about time zones. If we want to do anything useful with dates, we're going to want the date as an epoch time. Luckily, DateTime::Format::Mail steps in, and not only parses our date, but returns a highly useful DateTime object, allowing us to do all kinds of fun date stuff. To simply reformat the date as Day/Month/Year:

 my $datetime = DateTime::Format::Mail->new( loose => 1 );
 my $time = $datetime->parse_datetime( $message->header('date') );
 my $day_month_year = $time->dmy;

Finally, we're going to want to know if an email is new or not. Luckily, most MUAs will set/edit an email's status header. Rather than checking if an email is new, we check if it's been read -- denoted by a R in the status header:

 $new_flag++ if $message->header('Status') !~ m/R/;

Now let's put this all together to produce a listing of a folder. We'll use the well-known Schwartzian transform to make the sorting efficient, but unlike the usual practice, we keep the array reference around, as we'll be using the date as well.

 use Email::Folder;
 use MIME::WordDecoder qw( unmime );
 use DateTime::Format::Mail;

 my $folder = Email::Folder->new( '/home/sheriff/mbox' );
 my @to_sort;
 my $prev_date = "";
 for (sort { $a->[1] cmp $b->[1]    }
      map  { [$_, message2dmy($_) ] } 
      $folder->messages) {
     my ($message, $date) = @$_;
     if ($date ne $prev_date) { print $date, "\n"; $prev_date = $date; }
     print $message->header('Status') =~ m/R/ ? "   " : " * ";
     print unmime($message->header('from')), "\n";
 }

 sub message2dmy {
     my $message = shift;
     my $datetime = DateTime::Format::Mail->new( loose => 1 );
     my $time = $datetime->parse_datetime( $message->header('date') );
     my $day_month_year = $time->dmy;
 }

Displaying Individual Messages

Those are the main challenges of a folder-view. Viewing an individual message presents a different set of challenges.

First and foremost is the appalling habit people have of sending each other HTML-"enriched" emails, with all sorts of attachments. If you're trying to read the email on a cell phone over a slow connection, you don't want to be battling with this -- you want a nice plain-text representation of the email. So, Email::StripMIME is your friend. Assuming we have an Email::Simple object, we can simply:

 my $string = $email_simple_object->as_string();
 $string = Email::StripMIME::strip_mime( $string );
 $email_simple_object = Email::Simple->new( $string );

Of course, if we really wanted to cut down on the amount of content we're receiving, and we're only using this tool to get an overview of our messages, we can cut out quoted text, remnants of the email that the sender was replying to, and so on. Text::Original does just this for us, as well as stripping out attribution lines:

 my $body = $email_simple_object->body();
 $body = first_lines( $body, 20);
 $email_simple_object->body( $body );

The final problem is in creating actual real WML. Sadly, this is nontrivial, and in the past, I've tended to resort to outputting it by hand. But it doesn't have to be that way --; CGI::WML just about handles the task for us. CGI::WML is a subclass of CGI, with methods specific to WAP.

Conclusion

There is no fully working demo at the end of this article. My personal tool works in a way that's probably a little too specific for most people's needs. Hopefully however, it's introduced you to one or more modules you didn't know existed, and given you some inspiration to tinker around with Perl and email-handling.

Mail Filtering

There are many ways to filter your e-mail with Perl. Two of the more popular and interesting ways are to use PerlMx or Mail::Audit. I took a long look at both, and this is what I thought of them.

PerlMx

PerlMx is a server product from ActiveState that uses the milter support in recent versions of sendmail to hook in at almost every stage of the mail-handling process.

PerlMx comes with its own copy of Perl, and all the supporting modules it needs - it can't run from a normal Perl, as it needs Perl to be built with various options such as ithreads support and multiplicity. This means you need to install any modules you want to use with PerlMx twice if you already have them installed somewhere else on your system.

PerlMx provides a persistent daemon that processes e-mail for an entire mail-server - it avoids the overhead of starting a Perl process to handle each e-mail by running forever, and by using threads to ensure it can service more than one e-mail at a time.

PerlMx ships with two main filters - the Spam and Virus filters. The Virus filtering looks interesting, but ultimately I don't receive that many viruses in e-mail, so I was unable to test it beyond establishing that it didn't mangle my e-mail.

The Spam filtering in PerlMX is much more interesting - it seems to be based on Mail::SpamAssassin, a popular spam filtering module often used with Mail::Audit, procmail, or other ways of processing e-mail.

In two weeks of testing with PerlMx, using it to process a copy of all my personal e-mail, I found a lot useful functionality, and a few minor problems.

The first hassles were setup - I don't normally use sendmail, but PerlMx requires it for the milter API, so I installed sendmail, set it up, and hooked it into PerlMx.

Once you have sendmail setup, and built with milter support (as the default build from Debian Linux I used was), it's easy to add a connection to PerlMx with one line in your sendmail.mc file:


INPUT_MAIL_FILTER(`C<PerlMx>', `S=inet:3366@localhost, F=T, 
     T=S:3m;R:3m;E:8m'')

PerlMx essentially works out of the box - it asks a number of simple questions when you install and set it up, and assuming you get these right, no further configuration will be required.

The INPUT_MAIL_FILTER line also sets several key options, including the timeouts for communication between sendmail and PerlMx - I had to raise these significantly to deal with a problem I found where PerlMx was taking too much time to process spam (it appear to be doing DNS lookups), sendmail was timing out the connection to PerlMx, and refusing to accept mail.

In PerlMx 2.1, it even ships with its own sendmail install, pre-configured for use with PerlMx, but you can choose to ignore this and use an existing system sendmail.

Once you've done this, suddenly all the mail that goes through your mail-server is spam filtered, and virus checked. Mail that looks likely to be spam, or that contains a virus is stopped and held in a quarantine queue, the rest are sent to the user, possibly with a spam header added to indicate a score representing how likely to be spam they are. The quarantine queue is a systemwide collection of messages which, for one reason or another, weren't appropriate to deliver to the user - this will be normally as they are either suspected to contain viruses or spam.

If the filters supplied with PerlMx aren't to your tastes, then it comes supplied with an extension API, and extensive documentation and samples to allow you to write your own.

While testing PerlMx, I never managed to bounce or accidentally lose my e-mail - I made many configuration errors, which meant mail wasn't processed and a lot of stuff was somewhat over-enthusiastically marked as spam when it was actually valid. But as far as I can tell, nothing bounced or disappeared into the system - this is pretty impressive, as when configuring most new bits of e-mail I usually manage to delete everything I send to it in the first few attempts, or, worse, make myself look stupid by sending errors back to random people unfortunate enough to be on the same mailing list as me.

Mail::Audit

Mail::Audit is very different from PerlMx. For starters, once you've installed it, by default it doesn't do anything. Mail::Audit is just a Perl module - it's a powerful tool for implementing mail filters, but mostly you have to write them yourself. PerlMx ships with spam filtering and virus checking configured by default, Mail::Audit provides duplicate killing, a mailing list processing module (based on Mail::ListDetector), and a few simple spam filtering options based on Realtime Blackhole Lists or Vipul's Razor.

Mail::Audit is not designed to be used with an entire mail-server in the same way as PerlMx. Instead, it allows you to easily write little e-mail filter programs that can be triggered from the .forward file of a particular user. Mail::Audit can be easily configured and used on a per-user basis, whereas PerlMx takes over an entire mail-server and is an all-or-nothing choice.

The default Mail::Audit configuration starts one Perl process for each mail handled - normally this won't be a problem, but if you're processing large volumes of mail, or have a system which is already at or near capacity, it may be enough to tip the balance and cause performance problems (Translation: Long ago I installed Mail::Audit on an old, spare machine I was using as a mail-server, received 200 e-mails in less than a minute, and spent quite a while waiting for the system to stop gazing at its navel and start responding to the outside world again). If your mail comes to you via POP3, or can be made to do so (possibly by installing a POP3 daemon if you do not have one already), then a simple script supplied with Mail::Audit called popread provides a base you can use to feed articles from a POP3 server into Mail::Audit in a single Perl process, improving performance. I didn't do this myself, as I wanted to use what appeared to be the 'recommended' approach to Mail::Audit setup - the one that is, if not actively promoted in the documentation, most strongly suggested by it, of running a Mail::Audit script from a user's .forward file.

A popular Mail::Audit addition is SpamAssassin (the same codebase as PerlMx's mail processing is loosely based on) - this comes as a Mail::Audit plugin, among other forms.

Mail::Audit makes it easy to write mail filters that work on a per-user basis, whereas PerlMx by default applies to all mail processed on a given mailserver.

If you wanted to install Mail::Audit systemwide, then many mail-servers (such as exim) provide a way to configure a custom local delivery agent on flexible criteria. For example, this article provides some documentation on how to do this with exim.

Testing ... 1 ... 2 ... 3 ...

I decided to do an extended comparison of both PerlMx and Mail::Audit. As one of the most common applications of mail filtering tools is for spam filtering, I set up recent versions of both the tools on my personal e-mail, by various nefarious means, ran them for a week, and compared the results on two main criteria:

  • False positives (legitimate email recognized as spam)
  • False negatives (spam not recognized as spam)

Mail::Audit doesn't come with much spam filtering technology by default, so I decided to add SpamAssassin (http://www.spamassassin.org/) to the testing, as it can be used as a Mail::Audit extension.

I used procmail to copy all my incoming e-mail to two pop3 mailboxes setup for the purposes of testing - one would contain mail to be processed by Mail::Audit, the other mail to be processed by PerlMx's spam filtering. fetchmail was used to pull the mail down into the domain of Mail::Audit and PerlMx.

Once I had Mail::Audit and SpamAssassin setup, I started feeding mail into the test box with fetchmail, and was reminded that as the Mail::Audit approach of setting up a perl program to run from a .forward file has ... unpleasant effects if you receive more than a few e-mails in quick succession. As my test mail-server collapsed under the load, I checked the PerlMx machine, started at roughly the same time, and found that while it was working through the e-mail more slowly, it hadn't put any serious load on the machine.

Due to a PerlMx configuration error on my part, of the first 171 messages processed, 10 were quarantined as spam AND delivered to the inbox of my test user. PerlMx runs by default in 'training mode' when processing spam - in this mode, mail is spamchecked as normal, but even if it is found to be spam and quarantined, it is also delivered to the user.

I decided to keep track of any mail lost or mislaid during initial setup problems, so I could see what problems could arise from the tools being misconfigured. An important aspect of any software is not only how it behaves when configured right, but how much it punishes you when you get the configuration wrong.

Waking up the next morning, I found I'd bounced several hundred e-mails back to the account from which I was forwarding all the test e-mails, someone of which appeared to have gone back and forth, or found their way into the PerlMx test mailbox. Most of the problems appeared to be internal errors from within SpamAssassin. My mail-server still hadn't recovered.

I later found this was because of an compatibility issue with SpamAssassin / Mail::Audit, and there was a recommended fix in the SpamAssassin FAQ involving the nomime option to Mail::Audit (but not, sadly, in the documentation for the Mail::SpamAssassin module itself).

The SpamAssassin / Mail::Audit script I ended up using in the end was:


  #!/usr/local/bin/perl -w

  use strict;
  use C<Mail::Audit>;
  use Mail::SpamAssassin;

  # create C<Mail::Audit> object, log to /tmp, disable mime processing
  # for SpamAssassin compatibility, and store mail in ~/emergency_mbox
  # if processing fails
  my $mail = C<Mail::Audit>->new(emergency=>"~/emergency_mbox",
                              log => '/tmp/audit.log',
                              loglevel => 4, nomime => 1);

  my $spamtest = Mail::SpamAssassin->new;
  
  # check mail with SpamAssassin
  my $status = $spamtest->check($mail);
  
  # if it was spam, rewrite to indicate what the problem was, and 
  # store in the file ass-spam in our home directory
  if ($status->is_spam) {
          $status->rewrite_mail;
          $mail->accept("/home/spam1/ass-spam");
  # if if wasn't spam, accept it as normal mail
  } else {
          $mail->accept;
  }
  
  exit 0;

After clearing down all my mail, and losing two days of testing, I started again. It was only the nature of the testing setup that meant the bounce mail went to me and not the original sender. So, at 23:25 on Tuesday, I had another go. This time I knew enough to limit SpamAssassin to receiving messages in batches of five (using fetchmail) - something I could do in testing, but wouldn't be an easy option in most production setups. This meant my test machine could just about cope with delivering mail using SpamAssassin.

At 10 p.m. Sunday, I declared the testing closed, and examined the accuracy or otherwise of each system.

During the testing between Aug. 6 and 11, Mail::Audit marked 16 pieces of e-mail as spam. Seven of these e-mails proved to be false positives - mail that I had actually solicited and would have liked to have received. Six spam emails were accepted into my Inbox. There were 874 e-mails received in all. Mail::Audit appeared to receive 15 pieces of spam mail in total.

PerlMx marked 14 e-mails as spam. Two of these e-mails proved to be false positives - mail that was not spam. Impressively, it received 886 e-mails in the same period that Mail::Audit received 874 e-mails. I was unable to work out the exact cause of this, although the power-cut in the middle of the testing period will always be a major suspect. Eleven spam messages were incorrectly allowws through into my Inbox. PerlMx appeared to receive 23 pieces of spam mail in total.

The sample was small, as all I had was my own personal e-mail to work with, and I get what I'm told is surprisingly little spam, but it shows that Mail::Audit / SpamAssassin seems to decide more mail is spam than PerlMx does, but is also wrong more of the time. PerlMx marked slightly less e-mail as spam, and let more spam through, but when it did claim e-mail was spam it was right more of the time.

These tests would benefit significantly from being re-run during a long period of time on a larger mail-server, but I had neither the time nor the mail-server available.

Both tools can be extensively configured in terms of what is considered spam, and are likely to need regular updating to ensure they keep up to date with new tricks of the spammers. Here I only considered the behavior with the default configuration of the latest release at the time I ran my tests.

Feature Comparison

To help you choose, I've summarized the basic characteristics of both systems below. Some of the points are quite subjective and are more my impressions of the tools rather than hard facts - these are marked separately.

PerlMX Mail::Audit
Scalable Yes - persistent server Maybe - depends on config - obvious default configurations scale poorly
Ships with wide range of existing filtering functionality Yes Limited range, more available from third-parties
Target use System-wide mail filtering for mailservers Per-use mail filtering as a replacement for programs like procmail
Extensible? Yes Yes
Licensing Commercial Open-source
Mail Server Compatibility Sendmail Almost any mail server
Spam filtering Yes Third-party extension
Virus filtering Yes No
Easy to setup Yes Not so easy, requires custom code
Efficient and Scalable Very scalable - easily separated from the mailserver, and no noticable performance impact during testing Performance problems during testing in default configuration

Conclusions

During testing, PerlMx was significantly more reliable, both in terms of the amount of mail bounced due to configuration problems (none), and in terms of the load put in the mailserver (minimal) than Mail::Audit. Although Mail::Audit appears able to be setup for good performance, the obvious suggested configuration showed extremely poor scalability during testing. Also, as Mail::Audit requires writing some filtering code, bugs, mostly in this code, resulted in nontrivial quantities of mail being bounced during testing due to code/configuration errors, a problem that simply didn't occur with PerlMx's more pre-supplied, configuration file based system.

Both PerlMx and Mail::Audit provide good mail filtering solutions using Perl, but are targeted at entirely different markets. PerlMx is a systemwide solution providing drop-in functionality on mailservers, with Perl extensibility as well, whereas Mail::Audit is a more low-level tool, mostly focused on use by individuals, designed to let users build their own mail processing tools more easily.

Stopping Spam with SpamAssassin

I receive a lot of spam; an absolute massive bucket load of spam. I received more than 100 pieces of spam in the first three days of this month. I receive so much spam that Hormel Foods sends trucks to take it away. And I'm convinced that things are getting worse. We're all being bombarded with junk mail more than ever these days.

Well, a couple of days ago, I reached my breaking point, and decided that the simple mail filtering I had in place up until now just wasn't up to the job. It was time to call in an assassin.

SpamAssassin

SpamAssassin is a rule-based spam identification tool. It's written in Perl, and there are several ways of using it: You can call a client program, spamassassin, and have it determine whether a given message is likely to be spam; you can do essentially the same thing but use a client/server approach so that your client isn't always loading and parsing the rules each time mail comes; or, finally, you can use a Perl module interface to filter spam from a Perl program.

SpamAssassin is extremely configurable; you can select which rules you want to use, change the way the rules contribute to a piece of mail's "spam score," and add your own rules. We'll look at some of these features later in the article. First, how do we get SpamAssassin installed and start using it?

If you're using Debian Linux or one of the BSDs, then this couldn't be easier: just install the appropriate package using apt or the ports tree respectively. (The BSD port is called p5-Mail-SpamAssassin)

Those less fortunate will have to download the latest version of SpamAssassin, and install it themselves.

Vipul's Razor

SpamAssassin uses a variety of ways for testing whether an e-mail is spam, ranging from simple textual checks on the headers or body and detecting missing or misleading headers to network-based checks such as relay blackhole lists and an interesting distributed system called Vipul's Razor.

Vipul's Razor takes advantage of the fact that spam is, by its nature, distributed in bulk. Hence, a lot of the spam that you see, I'm also going to see at some point. If there were a big clearing-house where you could report spam and I could see if my incoming mail matches what you've already reported, then I could have a guaranteed way of determining whether a given mail is spam. Vipul's Razor is that clearing-house.

Why is it a Razor? Because it's a collaborative system, its strength is directly derived from the quality of its database, which comes back to the way it's used by the likes of you and me. If end-users report lots of real spam, the Razor gets better; if the database gets "poisoned" by lots of false or misleading reports, then the efficiency of the whole system drops.

Just like any other spam detection mechanism, Razor isn't perfect. There are two points particularly worth noting. First, while it tries to completely avoid false positives (saying something's spam when it isn't) by requiring that spam be reported, it doesn't do anything about false negatives (saying something's not spam when it is) because it only knows about the mail in its database.

Second, spammers, like all other primitive organisms, are constantly evolving. Vipul's Razor only works for spam that is delivered in bulk without modification. Spam that is "personalized" by the addition of random spaces, letters or the name of the recipient, will produce a different signature that won't match similar spam messages in the Razor database.

Nevertheless, the Razor is an excellent addition to the spam fighter's arsenal, since when it marks something as spam, you can be almost positive it's correct. And just like SpamAssassin, it's all pure Perl. Mail::Audit has long supported a Razor plugin, but now we can move to calling Razor as part of a more comprehensive mail filtering system based on SpamAssasin and Mail::Audit

Installing Vipul's Razor is similar to installing SpamAssassin. Debian and BSD users have packages called "razor" and "razor-clients," respectively; and the rest of the world can download and install from the home page. SpamAssassin will detect whether Razor is available and, by default, use it if so.

Assassinating Spam With Mail::Audit : The Easy Way

So this is the part you've all been waiting for. How do we use these things to trap spam? For those of you who aren't familiar with Mail::Audit, the idea is simple: just like with procmail, you write recipes that determine what happens to your mail. However, in the case of Mail::Audit, you specify the recipe in Perl. For instance, here's a recipe to move all mail sent to perl5-porters@perl.org to another folder:


    use Mail::Audit;
    my $mail = Mail::Audit->new();
    if ($mail->from =~ /perl5-porters\@perl.org/) {
        $mail->accept("p5p");
    }
    $mail->accept();
For more details on how to construct mail filters with Mail::Audit, see my previous article.

Plugging SpamAssassin into your filters couldn't be simpler. First of all, you absolutely need the latest version of Mail::Audit, version 2.1 from CPAN. Nothing earlier will do! Now write a filter like this:


    use Mail::Audit;
    use Mail::SpamAssassin;
    my $mail = Mail::Audit->new();

    ... the rest of your rules here ...

    my $spamtest = Mail::SpamAssassin->new();
    my $status = $spamtest->check($mail);

    if ($status->is_spam ()) {
        $status->rewrite_mail() };
        $mail->accept("spam");
    }
    $mail->accept();
As you might be able to guess, the important thing here is the calls to check and is_spam. check produces a "status object" that we can query and use to manipulate the e-mail. is_spam tells us whether the mail has exceeded the number of "spam points" required to flag an e-mail as spam.

The rewrite_mail method adds some headers and rewrites the subject line to include the distinctive string "*****SPAM******". The additional headers explain why the e-mail was flagged as spam. For instance:


X-Spam-Status: Yes, hits=6.1 required=5.0 
tests=SUBJ_HAS_Q_MARK,REPLY_TO_EMPTY,SUBJ_ENDS_IN_Q_MARK version=2.1
This message had a question mark in the subject, an empty reply-to, and the subject ended in a question mark. The mail wasn't actually spam, but this goes to prove that the technique isn't perfect. Nevertheless, since installing the spam filter, I've only seen about 10 false positives, and zero false negatives. I'm happy enough with this solution.

One important point to remember, however, is where in the course of your filtering you should call SpamAssassin's checks. For instance, you want to do so after your mailing list filtering, because mail sent to mailing lists may have munged headers that might confuse SpamAssassin. However, this means that spam sent to mailing lists might slip through the net. Experiment, and find the best solution for your own e-mail patterns.

Assassinating Spam Without Mail::Audit

Of course, there are times when it might not be suitable to use Mail::Audit or you may not want to. Since SpamAssassin is provided as a command line tool as well as a set of Perl modules, it's easy enough to integrate it in whatever mail filtering solution you use.

For instance, here's a procmail recipe that calls out to spamassassin to filter out spam:


:0fw
| spamassassin -P

:0:
* ^X-Spam-Status: Yes
spambox
For the speed-conscious, you can run the spamd daemon and replace calls to spamassassin with spamc; be aware that this is a TCP/IP daemon that you may want to firewall from the rest of the world.

Another approach is to call spamassassin in your mail transport agent, meaning that spam is filtered out before it even attempts to be delivered to you. There's a Sendmail milter library available that allows you to use SpamAssassin, and similar tricks for Exim and other MTAs are available.

Assassinating Spam With Mail::Audit : More Complex Operations

The Mail::SpamAssassin module has many other methods you can use to manipulate e-mail. For instance, if you've identified something as definitely being spam, then you can use


    $spamtest->report_as_spam($mail);
to report it to Vipul's Razor. (Take note of this: As we've mentioned above, the efficiency of the Razor database comes from the fact that e-mails in it are confirmed as spam by a human. Adding false positives to the database would degrade its usefulness for everyone. Only submit mail that you've confirmed personally.)

If you're finding that mail checking is taking too long because SpamAssassin is having to contact the various network-based blacklists and databases, then you can instruct it to only perform "local" checking:


    $spamtest = Mail::SpamAssassin->new({local_tests_only => 1});

There is a wealth of other options available. See the Mail::SpamAssassin documentation for more details, and happy assassinating!

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02