The Evolution of Perl Email Handling

Jun 10, 2004 by Simon Cozens

I spend the vast majority of my time at a computer working with email, whether it’s working through the ones I send and receive each day, or working on my interest in analyzing, indexing, organizing, and mining email content. Naturally, Perl helps out with this.

There are many modules on the CPAN for slicing and dicing email, and we’re going to take a whistlestop tour of the major ones. We’ll also concentrate on an effort started by myself, Richard Clamp, Simon Wistow, and others, called the Perl Email Project, to produce simple, efficient and accurate mail handling modules.

Message Handling

We’ll begin with those modules that represent an individual message, giving you access to the headers and body, and usually allowing you to modify these.

The granddaddy of these modules is Mail::Internet, originally created by Graham Barr and now maintained by Mark Overmeer. This module offers a constructor that takes either an array of lines or a filehandle, reads a message, and returns a Mail::Internet object representing the message. Throughout these examples, we’ll use the variable $rfc2822 to represent a mail message as a string.

    my $obj = Mail::Internet->new( [ split /\n/, $rfc2822 ] );

Mail::Internet splits a message into a header object in the Mail::Header class, plus a body. You can get and set individual headers through this object:

    my $subject = $obj->head->get("Subject");
    $obj->head->replace("Subject", "New subject");

Reading and editing the body is done through the body method:

    my $old_body = $obj->body;
    $obj->body("Wasn't worth reading anyway.");

I’ve not said anything about MIME yet. Mail::Internet is reasonably handy for simple tasks, but it doesn’t handle MIME at all. Thankfully, MIME::Entity is a MIME-aware subclass of Mail::Internet; it allows you to read individual parts of a MIME message:

    my $num_parts = $obj->parts;
    for (0..$num_parts) {
        my $part = $obj->parts($_);
        ...
    }

If Mail::Internet and MIME::Entity don’t cut it for you, you can try Mark Overmeer’s own Mail::Message, part of the impressive Mail::Box suite. Mail::Message is extremely featureful and comprehensive, but that is not always meant as a compliment.

Mail::Message objects are usually constructed by Mail::Box as part of reading in an email folder, but can also be generated from an email using the read method:

    $obj = Mail::Message->read($rfc2822);

Like Mail::Internet, messages are split into headers and bodies; unlike Mail::Internet, the body of a Mail::Message object is also an object. We read headers like so:

    $obj->head->get("Subject");

Or, for Subject and other common headers:

    $obj->subject;

I couldn’t find a way to set headers directly, and ended up doing this:

    $obj->head->delete($header);
    $obj->head->add($header, $_) for @data;

Reading the body as a string is only marginally more difficult:

    $obj->decoded->string

While setting the body is an absolute nightmare–we have to create a new Mail::Message::Body object and replace our current one with it.

    $obj->body(Mail::Message::Body->new(data => [split /\n/, $body]));

Mail::Message may be slow, but it’s certainly hard to use. It’s also rather complex; the operations we’ve looked at so far involved the use of 16 classes (Mail::Address, Mail::Box::Parser, Mail::Box::Parser::Perl, Mail::Message, Mail::Message::Body, Mail::Message::Body::File, Mail::Message::Body::Lines, Mail::Message::Body::Multipart, Mail::Message::Body::Nested, Mail::Message::Construct, Mail::Message::Field, Mail::Message::Field::Fast, Mail::Message::Head, Mail::Message::Head::Complete, Mail::Message::Part, and Mail::Reporter) and 4400 lines of code. It does have a lot of features, though.

Foolishly, I thought that email parsing shouldn’t be so complex, and so I sat down to write the simplest possible functional mail handling library. The result is Email::Simple, and its interface looks like this:

    my $obj = Email::Simple->new($rfc2822);
    my $subject = $obj->header("Subject");
    $obj->header_set("Subject", "A new subject");
    my $old_body = $obj->body;
    $obj->body_set("A new body\n");
    print $obj->as_string;

It doesn’t do a lot, but it does it simply and efficiently. If you need MIME handling, there’s a subclass called Email::MIME, which adds the parts method.

Realistically, the choice of which mail handling library to use ought to be up to you, the end user, but this isn’t always true. Auxilliary modules, which mess about with email at a higher level, can ask for the mail to be presented in a particular representation. For instance, until recently, the wonderful Mail::ListDetector module, which we’ll examine later, required mails passed in to it to be Mail::Internet objects, since this gave it a known API to work with the objects. I don’t want to work with Mail::Internet objects, but I want to use Mail::ListDetector’s functionality. What can I do?

In order to enable the user to have the choice again, I wrote an abstraction layer across all of the above modules, called Email::Abstract. Given any of the above objects, we can say:

     my $subject = Email::Abstract->get_header($obj, "Subject");
     Email::Abstract->set_header($obj, "Subject", "My new subject");
     my $body = Email::Abstract->get_body($obj);
     Email::Abstract->set_body($message, "Hello\nTest message\n");
     $rfc2822 = Email::Abstract->as_string($obj);

Email::Abstract knows how to perform these operations on the major types of mail representation objects. It also abstracts out the process of constructing a message, and allows you to change the interface of a message using the cast class method:

    my $obj = Email::Abstract->cast($rfc2822, "Mail::Internet");
    my $mm = Email::Abstract->cast($obj, "Mail::Message");

This allows module authors to write their mail handling libraries in an interface-agnostic way, and I’m grateful to Michael Stevens for taking up Email::Abstract in Mail::ListDetector so quickly. Now I can pass in Email::Simple objects to Mail::ListDetector and it will work fine.

Email::Abstract also gives us the opportunity to create some benchmarks for all of the above modules. Here was the benchmarking code I used:

    use Email::Abstract;
    my $message = do { local $/; <DATA>; };
    my @classes =
        qw(Email::MIME Email::Simple MIME::Entity Mail::Internet Mail::Message);

    eval "require $_" or die $@ for @classes;

    use Benchmark;
    my %h;
    for my $class (@classes) {
        $h{$class} = sub {
            my $obj = Email::Abstract->cast($message, $class);
            Email::Abstract->get_header($obj, "Subject");
            Email::Abstract->get_body($obj);
            Email::Abstract->set_header($obj, "Subject", "New Subject");
            Email::Abstract->set_body($obj, "A completely new body");
            Email::Abstract->as_string($obj);
        }
    }
    timethese(1000, \%h);

    __DATA__
    ...

I put a short email in the DATA section and ran the same simple operations a thousand times: construct a message, read a header, read the body, set the header, set the body, and return the message as a string.

    Benchmark: timing 1000 iterations of Email::MIME, Email::Simple,
    MIME::Entity, Mail::Internet, Mail::Message...
    Email::MIME: 10 wallclock secs ( 7.97 usr +  0.24 sys =  8.21 CPU)
        @ 121.80/s (n=1000)
    Email::Simple:  9 wallclock secs ( 7.49 usr +  0.05 sys =  7.54 CPU)
        @ 132.63/s (n=1000)
    MIME::Entity: 33 wallclock secs (23.76 usr +  0.35 sys = 24.11 CPU)
        @ 41.48/s (n=1000)
    Mail::Internet: 24 wallclock secs (17.34 usr +  0.30 sys = 17.64 CPU)
        @ 56.69/s (n=1000)
    Mail::Message: 20 wallclock secs (17.12 usr +  0.27 sys = 17.39 CPU)
        @ 57.50/s (n=1000)

The Perl Email Project was a success: Email::MIME and Email::Simple were twice as fast as their nearest competitors. However, it should be stressed that they’re both very low level; if you’re doing anything more complex than the operations we’ve seen, you might consider one of the older Mail:: modules.

Mailbox Handling

So much for individual messages; let’s move on to handling groups of messages, or folders. We’ve mentioned Mail::Box already, and this is truly the king of folder handling, supporting local and remote folders, editing folders, and all sorts of other things besides. To use it, we first need a Mail::Box::Manager, which is a factory object for creating Mail::Boxes.

    use Mail::Box::Manager
    my $mgr = Mail::Box::Manager->new;

Next, we need to open the folder using the manager:

    my $folder = $mgr->open(folder => $folder_file);

And now we can get at the individual messages as Mail::Message objects:

    for ($folder->messages) {
        print $_->subject,"\n";
    }

With its more minimalist approach, my favorite mail box manager until recently was Mail::Util’s read_mbox function, which takes the name of a Unix mbox file, and returns a list of array references; each reference is the array of lines of a message, suitable for feeding to Mail::Internet->new or similar:

    for (read_mbox($folder_file)) {
        my $obj = Mail::Internet->new($_);
        print $_->head->get("Subject"),"\n";
    }

These two are both really handy, but there seemed to be room for something in between the simplicity of Mail::Util and the functionality of Mail::Box, and so the Email Project struck again with Email::Folder and Email::LocalDelivery. Email::Folder handles mbox and maildir folders, with more types planned, and has a reasonably simple interface:

    my $folder = Email::Folder->new($folder_file);
    for ($folder->messages) {
        print $_->header("Subject"),"\n";
    }

By default it returns Email::Simple objects for the messages, but this can be changed by subclassing. For instance, if we want raw RFC2822 strings, we can do this:

    package Email::Folder::Raw; use base 'Email::Folder';
    sub bless_message { my ($self, $rfc2822) = @_; return $rfc2822; }

Perhaps in the future, we will change bless_message to use Email::Abstract->cast to make the representation of messages easier to select without necessarily having to subclass.

The other side of folder handling is writing to a folder, or “local delivery”. Email::LocalDelivery was written to assist Email::Filter, of which more later. The problem is harder than it sounds, as it has to deal with locking, escaping mail bodies, and specific problems due to mailbox and maildir formats. LocalDelivery hides all of these things beneath a simple interface:

    Email::LocalDelivery->deliver($rfc2822, @mailboxes);

Both Email::LocalDelivery and Email::Folder use the Email::FolderType helper module to determine the type of a folder based on its filename.

Address Handling

To come down to a lower level of abstraction again, there are a number of modules for handling email addresses. The old favorite is Mail::Address. A mail address appearing in the fields of an email can be made up of several elements: the actual address, a phrase or name, and a comment. For instance:

    Example user <example@example.com> (Not a real user)

Mail::Address parses these addresses, separating out the phrase and comments, allowing you to get at the individual components:

    for (Mail::Address->parse($from_line)) {
        print $_->name, "\t", $_->address, "\n";
    }

Unfortunately, like many of the mail modules, it tries really hard to be helpful.

    my ($addr) = Mail::Address->parse('"eBay, Inc." <support@ebay.com>');
    print $addr->name # Inc. eBay

Which, while better than the “Inc Ebay” that previous versions would produce, isn’t really acceptable. Casey West joined our merry band of renegades and produced Email::Address. It has exactly the same interface as Mail::Address, but it works, and is about twice to three times as fast.

One thing we often want to do when handling mail addresses is to make sure that they’re valid. If, for instance, a user is registering for content at a web site, we need to check that the address they’ve given is capable of receiving mail. Email::Valid, the original inhabitant of the Email:: namespace before our bunch of disaffected squatters moved in, does just this. In its most simple use, we can say:

    if (not Email::Valid->address('test@example.com')) {
        die "Not a valid address"
    }

You can turn on additional checks, such as ensuring there’s a valid MX record for the domain, correcting common AOL and Compuserve addressing mistakes, on so on:

    if (not Email::Valid->address(-address => 'test@example.com',
                                  -mxcheck => 1)) {
        die "Not a valid address"
    }

Mail Munging

Once we have our emails, what are we going to do with them? A lot of what I’ve been looking at has been textual analysis of email, and there are three modules that particularly help with this.

This first is Text::Quoted; it takes the body text of an email message, or any other text really, and tries to figure out which parts of the message are quotations from other messages. It then separates these out into a nested data structure. For instance, if we have

    $message = <<EOF
    > foo
    > # Bar
    > baz

    quux
    EOF

Then running extract($message) will return a data structure like this:

    [
      [
        { text => 'foo', quoter => '>', raw => '> foo' },
        [
            { text => 'Bar', quoter => '> #', raw => '> # Bar' }
        ],
        { text => 'baz', quoter => '>', raw => '> baz' }
      ],

      { empty => 1 },
      { text => 'quux', quoter => '', raw => 'quux' }
    ];

This is extremely useful for highlighting different levels of quoting in different colors when displaying a message. A similar concept is Text::Original, which looks for the start of original, non-quoted content in an email. It knows about many kinds of attribution lines, so with:

    $message = <<EOF
    You wrote:
    > Why are there so many different mail modules?

    There's more than one way to do it! Different modules have different
    focuses, and operate at different levels; some lower, some higher.
    EOF

the first_sentence($message) would be There's more than one way to do it!. The Mariachi mailing list archiver uses this technique to give a “prompt” for each message in a thread.

And speaking of threads, the Mail::Thread module is a Perl implementation of Jamie Zawinski’s mail threading algorithm, as used by Mozilla as well as many other mail clients since then. It’s also used by Mariachi, and has recently been updated to use Email::Abstract to handle any kind of mail object you want to throw at it:

    my $threader = Mail::Thread->new(@mails);
    $threader->thread; # Compute threads
    for ($threader->rootset) { # Original mails in a thread
        dump_thread($_);
    }

Mail Filtering

The classic Perl mail filtering tool is Mail::Audit, and I’ve written articles here about using Mail::Audit on its own and using it in conjunction with Mail::SpamAssassin.

We’ve mentioned Mail::ListDetector a couple of times already, and I use this with Mail::Audit to do most of the filtering automatically for me. The Mail::Audit::List plugin uses ListDetector to look for mailing list headers in a message; these are things like List-Id, X-Mailman-Version, and the like, which identify a mail as having come through a mailing list. This means I can filter out all mailing list posts to their own folders, like so:

    my $list = Mail::ListDetector->new($obj);
    if ($list) {
        my $name = $list->listname;
        $item->accept("mail/$name.-$date");
    }

However, Mail::Audit itself is getting a little long in the tooth, and so new installations are encouraged to use the Email Project’s Email::Filter instead; it has the same interface for the most part, although not all of the same features, and it uses the new-fangled Email::Simple mail representation for speed and cleanliness.

Mail Mining

Finally, the most high-level thing I do with email is develop frameworks to automatically categorize, organize, and index mail into a database, and attempt to analyze it for interesting nuggets of information.

My first module to do this with was Mail::Miner, which consists of three major parts. The first part takes an email, removes any attachments, and stores the lot in a database. The second looks over the email and runs a set of “Recogniser” modules on it; these find addresses, phone numbers, keywords and phrases, and so on, and store them in a separate database table. The third part is a command-line tool to query the database for mail and information.

For instance, if I need to find Tim O’Reilly’s postal address, I ask the query tool, mm, to find addresses in emails from him:

 % mm --from "Tim O" --address
 Address found in message 1835 from "Tim O'Reilly" <tim@oreilly.com>:
 Tim O'Reilly @ O'Reilly & Associates, Inc.
 1005 Gravenstein Highway North, Sebastopol, CA 95472

To retrieve the whole email, I’d say

 % mm --id 1835

And if it originally contained an attachment, we’d see something like this as part of the email:

 [ text/xml attachment something.xml detached - use
   mm --detach 208
   to recover ]

I paste that middle line mm --detach 208 into a shell, and hey presto, something.xml is written to disk.

Now Mail::Miner is all very well, but having the three ideas in one tight package–filing mail, mining mail, and interfacing to the database–makes it difficult to develop and extend any one of them. And of course, it uses the old-school Mail:: modules.

This brings us to our final module on the mail modules tour, and the most recently released: Email::Store. This is a framework, based on Class::DBI, for storing email in a database and indexing it in various ways:

   use Email::Store 'dbi:SQLite:mail.db';
   Email::Store->setup;
   Email::Store::Mail->store($rfc2822);

And then later…

   my ($name) = Email::Store::Name->search( name => "Simon Cozens" )
   @mails_from_simon = $name->addressings( role => "From" )->mails;

It can be used to build a mailing list archive tool such as Mariachi, or a data mining setup like Mail::Miner. It’s still very much in development, and makes use of a new idea in module extensibility.

I’ll be bringing more information when we’ve written the first mail archiving and searching tool using Email::Store, which I’m going to be doing as a new interface to the Perl mailing lists at perl.org.

Conclusion

We’ve looked at the major modules for mail handling on CPAN, and there are many more. I am obviously biased towards those which I wrote, and particularly the Perl Email Project modules in the Email::* namespace. These modules are specifically designed to be simple, efficient, and correct, but may not always be a good substitute for the more thorough Mail::* modules, particularly Mail::Box. However, I hope you’re now a little more aware of the diversity of mail handling tools out there, and know where to look next time you need to manipulate email with Perl.

Tags