Sign In/My Account | View Cart  
advertisement


Listen Print

The Evolution of Perl Email Handling

by Simon Cozens
June 10, 2004

I spend the vast majority of my time at a computer working with email, whether it's working through the ones I send and receive each day, or working on my interest in analyzing, indexing, organizing, and mining email content. Naturally, Perl helps out with this.

There are many modules on the CPAN for slicing and dicing email, and we're going to take a whistlestop tour of the major ones. We'll also concentrate on an effort started by myself, Richard Clamp, Simon Wistow, and others, called the Perl Email Project, to produce simple, efficient and accurate mail handling modules.

Message Handling

We'll begin with those modules that represent an individual message, giving you access to the headers and body, and usually allowing you to modify these.

The granddaddy of these modules is Mail::Internet, originally created by Graham Barr and now maintained by Mark Overmeer. This module offers a constructor that takes either an array of lines or a filehandle, reads a message, and returns a Mail::Internet object representing the message. Throughout these examples, we'll use the variable $rfc2822 to represent a mail message as a string.

    my $obj = Mail::Internet->new( [ split /\n/, $rfc2822 ] );

Mail::Internet splits a message into a header object in the Mail::Header class, plus a body. You can get and set individual headers through this object:

    my $subject = $obj->head->get("Subject");
    $obj->head->replace("Subject", "New subject");

Reading and editing the body is done through the body method:

    my $old_body = $obj->body;
    $obj->body("Wasn't worth reading anyway.");

I've not said anything about MIME yet. Mail::Internet is reasonably handy for simple tasks, but it doesn't handle MIME at all. Thankfully, MIME::Entity is a MIME-aware subclass of Mail::Internet; it allows you to read individual parts of a MIME message:

    my $num_parts = $obj->parts;
    for (0..$num_parts) {
        my $part = $obj->parts($_);
        ...
    }

If Mail::Internet and MIME::Entity don't cut it for you, you can try Mark Overmeer's own Mail::Message, part of the impressive Mail::Box suite. Mail::Message is extremely featureful and comprehensive, but that is not always meant as a compliment.

Mail::Message objects are usually constructed by Mail::Box as part of reading in an email folder, but can also be generated from an email using the read method:

    $obj = Mail::Message->read($rfc2822);

Like Mail::Internet, messages are split into headers and bodies; unlike Mail::Internet, the body of a Mail::Message object is also an object. We read headers like so:

    $obj->head->get("Subject");

Or, for Subject and other common headers:

    $obj->subject;

I couldn't find a way to set headers directly, and ended up doing this:

    $obj->head->delete($header);
    $obj->head->add($header, $_) for @data;

Reading the body as a string is only marginally more difficult:

    $obj->decoded->string

While setting the body is an absolute nightmare--we have to create a new Mail::Message::Body object and replace our current one with it.

    $obj->body(Mail::Message::Body->new(data => [split /\n/, $body]));

Mail::Message may be slow, but it's certainly hard to use. It's also rather complex; the operations we've looked at so far involved the use of 16 classes (Mail::Address, Mail::Box::Parser, Mail::Box::Parser::Perl, Mail::Message, Mail::Message::Body, Mail::Message::Body::File, Mail::Message::Body::Lines, Mail::Message::Body::Multipart, Mail::Message::Body::Nested, Mail::Message::Construct, Mail::Message::Field, Mail::Message::Field::Fast, Mail::Message::Head, Mail::Message::Head::Complete, Mail::Message::Part, and Mail::Reporter) and 4400 lines of code. It does have a lot of features, though.

Foolishly, I thought that email parsing shouldn't be so complex, and so I sat down to write the simplest possible functional mail handling library. The result is Email::Simple, and its interface looks like this:

    my $obj = Email::Simple->new($rfc2822);
    my $subject = $obj->header("Subject");
    $obj->header_set("Subject", "A new subject");
    my $old_body = $obj->body;
    $obj->body_set("A new body\n");
    print $obj->as_string;

It doesn't do a lot, but it does it simply and efficiently. If you need MIME handling, there's a subclass called Email::MIME, which adds the parts method.

Realistically, the choice of which mail handling library to use ought to be up to you, the end user, but this isn't always true. Auxilliary modules, which mess about with email at a higher level, can ask for the mail to be presented in a particular representation. For instance, until recently, the wonderful Mail::ListDetector module, which we'll examine later, required mails passed in to it to be Mail::Internet objects, since this gave it a known API to work with the objects. I don't want to work with Mail::Internet objects, but I want to use Mail::ListDetector's functionality. What can I do?

In order to enable the user to have the choice again, I wrote an abstraction layer across all of the above modules, called Email::Abstract. Given any of the above objects, we can say:

     my $subject = Email::Abstract->get_header($obj, "Subject");
     Email::Abstract->set_header($obj, "Subject", "My new subject");
     my $body = Email::Abstract->get_body($obj);
     Email::Abstract->set_body($message, "Hello\nTest message\n");
     $rfc2822 = Email::Abstract->as_string($obj);

Email::Abstract knows how to perform these operations on the major types of mail representation objects. It also abstracts out the process of constructing a message, and allows you to change the interface of a message using the cast class method:

    my $obj = Email::Abstract->cast($rfc2822, "Mail::Internet");
    my $mm = Email::Abstract->cast($obj, "Mail::Message");

This allows module authors to write their mail handling libraries in an interface-agnostic way, and I'm grateful to Michael Stevens for taking up Email::Abstract in Mail::ListDetector so quickly. Now I can pass in Email::Simple objects to Mail::ListDetector and it will work fine.

Email::Abstract also gives us the opportunity to create some benchmarks for all of the above modules. Here was the benchmarking code I used:

    use Email::Abstract;
    my $message = do { local $/; <DATA>; };
    my @classes =
        qw(Email::MIME Email::Simple MIME::Entity Mail::Internet Mail::Message);

    eval "require $_" or die $@ for @classes;

    use Benchmark;
    my %h;
    for my $class (@classes) {
        $h{$class} = sub {
            my $obj = Email::Abstract->cast($message, $class);
            Email::Abstract->get_header($obj, "Subject");
            Email::Abstract->get_body($obj);
            Email::Abstract->set_header($obj, "Subject", "New Subject");
            Email::Abstract->set_body($obj, "A completely new body");
            Email::Abstract->as_string($obj);
        }
    }
    timethese(1000, \%h);

    __DATA__
    ...

I put a short email in the DATA section and ran the same simple operations a thousand times: construct a message, read a header, read the body, set the header, set the body, and return the message as a string.

    Benchmark: timing 1000 iterations of Email::MIME, Email::Simple, 
    MIME::Entity, Mail::Internet, Mail::Message...
    Email::MIME: 10 wallclock secs ( 7.97 usr +  0.24 sys =  8.21 CPU) 
        @ 121.80/s (n=1000)
    Email::Simple:  9 wallclock secs ( 7.49 usr +  0.05 sys =  7.54 CPU) 
        @ 132.63/s (n=1000)
    MIME::Entity: 33 wallclock secs (23.76 usr +  0.35 sys = 24.11 CPU) 
        @ 41.48/s (n=1000)
    Mail::Internet: 24 wallclock secs (17.34 usr +  0.30 sys = 17.64 CPU) 
        @ 56.69/s (n=1000)
    Mail::Message: 20 wallclock secs (17.12 usr +  0.27 sys = 17.39 CPU) 
        @ 57.50/s (n=1000)

The Perl Email Project was a success: Email::MIME and Email::Simple were twice as fast as their nearest competitors. However, it should be stressed that they're both very low level; if you're doing anything more complex than the operations we've seen, you might consider one of the older Mail:: modules.

Pages: 1, 2

Next Pagearrow