June 2004 Archives

This Week on Perl 6, Fortnight Ending 2004-06-21

Good evening. You find me typing away in a motel room on the edge of the Lake District on the shortest night of the year. I suppose, by rites, I should be spending the night by some stone circle drinking, dancing, and generally making merry, but as I'm a teetotalling, unfit, atheistical nerd with a mission to summarize, I'll skip that and plough straight into the activities of another arcane sect. Yes, it's perl6-internals again ...

Toward an Accurate README

Andy Dougherty sent in a patch adjusting the Parrot README so that it accurately reflected the toolset needed to build parrot. For some reason, as well as the requirement for a C++ compiler if you want to build the ICU library, we've moved to needing Perl 5.6.0 rather than 5.005. Nicholas Clark suggested that, rather than bumping the Perl version requirement, we bundle appropriate versions of modules Parrot required, but chromatic thought it better to check the for the correct versions of said modules at Configure time and bail out if they weren't there.

There was further discussion about whether going to Perl 5.6.0 as a requirement was the right thing to do -- the issue is that some OSes (FreeBSD and Solaris 8 were cited) still only ship with 5.005 family Perls. My feeling (which I didn't mention in the thread, more fool me) is that, as eventually the goal is to eliminate the need for Perl at all in the build process, requiring 5.6.0 rather than 5.005 isn't exactly the end of the world.

http://groups.google.com/groups?selm=rt-3.0.9-30095-89885.16.61061401395@perl.org

http://groups.google.com/groups?selm=Pine.SOL.4.58.0406091317550.12666@maxwell.phys.lafayette.edu

OS X Builds Again

Someone finally summoned up the tuits to fix the OS X build. Step forward Nicholas Clark and accept your plaudits.

http://groups.google.com/groups?selm=a06110404bcee2724f70a@[10.0.1.3]

Basics of the PMC Class Set

Dan announced that he was about to make some changes to the repository, including establishing a basic set of PMC classes. Matt Fowles wondered about Array PMCs, which seemed to be missing from Dan's list. Dan agreed that they should be rationalized as well.

Bernhard Schmalhofer suggested adding a complex number class to Dan's basic menagerie, but Dan was unconvinced.

http://groups.google.com/groups?selm=a06110402bcee766bc83c@[172.24.18.98]

Correct Use of #define

Nicholas Clark was moved to growl by a section of include/parrot/string.h that used a #define to define a type. Actually, he was moved to 'swearing and cursing and time wasted grepping to find out what a STRING is'. Dan agreed that this was bad and decreed: 'Typedefs for types, #defines for constants and bizarre (temporary!) debugging things.'

http://groups.google.com/groups?selm=20040611095043.GB81272@plum.flirble.org

The Next Release

During a lull while Leo Tötsch was away at a conference, Dan mused on the next release of parrot. His goal for that is a rejig of classes/, aiming to make sure that all the PMCs we want are in and working, and that the ones we don't care about are neatly deprecated.

Warnock applies.

http://groups.google.com/groups?selm=a06110409bcefa5f8c6ff@[10.0.1.3]

A Small Task for the Interested

Stretching the definition of 'small' somewhat, Dan asked for volunteers to implement the missing PMCs detailed in PDD 17. He reckoned it should be 'fairly straightforward'. A new entrant in these summaries, Ion Alexandru Morega took a crack at the String PMC and posted a patch to the list which was promptly ignored. Which seems a little unfair really.

http://groups.google.com/groups?selm=a0611040abcefb912413e@[10.0.1.3]

Bignums!

Dan asked for a volunteer to get Bignums working. Alin Iacob stepped up to the plate. Leo suggested that, rather than starting from types/bignum.c, it might be better to use an existing, maintained, maths (Look, I'll spell 'summarise' with a 'z' -- the OED does -- but it will be a cold, cold day in hell when I start abbreviating 'mathematics' as 'math'. Ahem.) package. Dan worried about license compatibility; the proposed GMP package is licensed under the LGPL which may (or may not) be compatible with Parrots Artistic/GPL dual licence. After a closer reading he reckoned that GMP's license is definitely incompatible with Parrot.

http://groups.google.com/groups?selm=a0611040bbcefb9fb77e6@[10.0.1.3]

Adding Fixed*Array Classes

Matt Fowles posted a patch implementing Fixed Array classes. Dan applied it. Leo wondered about the patch's use of malloc(3) instead of Parrot's memory management. Dan wasn't worried in the case of fixed size arrays.

http://groups.google.com/groups?selm=rt-3.0.9-30230-90377.3.32088317231715@perl.org

Resizeable*Array Classes

Fresh from his Fixed Array triumph, Matt Fowles posted a patch implementing naive Resizeable Arrays. Leo thought it a little too naive, and worried about duplication of existing functionality. Dan wasn't worried about the naivete, or the duplication of functionality. He pointed out that it was more important to get something which could be improved and that the duplication was okay given that the idea was to get a standard framework in place and then eliminate the duplication. (I admit that I'm a little surprised to hear Dan, who's normally a strong advocate of up front design, preaching the refactorers creed...)

http://groups.google.com/groups?selm=rt-3.0.9-30245-90438.1.98115444593533@perl.org

Making PMCs

Nicholas Clark wondered how to go about making a PMC, pointing out that the documentation is rather sparse. He had a bunch of questions for the Men Who Know. Matt Fowles and Dan Sugalski came up with answers. Leo wrote a document and checked it in as docs/pmc.pod. If you're interested in implementing a PMC, you'd do well to read it.

http://groups.google.com/groups?selm=20040613100125.GG81272@plum.flirble.org

Slices and Iterators

Continuing his campaign to spec out more and more of Parrot, Dan set to work on defining and documenting Parrot's handling of slices and iterators. It appears that array slicing syntax is one of those topics where you can't please everyone; Dan's post generated a good deal of discussion.

Dan added more in a later post and, in later post still, threw up his hands and asked for a volunteer to define the iterator protocol.

http://groups.google.com/groups?selm=a06110409bcf382d914cc@[10.0.1.3]

http://groups.google.com/groups?selm=a06110409bcf4cf7b179c@[172.24.18.98]

http://groups.google.com/groups?selm=a0611040bbcf4ecb9f27f@[172.24.18.98]

Iterator semantics

Having a syntax is all very well, but you need to define what it means too. So Dan did.

http://groups.google.com/groups?selm=a0611040abcf38b01fe20@[10.0.1.3]

Some rationale for Parrot's mixed encoding scheme

Argh! Unicode! Run away!

Dan posted the rationale for his design for Parrot's strings, in particular the decision to defer conversion of strings to Unicode. Amazingly, there was no argument.

http://groups.google.com/groups?selm=a0611041bbcf3f6bd3ee8@[10.0.1.3]

Strings. Finally.

Dan posted "the official, 1.0, final version" of the Parrot Strings document. Modulo a caveat about the use of 'grapheme' it looks like this one might stick.

http://groups.google.com/groups?selm=a06110418bcf3bff76821@[10.0.1.3]

Strings' Internals

Having got a Strings spec that people seemed to agree on, Dan went on to discuss how parrot's strings should work internally. He and Mark A Biggar ironed out a few wrinkles.

http://groups.google.com/groups?selm=a06110403bcf61a2185f0@[10.0.1.3]

Fixing pbc2c.pl

Dan asked for volunteers to get build_tools/pbc2c.pl working and/or write a test suite for it.

http://groups.google.com/groups?selm=a06110402bcf79fccdc91@[10.0.1.3]

PIO_unix_pipe

Leo's extended Parrot's open opcode to allow for opening a unix pipe rather than simply a file. There were issues with passing arguments to the the program being piped to/from, but Dan came up with a suggested backtick opcode whose semantics were liked, but whose necessity and name were called into question. Leo pointed out that the major credit for the functionality belongs to Melvin Smith; Leo merely implemented a means to get at the functionality from parrot code.

http://groups.google.com/groups?selm=200406071207.i57C7bh27980@thu8.leo.home

Teapots!

Nick Glencross wants us all to look at his shiny SDL teapots; the poor man's wife is apparently underwhelmed, so let's all make with the "Wooo! Well done that man!".

Simple Trinary Ops

Dan toyed with adding a couple of non-flowcontrol min/max ops along the lines of

    min $P1, $P2, $P2 # Sets $P1 to the lesser of $P2 and $P3

He noted that such ops would make some of the code he was generating much simpler, but wasn't sure if that particular itch needed to be scratched with a new op. He was rapidly convinced that it wasn't a good idea (I wouldn't be surprised to discover that he was convinced of this about 10 seconds after he hit the 'send' button for the thread's root message).

Steve Fink attempted to resurrect the idea by proposing a single choose operator, which looks rather neat, but is probably still unnecessary.

http://groups.google.com/groups?selm=a06110404bcf6254d244e@[10.0.1.3]

Meanwhile, in perl6-language

Cawley's Corollary to Godwin's Law

Given enough time, any thread on perl6-language will end up arguing the toss about Unicode operators.

Perl 6 will have Unicode operators. It will also have ASCII equivalents. You're highly unlikely to convince Larry to change his mind on this.

IDs of Subroutine Wrappers Should be Objects

Ingo Blechschmidt made a convincing argument for making the subroutine wrappers discussed in Apocalypse 6 into first class objects. (Well, I was convinced.)

Later in the thread, Dan invoked "our Quantum Ninjas".

http://groups.google.com/groups?selm=20040608130813.23788.qmail@onion.perl.org

Simple Grammar Example

Aldo Calpini asked for assistance in coming up with a simple, yet impressive, example of Perl 6 grammars. I'm not sure anyone's come up with a suitably satisfactory example yet, but Garrett Goebel pointed out that Damian has released Perl6::Rules. And there was much rejoicing (well, in this summarizer's head there was).

http://groups.google.com/groups?selm=40C71BFD.1050105@perl.it

http://search.cpan.org/~dconway/Perl6-Rules-0.03/Rules.pm

Messages Not Guaranteed to Arrive in the Order Sent

In a message dated 28th of January 2004, Damian Conway addressed the issue of junctive lvalues.

http://groups.google.com/groups?selm=40171197.6070401@conway.org

Apologies, Acknowledgements, Announcements

This summary would have been a great deal trickier to write it if weren't for the efforts of Jeffrey Dik, Sebastian Riedel, Brent Royal-Gorden and Robert Spier, all of whom helped me fill in my message archive while my main server is away being fettled.

Computer menders and British Telecom willing, next week's summary will actually happen. If it doesn't, I'll be along in a fortnight.

If you find these summaries useful or enjoyable, you can show your appreciation by contributing to the Perl Foundataion to help support the ongoing development of Perl. Money and time are both appreciated. Or you could drop me a line at mailto:p6summarizer@bofh.org.uk; I'm always happy to get feedback.

http://donate.perlfoundation.org/ -- The Perl Foundation

Profiling Perl

Update: Perl Profiling has evolved since this article was written, please see http://www.perl.org/about/whitepapers/perl-profiling.html for the latest information.

Everyone wants their Perl code to run faster. Unfortunately, without understanding why the code is taking so long to start with, it's impossible to know where to start optimizing it. This is where "profiling" comes in; it lets us know what our programs are doing.

We'll look at why and how to profile programs, and then what to do with the profiling information once we've got it.

Why Profile?

There's nothing worse than setting off a long-running Perl program and then not knowing what it's doing. I've recently been working on a new, mail-archiving program for the perl.org mailing lists, and so I've had to import a load of old email into the database. Here's the code I used to do this:

    use File::Slurp;
    use Email::Store "dbi:SQLite:dbname=mailstore.db";
    use File::Find::Rule;

    for (File::Find::Rule->file->name(qr/\d+/)->in("perl6-language")) {
        Email::Store::Mail->store(scalar read_file($_));
    }

It's an innocent little program -- it looks for all the files in the perl6-language directory whose names are purely numeric (this is how messages are stored in an ezmlm archive), reads the contents of the files into memory with File::Slurp::read_file, and then uses Email::Store to put them into a database. You start it running, and come back a few hours later and it's done.

All through, though, you have this nervous suspicion that it's not doing the right thing; or at least, not doing it very quickly. Sure there's a lot of mail, but should it really be taking this long? What's it actually spending its time doing? We can add some print statements to help us feel more at ease:

    use File::Slurp;
    use Email::Store "dbi:SQLite:dbname=mailstore.db";
    use File::Find::Rule;

    print "Starting run...\n";
    $|++;
    for (File::Find::Rule->file->name(qr/\d+/)->in("perl6-language")) {
        print "Indexing $_...";
        Email::Store::Mail->store(scalar read_file($_));
        print " done\n";
    }

Now we can at least see more progress, but we still don't know if this program is working to full efficiency, and the reason for this is that there's an awful lot going on in the underlying modules that we can't immediately see. Is it the File::Find::Rule that's taking up all the time? Is it the storing process? Which part of the storing process? By profiling the code we'll identify, and hopefully smooth over, some of the bottlenecks.

Simple Profiling

The granddaddy of Perl profiling tools is Devel::DProf. To profile a code run, add the -d:DProf argument to your Perl command line and let it go:

    % perl -d:DProf store_archive

The run will now take slightly longer than normal as Perl collects and writes out information on your program's subroutine calls and exits, and at the end of your job, you'll find a file called tmon.out in the current directory; this contains all the profiling information.

A couple of notes about this:

  • It's important to control the length of the run; in this case, I'd probably ensure that the mail archive contained about ten or fifteen mails to store. (I used seven in this example.) If your run goes on too long, you will end up processing a vast amount of profiling data, and not only will it take a lot time to read back in, it'll take far too long for you to wade through all the statistics. On the other hand, if the run's too short, the main body of the processing will be obscured by startup and other "fixed costs."
  • The other problem you might face is that Devel::DProf, being somewhat venerable, occasionally has problems keeping up on certain recent Perls, (particularly the 5.6.x series) and may end up segfaulting all over the place. If this affects you, download the Devel::Profiler module from CPAN, which is a pure-Perl replacement for it.

The next step is to run the preprocessor for the profiler output, dprofpp. This will produce a table of where our time has been spent:

  Total Elapsed Time = 13.89525 Seconds
    User+System Time = 9.765255 Seconds
  Exclusive Times
  %Time ExclSec CumulS #Calls sec/call Csec/c  Name
   24.1   2.355  4.822     38   0.0620 0.1269  File::Find::_find_dir
   20.5   2.011  2.467  17852   0.0001 0.0001  File::Find::Rule::__ANON__
   7.82   0.764  0.764    531   0.0014 0.0014  DBI::st::execute
   4.73   0.462  0.462  18166   0.0000 0.0000  File::Spec::Unix::splitdir
   2.92   0.285  0.769    109   0.0026 0.0071  base::import
   2.26   0.221  0.402    531   0.0004 0.0008  Class::DBI::transform_sql
   2.09   0.204  0.203   8742   0.0000 0.0000  Class::Data::Inheritable::__ANON__
   1.72   0.168  0.359  18017   0.0000 0.0000  Class::DBI::Column::name_lc
   1.57   0.153  0.153  18101   0.0000 0.0000  Class::Accessor::get
   1.42   0.139  0.139     76   0.0018 0.0018  Cwd::abs_path

The first two lines tell us how long the program ran for: around 14 seconds, but it was actually only running for about 10 of those -- the rest of the time other programs on the system were in the foreground.

Next we have a table of subroutines, in descending order of time spent; perhaps surprisingly, we find that File::Find and File::Find::Rule are the culprits for eating up 20% of running time each. We're also told the number of "exclusive seconds," which is the amount of time spent in one particular subroutine, and "cumulative seconds." This might better be called "inclusive seconds," since it's the amount of time the program spent in a particular subroutine and all the other routines called from it.

From the statistics above, we can guess that File::Find::_find_dir itself took up 2 seconds of time, but during its execution, it called an anonymous subroutine created by File::Find::Rule, and this subroutine also took up 2 seconds, making a cumulative time of 4 seconds. We also notice that we're making an awful lot of calls to File::Find::Rule, splitdir, and some Class::DBI and Class::Accessor routines.

What to Do Now

Now we have some profiling information, and we see a problem with File::Find::Rule. "Aha," we might think, "Let's replace our use of File::Find::Rule with a simple globbing operation, and we can shave 4 seconds off our runtime!". So, just for an experiment, we try it:

    use File::Slurp;
    use Email::Store "dbi:SQLite:dbname=mailstore.db";
    $|=1;
    for (<perl6-language/archive/0/*>) {
        next unless /\d+/;
        print "$_ ...";
        Email::Store::Mail->store(scalar read_file($_));
        print "\n";
    }

Now this looks a bit better:

 Total Elapsed Time = 9.559537 Seconds
   User+System Time = 5.329537 Seconds
 Exclusive Times
 %Time ExclSec CumulS #Calls sec/call Csec/c  Name
  13.1   0.703  0.703    531   0.0013 0.0013  DBI::st::execute
  5.54   0.295  0.726    109   0.0027 0.0067  base::import
  5.52   0.294  0.294  18101   0.0000 0.0000  Class::Accessor::get
  3.45   0.184  1.930  19443   0.0000 0.0001  Class::Accessor::__ANON__
  3.13   0.167  0.970    531   0.0003 0.0018  DBIx::ContextualFetch::st::_untain
                                              t_execute
  3.10   0.165  1.324   1364   0.0001 0.0010  Class::DBI::get
  2.98   0.159  0.376    531   0.0003 0.0007  Class::DBI::transform_sql
  2.61   0.139  0.139     74   0.0019 0.0019  Cwd::abs_path
  2.23   0.119  0.119   8742   0.0000 0.0000  Class::Data::Inheritable::__ANON__
  2.06   0.110  0.744   2841   0.0000 0.0003  Class::DBI::__ANON__
  1.95   0.104  0.159   2669   0.0000 0.0001  Class::DBI::ColumnGrouper::group_cols

Now to be honest, I would never have guessed that removing File::Find::Rule would shave 4 seconds off my code run. This is the first rule of profiling: You actually need to profile before optimizing, because you never know where the hotspots are going to turn out to be. We've also exercised the second rule of profiling: Review what you're using. By using another technique instead of File::Find::Rule, we've reduced our running time by a significant amount.

This time, it looks as though we're doing reasonably well -- the busiest thing is writing to a database, and that's basically what this application does, so that's fair enough. There's also a lot of busy calls that are to do with Class::DBI, and we know that we use Class::DBI as a deliberate tradeoff between convenience and efficiency. If we were being ruthlessly determined to make this program faster, we'd start looking at using plain DBI instead of Class::DBI, but that's a tradeoff I don't think is worth making at the moment.

This is the third rule of profiling: Hotspots happen. If you got rid of all the hotspots in your code, it wouldn't do anything. There are a certain reasonable number of things that your program should be doing for it to be useful, and you simply can't get rid of them; additionally there are any number of tradeoffs that we deliberately or subconsciously make in order to make our lives easier at some potential speed cost -- for instance, writing in Perl or C instead of machine code.

From Exclusive to Inclusive

The default report produced by dprofpp is sorted by exclusive subroutine time, and is therefore good at telling us about individual subroutines that are called a lot and take up disproportionate amounts of time. This can be useful, but it doesn't actually give us an overall view of what our code is doing. If we want to do that, we need to move from looking at exclusive to looking at inclusive times, and we do this by adding the -I option to dprofpp. This produces something like this:

 Total Elapsed Time = 9.559537 Seconds
   User+System Time = 5.329537 Seconds
 Inclusive Times
 %Time ExclSec CumulS #Calls sec/call Csec/c  Name
  83.8   0.009  4.468      7   0.0013 0.6383  Email::Store::Mail::store
  80.8   0.061  4.308     35   0.0017 0.1231  Module::Pluggable::Ordered::__ANON
                                              __
  46.3       -  2.472      3        - 0.8239  main::BEGIN
  43.4       -  2.314      7        - 0.3306  Mail::Thread::thread
  43.4       -  2.314      7        - 0.3305  Email::Store::Thread::on_store
  36.2   0.184  1.930  19443   0.0000 0.0001  Class::Accessor::__ANON__
  28.9   0.006  1.543    531   0.0000 0.0029  Email::Store::Thread::Container::_
                                              _ANON__
  27.3   0.068  1.455    105   0.0006 0.0139  UNIVERSAL::require

This tells us a number of useful facts. First, we find that 84% of the program's runtime is spent in the Email::Store::Mail::store subroutine and its descendants, which is the main, tight loop of the program. This means, quite logically, that 16% is not spent in the main loop, and that's a good sign -- this means that we have a 1-second fixed cost in starting up and loading the appropriate modules, and this will amortize nicely against a longer run than 10 seconds. After all, if processing a massive amount of mail takes 20 minutes, the first 1-second startup becomes insignificant. It means we can pretty much ignore everything outside the main loop.

We also find that threading the emails is costly; threading involves a lot of manipulation of Email::Store::Thread::Container objects, which are database backed. This means that a lot of the database stores and executes that we saw in the previous, exclusive report are probably something to do with threading. After all, we now spend 2 seconds out of our 4 seconds of processing time on threading in Mail::Thread::thread, and even though we only call this seven times, we do 531 things with the container objects. This is bad.

Now, I happen to know (because I wrote the module) that Email::Store::Thread::Container uses a feature of Class::DBI called autoupdate. This means that while we do a lot of fetches and stores that we could conceivably do in memory and commit to the database once we're done, we instead hit the database every single time.

So, just as an experiment, we do two things to optimize Email::Store::Thread::Container. First, we know that we're going to be doing a lot of database fetches, sometimes of the same container multiple times, so we cache the fetch. We turn this:

    sub new { 
        my ($class, $id) = @_;
        $class->find_or_create({ message => $id });
    }

Into this:

    my %container_cache = ();
    sub new {
        my ($class, $id) = @_;
        $container_cache{$id} 
            ||= $class->find_or_create({ message => $id });
    }

This is a standard caching technique, and will produce another tradeoff: we trade memory (in filling up %container_cache with a bunch of objects) for speed (in not having to do as many costly database fetches).

Then we turn autoupdate off, and provide a way of updating the database manually. The reason we wanted to turn off autoupdate is that because all these containers form a tree structure (since they represent mails in a thread which, naturally, form a tree structure), it's a pain to traverse the tree and update all the containers once we're done.

However, with this cache in place, we know that we already have a way to get at all the containers in one go: we just look at the values of %container_hash, and there are all the objects we've used. So we can now add a flush method:

    sub flush {
        (delete $container_cache{$_})->update for keys %container_cache;
    }

This both empties the cache and updates the database. The only remaining problem is working out where to call flush. If we're dealing with absolutely thousands of emails, it might be worth calling flush after every store, or else %container_hash will get huge. However, since we're not, we just call flush in an END block to catch the container objects before they get destroyed by the garbage collector:

    END { Email::Store::Thread::Container->flush; }

Running dprofpp again:

 Total Elapsed Time = 7.741969 Seconds
   User+System Time = 3.911969 Seconds
 Inclusive Times
 %Time ExclSec CumulS #Calls sec/call Csec/c  Name
  65.4       -  2.559      7        - 0.3656  Email::Store::Mail::store
  62.9   0.014  2.461     35   0.0004 0.0703  Module::Pluggable::Ordered::__ANON
                                              __
  56.2   0.020  2.202      3   0.0065 0.7341  main::BEGIN
  31.8   0.028  1.247    105   0.0003 0.0119  UNIVERSAL::require
  29.4   0.004  1.150      7   0.0006 0.1642  Email::Store::Entity::on_store
  22.7   0.025  0.890    100   0.0003 0.0089  Class::DBI::create
  21.0   0.031  0.824    100   0.0003 0.0082  Class::DBI::_create
  18.3   0.235  0.716    109   0.0022 0.0066  base::import
  15.1       -  0.594    274        - 0.0022  DBIx::ContextualFetch::st::execute
  15.1       -  0.592      7        - 0.0846  Mail::Thread::thread
  15.1       -  0.592      7        - 0.0845  Email::Store::Thread::on_store

We find that we've managed to shave another second-and-a-half off, and we've also swapped a per-mail cost (of updating the threading containers every time) to a once-per-run fixed cost (of updating them all at the end of the run). This has taken the business of threading down from two-and-a-half seconds per run to half a second per run, and it means that 35% of our running time is outside the main loop; again, this will amortize nicely on large runs.

We started with a program that runs for 10 seconds, and now it runs for 4. Through judicious use of the profiler, we've identified the hotspots and eliminated the most troublesome ones. We've looked at both exclusive and inclusive views of the profiling data, but there are still a few other things that dprofpp can tell us. For instance, the -S option gives us a call tree, showing what gets called from what. These trees can be incredibly long and tedious, but if the two views we've already looked at haven't identified potential trouble spots, then wading through the tree might be your only option.

Writing your Own Profiler

At least, that is, if you want to use dprofpp; until yesterday, that was the only way of reading profiling data. Yesterday, however, I released Devel::DProfPP, which provides an event-driven interface to reading tmon.out files. I intended to use it to write a new version of dprofpp because I find the current profiler intolerably slow; ironically, though, I haven't profiled it yet.

Anyway, Devel::DProfPP allows you to specify callbacks to be run every time the profiling data shows Perl entering or exiting a subroutine, and provides access to the same timing and call stack information used by dprofpp.

So, for instance, I like visualization of complicated data. I'd prefer to see what's calling what as a graph that I can print out and pore over, rather than as a listing. So, I pull together Devel::DProfPP and the trusty Graphviz module, and create my own profiler:

 use GraphViz;
 use Devel::DProfPP;
 
 my $graph = GraphViz->new();
 my %edges = ();
 Devel::DProfPP->new(enter => sub {
     my $pp = shift;
     my @stack = $pp->stack;
     my $to = $stack[-1]->sub_name;
     my $from = @stack > 1 ? $stack[-2]->sub_name : "MAIN BODY";
     $graph->add_edge($from => $to) unless $edges{$from." -> ".$to}++;
 })->parse;
 
 print $graph->as_png;

Every time we enter a subroutine, we look at the call stack so far. We pick the top frame of the stack, and ask for its subroutine name. If there's another subroutine on the stack, we take that off too; otherwise we're being called from the main body of the code. Then we add an edge on our graph between the two subroutines, unless we've already got one. Finally, we print out the graph as a PNG file for me to print out and stick on the wall.

There are any number of other things you can do with Devel::DProfPP if the ordinary profiler doesn't suit your needs for some reason; but as we've seen, just judicious application of profiling and highlighting hotspots in your code can cut the running time of a long-running Perl program by 50% or so, and can also help you to understand what your code is spending all its time doing.

Perl's Special Variables

One of the best ways to make your Perl code look more like ... well, like Perl code -- and not like C or BASIC or whatever you used before you were introduced to Perl -- is to get to know the internal variables that Perl uses to control various aspects of your program's execution.

In this article we'll take a look at a number of variables that give you finer control over your file input and output.

Counting Lines

I decided to write this article because I am constantly amazed by the number of people who don't know about the existence of $.. I still see people producing code that looks like this:

  my $line_no = 0;

  while (<FILE>) {
    ++$line_no;
    unless (/some regex/) {
      warn "Error in line $line_no\n";
      next;
    }

    # process the record in some way
  }

For some reason, many people seem to completely miss the existence of $., which is Perl's internal variable that keeps track of your current record number. The code above can be rewritten as:

  while (<FILE>) {
    unless (/some regex/) {
      warn "Error in line $.\n";
      next;
    }

    # process the record in some way
  }

I know that it doesn't actually save you very much typing, but why create a new variable if you don't have to?

One other nice way to use $. is in conjunction with Perl's "flip-flop" operator (..). When used in list context, .. is the list construction operator. It builds a list of elements by calculating all of the items between given start and end values like this:

  my @numbers = (1 .. 1000);

But when you use this operator in a scalar context (like, for example, as the condition of an if statement), its behavior changes completely. The first operand (the left-hand expression) is evaluated to see if it is true or false. If it is false then the operator returns false and nothing happens. If it is true, however, the operator returns true and continues to return true on subsequent calls until the second operand (the right-hand expression) returns true.

An example will hopefully make this clearer. Suppose you have a file and you only want to process certain sections of it. The sections that you want to print are clearly marked with the string "!! START !!" at the start and "!! END !!" at the end. Using the flip-flop operator you can write code like this:

  while (<FILE>) {
    if (/!! START !!/ .. /!! END !!/) {
      # process line
    }
  }

Each time around the loop, the current line is checked by the flip-flop operator. If the line doesn't match /!! START !!/ then the operator returns false and the loop continues. When we reach the first line that matches /!! START !!/ then the flip-flop operator returns true and the code in the if block is executed. On subsequent iterations of the while loop, the flip-flop operator checks for matches again /!! END !!/, but it continues to return true until it finds a match. This means that all of the lines between the "!! START !!" and "!! END !!" markers are processed. When a line matches /!! END !!/ then the flip-flop operator returns false and starts checking against the first regex again.

So what does all this have to do with $.? Well, there's another piece of magic coded into the flip-flop operator. If either of its operands are constant values then they are converted to integers and matched against $.. So to print out just the first 10 lines of a file you can write code like this:

  while (<FILE>) {
    print if 1 .. 10;
  }

One final point on $., there is only one $. variable. If you are reading from multiple filehandles then $. contains the current record number from the most recently read filehandle. If you want anything more complex then you can use something like IO::File objects for your filehandle. These objects all have an input_line_number method.

The Field Record Separators

Next, we'll look at $/ and $\ which are the input and output record separators respectively. They control what defines a "record" when you are reading or writing data.

Let me explain that in a bit more detail. Remember when you were first learning Perl and you were introduced to the file input operator. Almost certainly you were told that <FILE> read data from the file up to and including the next newline character. Well that's not true. Well, it is, but it's only a specialized case. Actually it reads data up to and including the next occurrence of whatever is currently in $/ - the file input separator. Let's look at an example.

Imagine you have a text file which contains amusing quotes. Or lyrics from songs. Or whatever it is that you like to put in your randomly generated signature. The file might look something like this.

    This is the definition of my life
  %%
    We are far too young and clever
  %%
    Stab a sorry heart
    With your favorite finger

Here we have three quotes separated by a line containing just the string %%. How would you go about reading in that file a quote at a time?

One solution would be to read the file a line at a time, checking to see if the new line is just the string %%. You'd need to keep a variable that contains current quote that you are building up and process a completed quote when you find the termination string. Oh, and you'd need to remember to process the last quote in the file as that doesn't have a termination string (although, it might!)

A simpler solution would be to change Perl's idea of what constitutes a record. We do that by changing the value of $/. The default value is a newline character - which is why <...> usually reads in a line at a time. But we can set it to any value we like. We can do something like this

  $/ = "%%\n";

  while (<QUOTE>) {
    chomp;
    print;
  }

Now each time we call the file input operator, Perl reads data from the filehandle until it finds %%\n (or the end of file marker). A newline is no longer seen as a special character. Notice, however, that the file input operator always returns the next record with the file input separator still attached. When $/ has its default value of a newline character, you know that you can remove the newline character by calling chomp. Well it works exactly the same way when $/ has other values. It turns out that chomp doesn't just remove a newline character (that's another "simplification" that you find in beginners books) it actually removes whatever is the current value of $/. So in our sample code above, the call to chomp is removing the whole string %%\n.

Changing Perl's Special Variables

Before we go on I just need to alert you to one possible repercussion of changing these variables whenever you want. The problem is that most of these variables are forced into the main package. This means that when you change one of these variables, you are altering the value everywhere in your program. This includes any modules that you use in your program. The reverse is also true. If you're writing a module that other people will use in their programs and you change the value of $/ inside it, then you have changed the value for all of the remaining program execution. I hope you can seen why changing variables like $/ in one part of your program can potentially lead to hard to find bugs in another part.

So we need to do what we can to avoid this. Your first approach might be to reset the value of $/ after you have finished with it. So you'd write code like this.

  $/ = "%%\n";

  while (<QUOTE>) {
    chomp;
    print;
  }

  $/ = "\n";

The problem with this is you can't be sure that $/ contained \n before you started fiddling with it. Someone else might have changed it before your code was reached. So the next attempt might look like this.

  $old_input_rec_sep = $/;
  $/ = "%%\n";

  while (<QUOTE>) {
    chomp;
    print;
  }

  $/ = $old_input_rec_sep;

This code works and doesn't have the bug that we're trying to avoid but there's another way that looks cleaner. Remember the local function that you used to declare local variables until someone told you that you should use my instead? Well this is one of the few places where you can use local to great effect.

It's generally acknowledged that local is badly named. The name doesn't describe what the function does. In Perl 6 the function is likely to be renamed to temp as that's a far better description of what it does - it creates a temporary variable with the same name as an existing variable and restores the original variable when the program leaves the innermost enclosing block. This means that we can write our code like this.

  {
    local $/ = "%%\n";

    while (<QUOTE>) {
      chomp;
      print;
    }
  }

We've enclosed all of the code in another pair of braces to create a naked block. Code blocks are usually associated with loops, conditionals or subroutines, but in Perl they don't need to be. You can introduce a new block whenever you want. Here, we've introduced a block purely to delimit the area where we want $/ to have a new value. We then use local to store the old $/ variable somewhere where it can't be disturbed and set our new version of the variable to %%\n. We can then do whatever we want in the code block and when we exit from the block, Perl automatically restores the original copy of $/ and we never needed to know what it was set to.

For all this reason, it's good practice to never change one of Perl's internal variables unless it is localized in a block.

Other Values For $/

There are a few special values that you can give $/ which turn on interesting behaviours. The first of these is setting it to undef. This turns on "slurp mode" and the next time you read from a filehandle you will get all of the remaining data right up to the end of file marker. This means that you can read a whole file in using code like this.

  my $file = do { local $/; <FILE> };

A do block returns the value of the last expression evaluated within it, which in this case is the file input operator. And as $/ has been set to undef it returns the whole file. Notice that we don't even need to explicitly set $/ to undef as all Perl variables are initialized to undef when they are created.

There is a big difference between setting $/ to undef and setting it to an empty string. Setting it to an empty string turns on "paragraph" mode. In this mode each record is a paragraph of text terminated by one or more empty lines. You might think that this effect can be mimicked by setting $/ to \n\n, but the subtle difference is that paragraph mode acts as thought $/ had been set to \n\n+ (although you can't actually set $/ equal to a regular expression.)

The final special value is to set $/ to either a reference to a scalar variable that holds an integer, or to a reference to an integer constant. In these cases the next read from a filehandle will read up to that number of bytes (I say "up to" because at the end of the file there might not be enough data left to give you). So you read a file 2Kb at a time and you can do this.

  {
    local $/ = \2048;

    while (<FILE>) {
      # $_ contains the next 2048 bytes from FILE
    }
  }

$/ and $.

Note that changing $/ alters Perl's definition of a record and therefore it alters the behavior of $.. $. doesn't actually contain the current line number, it contains the current record number. So in our quotes example above, $. will be incremented for each quote that you read from the filehandle.

What About $\?

Many paragraphs back I mentioned both $/ and $\ as being the input and output record separators. But since then I've just gone on about $/. What happened to $\?

Well, to be honest, $\ isn't anywhere near as useful as $/. It contains a string that is printed at the end of every call to print. Its default value is the empty string, so nothing gets added to data that you display with print. But if, for example, you longed for the days of Pascal you could write a println function like this.

  sub println {
    local $\ = "\n";
    print @_;
  }

Then every time you called println, all of the arguments would be printed followed by a newline.

Other Print Variables

The next two variables that I want to discuss are very easily confused although they do completely different things. To illustrate them, consider the following code.

  my @arr = (1, 2, 3);

  print @arr;
  print "@arr";

Now, without looking it up do you know what the difference is between the output from the two calls to print?

The answer is that the first one prints the three elements of the array with nothing separating them (like this - 123) whereas the second one prints the elements separated by spaces (like this - 1 2 3). Why is there a difference?

The key to understanding it is to look at exactly what is being passed to print in each case. In the first case print is passed an array. Perl unrolls that array into a list and print actually sees the three elements of the array as separate arguments. In the second case, the array is interpolated into a double quoted string before print sees it. That interpolation has nothing at all to do with the call to print. Exactly the same process would take place if, for example, we did something like this.

  my $string = "@arr";
  print $string;

So in the second case, the print function only sees one argument. The fact that it is the results of interpolating an array in double quotes has no effect on how print treats the string.

We therefore have two cases. When print receives a number of arguments it prints them out with no spaces between them. And when an array is interpolated in double quotes it is expanded with spaces between the individual elements. These two cases are completely unrelated, but from our first example above it's easy to see how people can get them confused.

Of course, Perl allows us to change these behaviors if we want to. The string that is printed between the arguments passed to print is stored in a variable called $, (because you use a comma to separate arguments). As we've seen, the default value for that is an empty string but it can, of course, be changed.

  my @arr = (1, 2, 3);
  {
    local $, = ',';

    print @arr;
  }

This code prints the string 1,2,3.

The string that separates the elements of an array when expanded in a double quoted string is stored in $". Once again, it's simple to change it to a different value.

  my @arr = (1, 2, 3);
  {
    local $" = '+';

    print "@arr";
  }

This code prints 1+2+3".

Of course, $" doesn't necessarily have to used in conjunction with a print statement. You can use it anywhere that you have an array in a doubled quoted string. And it doesn't just work for arrays. Array and hash slices work just as well.

  my %hash = (one => 1, two => 2, three => 3);

  {
    local $" = ' < ';

    print "@hash{qw(one two three)}";
  }

This displays 1 < 2 < 3.

Conclusion

In this article we've just scratched the surface of what you can do by changing the values in Perl's internal variables. If this makes you want to look at this subject in more detail, then you should read the perlvar manual page.

The Evolution of Perl Email Handling

I spend the vast majority of my time at a computer working with email, whether it's working through the ones I send and receive each day, or working on my interest in analyzing, indexing, organizing, and mining email content. Naturally, Perl helps out with this.

There are many modules on the CPAN for slicing and dicing email, and we're going to take a whistlestop tour of the major ones. We'll also concentrate on an effort started by myself, Richard Clamp, Simon Wistow, and others, called the Perl Email Project, to produce simple, efficient and accurate mail handling modules.

Message Handling

We'll begin with those modules that represent an individual message, giving you access to the headers and body, and usually allowing you to modify these.

The granddaddy of these modules is Mail::Internet, originally created by Graham Barr and now maintained by Mark Overmeer. This module offers a constructor that takes either an array of lines or a filehandle, reads a message, and returns a Mail::Internet object representing the message. Throughout these examples, we'll use the variable $rfc2822 to represent a mail message as a string.

    my $obj = Mail::Internet->new( [ split /\n/, $rfc2822 ] );

Mail::Internet splits a message into a header object in the Mail::Header class, plus a body. You can get and set individual headers through this object:

    my $subject = $obj->head->get("Subject");
    $obj->head->replace("Subject", "New subject");

Reading and editing the body is done through the body method:

    my $old_body = $obj->body;
    $obj->body("Wasn't worth reading anyway.");

I've not said anything about MIME yet. Mail::Internet is reasonably handy for simple tasks, but it doesn't handle MIME at all. Thankfully, MIME::Entity is a MIME-aware subclass of Mail::Internet; it allows you to read individual parts of a MIME message:

    my $num_parts = $obj->parts;
    for (0..$num_parts) {
        my $part = $obj->parts($_);
        ...
    }

If Mail::Internet and MIME::Entity don't cut it for you, you can try Mark Overmeer's own Mail::Message, part of the impressive Mail::Box suite. Mail::Message is extremely featureful and comprehensive, but that is not always meant as a compliment.

Mail::Message objects are usually constructed by Mail::Box as part of reading in an email folder, but can also be generated from an email using the read method:

    $obj = Mail::Message->read($rfc2822);

Like Mail::Internet, messages are split into headers and bodies; unlike Mail::Internet, the body of a Mail::Message object is also an object. We read headers like so:

    $obj->head->get("Subject");

Or, for Subject and other common headers:

    $obj->subject;

I couldn't find a way to set headers directly, and ended up doing this:

    $obj->head->delete($header);
    $obj->head->add($header, $_) for @data;

Reading the body as a string is only marginally more difficult:

    $obj->decoded->string

While setting the body is an absolute nightmare--we have to create a new Mail::Message::Body object and replace our current one with it.

    $obj->body(Mail::Message::Body->new(data => [split /\n/, $body]));

Mail::Message may be slow, but it's certainly hard to use. It's also rather complex; the operations we've looked at so far involved the use of 16 classes (Mail::Address, Mail::Box::Parser, Mail::Box::Parser::Perl, Mail::Message, Mail::Message::Body, Mail::Message::Body::File, Mail::Message::Body::Lines, Mail::Message::Body::Multipart, Mail::Message::Body::Nested, Mail::Message::Construct, Mail::Message::Field, Mail::Message::Field::Fast, Mail::Message::Head, Mail::Message::Head::Complete, Mail::Message::Part, and Mail::Reporter) and 4400 lines of code. It does have a lot of features, though.

Foolishly, I thought that email parsing shouldn't be so complex, and so I sat down to write the simplest possible functional mail handling library. The result is Email::Simple, and its interface looks like this:

    my $obj = Email::Simple->new($rfc2822);
    my $subject = $obj->header("Subject");
    $obj->header_set("Subject", "A new subject");
    my $old_body = $obj->body;
    $obj->body_set("A new body\n");
    print $obj->as_string;

It doesn't do a lot, but it does it simply and efficiently. If you need MIME handling, there's a subclass called Email::MIME, which adds the parts method.

Realistically, the choice of which mail handling library to use ought to be up to you, the end user, but this isn't always true. Auxilliary modules, which mess about with email at a higher level, can ask for the mail to be presented in a particular representation. For instance, until recently, the wonderful Mail::ListDetector module, which we'll examine later, required mails passed in to it to be Mail::Internet objects, since this gave it a known API to work with the objects. I don't want to work with Mail::Internet objects, but I want to use Mail::ListDetector's functionality. What can I do?

In order to enable the user to have the choice again, I wrote an abstraction layer across all of the above modules, called Email::Abstract. Given any of the above objects, we can say:

     my $subject = Email::Abstract->get_header($obj, "Subject");
     Email::Abstract->set_header($obj, "Subject", "My new subject");
     my $body = Email::Abstract->get_body($obj);
     Email::Abstract->set_body($message, "Hello\nTest message\n");
     $rfc2822 = Email::Abstract->as_string($obj);

Email::Abstract knows how to perform these operations on the major types of mail representation objects. It also abstracts out the process of constructing a message, and allows you to change the interface of a message using the cast class method:

    my $obj = Email::Abstract->cast($rfc2822, "Mail::Internet");
    my $mm = Email::Abstract->cast($obj, "Mail::Message");

This allows module authors to write their mail handling libraries in an interface-agnostic way, and I'm grateful to Michael Stevens for taking up Email::Abstract in Mail::ListDetector so quickly. Now I can pass in Email::Simple objects to Mail::ListDetector and it will work fine.

Email::Abstract also gives us the opportunity to create some benchmarks for all of the above modules. Here was the benchmarking code I used:

    use Email::Abstract;
    my $message = do { local $/; <DATA>; };
    my @classes =
        qw(Email::MIME Email::Simple MIME::Entity Mail::Internet Mail::Message);

    eval "require $_" or die $@ for @classes;

    use Benchmark;
    my %h;
    for my $class (@classes) {
        $h{$class} = sub {
            my $obj = Email::Abstract->cast($message, $class);
            Email::Abstract->get_header($obj, "Subject");
            Email::Abstract->get_body($obj);
            Email::Abstract->set_header($obj, "Subject", "New Subject");
            Email::Abstract->set_body($obj, "A completely new body");
            Email::Abstract->as_string($obj);
        }
    }
    timethese(1000, \%h);

    __DATA__
    ...

I put a short email in the DATA section and ran the same simple operations a thousand times: construct a message, read a header, read the body, set the header, set the body, and return the message as a string.

    Benchmark: timing 1000 iterations of Email::MIME, Email::Simple, 
    MIME::Entity, Mail::Internet, Mail::Message...
    Email::MIME: 10 wallclock secs ( 7.97 usr +  0.24 sys =  8.21 CPU) 
        @ 121.80/s (n=1000)
    Email::Simple:  9 wallclock secs ( 7.49 usr +  0.05 sys =  7.54 CPU) 
        @ 132.63/s (n=1000)
    MIME::Entity: 33 wallclock secs (23.76 usr +  0.35 sys = 24.11 CPU) 
        @ 41.48/s (n=1000)
    Mail::Internet: 24 wallclock secs (17.34 usr +  0.30 sys = 17.64 CPU) 
        @ 56.69/s (n=1000)
    Mail::Message: 20 wallclock secs (17.12 usr +  0.27 sys = 17.39 CPU) 
        @ 57.50/s (n=1000)

The Perl Email Project was a success: Email::MIME and Email::Simple were twice as fast as their nearest competitors. However, it should be stressed that they're both very low level; if you're doing anything more complex than the operations we've seen, you might consider one of the older Mail:: modules.

Mailbox Handling

So much for individual messages; let's move on to handling groups of messages, or folders. We've mentioned Mail::Box already, and this is truly the king of folder handling, supporting local and remote folders, editing folders, and all sorts of other things besides. To use it, we first need a Mail::Box::Manager, which is a factory object for creating Mail::Boxes.

    use Mail::Box::Manager
    my $mgr = Mail::Box::Manager->new;

Next, we need to open the folder using the manager:

    my $folder = $mgr->open(folder => $folder_file);

And now we can get at the individual messages as Mail::Message objects:

    for ($folder->messages) {
        print $_->subject,"\n";
    }

With its more minimalist approach, my favorite mail box manager until recently was Mail::Util's read_mbox function, which takes the name of a Unix mbox file, and returns a list of array references; each reference is the array of lines of a message, suitable for feeding to Mail::Internet->new or similar:

    for (read_mbox($folder_file)) {
        my $obj = Mail::Internet->new($_);
        print $_->head->get("Subject"),"\n";
    }

These two are both really handy, but there seemed to be room for something in between the simplicity of Mail::Util and the functionality of Mail::Box, and so the Email Project struck again with Email::Folder and Email::LocalDelivery. Email::Folder handles mbox and maildir folders, with more types planned, and has a reasonably simple interface:

    my $folder = Email::Folder->new($folder_file);
    for ($folder->messages) {
        print $_->header("Subject"),"\n";
    }

By default it returns Email::Simple objects for the messages, but this can be changed by subclassing. For instance, if we want raw RFC2822 strings, we can do this:

    package Email::Folder::Raw; use base 'Email::Folder';
    sub bless_message { my ($self, $rfc2822) = @_; return $rfc2822; }

Perhaps in the future, we will change bless_message to use Email::Abstract->cast to make the representation of messages easier to select without necessarily having to subclass.

The other side of folder handling is writing to a folder, or "local delivery". Email::LocalDelivery was written to assist Email::Filter, of which more later. The problem is harder than it sounds, as it has to deal with locking, escaping mail bodies, and specific problems due to mailbox and maildir formats. LocalDelivery hides all of these things beneath a simple interface:

    Email::LocalDelivery->deliver($rfc2822, @mailboxes);

Both Email::LocalDelivery and Email::Folder use the Email::FolderType helper module to determine the type of a folder based on its filename.

Address Handling

To come down to a lower level of abstraction again, there are a number of modules for handling email addresses. The old favorite is Mail::Address. A mail address appearing in the fields of an email can be made up of several elements: the actual address, a phrase or name, and a comment. For instance:

    Example user <example@example.com> (Not a real user)

Mail::Address parses these addresses, separating out the phrase and comments, allowing you to get at the individual components:

    for (Mail::Address->parse($from_line)) {
        print $_->name, "\t", $_->address, "\n";
    }

Unfortunately, like many of the mail modules, it tries really hard to be helpful.

    my ($addr) = Mail::Address->parse('"eBay, Inc." <support@ebay.com>');
    print $addr->name # Inc. eBay

Which, while better than the "Inc Ebay" that previous versions would produce, isn't really acceptable. Casey West joined our merry band of renegades and produced Email::Address. It has exactly the same interface as Mail::Address, but it works, and is about twice to three times as fast.

One thing we often want to do when handling mail addresses is to make sure that they're valid. If, for instance, a user is registering for content at a web site, we need to check that the address they've given is capable of receiving mail. Email::Valid, the original inhabitant of the Email:: namespace before our bunch of disaffected squatters moved in, does just this. In its most simple use, we can say:

    if (not Email::Valid->address('test@example.com')) {
        die "Not a valid address"
    }

You can turn on additional checks, such as ensuring there's a valid MX record for the domain, correcting common AOL and Compuserve addressing mistakes, on so on:

    if (not Email::Valid->address(-address => 'test@example.com',
                                  -mxcheck => 1)) {
        die "Not a valid address"
    }

Mail Munging

Once we have our emails, what are we going to do with them? A lot of what I've been looking at has been textual analysis of email, and there are three modules that particularly help with this.

This first is Text::Quoted; it takes the body text of an email message, or any other text really, and tries to figure out which parts of the message are quotations from other messages. It then separates these out into a nested data structure. For instance, if we have

    $message = <<EOF
    > foo
    > # Bar
    > baz

    quux
    EOF

Then running extract($message) will return a data structure like this:

    [
      [
        { text => 'foo', quoter => '>', raw => '> foo' },
        [ 
            { text => 'Bar', quoter => '> #', raw => '> # Bar' } 
        ],
        { text => 'baz', quoter => '>', raw => '> baz' }
      ],

      { empty => 1 },
      { text => 'quux', quoter => '', raw => 'quux' }
    ];

This is extremely useful for highlighting different levels of quoting in different colors when displaying a message. A similar concept is Text::Original, which looks for the start of original, non-quoted content in an email. It knows about many kinds of attribution lines, so with:

    $message = <<EOF
    You wrote:
    > Why are there so many different mail modules?

    There's more than one way to do it! Different modules have different
    focuses, and operate at different levels; some lower, some higher.
    EOF

the first_sentence($message) would be There's more than one way to do it!. The Mariachi mailing list archiver uses this technique to give a "prompt" for each message in a thread.

And speaking of threads, the Mail::Thread module is a Perl implementation of Jamie Zawinski's mail threading algorithm, as used by Mozilla as well as many other mail clients since then. It's also used by Mariachi, and has recently been updated to use Email::Abstract to handle any kind of mail object you want to throw at it:

    my $threader = Mail::Thread->new(@mails);
    $threader->thread; # Compute threads
    for ($threader->rootset) { # Original mails in a thread
        dump_thread($_);
    }

Mail Filtering

The classic Perl mail filtering tool is Mail::Audit, and I've written articles here about using Mail::Audit on its own (http://www.perl.com/pub/a/2001/07/17/mailfiltering.html) and using it in conjunction with Mail::SpamAssassin (http://www.perl.com/pub/a/2002/03/06/spam.html).

We've mentioned Mail::ListDetector a couple of times already, and I use this with Mail::Audit to do most of the filtering automatically for me. The Mail::Audit::List plugin uses ListDetector to look for mailing list headers in a message; these are things like List-Id, X-Mailman-Version, and the like, which identify a mail as having come through a mailing list. This means I can filter out all mailing list posts to their own folders, like so:

    my $list = Mail::ListDetector->new($obj);
    if ($list) {
        my $name = $list->listname;
        $item->accept("mail/$name.-$date");
    }

However, Mail::Audit itself is getting a little long in the tooth, and so new installations are encouraged to use the Email Project's Email::Filter instead; it has the same interface for the most part, although not all of the same features, and it uses the new-fangled Email::Simple mail representation for speed and cleanliness.

Mail Mining

Finally, the most high-level thing I do with email is develop frameworks to automatically categorize, organize, and index mail into a database, and attempt to analyze it for interesting nuggets of information.

My first module to do this with was Mail::Miner, which consists of three major parts. The first part takes an email, removes any attachments, and stores the lot in a database. The second looks over the email and runs a set of "Recogniser" modules on it; these find addresses, phone numbers, keywords and phrases, and so on, and store them in a separate database table. The third part is a command-line tool to query the database for mail and information.

For instance, if I need to find Tim O'Reilly's postal address, I ask the query tool, mm, to find addresses in emails from him:

 % mm --from "Tim O" --address              
 Address found in message 1835 from "Tim O'Reilly" <tim@oreilly.com>:
 Tim O'Reilly @ O'Reilly & Associates, Inc.
 1005 Gravenstein Highway North, Sebastopol, CA 95472

To retrieve the whole email, I'd say

 % mm --id 1835

And if it originally contained an attachment, we'd see something like this as part of the email:

 [ text/xml attachment something.xml detached - use
   mm --detach 208
   to recover ]

I paste that middle line mm --detach 208 into a shell, and hey presto, something.xml is written to disk.

Now Mail::Miner is all very well, but having the three ideas in one tight package--filing mail, mining mail, and interfacing to the database--makes it difficult to develop and extend any one of them. And of course, it uses the old-school Mail:: modules.

This brings us to our final module on the mail modules tour, and the most recently released: Email::Store. This is a framework, based on Class::DBI, for storing email in a database and indexing it in various ways:

   use Email::Store 'dbi:SQLite:mail.db';
   Email::Store->setup;
   Email::Store::Mail->store($rfc2822);

And then later...

   my ($name) = Email::Store::Name->search( name => "Simon Cozens" )
   @mails_from_simon = $name->addressings( role => "From" )->mails;

It can be used to build a mailing list archive tool such as Mariachi, or a data mining setup like Mail::Miner. It's still very much in development, and makes use of a new idea in module extensibility.

I'll be bringing more information when we've written the first mail archiving and searching tool using Email::Store, which I'm going to be doing as a new interface to the Perl mailing lists at perl.org.

Conclusion

We've looked at the major modules for mail handling on CPAN, and there are many more. I am obviously biased towards those which I wrote, and particularly the Perl Email Project modules in the Email::* namespace. These modules are specifically designed to be simple, efficient, and correct, but may not always be a good substitute for the more thorough Mail::* modules, particularly Mail::Box. However, I hope you're now a little more aware of the diversity of mail handling tools out there, and know where to look next time you need to manipulate email with Perl.

This Week on Perl 6, Fortnight Ending 2004-06-06

Whee! There's a new graphics card in the G5, which means I can type this at the big screen again, which makes me happy. Well, it would make me far happier if the new card didn't leave horrible artifacts all over the screen like some kind of incontinent puppy attempting to fulfil OpenGL draw instructions. Maybe next week will see a third card in the box.

Dang! It looks like the G5 will be off receiving some TLC from an Apple service centre while I'm on holiday next week. Which means that the 'weekly' summaries will continue on their fortnightly summer holiday schedule for at least one more summary. But then the lists themselves appear to be on summer footing anyway.

Ahem.

As you will probably have worked out by now, we start with perl6-internals.

Library loading

Jens Rieks is working on library loading code that does all the nice things we've come to expect from other languages. The plan being that you'll be able to write (say)

    use('some_library')

and Parrot will go off and search it's include paths for some_library.(pbc|imc|pasm|whatever) and load it. As he noted, if you're going to implement that kind of code in parrot assembler (or PIR, or whatever), you need some way of loading the loading code. It's also a good idea to have a working stat. Jens added a Parrot_load_bytecode_direct function to the parrot core to support the first part. His please for a functional (if not complete) stat were answered by Dan who set about implementing the stat API he outlined a few weeks ago.

http://groups.google.com/groups?selm=200405211333.22233.parrot@jensbeimsurfen.de

Embedding Parrot

Leo Tötsch and chromatic answered Paul Querna's questions from last week about embedding Parrot.

http://groups.google.com/groups?selm=200405241043.i4OAh7a01618@thu8.leo.home

Using PMCs as PerlHash keys

TOGoS wanted to know how he could use a PMC as a key in a PerlHash. Leo replied that it was as simple as doing

    $P2 = new Key
    $P2 = "key_string"
    $P0 = $P1[$P2]

Piers Cawley did some naive translation into PASM and got himself horribly confused. Leo and TOGoS corrected him.

http://groups.google.com/groups?selm=20040520190352.46387.qmail@web41410.mail.yahoo.com

First draft of IO and event design

Remarking that events and IO are "(kinda, sorta)" the same thing, Dan posted his first draft of a unified IO and events design and asked for comments. This being p6i, he got several (though not as many as usual, maybe everyone likes it).

http://groups.google.com/groups?selm=a0611040abcd7faf9492e@[10.0.1.3]

Freeze, objects, crash, boom

Will Coleda tried to get freezing and objects to play well together and failed. So he asked a bunch of questions. Leo didn't solve the problem, but he did have some pointers to where it was coming from.

http://groups.google.com/groups?selm=40B4D76B.5060306@coleda.com

MMD table setup semantics

Possibly winning an award for the oldest rejuvenated thread, Nicholas Clark had some questions about a post Dan made about MMD at the end of April. He made a suggesting about how to calculate 'distance' for multi dispatch. Dan pointed out that Larry had decreed that the 'distance' would be the 'Manhattan Distance'. (Google has several definitions).

http://groups.google.com/groups?selm=20040527143518.GD1129@plum.flirble.org

Compiler FAQ entry

Will Coleda posted a possible set of entries for the compiler writers' FAQ. Leo had a few quibbles. Sterling Hughes suggested that having small, runnable source code examples would be really helpful.

http://groups.google.com/groups?selm=40B7ECCF.6030205@coleda.com

Layering PMCs

Dan kicked off a discussion on how to go about layering PMCs. The usual suspects offered suggestions. The aim is to be able to layer behaviours on top of PMCs without massive overhead or combinatorial explosion problems. As usual with these things, there's several possible ways of doing it, the debate is about choosing the best one.

http://groups.google.com/groups?selm=a06110404bcde7bf18ebd@[172.24.18.98]

IO Layers

Leo had some questions about the (not fully implemented) ParrotIOLayerAPI. He laid out a proposal for implementing things. Uri Guttman and Dan joined in a discussion of the issues. (Simple summary of the issues: Asynchronous IO is hard. Possibly more accurate summary of the issues: Asynchronous IO is *not* synchronous)

http://groups.google.com/groups?selm=40B71F53.8000001@toetsch.at

PIO_unix_pipe()

Leo's implemented a PIO_unix_pipe() method which allows you to run an external program and capture the results with a Parrot IO handle. He doctored the open opcode to use it

    pipe = open "/bin/ls -l", "-|" 

Dan liked it, but proposed also adding a backtick operator.

http://groups.google.com/groups?selm=40C1C64D.5080105@toetsch.at

Register spilling

Dan noted that it's possible to get the register allocator caught up in an infinite loop (or as near infinite as makes no difference) as it tries to work out a register spilling strategy. He proposed there be a 'slow but working' fallback method to use if the normal method goes through too many iterations. Leo suggested an delightfully brute force approach with some possible elaborations that Dan didn't think would be that useful in the cases he was seeing.

http://groups.google.com/groups?selm=rt-3.0.9-29837-88076.5.39698439413385@perl.org

Stat

Dan implemented a simple stat function that should be enough for Jens Rieks to get a library path based loading system to work. Leo patched his first attempt and most things in the garden were lovely (with a couple of rather less pretty spots that are being worked on as I type).

http://groups.google.com/groups?selm=a06110401bcd836c94dd7@[172.24.18.98]

Bit ops on strings

Argh! The character encoding discussion! Run away!

Nicholas Clark and Dan discussed Parrots Unicode handling.

http://groups.google.com/groups?selm=20040525113008.GC1129@plum.flirble.org

Standard Library behaviour

Dan commented that now Jens was working the architecture of the standard library, the time had come to discuss how things should work. He outlined the options, asked for comments, and got promptly Warnocked.

http://groups.google.com/groups?selm=a06110407bcd924533939@[10.0.1.3]

Meanwhile, in perl6-language

The Periodic Table of the Operators

Mark Lentczner announced that he'd put together a periodic table of Perl 6 operators. Everyone liked it, though several amendments were proposed and made. Check it out, it's very lovely (and anyone who references the great Edward Tufte in his explanation of the design is likely to be all right in my book).

There will be T-shirts.

http://groups.google.com/groups?selm=486B658A-AF64-11D8-8F60-000393A56BB6@glyphic.com

http://www.ozonehouse.com/mark/blog/code/PeriodicTable.html

Announcements, Apologies, Acknowledgements

Ah well, if you can't beat the system, roll with it and pretend it's deliberate. Next week's summary definitely won't be next week, but there will be another fortnightly summary covering the fortnight ending 2004-06-20. And the next summary will almost certainly be on a fortnightly schedule too, unless BT work some miracle and sort out my ADSL transfer promptly.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send me feedback at mailto:pdcawley@bofh.org.uk

http://donate.perl-foundation.org/ -- The Perl Foundation

http://dev.perl.org/perl6/ -- Perl 6 Development site

Web Testing with HTTP::Recorder

HTTP::Recorder is a browser-independent recorder that records interactions with web sites and produces scripts for automated playback. Recorder produces WWW::Mechanize scripts by default (see WWW::Mechanize by Andy Lester), but provides functionality to use your own custom logger.

Why Use HTTP::Recorder?

Simply speaking, HTTP::Recorder removes a great deal of the tedium from writing scripts for web automation. If you're like me, you'd rather spend your time writing code that's interesting and challenging, rather than digging through HTML files, looking for the names of forms an fields, so that you can write your automation scripts. HTTP::Recorder records what you do as you do it, so that you can focus on the things you care about.

Automated Testing

We all know that testing our code is good, and that writing automated tests that can be run again and again to check for regressions is even better. However, writing test scripts by hand can be tedious and prone to errors. You're more likely to write tests if it's easy to do so. The biggest obstacle to testing shouldn't be the mechanics of getting the tests written — it should be figuring out what needs to be tested, and how best to test it.

Part of your test suite should be devoted to testing things the way the user uses them, and HTTP::Recorder makes it easy to produce automation to do that, which allows you to put your energy into the parts of your code that need your attention and your expertise.

Automate Repetitive Tasks

When you think about web automation, the first thing you think of may be automated testing, but there are other uses for automation as well:

  • Check your bank balance.
  • Check airline fares.
  • Check movie times.

How to Set It Up

Use It with a Web Proxy

One way to use HTTP::Recorder (as recommended in the POD) is to set it as the user agent of a web proxy (see HTTP::Proxy by Phillipe "BooK" Bruhat). Start the proxy running like this:


    #!/usr/bin/perl

    use HTTP::Proxy;
    use HTTP::Recorder;

    my $proxy = HTTP::Proxy->new();

    # create a new HTTP::Recorder object
    my $agent = new HTTP::Recorder;

    # set the log file (optional)
    $agent->file("/tmp/myfile");

    # set HTTP::Recorder as the agent for the proxy
    $proxy->agent( $agent );

    # start the proxy
    $proxy->start();

    1;

Then, instruct your favorite web browser to use your new proxy for HTTP traffic.

Other Ways to Use It

Since HTTP::Recorder is a subclass of LWP::UserAgent, so you can use it in any way that you can use its parent class.

How to Use It

Once you've set up HTTP::Recorder, just navigate to web pages, follow links, and fill in forms the way you normally do, with the web browser of your choice. HTTP::Recorder will record your actions and produce a WWW::Mechanize script that you can use to replay those actions.

The script is written to a logfile. By default, this file is /tmp/scriptfile, but you can specify another pathname when you set things up. See Configuration Options for information about configuring the logfile.

HTTP::Recorder Control Panel

The HTTP::Recorder control panel allows you to use to view and edit scripts as you create them. By default, you can access the control panel by using the HTTP::Recorder UserAgent to access the control URL. By default, the control URL is http://http-recorder/, but this address is configurable. See Configuration Options for more information about setting the control URL.

The control panel won't automatically refresh , but if you create HTTP::Recorder with showwindow => 1, a JavaScript popup window will be opened and refreshed every time something is recorded.

Goto Page. You can enter a URL in the control panel to begin a recording session. For SSL sessions, the initial URL must be entered into this field rather than into the browser.

Current Script. The current script is displayed in a textfield, which you can edit as you create it. Changes you make in the control panel won't be saved until you click the Update button.

Update. Saves changes made the script via the control panel. If you prefer to edit your script as you create it, you can save your changes as you make them.

Clear. Deletes the current script and clears the text field.

Reset. Reverts the text field to the currently saved version of the script. Any changes you've made to the script won't be applied if you haven't clicked Update.

Download. Displays a plain text version of the script, suitable for saving.

Close. Closes the window (using JavaScript).

Updating Scripts as They're Recorded

You can record many things, and then turn the recordings into scripts later, or you can make changes and additions as you go by editing the script in the Control Panel.

For example, if you record filling in this form and clicking the Submit button:

HTTP::Recorder produces the following lines of code:

    $agent->form_name("form1");
    $agent->field("name", "Linda Julien");
    $agent->submit_form(form_name => "form1");

However, if you're writing automated tests, you probably don't want to enter hard-coded values into the form. You may want to re-write these lines of code so that they'll accept a variable for the value of the name field.

You can change the code to look like this:

    my $name = "Linda Julien";

    $agent->form_name("form1");
    $agent->field("name", $name);
    $agent->submit_form(form_name => "form1");

Or even this:

    sub fill_in_name {
      my $name = shift;

      $agent->form_name("form1");
      $agent->field("name", $name);
      $agent->submit_form(form_name => "form1");
    }

    fill_in_name("Linda Julien");

Then click the Update button. HTTP::Recorder will save your changes, and you can continue recording as before.

You may also want to add tests as you go, making sure that the results of submitting the form were what you expected:

You can add tests to the script like this:

    sub fill_in_name {
      my $name = shift;

      $agent->form_name("form1");
      $agent->field("name", $name);
      $agent->submit_form(form_name => "form1");
    }

    my $entry = "Linda Julien";
    fill_in_name($entry);

    $agent->content =~ /You entered this name: (.*)/;
    is ($1, $entry);

Using HTTP::Recorder with SSL

In order to do what it does, HTTP::Recorder relies on the ability to see and modify the contents of requests and their resulting responses...and the whole point of SSL is to make sure you can't easily do that. HTTP::Recorder works around this, however, by handling the SSL connection to the server itself, and and communicating with your browser via plain HTTP.

Caution: Keep in mind that communication between your browser and HTTP::Recorder isn't encrypted, so take care when recording sensitive information, like passwords or credit card numbers. If you're running the Recorder as a proxy on your local machine, you have less to worry about than if you're running it as a proxy on a remote machine. The resulting script for playback will be encrypted as usual.

If you want to record SSL sessions, here's how you do it:

Start at the control panel, and enter the initial URL there rather than in your browser. Then interact with the web site as you normally would. HTTP::Recorder will record form submissions, following links, etc.

Replaying your Scripts

HTTP::Recorder getting pages, following links, filling in fields and submitting forms, etc., but it doesn't (at this point) generate a complete perl script. Remember that you'll need to add standard script headers and initialize the WWW::Mechanize agent, with something like this:

#!/usr/bin/perl

    use strict;
    use warnings;
    use WWW::Mechanize;
    use Test::More qw(no_plan);

    my $agent = WWW::Mechanize->new();

Configuration Options

Output file. You can change the filename for the scripts that HTTP::Recorder generates with the $recorder->file([$value]) method. The default output file is '/tmp/scriptfile'.

Prefix. HTTP::Recorder adds parameters to link URLs and adds fields to forms. By default, its parameters begin with "rec-", but you can change this prefix with the $recorder->prefix([$value]) method.

Logger. The HTTP::Recorder distribution includes a default logging module, which outputs WWW::Mechanize scripts. You can change the logger with the $recorder->logger([$value]) method, replacing it with a logger that:

  • subclasses the standard logger to provice special functionality unique to your site
  • outputs an entirely different type of script

RT (Request Tracker) 3.1 by Best Practical Solutions has a Query Builder that's a good example of a page that benefits from a custom logger:

This page has several Field/Operator/Value groupings. Left to its own devices, the default HTTP::Recorder::Logger will record every field for which a value has been set:

    $agent->form_name("BuildQuery");
    $agent->field("ActorOp", "=");
    $agent->field("AndOr", "AND");
    $agent->field("TimeOp", "<");
    $agent->field("WatcherOp", "LIKE");
    $agent->field("QueueOp", "=");
    $agent->field("PriorityOp", "<");
    $agent->field("LinksOp", "=");
    $agent->field("idOp", "<");
    $agent->field("AttachmentField", "Subject");
    $agent->field("ActorField", "Owner");
    $agent->field("PriorityField", "Priority");
    $agent->field("StatusOp", "=");
    $agent->field("DateField", "Created");
    $agent->field("TimeField", "TimeWorked");
    $agent->field("LinksField", "HasMember");
    $agent->field("WatcherField", "Requestor.EmailAddress");
    $agent->field("AttachmentOp", "LIKE");
    $agent->field("ValueOfAttachment", "foo");
    $agent->field("DateOp", "<");
    $agent->submit_form(form_name => "BuildQuery");

But on this page, there's no need to record setting the values of fields (XField) and operators (XOp) unless a value (ValueOfX) has actually been set. We can do this with a custom logger that checks for the presence of a value, and only records the value of the field and operator fields if the value field has been set:

    package HTTP::Recorder::RTLogger;

    use strict;
    use warnings;
    use HTTP::Recorder::Logger;
    our @ISA = qw( HTTP::Recorder::Logger );

    sub SetFieldsAndSubmit {
        my $self = shift;
        my %args = (
		    name => "",
		    number => undef,
		    fields => {},
		    button_name => {},
		    button_value => {},
		    button_number => {},
		    @_
		    );

	$self->SetForm(name => $args{name}, number => $args{number});
	my %fields = %{$args{fields}};
	foreach my $field (sort keys %fields) {
	    if ( $args{name} eq 'BuildQuery' &&
		 ($field =~ /(.*)Op$/ || $field =~ /(.*)Field$/) &&
		 !exists ($fields{'ValueOf' . $1})) {
		next;
	    }
	    $self->SetField(name => $field, 
			    value => $args{fields}->{$field});
	}
	$self->Submit(name => $args{name}, 
		      number => $args{number},
		      button_name => $args{button_name},
		      button_value => $args{button_value},
		      button_number => $args{button_number},
		      );
    }

    1;

Tell HTTP::Recorder to use the custom logger like this:

    my $logger = new HTTP::Recorder::RTLogger;
    $agent->logger($logger);

And it will record a much more reasonable number of things:

    $agent->form_name("BuildQuery");
    $agent->field("AndOr", "AND");
    $agent->field("AttachmentField", "Subject");
    $agent->field("AttachmentOp", "LIKE");
    $agent->field("ValueOfAttachment", "foo");
    $agent->submit_form(form_name => "BuildQuery");

Control panel. By default, you can access the HTTP::Recorder control panel by using the Recorder to get http://http-recorder. You can change this URL with the $recorder->control([$value]) method.

Logger Options

Agent name. By default, HTTP::Recorder::Logger outputs scripts with the agent name $agent:

     $agent->follow_link(text => "Foo", n => 1);

However, if you prefer a different agent name (in order to drop recorded lines into existing scripts, conform to company conventions, etc.), you can change that with the $logger->agentname([value]) method:

     $recorder->agentname("mech");

will produce the following:

     $mech->follow_link(text => "Foo", n => 1);

How HTTP::Recorder Works

The biggest challenge to writing a web recorder is knowing what the user is doing, so that it can be recorded. A proxy can watch requests and responses go by, the only thing you'll learn is the URL that was requested and its parameters. HTTP::Recorder solves this problem by rewriting HTTP responses as they come through, and adding additional information to the page's links and forms, so that it can extract that information again when the next request comes through.

As an example, a page might contain a link like this:

    <a href="http://www.cpan.org/">CPAN</a>

If the user follows the link, and we want to record it, we need to know all of the relevant information about the action, so that we can produce a line of code that will replay the action. This includes:

  • the fact that a link was followed.
  • the text of the link.
  • the URL of the link.
  • the index (in case there are multiple links on the page of the same name).

HTTP::Recorder overloads LWP::UserAgent's send_request method, so that it can see requests and responses as they come through, and modify them as needed.

HTTP::Recorder rewrites the link so that it looks like this:

<a href="http://www.cpan.org/?rec-url=http%3A%2F%2Fwww.cpan.org%2F&rec-action=follow&rec-text=CPAN&rec-index=1">CPAN</a>

So, with the rewritten page, if the user follows this link, the request will contain all of the information needed to record the action.

Forms are handled likewise, with additional fields being added to the form so that the information can be extracted later. HTTP::Recorder then removes the added parameters from the resulting request, and forwards the request along in something close to its originally intended state.

Looking Ahead

HTTP::Recorder won't record 100% of every script you need to write, and while future versions will undoubtedly have more features, they still won't write your scripts for you. However, it will record the simple things, and it will give you example code that you can cut, paste, and modify to write the scripts that you need.

Some ideas for the future include:

  • Choosing from a list of simple tests based on the fields on the page and their current values.
  • "Threaded" recording, so that multiple sessions won't be recorded in the same file, overlapped with each other.
  • "Add script header" feature.
  • Supporting more configuration options from the control panel.
  • Other loggers.
  • JavaScript support.

Where to Get HTTP::Recorder

The latest released version of HTTP::Recorder is available at CPAN.

Contributions, Requests, and Bugs

Patches, feature requests, and problem reports are welcomed at http://rt.cpan.org.

You can subscribe to the mailing list for users and developers of HTTP::Recorder at http://lists.fsck.com/mailman/listinfo/http-recorder, or by sending email to http-recorder-request@lists.fsck.com with the subject "subscribe".

The mailing list archives can be found at http://lists.fsck.com/piper-mail/http-recorder.

See Also

WWW::Mechanize by Andy Lester.

HTTP::Proxy by Phillipe "BooK" Bruhat.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en