October 2003 Archives

Open Guides

First, a disclaimer.

I'm not a wiki celebrity. I don't look good in StudlyCaps. I'm not part of the wiki culture — I've never contributed to Ward's Wiki, never used TWiki, am baffled by MoinMoin, and every time I look at UseMod code, my brain turns to mashed banana. Most wiki people probably have no idea who I am.

Having said that, I'm going to spend 2,500 words or so advocating the possibilities of Ward Cunningham's simple, potent idea, and explaining how I and a couple of other Perlmongers have applied it to create what I think is one of the most exciting Perl applications currently in existence — OpenGuides.

Beginnings

grubstreet, the predecessor to OpenGuides, was conceived in early 2002 when I asked Earle Martin whether he knew of a London wiki:


It seems that my friend's Wiki is running UseModWiki; I think I rather
like it.  It would be good if there was one of these for info about
London; do you know if anyone's done that?  Things like which pubs
serve food and good beer, etc.; which is the best end of the platform
to stand at to get a seat (discussed this kind of thing with blech
recently).

Earle was enthusiastic and made it so. We both got very excited and started filling the thing with content. Wiki makes this easy! You're reading a web page and spot something that's wrong or missing. Click the "edit" link, add your comment, and it's right there.

Continuings

It didn't take long before we started bumping our heads against the limitations of the usemod software. Even leaving aside its tendency to clamp down on its (custom-format) flat-file database and refuse anyone edit access, I found myself writing umpteen screenscrapers to do simple things like find a nice pub in Soho. I hate screenscraping, but I love my beer.

We tried to patch and amend usemod. We tried very hard. Ivor Williams, in particular, spent a lot of time in its guts. I decided in the end that writing software should only hurt some of the time, and after several beers one night, made a pact with Chris Ball that grubstreet's software would be rewritten in Real Perl. Chris held me to it, and a CPAN-friendly wiki toolkit — CGI::Wiki — resulted. Once we had that to build on, we started on the CGI script that eventually turned into the OpenGuides distribution.

What It Says on the Tin

OpenGuides is a complete web application for managing a collaboratively written guide to a city or town.

Install OpenGuides, and what you get is a blank framework waiting for you to put content into it.

There's an opportunity right here for anyone wanting to join the project team. Write a set of pages for bundling with new installs of the OpenGuides software — how to use the Guide, how to format your entries, maybe stub pages for things that all cities have in common, maybe a category framework for transport pages — you're bound to be able to come up with better ideas than those of us who've been using the software for ages and are blind to its confusing spots.

Just a Skeleton, But a Damned Sturdy One

No, we didn't just give you the equivalent of an empty directory to put your HTML files into. Start adding pages and you'll see.

Suppose I want to add a page about my local pub. I'll click on "Create a new page" and type in the page name. What should I call it? Well, this is a new OpenGuides install, with no established conventions, so I could call it "The Drapers Arms", "Drapers Arms", "The Drapers Arms (Islington)", or whatever. I just need to keep in mind that the name needs to be unique, so if I expect there to be more than one Drapers Arms in my city, I really should add some other kind of identifying information. The Open Guide to London has a convention of including the postcode — thus "Drapers Arms, N1 1ER".

OK, so I've done that, and now I'm presented with an editing form with several boxes for me to type into. The first, Content, is a freeform box where I can put any information that doesn't fit into the particular boxes below.

Locales and Categories are the next boxes. I can put whatever I like into these, and so can later visitors to this Guide. I don't need to decide right now on a useful way to divide my city into locales; it'll just emerge from the aggregated opinion of all the people who contribute. I can always come back to this, my first page, in a few months and add any later-defined categories or locales that seem to apply to it. Or I may not need to; someone else may have got around to it before me.

Locales and categories are excellent ways to make sure that your newly added content doesn't drift off into a decoupled purgatory of unlinked pages. Just add the Pubs category and the Islington locale to the Drapers Arms page, and anyone doing a search — whether a simple type-into-box or a directed locale or category search — will find it.

Next, we get a set of smaller boxes for entering things like more-detailed location information, contact information, and opening hours. These boxes may be completely irrelevant to many, most, or all pages in your Guide. That's OK. They're optional. But if you do fill them in, you get to play with what I feel is one of the most innovative, yet simple, features of OpenGuides — find me everything within half a kilometre of Piccadilly Circus Tube station. Please. Because my feet hurt and I could murder a glass of wine.

Customization and Extension

I meant it when I said I wanted to be able to find pubs. I want to find all pubs in Notting Hill that serve food and have a beer garden. The Open Guide to London must have this information! There's no obvious way to get to it directly, though. I may have to write some code.

Given that I'm one of the admins, I have access to the database on the server — so I can call the CGI::Wiki list_nodes_by_metadata method directly to find all pages in Category Pubs, Locale Notting Hill, and Category Pub Food.

I wrote a CGI script to take in options for selecting pubs and output results. It's very useful, so will be in one of the next few official OpenGuides releases. Here's an excerpt. Note that the locale and categories are simply stored as CGI::Wiki metadata. Note also the use of CGI::Wiki::Plugin::Locator::UK to allow searching by nearest Tube station. You could easily adapt this if you live in a city where people navigate by some other kind of landmark.


my %possible_features = (
    "beer gardens"    => "Has beer garden",
    "function room"   => "Has function room",
    "good beer guide" => "Appears in the CAMRA Good Beer Guide",
    "real cider"      => "Serves real cider",
    "belgian beer"    => "Serves Belgian beer",
    "pub food"        => "Serves food of some kind",
);

if ( $action eq "search" ) {
    my @locales       = CGI::param( "locale" );
    my @features      = CGI::param( "feature" );
    my @tube_stations = CGI::param( "tube" );

    # Ignore the blank "any locales" option.
    @locales = grep { $_ } @locales;

    # Ensure that we only look for 'allowed' features.
    @features = grep { $possible_features{$_} } @features;

    # Ensure that we only look for extant Tube stations.
    my %is_tube = map { $_ => 1 } list_tube_stations();
    @tube_stations = grep { $is_tube{$_} } @tube_stations;

    # Grab all the pubs, to start with.
    my @pubs = $wiki->list_nodes_by_metadata(
                   metadata_type => "category",
                   metadata_value => "pubs",
                   ignore_case   => 1,
    );

    # Filter by locale if specified.
    if ( scalar @locales > 0 ) {
        my @in_locale;
        foreach my $locale ( @locales ) {
            push @in_locale,
                 $wiki->list_nodes_by_metadata(
                     metadata_type  => "locale",
                     metadata_value => $locale,
                     ignore_case    => 1,
                 );
        }
        my %in_locale_hash = map { $_ => 1 } @in_locale;
        @pubs = grep { $in_locale_hash{$_} } @pubs;
    }

    # Filter by Tube station if specified.
    if ( scalar @tube_stations > 0 ) {
        my $locator = CGI::Wiki::Plugin::Locator::UK->new;
        $wiki->register_plugin( plugin => $locator );
        my @near_station;
        foreach my $station ( @tube_stations ) {
            push @near_station,
                $locator->find_within_distance(
                    node   => $station . " Station",
                    metres => 600,
                );
         }
         my %near_station_hash = map { $_ => 1 } @near_station;
         @pubs = grep { $near_station_hash{$_} } @pubs;
    }

    # Filter by features if specified.
    if ( scalar @features > 0 ) {
        my %has_feature = map { $_ => [] } @pubs;
        foreach my $feature ( @features ) {
            my @has_this_feature = $wiki->list_nodes_by_metadata(
                     metadata_type  => "category",
                     metadata_value => $feature,
                     ignore_case    => 1,
                 );
            foreach my $pub ( @has_this_feature ) {
                push @{ $has_feature{$pub} }, $feature;
            }
        }
        # Only keep pubs that have *all* the requested features.
        @pubs = grep { scalar @{ $has_feature{$_} } == scalar @features }
                     @pubs;
    }

    show_results(
                  pubs          => \@pubs,
                  locales       => \@locales,
                  tube_stations => \@tube_stations,
                  features      => [ @possible_features{ @features } ],
                );

You Can Do It, Too

Suppose I'd had the idea for this directed pub search but didn't have direct access to any OpenGuides data store? No problem — I can play with the RDF interface. Most OpenGuides pages have a link to an RDF version, and this includes the auto-generated pages like locale or category search results.

I can send a query like http://london.openguides.org/index.cgi?action=index;index_type=category;index_value=Pubs;format=rdf and then use RDF::Core::Parser to parse the returned RDF/XML and get the data that otherwise would have required CGI::Wiki calls.

The RDF interface isn't too well advertised. A list of places where any kind of link to an RDF version is missing would be most useful.

Given the simple data model of an OpenGuides page, such an external add-on would be trivial to incorporate into the core distribution. So once you've written one, send it to us.

The RDF interface is also ideal for people interested in writing IRC bots:


15:12 <Kake> grotbot: things in Chinatown
15:12 <grotbot> OK, working on it
<grotbot> Kake: things in Chinatown: Crispy Duck, W1D 6PR; De Hems,
          W1D 5BW; Golden Harvest, WC2H 7BE; HK Diner; Hung's, W1D 6PR;
          Misato, W1D 6PG; Tai, W1D 4DH; Tokyo Diner; Zipangu, WC2H 7JJ

Caveats

The OpenGuides software is still young. The install procedure, in particular, needs a good going-over, plus some of the location features only currently work for guides to cities located in the UK.

Live OpenGuides Installs

Similar systems

This week on Perl 6, week ending 2003-10-26

Where does the time go? It seems like only yesterday that I was sat hiding Leon Brocard in the first letters of the first 11 body paragraphs of the last summary. Now, here I am, on the train, typing away in a desperate attempt to get this mailed out before Wednesday. Let's start with perl6-internals again.

Object Freezing

Whoever would have thought that freezing objects would be so controversial? Object freezing is when you take an object and generate a 'frozen' representation of it that can be used to make a copy of the original object. It's what Storable, Data::Dumper and YAML (to name three Perl 5 modules) all do more or less successfully. (Most of the time there's no problem, the tricky case is a composite object which involves a circular data structure. Consider this:


    Ring -> A -> B -> C -> D
            ^              |
            `--------------'

(It's not quite a purely academic example, I can point to at least one text editor that uses a similar data structure).

When you come to freeze your ring, you need some way of detecting the cycle and generating a finite representation that works. This isn't the only problem; thread safety is hard, for instance.)

Dan's set some fairly stringent requirements on whatever mechanism is used for freezing. The most important/stringent is the requirement that (because freezing during object destruction will be a possibility) object traversal is not allowed to use additional memory for that traversal.

Dan is convinced that we can do this using the Garbage Collector's object traversal system. Leo Tötsch is equally convinced that we can't. The resulting thread is rather meaty and hard to summarize without massive amounts of cut and paste so I'll just point you at the root message. The upshot is that we're doing it Dan's way; Glorious Leader continues to trump Pumpking Patchmonster.

In another subthread there was a good deal of misunderstanding about XML declarations and parsing, which got cleared up surprisingly quickly.

http://groups.google.com/groups

Object Instantiation

Dan had a moment of clarity and declared that the Parrot Way to instantiate an object in class Foo will be:


  new P5, .Foo

All we need now is a working implementation. And, apparently, knowing what class a class is a member of might be handy, but Dan's punting on (``ignoring the heck out of'') that one.

http://groups.google.com/groups

A less controversial API addition

In the middle of the discussion about object freezing, Dan popped up another thread to discuss how to invoke Sub/Method PMCs from C code. So Leo implemented Parrot_runops_fromc_args(), but he's not exactly happy about the name. Regardless of the name, having this implemented is rather spiffy.

http://groups.google.com/groups

Old Big Problems before New Big Problems

Melvin Smith popped up with a rant about the Parrot development effort's tendency to rush off to implement new stuff before making double sure about the 'old' stuff is actually complete and robust. To which I can only say ``Hear! Hear!'' Dan agreed, and threatened to give Melvin the pumpkin after Leo had finished with it.

http://groups.google.com/groups

Class metadata for PIR/assembly files

Donning his designer's cap again, Dan posted a rough spec for class metadata declarations in PIR and pasm files. It looks pretty decent to me. Melvin Smith made a few telling comments though, so it looks like Dan's post isn't quite the final word on the matter.

http://groups.google.com/groups

Class creation in bytecode

Mere moments after the metadata post, Dan went on to spec out the assembly language needed to support it.

http://groups.google.com/groups

Open Patches

Leo pointed everyone at http://www.parrotcode.org/openpatches/, the list of open Parrot patches and asked for a volunteer to go through and make sure that the list is in agreement with reality.

http://groups.google.com/groups

Leo's notes on objects and classes

Leo posted a list of thoughts about how classes and objects will work in Parrot and offered a suggestion about using OrderedHashes to handle things.

http://groups.google.com/groups

Unifying vtables and method invocations

Leo wondered if we should arrange things so that PMC vtables could be invoked in the same way as in methods on Parrot level objects are invoked. Dan answered that it won't be quite like that, but it will be close. (Yay! A reflective programmer writes).

http://groups.google.com/groups

More fixed number assignments

So, it turns out that adding PMC classes to the core breaks binary compatibility. Which needs fixing. Dan asked for volunteers. I'm afraid I don't know what fixing it would entail.

http://groups.google.com/groups

Plotz!

Plotz (the Pol(l)y-Lingual Opcode Translation for the Z-machine) is Amir Karger's shot at Parrot immortality by getting Parrot to run Infocom adventures. Right now he's prototyping his design in Perl and he's apparently 10 opcodes away from a tool that will run around two thirds of all released Infocom games. He's rather happy about this.

http://groups.google.com/groups

Halloween Release

Melvin Smith proposed a ``just for fun'' Halloween release of Parrot. Dan christened the putative release ``The Screaming Pumpkin'', and Leo thought there was a little too much stuff in flux at the moment. Me? I think it's going to happen. The codename is too good to resist.

http://groups.google.com/groups

IMCC leaves the languages ghetto

Melvin Smith checked in a large patch to move IMCC from parrot/languages/imcc/ to parrot/imcc, reflecting the intention/reality that IMCC is parrot. As the week progressed, various other directories got moved around in an attempt to make things a little more logical. Everything is now, roughly in its place, but at one point during the week it would have been, frankly, foolhardy to attempt to check parrot out of CVS.

http://groups.google.com/groups

IMCC, Classes & Metadata, oh my!

Melvin Smith, who has returned from his Parrot holiday with a seemingly enormous supply of tuits, announced that he'd started work on implementing the class syntax for PIR and had reached a decision point, so he asked for comments. The consensus was to go with a short term hack for the time being, just to get something working, with a big flag for the future.

http://groups.google.com/groups

PIR changes

For those of you who are implementing things in IMCC, Melvin announced that he'd added newsub and newclosure to IMCC, allowing implementors to simply write P0 = newsub <label> and P0 = newclosure <label> which is rather more efficient than the PASM equivalents.

http://groups.google.com/groups

PMC initialization

Dan decided that the time has come to allow PMCs to be created with initialization data rather than having separate creation and initialization phases. Hes proposal, pending a Better Idea(tm) was to have two init methods, one taking no arguments and simply creating an empty PMC, and another which assumes that its parameters are in the registers (using the standard calling conventions) and goes from there. Leo thought it was a little heavy. I like it though, the more things that use standard Parrot calling conventions, the less I have to remember and the happier I am -- but I am a bear of very little brain.

http://groups.google.com/groups

Meanwhile, in perl6-language

Luke Palmer asked a question about named named return values and :=. David Storrs boggled slightly.

Next week, Damian Conway answers Luke's question.

http://groups.google.com/groups

Acknowledgements, Announcements, Apologies

Looks like the Wednesday ship date of the last summary was a temporary aberration. Which is nice.

If you found this summary valuable, please consider one or more of:

Database Programming with Perl

An Embarrassing Confession

I'd like to think that I'm a reasonably decent Perl programmer now. I'd like to think that I have a good grasp of how to solve relatively common problems in Perl. But, you know, it hasn't always been this way. Oh no.

A long, long time ago, when I was a tiny little programmer, I worked as a trainee Perl coder and systems administrator for a large database company. Naturally, at a database company, a lot of what we had to do was talking to databases in Perl. As a fresh-faced programmer, the only way I knew to interface with databases was through a command-line SQL client.

I won't embarrass the company in question by giving away the name of the database, so let's call this SQL client sqlstar. Very soon I was writing horrendous Perl programs that did things like:


    my @rows = `sqlstar $database "select * from $table"`;
    for (@rows) {
        ...
    }

Of course, things got terribly confused when we had complex where clauses, and needed to escape metacharacters, and ...


    my @rows = `sqlstar $database "select * from $table where
        $row LIKE \"%pattern%\""`;

The code rapidly got ugly, error-prone, and dangerously unsafe. If someone had decided to search for a value with a double quote in it, I don't know where we'd have been. But for the most part it worked, so nobody really worried about it.

Looking back on programs like that makes me cringe today. Of course, a better solution is obvious -- but only once someone tells you about it. And if nobody tells you about it, you could end up writing horrible code like this until someone does tell you or you get fired.

The Obvious Solution

So in case anyone hasn't told you yet: there's a better way. The better way is called the DBI, the DataBase Interface module. It was initially written between 1992 and 1994, long before I was messing about with sqlstar -- so I really have no excuse.

You see, there were many problems with my code. Not only was it ugly, susceptible to shell breakage, conceptually wrong, and inefficient, but it tied my code to the particular database engine we were using. Now, given we were a database company, it's unlikely that we'd ever be using a different database at anytime, but the principle of the thing remains.

Historically, Perl had several different ways to talk to databases; back in the days of Perl 4, the best way to communicate with a database -- even better than my horrible command-line utility hack -- was to use one of the specially compiled Perl binaries that included functions for driving a database. For instance, there was one called oraperl, which allowed you to write code like so:


$lda = &ora_login("master", $user, $pass);
$csr = &ora_open($lda, "select * from $table");
while (@data = &ora_fetch($csr)) {
    # ...
}

This is obviously a bit of an improvement over the code I was writing -- it's a lot more robust, it allows you to do error checking on several different levels, and it saves you a lot of hassle parsing the output of the command-line tool. It's also more efficient, since everything stays inside the one Perl process, which again reduces the number of "moving parts" and things that can go wrong.

So these things were a good solution for Perl 4, but along came Perl 5 and plug-in modules, and suddenly people found a way to solve one of the big problems with these compiled-in database libraries.

We've just seen an example of oraperl, which works great if you're using Oracle -- and, of course, the version of Oracle your Perl was compiled for. If you decide to move your program to Informix, you don't have much option than to rewrite all your database code; this isn't very practical, especially if you want to write code that can be deployed to third parties.

The Perl 5 solution was Tim Bunce's DBI. As well as providing a handy set of functions for all kinds of database access, DBI provides an abstraction layer between the Perl code and the underlying database, allowing you to switch database implementations really easily.

Here's a little conceptual diagram of how the DBI does its stuff:

Your Perl program talks to the DBI, and the DBI talks to whichever Database Driver (DBD) is right for your backend database. This means that to use the DBI, you need to have four things:

  • A C compiler to compile the XS code for DBI and the DBD drivers.
  • A copy of the DBI module compiled and installed.
  • A copy of the relevant client libraries and header files for the database you want to talk to. For instance, on my Debian system, to talk to mysql, I install the libmysqlclient10-dev package.
  • The relevant DBD library compiled and installed -- for example, DBD::MySQL.

Once that's all up and working, we can start writing some database code using the DBI.

Using the DBI

To connect to a database with the DBI, we need to first construct a string that identifies the database we want to connect to; this is called a data source name, or DSN. Let's assume we're going to be working with a MySQL database called "phonebill." (Simply because that's what I was working with yesterday.) The DSN for this is made up of three parts, joined by colons: first, dbi, since that's what we're using to get our data; mysql, since that's the name of the driver we want to use to get it; and phonebill, since that's the database we're getting it from.

So, to connect to a database with the DBI, we'd write something like this:


use DBI;
my $dbh = DBI->connect("dbi:mysql:phonebill", $user, $password);

In a lot of cases, you can do without the username and password if you're connecting as a local user. However, for a serious application, you'll probably want to create a specific user or prompt for a password.

Now we have connected to the database, DBI returns us a database handle, which is typically stored into a variable called $dbh. (Of course, if you're connecting to multiple different databases, you may prefer to give it a name that identifies it to a particular database.) Now we have a database handle, and we can use it to make queries.

Making a query in the DBI takes place in three stages. First, you prepare some SQL; then you execute the query; finally, you get the results. Let's do that now:


my $sth = $dbh->prepare(<<SQL);
   select recipient, calldate, calltime, duration
   from call 
   where duration > 60
   order by duration desc
SQL

$sth->execute;

my %calls;
while (my @row = $sth->fetchrow_array()) {
   my ($recipient, $calldate, $calltime, $duration) = @row;
   $calls{$recipient} += $duration;
   print "Called $recipient on $calldate\n";
}

# Now do something with the total times here.

Why, you might think, do we have to go through these three stages just to make an SQL query? Isn't Perl supposed to make things easy? Well it does, but it makes different things easy to what you're expecting. For instance, suppose you're inserting a lot of rows into a table. This is precisely the sort of thing you don't want to do:


<![CDATA[
while (my $data = <FILE>) {
    my ($recipient, $date, $time, $duration) = split /:/, $data;
    # DON'T DO THIS
    my $sth = $dbh->prepare(<<SQL);
INSERT INTO call (recipient, calldate, calltime, duration)
VALUES ("$recipient", "$date", "$time", "$duration");
SQL
    # NO REALLY, DON'T DO THIS

    $sth->execute;
}
]]>

There are two reasons why this is BAD, BAD, BAD. The first, of course, is that the moment someone comes along with a double-quote in the file, we're in big trouble. In fact, the moment someone comes along with "; DROP DATABASE; in the table, we're out of a job.

The second is that it's really inefficient to set up a statement, execute it, tear it down, set up a statement, execute it, tear it down, and round we go again.

The reason for the disconnect between preparing a statement and executing it is to enable us to use the same statement multiple times with slightly different values; we do this by using what DBI calls "bind parameters" -- portions of the SQL that will be replaced later. For instance, the right way to do our mass inserts would be something like this:


    my $sth = $dbh->prepare(<<SQL);
INSERT INTO call (recipient, calldate, calltime, duration)
VALUES (?, ?, ?, ?)
SQL

while (my $data = <FILE>) {
    my ($recipient, $date, $time, $duration) = split /:/, $data;
    $sth->execute($recipient, $date, $time, $duration);
}
    

Isn't that just so much neater? We've hoisted the statement outside the loop, so it only gets prepared once -- much more efficient. We specify the parameters we want bound to the SQL using question marks, and we pass in the values to the execute call.

As an additional bonus, when execute substitutes in the bind values to the SQL, it calls the database handle's quote method on each one; this is a database-specific method, which escapes any nasty characters like quotes and semicolons in the input, and makes our code safe against the ";drop database attack.

Making Things Easier

But in many cases, the prepare-execute-fetch process is a pain in the neck. Thankfully, DBI provides some easier ways to perform SQL statements; it has some canned methods that do prepare, execute, and fetch in one go.

The first of these is do, which executes a statement when you don't care about the return value, when you're not trying to get results back, such as a DELETE:


# Ignore short calls.
$dbh->do("delete from calls where duration < 5");
    

For SELECT statements, there are a variety of methods that can help out. Perhaps the easiest to use is selectall_arrayref. This returns the results of the SELECT as an array of arrays:



my $results = $dbh->selectall_arrayref(<<SQL);
   select recipient, calldate, calltime, $duration 
   from call  
   where duration > 60 
   order by duration desc 
SQL

for my $row (@$results) {
   my ($recipient, $calldate, $calltime, $duration) = @$row;
   ...
}
    

There are many other DBI tricks, too many to go into here; for more information check out the DBI documentation page, or the DBI home page; there's also Programming the Perl DBI, which was co-authored by the creator of DBI.

Where to from Here?

These days, I actually don't write very much SQL; there are many more Perlish abstraction layers on top of SQL, such as Tony Bowden's Class::DBI, the DBIx::RecordSet, DBIx::SearchBuilder and many more.

Additionally, there are some very interesting things going on in the world of database servers -- SQLite is a very fast embedded SQL engine which doesn't require an external server process, and there are Perl bindings to that in the form of DBD::SQLite.

We'll look at some more of these techniques at a later date, but hopefully this should be enough to get you started using relational databases in Perl ... and of course, the most important lesson of this article: don't worry if you look back at your code after five years and cringe -- so do I!

This week on Perl 6, week ending 2003-10-19

Lumme! Another week, another summary.

Every week (almost) we start with the perl6-internals list, so here goes.

An Evil task for the interested

Our Glorious Leader, Dan Sugalski, last week asked for volunteers to work on making ordered destruction work. (Ordered destruction is where the system tries to ensure that 'parent' objects get destroyed before any of their children). Jeff Clites announced that he'd got a partial implementation working.

Neither Leo Tötsch nor Jürgen Bömmels were sure that the approach Jeff was taking would be the Right Thing in the long run, proposing instead a more general iterator mechanism.

http://groups.google.com/groups

Perl 6 Sub calling

Bringing the perl6 compiler (in languages/ back to live, Steve Fink has committed a 'rather large' patch which implements a subset of the Apocalypse 6 subroutine signature rules. The implementation is apparently very ad hoc and shouldn't be regarded as the final word. But it looks like a very good start to me.

http://groups.google.com/groups

Website timeliness

Responding to Matt Fowles' observation that the parrot website is rather behind the times, Robert Spier let slip that there would soon be a revised website that will be easy for everyone to send patches to, and which would be much easier to have multiple maintainers of different areas of the site.

Of course, there are still a few 'technical chunks' that need to get finished before it's ready to unveil, but it's good to know that work continues in this area. Thanks Robert.

http://groups.google.com/groups

http://www.vendian.org/parrot/wiki/bin/view.cgi — Mike Scott's Parrot Wiki

Dynamic oplibs

Chances are, you've never really needed a fortytwo operator, or even a what_do_you_get_if_you_multiply_six_by_nine operator, and you certainly don't need them cluttering up the parrot core. Which is why Leo Tötsch has implemented those ops as a dynamically loadable ops library.

Admittedly, you're highly unlikely to load this particular ops library, but the underpinning tools for dynamic loading of ops libraries are new and potentially very useful.

Rather later in the week, once he'd got dynamic loading of ops working in all the runtime cores, Leo posted an 'intermediate summary' which explains how things work.

http://groups.google.com/groups

http://groups.google.com/groups

Oplips, pmclibs, function libs

Dan has been thinking about the problems that can arise with dynamic loading. One issue is that, if you have separate files for each PMC class, opcode library and parrot function library, things get unwieldy very quickly, and if you're not careful you'll exhaust the OS file descriptor pool. Which would be bad. So, he asked for a sanity check before going on to work out a scheme for bundling libraries into larger files. Leo agreed that Dan was talking sense, so I expect we'll be seeing some design in this area soon.

http://groups.google.com/groups

Instantiating objects

It's been a busy week on the parrot front for Dan. On Wednesday he outlined his thinking on instantiating objects, with the aim of getting single inheritance objects up and running. This sparked a good deal of discussion, but nothing was actually agreed.

http://groups.google.com/groups

Redoing IMCC macros

Surprisingly, Jürgen Bömmels opened up a can of worms when he redid IMCC's macro support to use a hash lookup instead of a linear search through an array. This sparked a good deal of discussion about the right scoping, which hash implementation to use, and whether IMCC should be closely entwined with the interpreter.

http://groups.google.com/groups

Fixed opcode numbering infrastructure

Dan checked in a patch to fix opcode numbers for the core ops, deliberately breaking the JIT in the process. Leo wasn't happy. After a bit of back and forth, we now have fixed opcodes, but the implementation isn't quite what Dan originally did.

There was also some discussion of how many opcodes really needed to be fixed; after all, in the presence of dynamically loaded oplibs, you can't nail every opcode down. Leo worried that dynamically loaded oplibs don't play well with JIT, and making it work would probably need a total rewrite of the JIT core, but Jürgen didn't think it was all that bad.

http://groups.google.com/groups

Applying the Pumpking Patch

    $ patch RESPONSIBLE_PARTIES 
    6c6
    < Release pumpking              Steve Fink
    ---
    > Release pumpking              Leopold Toetsch

In other words, Steve Fink has stepped down from the role of Parrot Pumpking and handed the his mantle on to Leo "Pumpking Patchmonster" Tötsch. I'm sure I'm not alone in wishing to thank Steve for his sterling work as our release manager. Nor, I'm sure, am I alone in wondering where Leo finds the time.

http://groups.google.com/groups

Bounds checking in extension API

Simon Glover wanted to know what sort of bounds checking was/will be done by the extension API's register access functions. At the time of his writing they didn't do any, which meant you could trivially cause a buffer overflow in your extension by accessing an illegal register number, like 42. Dan said that what needed to happen is for the access functions to do bounds checking and to throw an exception on an illegal register number. Which wasn't quite as straightforward as he thought.

http://groups.google.com/groups

Meanwhile, in perl6-language-subs

What's that you say? You've never heard of perl6-language-subs? Shame on you. Actually, I'd forgotten about perl6-language-subs but, judging by the messages I've seen on the list since I started writing the summaries, it would appear that it's a mailing list where students can ask the Perl 6 community to do their computer science homework for them. I don't expect to be covering this list any further.

Meanwhile, in perl6-language

The Wall Returns

Larry returned from what sounds like a horrible couple of months in hospital and answered Luke Palmer's awkward question about block returns from a couple of weeks ago.

http://groups.google.com/groups

Acknowledgements, Announcements, Apologies

Tuesday is no longer the new Monday. Wednesday is the new new Monday.

If you found this summary valuable, you might like to consider one or more of:

A Chromosome at a Time with Perl, Part 2

James D. Tisdall is the author of the recently released Mastering Perl for Bioinformatics.

In my previous article, A Chromosome at a Time with Perl, Part I, I showed you some programming "tricks" that help you avoid the trap of using up all your main memory when coding for very long strings, such as chromosomes and entire genomes.

The basic approach involved improving your code's running time by limiting the amount of memory space the program uses. The tricks discussed were calling subroutines with references as arguments, and searching for a pattern in a very large file by keeping only a small "window" of the file in memory at any one time, in a buffer.

This article will continue that discussion. I'll show you more about how references can greatly speed up a subroutine call by avoiding making copies of very large strings. I'll show you how you can bypass the overhead of subroutine calls entirely. I'll extend the previous example of a buffer window into a large file, making it suited to any situation where you know the minimum and maximum length of a pattern for which you're searching. And I'll show you how to quantify the behavior of your code by measuring its speed and space usage.

Why Space Puts a Lower Bound on Time

In Perl, as in any programming system, the size of the data that the program uses is an absolute lower bound on how fast the program can perform.

In fact, algorithms are typically classified by how fast they perform on inputs of varying sizes, by giving their speed as a function of the size of the input. So a program that has to do 2n computations on an input of size n is a hell of a lot slower than a program that has to do n2 computations on an input of size n. The first is called intractable and exponential, or "bad"; the second is called tractable and polynomial, or "good." For instance, if n, the size of the input, is 100, then n2 is 10,000, while 2n is bigger than the number of atoms in the universe. But who's counting? And is the universe really finite? Oh well ... back to your regularly scheduled program.

This way of measuring an algorithm is called time complexity. It's usually written in a shorthand called big Oh notation. (See the Suggested Reading at the end of this article, if you get that far.)

In particular, if an algorithm gets an input of size n, and then just to write the answer it must write an output of size 2n, then the algorithm is taking 2n time, at least. So the space that an algorithm uses is intimately connected to the time it uses. Of course, a program could use just a small, constant amount of space and still use 2n time, for instance if it added and subtracted the number one over and over, 2n times, for some perverse reason. Still, the amount of space that an algorithm uses establishes a lower bound for how much time the algorithm takes to complete.

What's all this got to do with Perl programming in bioinformatics? Quite a lot, if you're writing code that manipulates, say, the 3 gigabases of human DNA.

If you're new to the field, a base is one of the letters A, C, G, or T that represents one of the four molecules that are the principal building blocks of DNA. Each base is typically represented in a computer language as one ASCII character taking one 8-bit byte, so 3 gigabases equals 3 gigabytes. Of course, you could represent each of the four bases using only 2 bits, so considerable compression is possible; but such space efficiency is not commonly employed. Hmmm ... makes an interesting idea for another article, however! "Perl and a Chromosome, Two Bits." Watch this space.

Just to read in the 3 gigabytes of DNA is going to take you some time. If you're also making copies of the 3 gigabytes in your variables, you're going to need more main memory than most computers have available. So the crafty Perl programmer needs to think of programming solutions that minimize the amount of space used when computing with such large input. If she does, not only will she have a program that fits into her computer's memory (always a wise move); she may also have a program that runs pretty fast, if she does say so herself, with all due humility.

Subroutines Without Extra Space

In Part I, I briefly discussed how passing references to subroutines can save you considerable space. I'll just expand a little on that discussion in this section. There are three main ways that references can be used to save space and therefore time in the subroutines of your Perl program.

One: Collect Data in a Subroutine Argument

First, let's say you call a subroutine to get some data. Typically this takes a form such as this:


my $chromosome1 = get_chromosome( 1 );

Assuming that the data is about 100 megabases long, the Perl programmer can see that the subroutine "get_chromosome" is collecting 100 megabases of DNA and then "returning" it, which means that copies of those 100 megabases are being made. And that's a Bad Thing.

Instead, the wily hacker could pass a reference to the $chromosome1 string into the subroutine, which could be written to "fill" the string with the 100 megabases without the need for further copying. Then after the subroutine call, say like so:


get_chromosome(1, \$chromosome1);

the variable $chromosome1 would contain the same data as in the previous version, but it would have gotten there without being copied by means of being "returned" from a subroutine. And that's a Good Thing. The only gotcha here is that the subroutine call is definitely changing what's in the $chromosome1 variable. No problem as long as you remember that's what's happening.

Two: Pass the Subroutine a Reference to the Data

Here's a second way to use references as subroutine arguments to good advantage. Let's say you have a chromosome's worth of DNA in a variable $chromosome1, and you want to pass it to a subroutine that counts how many As, Cs, Gs, and Ts there are in it. (This might be important if you were looking for genes in the chromosome, for instance -- in humans, genes tend to be GC rich.)

If you write your code like this:


my($a, $c, $g, $t) = countacgt( $chromosome1 );

then the "countacgt" subroutine is going to make a copy of the data in the argument $chromosome1. That's a Regrettable Occurence.

On the other hand, if you pass the subroutine a reference to the data in $chromosome1, the data will not be copied, and that's a Fortunate Happenstance:


my($a, $c, $g, $t) = countacgt( \$chromosome1 );

However, once again you'll have to be aware that the subroutine has the power to change what's in the $chromosome1 variable, and either avoid changing it or take note of the change.

As another alternative, you could use


my($a, $c, $g, $t) = countacgt( $chromosome1 );

but then don't assign a new variable to the argument within the countacgt subroutine, like so:


my($chrom) = @_;

Instead, just use $_[0] to access the chromosome data without copying it. And that's a Perl Idiom. (For readability, you may want to add a comment explaining what kind of data $_[0] is, since you won't have an informative variable name.)

Three: Return a Reference to the Data

Now a third and final way to use references instead of space: if you have a subroutine that collects a large amount of data, you can have it return a reference to the data instead of the data itself; this will also avoid the large string copies, which Ain't Bad:


my $chromosome1ref = get_chromosome( 1 );

Eliminating Subroutines Altogether

Organizing the code for a computation into a logical set of subroutines can make for clean, easy-to-understand, and elegant programming.

Unfortunately, it can also make your program perform much slower. Take this small example. (An exon is a stretch of a chromosome's DNA, transcribed into RNA, that contains part of the code for a gene. In organisms such as humans, most genes are composed of multiple exons separated by non-coding introns; the exons are spliced together to get the actual code for the gene):


while ((my $begin, my $end) =  each %exon_endpoints) {
    print get_exon($chromosome, $begin, $end), "\n\n";
}

sub get_exon {
    my($chromosome, $begin, $end) = @_;

    # The arguments to substr are: string, beginning, length
    return substr($chromosome, $begin - 1, $end - $begin + 1);
}

This code takes the information about exon endpoints stored in the hash %exon_endpoints (key = begin, value = end) to extract the exons from a chromosome and print them. (You may remember from Part I that I translated between the Perl idea of location, where the first location of a string is position 0, and the biologist's idea of location, where the first location is position 1.) The code is short, to the point, and gets the job done. Unfortunately, it also makes as many copies of the entire chromosome as there are exons to print out. Ouch.

In such circumstances, you can save a boatload of pain by eliminating the subroutine entirely, like so:


while ((my $begin, my $end) =  each %exon_endpoints) {
    print substr($chromosome, $begin - 1, $end - $begin + 1), "\n\n";
}

The bad news: now the details of how to extract the desired exon from the chromosome are right in the loop, instead of being nicely tucked away in the subroutine get_exon. The good news: the program will finish running before the weekend.

Sequence Motifs with Bounded Lengths

In Part I, I showed you how to search for a small pattern in a very large file of DNA data (in FASTA format) by only keeping at most two lines of the file in the program's memory at any one time.

Here is some code that generalizes that approach. It is more general because it allows you to declare the maximum and minimum pattern sizes. It uses no more than a certain maximum amount of main memory for the data at any one time. For instance, if you're looking for a pattern and you know that any match must be between 300 and 2,000 bases long, you can use this subroutine to search any amount of DNA while keeping the amount of main memory used for the DNA to within about 4,000 bytes, twice the maximum pattern size. Only matching patterns between 300 and 2,000 bases long will be reported.


#!/usr/bin/perl
#
# find_size_bounded_pattern : find a pattern known to be between a min and max length
#   Keep small size of memory but handle arbitrarily large input files
#
#  Copyright (c) 2003 James Tisdall
#

use warnings;
use strict;
use Carp;

my $pattern  = "GGTGGAC[GT].{50,1500}[AC][GT][CG]ATAT";
my $filename = "Fly.dna";
my $min      = 65;
my $max      = 1515;

my @locations = find_size_bounded_pattern($pattern, $filename, $min, $max);

print "@locations\n";

exit;

### End of main program
##################################################

##################################################
### Subroutines:

sub find_size_bounded_pattern {

    ################# Arguments:
    # $pattern   - the pattern to search for, as a regular _expression
    # $filename  - the name of the DNA fasta file (may have multiple records)
    # $min       - the shortest length of a usable match to the pattern    
    # $max       - the longest length of a usable match to the pattern    
    #################
    my($pattern, $filename, $min, $max) = @_;

    ################# Other variables:
    # $buffer    - a buffer to store the DNA text, usually (2 * $max) in length
    # $position  - the position of the beginning of the buffer in the DNA
    # @locations - the locations where the pattern was found, to be returned
    #              @locations also includes headers for each fasta record
    # $header    - the one-line fasta header for each record in a fasta file
    my $buffer = '';
    my $position = 0;
    my @locations = ();
    my $header = '';
    #################
    
    # Open the DNA file
    open(DNA,"<$filename") or croak("Cannot open $filename:$!\n");

    # Get the input lines and compute
    while(my $newline = <DNA> ) {

        # If new fasta header, reinitialize buffer and location counter
        if($newline =~ /^>/) {
            # Complete previous search in buffer which contains end of fasta record
            while($buffer =~ /$pattern/gi) {
                if($-[0] <= length($buffer) - $min) {
                    unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
                        push(@locations, $position + $-[0] + 1);
                    }
                }
            }
            # Reset $header, $buffer, $position for new fasta record
            $header = $newline;
            push(@locations, "\n$header");
            buffer = '';
	    $position = 0;

            # Get new line from file
            next;
        }

        # If new line of DNA data

        # Add the new line to the buffer
        chomp $newline;
        $buffer .= $newline;

        if(length($buffer) < (2 * $max) ) {
            next;
        }
    
        # Search for the DNA pattern
        # (Report the character at position 0 as at position 1, as usual in biology)
        while($buffer =~ /$pattern/gi) {
            if($-[0] < $max) {
                unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
                    push(@locations, $position + $-[0] + 1);
                }
            }else{
                last;
            }
        }
    
        # Reset the position counter
        # (will be accurate after you reset the buffer, next)
        $position = $position + $max;
    
        # Reset the buffer
        # Discard the first $max worth of data in the buffer
        $buffer = substr($buffer, $max);
    }

    # Complete search in final buffer
    while($buffer =~ /$pattern/gi) {
        if($-[0] <= length($buffer) - $min) {
            unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
                push(@locations, $position + $-[0] + 1);
            }
        }
    }

    # Computation complete
    return @locations;
}

How the Code Works

This code gets its DNA from a FASTA-formatted file (FASTA is the most common format for a file of DNA sequence data). It would be fairly easy to rewrite this so that you could give multiple FASTA filenames on a command line and all the files would be processed by this code. As it is, it can handle a single FASTA file that contains multiple FASTA records.

The subroutine find_size_bounded_pattern returns an array of all the locations in the DNA that contain the pattern. Since the input may have several FASTA records, the one-line header of each record is also returned, to help identify the start of each new record. For instance, I tested this program on a file, Fly.dna, that contains all the chromosomes of the fruit fly, Drosophila melanogaster. In this file, each new chromosome begins with a new FASTA header, which is added to the returned array. The locations reported start afresh from 1 for each chromosome.

The pattern to be searched for is only reported if it's between a certain minimum and maximum length. Twice the maximum desired pattern length (plus the length of an input line) is the limit of the amount of DNA data that is read into the program's buffer. That way you can search a $max worth of DNA for the beginning locations of patterns that may be up to $max long.

The overall structure of this code is pretty simple, and the comments in the code should do most of the explaining. There are two situations dealt with in the loop as it reads input lines. First is when there is a new FASTA header. Then you have to finish searching in the buffer, and reset the variables to begin a search in a new sequence of DNA from a new FASTA record. Second is when there is a new line of DNA. And finally, after all the lines have been read and you exit the loop, there may still be some unsearched DNA in the buffer, so the subroutine ends by searching the DNA remaining in the last buffer.

In this code, the devil is in the details of how the specific locations and sizes are set. The intermediate level Perl programmer should be able to puzzle this out given the comments in the code. Note that after a successful pattern match the builtin variable $-[0] has the offset of the beginning of the match, and $+[0] has the offset of the end of the match. This avoids the use of the special variable $&, the use of which causes all manner of space to be used to hold this and several other special variables. But if your regular expression has any parentheses, that's enough to make the special variables and their considerable space get used too. Of course, regular expressions have their own rules of behavior, such as greedy matching and so forth, that are not addressed by this code. (Could you modify this program to find patterns that overlap each other? What happens if $max is less than the input line size? What other assumptions are made by this code?)

Profiling the Speed of your Perl Program

You can profile the speed of your Perl program fairly easily. Let's say I put the program in a file called sizebound.pl. Then I can get a report on the time the various parts of the program require by running the program like this:


[tisdall@coltrane]$ perl -d:DProf sizebound.pl 

And then getting a summary of the report (from the file tmon.out that DProf creates) like so:


[tisdall@coltrane]$ dprofpp
Total Elapsed Time =  95.1899 Seconds
  User+System Time =  94.9099 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 99.9   94.87 94.870      1   94.870 94.870  main::find_size_bounded_pattern
 0.02   0.020  0.020      3   0.0067 0.0067  main::BEGIN
 0.00   0.000 -0.000      1   0.0000      -  warnings::BEGIN
 0.00   0.000 -0.000      2   0.0000      -  Exporter::import
 0.00   0.000 -0.000      1   0.0000      -  warnings::import
 0.00   0.000 -0.000      1   0.0000      -  strict::import
 0.00   0.000 -0.000      1   0.0000      -  strict::bits

When you have lots of subroutines, this can really help you see where the most time is being spent. Here, I'm really just getting the information that the program took about a minute and a half to look for the pattern in the DNA of the fruit fly.

It's also possible to get information about the space usage of a program, but you have to use a version of Perl that was compiled with -DDEBUG, which is not usually the case. If you have such a version, then the following will give you some information:


[tisdall@coltrane]$ perl -DL sizebound.pl 

But that's enough for here and now; take a look at the Perl documentation section called perldebguts. And drive safely.

Suggested Reading

Here are some of the many books that you might find useful. I cut my teeth on the Bentley books, but the older ones are hard to find.

  • Introduction to Algorithms, Second Edition, by Cormen et al, MIT Press, 2001.
  • Writing Efficient Programs, by Jon Bentley, Prentice Hall 1982.
  • Refactoring: Improving the Design of Existing Code, by Fowler et al, Addison-Wesley, 1999.
  • Programming Pearls, Second Edition, by Jon Bentley, Prentice Hall 1999.
  • More Programming Pearls: Confessions of a Coder, by Jon Bentley, Pearson Education, 1988.

O'Reilly & Associates recently released (September 2003) Mastering Perl for Bioinformatics.

This week on Perl 6, week ending 2003-10-12

Good afternoon readers. You find me sitting comfortably and tired after a vaguely frantic week involving large amounts of new (and huge) equipment, the delivery of a new Mini Cooper, and four days offline at a large format photography workshop (photos will be going online soonish, if you're interested). All of which hopefully goes some way to explaining the fact that I've only just started writing the summary at ten to four on Tuesday.

We start (and finish) with the internals list.

New PMC compiler

Leo Tötsch isn't exactly happy with the current classes/pmc2c.pl PMC compiler. He outlined the drawbacks as he sees them, proposed a scheme for reimplementing it, and asked for comments. Dan said that so long as the resulting code is clearer than the current PMC compiler, Leo should go for it. So he did.

http://groups.google.com/groups

The Status of Language, Credit Where It's Due

Melvin Smith has added a LANGUAGES.STATUS file in the languages/ subdirectory and asked for language authors to add summaries of the various language subdirectories to the file.

Fresh from this success he added a CREDITS file for giving credit where it's due. Rumours of an entry reading


    N: Leon Brocard
    D: Running Joke

are (currently) false.

Later in the week, LANGUAGES.STATUS was extended to cover both the languages found in languages/ but any other languages that people were working on and making available outside the Parrot distribution for whatever reason.

http://groups.google.com/groups

http://groups.google.com/groups

Binary MMD vtable functions are go

Dan's solved an off by one error with multimethod dispatch and checked everything in, so now Parrot has working two argument multimethod dispatch. Hurrah! There's still lots it doesn't do mind, but getting started is always good.

http://groups.google.com/groups

Attacking the Unicode Monster

Dan's searching for a Unicode Monster Wrangler. ICU is now building on enough platforms that the time has come to build an encoding and chartype library for it. Volunteers gratefully received.

Jeff Clites and Michael Scott have both been poking at it a bit; hopefully they'll be able to put their heads together and emerge with something wonderful.

http://groups.google.com/groups

More NCI stuff

Dan's been a busy boy this week. Parrot now comes with library/postgres.pasm, an interface to libpq (the C library that talks to Postgres). These NCI interface files are currently built by hand. Tim Bunce wondered if it might be possible to ExtUtils::XSBuilder to generate PASM from C header files instead of XS code. It was generally agreed that this would be a cool thing, but it's not been done yet.

http://groups.google.com/groups

http://www.postgresql.org/

The Parrot Stack and Z-Machine

Amir Karger is working on getting his head 'round the Z-Machine (The virtual machine that runs Zork and the other Infocom games, amongst others) by writing code to translate Z-code into a Perl executable. He came up with a problem with saving and restoring the stack in order to save the game state.

http://groups.google.com/groups

References

Leo's been giving the Reference PMC a hard look and, as a result, suggests adding a couple of handy ops, deref and ref (which finds the type of the thing the reference refers to) to use with them. Melvin Smith pointed out that, actually we only needed one of Leo's proposed ops. On the other hand, as Nicholas Clark pointed out, ref avoids the need to grab another PMC register, which may cause a register spill. On the gripping hand, Melvin argued that spillage is pretty rare with the number of registers Parrot has.

http://groups.google.com/groups

The program's ending! DESTROY everything!

Jürgen Bömmels found a subtle bug where objects that had active destruction weren't having their destroy functions called correctly (at all) when parrot terminated. After a couple of goes round the '``this would be so much easier if we were all in the same room'' misunderstanding' loop, the Right Thing was settled upon and implemented.

http://groups.google.com/groups

An interesting task for the evil

One of the things that Dan would really like to be able to promise with Parrot's GC is ordered destruction. Taking a leaf out of my lightning talk on complexity management, he noted that ``we just need to order the PMCs before destroying them, to make sure we don't call the destructor for a PMC before we destroy the things that depend on it'' and asked for volunteers to implement it.

Dave Mitchell spoilt the understatement party by pointing out that, in the presence of circular dependencies things got a little more complicated (but not that much more complicated, Dan claims).

Nobody's actually volunteered yet.

http://groups.google.com/groups

Parrot gets pioctl

Melvin Smith added a pioctl op for general purpose IO manipulation in the style of the UNIX ioctl syscall. Dan reckoned it was almost certainly time to start thinking about a better API (though we're not sure if that's because he doesn't like ioctl as a name). Steve Fink suggested using keyed access to the IO PMC, so you could do

   set IO, P0[.BUFSIZE]
   set P0[.BUFSIZE], 8192

Steve later backed away from this suggestion, suggesting that maybe get/setprop fit the semantics better, but Leo thought that 'fundamentals' like buffer size belonged as keyed access, with properties being reserved for setting up things like encoding layers.

http://groups.google.com/groups

PCC and IMC

Gregor N. Purdy suggested that, instead of going through the hassle of explicitly setting up a function call under the parrot calling conventions, one could call the the function in the style of a macro call. Leo wasn't keen, IMCC is still an assembler after all (just a very sophisticated one). It looks like, unless Dan rules otherwise, we're going to stick with the current explicit method of function invocation setup, and human parrot programmers will just have to set up convenience macros explicitly.

http://groups.google.com/groups

Meanwhile in perl6-language

Acknowledgements, Announcements, Apologies, Alliteration

I'm really sorry I used the '...' joke about perl6-language again. Write something people!

Tuesday continues to be the new Monday, and will probably remain so for a while.

http://www.bofh.org.uk:8080/ will hopefully be hosting my photos from the weekend's workshop.

If you found this summary valuable, you might like to consider one or more of the following suggestions:

  • Get involved. There's at least two calls for volunteers out at the moment related to Parrot. Okay, they're not necessarily easy, but if you turn up on the lists (check http://dev.perl.org/perl6/ and http://www.parrotcode.org/ for subscription details) and offer to help I'm sure we'll find something that needs doing.
  • Support Larry, Dan, Damian and company by donating to the Perl Foundation at http://donate.perl-foundation.org/. Fun as hacking Perl 6 no doubt is, it's very easy to get distracted from it by the pressing need to feed yourself.
  • Boost my ego. Praise my photos/this summary/my lackadaisical weblog. Suggest improvements, point out my grammatical howlers, hire me to do cool stuff, press gifts upon me. Contact me at p6summarizer@bofh.org.uk, unless you're a spamming scumbag I can pretty much promise you a prompt response. Assuming I'm not out hooning 'round the countryside in my new car...

A Refactoring Example

About a year ago, a person asked the Fun With Perl mailing list about some code they had written to do database queries. It's important to note that this person was posting from an .it address; why will become apparent later. The code was reading records in from a text file and then doing a series of queries based on that information. They wanted to make it faster.

Here's his code:

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
$nump++;
chop($riga);
$pagina[$nump] = $riga;

$sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
pageid='$pagina[$nump]' and data>='$startdate'");
$sth->execute;
$totalvisit[$nump] = $sth->fetchrow_array();

$sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
(pageid='$pagina[$nump]' and data='$dataoggi')");
$sth->execute;
$totalvisittoday[$nump] = $sth->fetchrow_array();

 $sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
(pageid='$pagina[$nump]' and data='$dataieri')");
$sth->execute;
$totalyvisit[$nump] = $sth->fetchrow_array();

 $sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
(pageid='$pagina[$nump]' and data<='$fine30gg' and data>='$inizio30gg')");
$sth->execute;
$totalmvisit[$nump] = $sth->fetchrow_array();

 }

I decided that rather than try to read through this code and figure out what it's doing and how to make it faster, I'd clean it up first. Clean it up before you figure out how it works? Yes, using a technique called Refactoring.

Refactoring?

In his book, Martin Fowler defines Refactoring as "the process of changing a software system in such a way that it does not alter the external behavior of the code yet improves its internal structure." In other words, you clean up your code but don't change what it does.

Refactoring can be as simple as changing this code:

$exclamation = 'I like '.$pastry.'!';

To this:

$exclamation = "I like $pastry!";

Still does the same thing, but it's easier to read.

It's important to note that I don't need to know anything about the contents of $pastry or how $exclamation is used. The change is completely self-contained and does not affect surrounding code or change what it does. This is Refactoring.

On the principle of "show me don't tell me," rather than talk about it, we'll dive right into refactoring our bit of code.

Fix the Indentation

Your first impulse when faced with a hunk of code slammed against the left margin is to indent it. This is our first refactoring.

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
                         pageid='$pagina[$nump]' and data>='$startdate'");
    $sth->execute;
    $totalvisit[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
                          (pageid='$pagina[$nump]' and data='$dataoggi')");
    $sth->execute;
    $totalvisittoday[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
                          (pageid='$pagina[$nump]' and data='$dataieri')");
    $sth->execute;
    $totalyvisit[$nump] = $sth->fetchrow_array();

    $sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
                         (pageid='$pagina[$nump]' and data<='$fine30gg' 
						 and data>='$inizio30gg')");
    $sth->execute;
    $totalmvisit[$nump] = $sth->fetchrow_array();
}

close (INPUT);

Already it looks better. We can see that we're iterating over a file, performing some SELECTs on each line and shoving the results into a bunch of arrays.

A Single, Simple Change

One of the most important principles of Refactoring is that you work in small steps. This re-indentation is a single step. And part of this single step includes running the test suite, logging the change, and checking it into CVS.

Checking into CVS after something this simple? Yes. Many programmers ask the question, "When should I check in?" When you're refactoring it's simple: check in when you've done one refactoring and have tested that it works. Our re-indentation is one thing; we test that it works and check it in.

This may seem excessive, but it prevents us from entangling two unrelated changes together. By doing one change at a time we know that any new bugs were introduced by that one change. Also, you will often decide in the middle of a refactoring that it's not such a good idea. When you've checked in at every one you can simply rollback to the last version rather than having to undo it by hand. Convenient, and you're sure no stray bits of your aborted change are hanging around.

So our procedure for doing a proper refactoring is:

  • Make one logical change to the code.
  • Make sure it passes tests.
  • Log and check in.

Big Refactorings from Small

The goal of this refactoring is to make the code go faster. One of the simplest ways to do achieve that is to pull necessary code out of the loop. Preparing four new statements in every iteration of the loop seems really unnecessary. We'd like to pull those prepare() statements out of the loop. This is a refactoring. To achieve this larger refactoring, a series of smaller refactorings must be done.

Use Bind Variables

Each time through the loop, a new set of SQL statements is created based on the line read in. But they're all basically the same, just the data is changing. If we could pull that data out of the statement we'd be closer to our goal of pulling the prepare()s out of the loop.

So my next refactoring pulls variables out of the SQL statements and replaces them with placeholders. Then the data is bound to the statement using bind variables. This means we're now prepare()ing the same statements every time through the loop.

Before refactoring:

$sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
                     pageid='$pagina[$nump]' and data>='$startdate'");
$sth->execute;

After refactoring:

$sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                     pageid=? and data>=?');
$sth->execute($pagina[$nump], $startdate);

Bind variables also protect against a naughty user from trying to slip some extra SQL into your program via the data you read in. As a side-effect of our code cleanup, we've closed a potential security hole.

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                         pageid=? and data>=?');
    $sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                          (pageid=? and data=?)');
    $sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                          (pageid=? and data=?)');
    $sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $sth->fetchrow_array();

    $sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE
                         (pageid=? and data<=? and data>=?)');
    $sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Split a Poorly Reused Variable

The next stumbling block to pulling the prepare() statements out of the loop is that they all use the same variable, $sth. We'll have to change it so they all use different variables. While we're at it, we'll name those statement handles something more descriptive of what the statement does. Since at this point we haven't figured out what the statements do, we can base the name on the array it gets assigned to.

While we're at it, throw in some my() declarations to limit the scope of these variables to just the loop.

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    my $totalvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                                        pageid=? and data>=?');
    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    my $totalvisittoday_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                             (pageid=? and data=?)');
    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    my $totalyvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                         (pageid=? and data=?)');
    $totalyvisit_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalyvisit_sth->fetchrow_array();

    my $totalmvisit_sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                        (pageid=? and data<=? and data>=?)');
    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Getting Better All the Time

The new names are better, but they're not great. This is ok. Naming is something people often get hung up on. One can spend hours wracking their brains thinking of the perfect name for a variable or a function. If you can think of a better one than what's there right now, use it. The beauty of Refactoring is you an always improve upon it later.

This is an important lesson of Refactoring. Voltare said, "the best is the enemy of the good". We often get so wound up trying to make code great that we fail to improve it at all. In refactoring, it's not so important to make your code great in one leap, just a little better all the time (it's a little known fact John Lennon was into Refactoring.) These small improvements will build up into a clean piece of code, with less bugs, more surely than a large-scale code cleanup would.

Pull Code Out of the Loop

Now it's a simple cut and paste to pull the four prepare() statements out of the loop.

my $totalvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                                    pageid=? and data>=?');

my $totalvisittoday_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                         (pageid=? and data=?)');

my $totalyvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                     (pageid=? and data=?)');

my $totalmvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                     (pageid=? and data<=? and data>=?)');

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalyvisit_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalyvisit_sth->fetchrow_array();

    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Already the code is looking better. With the SQL separated, the inner workings of the loop are much less daunting.

Test. Log. Check in.

A Place to Stop

Remember our goal, to make this code run faster. By pulling the prepare() statements outside the loop we've likely achieved this goal. Additionally, it still does exactly what it did before even though we still don't fully understand what that is. If this were a real project, you'd do some benchmarking to see if the code is fast enough and move on to another task.

Since this is an example, I'll continue with more refactorings with the goal of clarifying the code further and figuring out what it does.

Keep in mind that after every refactoring the code still does exactly what it did before. This means we can stop choose to stop after any refactoring. If a more pressing task suddenly pops up we can pause our refactoring work and attend to that feeling confident we didn't leave any broken code lying around.

Reformat SQL for Better Readability

In order to make sense of the code, we have to make sense of the SQL. The simplest way to better understand the SQL is to put it into a clearer format.

The three major parts of an SQL SELECT statement are:

  • The rows (ie. SELECT count(*))
  • The table (ie. FROM lognew)
  • The predicate (ie. WHERE pageid = ...)

I've chosen a new format that highlights these parts.

I've also removed some unnecessary parenthesis because they just serve to clutter things up rather than disambiguate an expression.

I've also decided to change the quoting style from single quotes to a here-doc. It would have also been okay to use q{}.

my $totalvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $totalvisittoday_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalyvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalmvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalyvisit_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalyvisit_sth->fetchrow_array();

    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Remove Redundancy

With the SQL in a more readable format, some commonalities become clear.

  • All the statements are doing a count(*).
  • They're all using the lognew table
  • They're all looking for a certain pageid.

In fact, $totalvisittoday_sth and $totalyvisit_sth are exactly the same! Let's eliminate one of them, doesn't matter which, we're going to rename them in a moment anyway. $totalyvisit_sth goes away, making sure to change all references to it to $totalvisittoday_sth.

my $totalvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $totalvisittoday_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalmvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Fix Conflicting Styles

Now the only difference between the statements is the choice of data ranges.

Using the variables are passed into each statement we can make some more deductions. Let's have a look...

  • $startdate
  • $dataoggi
  • $dataieri
  • $fine30gg, $inizio30gg

One of these things is not like the other. What's $startdate doing there? Everything else is talking about 'data'. What's 'ieri'? 'oggi'? Remember, the programmer who submitted this code is Italian. Maybe the names are in Italian. Grabbing an Italian-English dictionary we find out that 'data' is Italian for 'date'! Now it makes sense, this code was probably originally written in English, then worked on by an Italian (or vice-versa).

This code has committed a cardinal stylistic sin. It uses two different languages for naming variables. Not just different languages, languages which have different meanings for the same words. Taken out of context, we can't know if $data represents a hunk of facts or "Thursday."

Since the styles conflict, one of them has to go. Since I don't speak Italian, I'm going to translate it into English.

Pulling out our Italian-to-English dictionary...

  • "riga" is "line"
  • "pagina" is "page"
  • "nump" is probably short for "numero pagina" which is "page number"
  • "data" is "date"
  • "oggi" is "today"
  • "ieri" is "yesterday"
  • "inizio" is "start"
  • "fine" is "end"
  • "gg" is probably short for "giorni" which is "days"
    • "fine30gg" would then be "the end of 30 days"
    • "inizio30gg" would be "the beginning of 30 days"

It would be a straightforward matter of a bunch of search-and-replaces in any good editor but for one snag, the SQL column 'data.' We'd like to change this to its English 'date', but databases are very global with possibly lots of other programs using it. So we can't change the column name without breaking other code. While in a well-organized programming shop you might have the ability to find all the code which uses your database, we won't assume we have that luxury here. For the moment then, we'll leave that be and deal with it in a separate refactoring.

my $totalvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $totalvisittoday_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalmvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($line=<INPUT>){
    $page_num++;
    chop($line);
    $pages[$page_num] = $line;

    $totalvisit_sth->execute($pages[$page_num], $start_date);
    $totalvisit[$page_num] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pages[$page_num], $today);
    $totalvisittoday[$page_num] = $totalvisittoday_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pages[$page_num], $yesterday);
    $totalyvisit[$page_num] = $totalvisittoday_sth->fetchrow_array();

    $totalmvisit_sth->execute($pages[$page_num], $end_of_30_days,
                                                 $start_of_30_days);
    $totalmvisit[$page_num] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Better Names

With decent variable names in place, the purpose of the program becomes much clearer. This is a program to calculate the number of visits to a page for various date ranges. Based on this new information we can give the statement handles and the arrays they put data into better names.

Looking at the SQL we see we've got:

  • One to get all the visits up to a single day.
  • One to get the visits for a certain date.
  • One to get the visits for a range of dates.

A good set of new names would be:

  • daily
  • up to
  • range

Also, Total Visits is too long. We could shorten that to just Visits, or even shorter to Hits.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($line=<INPUT>){
    $page_num++;
    chop($line);
    $pages[$page_num] = $line;

    $hits_upto_sth->execute($pages[$page_num], $start_date);
    $totalvisit[$page_num] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $today);
    $totalvisittoday[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $yesterday);
    $totalyvisit[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($pages[$page_num], $end_of_30_days,
                                                $start_of_30_days);
    $totalmvisit[$page_num] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Changing Global Variable Names

The array names need work, too. Currently, they're rather ambiguous. @totalyvisit, what does the y mean? Looking at each variable name and the variables that got passed to execute() to produce it...

  • @totalvisit comes up to a $start_date. So that can be @hits_upto
  • @totalvisittoday comes from $today and is pretty obvious. @hits_today
  • @totalyvisit comes from $yesterday so 'y' must be for 'yesterday'. @hits_yesterday
  • @totalmvisit comes from the range produced by the $start_of_30_days and the $end_of_30_days. So 'm' must be 'month'. @hits_monthly
my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($line=<INPUT>){
    $page_num++;
    chop($line);
    $pages[$page_num] = $line;

    $hits_upto_sth->execute($pages[$page_num], $start_date);
    $hits_upto[$page_num] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $today);
    $hits_today[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $yesterday);
    $hits_yesterday[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($pages[$page_num], $end_of_30_days,
                                                $start_of_30_days);
    $hits_monthly[$page_num] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test... uh-oh, test failed!

There's something very different about this change compared to the others. The variables we changed were not declared in our little code block. Likely they're used in other parts of the code, such as our test which caused it to break.

In the Real World, we would be sure to replace all occurrences of the variable. The simplest way to do this is to use your editor to perform a search and replace rather than doing it by your all too fallible hands. If it could be used over a set of files, grepping through those files for all occurrences of it and changing those as well would be necessary.

# If you don't have rgrep, grep -r does the same thing.
rgrep '[@$]totalvisit' /path/to/your/project

I do this so often that I've taken to calling grep -r, 'Refactoring Grep'. Other languages who's syntax is -- ummm -- not as inspired as Perl's, such as Java, C++ and Python, have tools for doing this sort of thing automatically. Because of the complexity of Perl's syntax, we still have to do it mostly by hand, though there are some efforts underway to rectify this.

Changing the array names in our test as well we get them to pass.

Log. Check in.

Improve Overly Generic Names

Continuing with our variable name improvements, we're left with the last few unimproved names. Let's start with $line.

Since we can see clearly that $line = <INPUT>, calling the variable 'line' tells us nothing new. A better name might be what each line contains. Looking at how the line is used we see $pages[$page_num] = $line and how that is then used in the SQL. It's a page id.

But it doesn't make much sense to put a page id into an array called @pages. It doesn't contain pages, it contains @page_ids.

What about $page_num? It doesn't contain a page number, it contains the line number of the file we're reading in. Or more conventionally, an $index or $idx.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chop($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits_upto[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits_today[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits_yesterday[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $end_of_30_days,
                                                   $start_of_30_days);
    $hits_monthly[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Fixing Odd Interfaces

What's wrong with this picture?

$hits_range_sth->execute($page_ids[$idx], $end_of_30_days,
                                               $start_of_30_days);

Isn't it a little odd to specify a date range with the end first? Sure is. It also guarantees someone is going to get it backwards. Reverse it. Don't forget the SQL, too.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chop($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits_upto[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits_today[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits_yesterday[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $start_of_30_days,
                                                   $end_of_30_days);
    $hits_monthly[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

s/chop/chomp/

Now that we've stared at the code for a while, you might have noticed the use of chop(). Using chop() to strip a newline is asking for portability problems, so let's fix it by using chomp().

Technically this isn't a refactoring since we altered the behavior of the code by fixing the bug. But using chop() where you meant chomp() is such a common mistake we'll make it an honorary refactoring.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits_upto[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits_today[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits_yesterday[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $start_of_30_days,
                                                   $end_of_30_days,);
    $hits_monthly[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Collect Related Variables into Hashes

The common prefix hits_ is a dead giveaway that much of the data in this code is related. Related variables should be grouped together into a single structure, probably a hash to make the relation obvious and allow them to be passed around to subroutines more easily. Its easier to move around one hash than four arrays.

I've decided to collect together all the @hit_ arrays into a single hash %hits since they'll probably be used together parts of the program. If this code snippet represents a function it means I can return one hash reference rather than four array refs. It also makes future expansion easier, rather than returning an additional array it simply becomes another key in the hash.

Before.

$hits{upto}[$idx] = $hits_upto_sth->fetchrow_array();

After.

$hits_upto[$idx]  = $hits_upto_sth->fetchrow_array();

It's interesting to note what a small, natural change this is. Circumstantial evidence that this is a good refactoring.

As before, since these arrays are global data, we must be sure to change them everywhere. This includes the tests.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits{upto}[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits{today}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits{yesterday}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $start_of_30_days,
                                                   $end_of_30_days,);
    $hits{monthly}[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

When Not to Refactor

The statement handles are also related, but I'm not going to collect them together into a hash. The statement handles are short-lived lexicals, they're never likely to be passed around. Their short scope and grouping within the code makes their relationship obvious. The design would not be improved by the refactoring.

Refactoring is not a set of rules to be slavishly followed, it's a collection of tools. And like any other tool you must carefully consider when and when not to use it. Since collecting the statement handles together doesn't improve the design, I won't do it.

Eliminate Unnecessary Longhand

Boy, we sure use $page_ids[$idx] a lot. It's the current page ID. But don't we have a variable for that?

Replace all the unnecessary array accesses and just use the more concise and descriptive $page_id.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $today);
    $hits{today}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $yesterday);
    $hits{yesterday}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_id, $start_of_30_days,
                                       $end_of_30_days,);
    $hits{monthly}[$idx] = $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Rearrange Data Structures to Fit Their Use

Currently, %hits is accessed by the order the page ID was read out of the file. Well, that doesn't seem very useful at all. Its purpose seems to be for listing the page counts in exactly the same order as you read them in. Even then you need to iterate through @page_ids simultaneously because no where in %hits is the page ID stored.

Consider a common operation, looking up the hit counts for a given page ID. You have to iterate through the whole list of page IDs to do it.

foreach my $idx (0..$#page_ids) {
    if( $page_ids[$idx] eq $our_page_id ) {
        print "Hits for $our_page_id today: $hits{today}[$idx]\n";
        last;
    }
}

Cumbersome. A much better layout would be a hash keyed on the page ID.

$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

Now we can directly access the data for a given page ID. If necessary, we can still list the hits in the same order they were read in by iterating through @page_ids.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $today);
    $hits{today}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $yesterday);
    $hits{yesterday}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_id, $start_of_30_days,
                                       $end_of_30_days,);
    $hits{monthly}{$page_id} = $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Eliminate Unnecessary Variables

Now that %hits is no longer ordered by how it was read in, $idx isn't used much anymore. It's only used to stick $page_id onto the end of @page_ids, but we can do that with push.

This is minor but little things build up to cause big messes.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    chomp($page_id);
    push @page_ids, $page_id;

    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $today);
    $hits{today}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $yesterday);
    $hits{yesterday}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_id, $start_of_30_days,
                                       $end_of_30_days,);
    $hits{monthly}{$page_id} = $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Pull Logical Chunks Out into Functions

Our final refactoring is one of the most common and most useful.

Let's assume that we need to generate page counts somewhere else in the code. Rather than repeat the code to do this, we want to put it in a subroutine so it can be reused. One subroutine for each statement.

In order to do this, start by identifying the code that would go into the routine.

$hits_upto_sth->execute($page_id, $start_date);
$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

Then wrap a subroutine around it.

sub hits_upto {
    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();
}

Now look at all the variables used.

$hits_upto_sth is a global (well, file-scoped lexical) and is defined entirely outside the function. We can keep using it in our subroutine in the same way we are now.

$hits{upto}{$page_id} is receiving the result of the calculation. It contains the return value. So it goes outside the function to receive the return value. Where its assignment was, we put a return.

sub hits_upto {
    $hits_upto_sth->execute($page_id, $start_date);
    return $hits_upto_sth->fetchrow_array();
}

$page_id and $start_date vary from call to call. These are our function arguments.

sub hits_upto {
    my($page_id, $start_date) = @_;
    $hits_upto_sth->execute($page_id, $start_date);
    return $hits_upto_sth->fetchrow_array();
}

Finally, rename things in a more generic manner. This is a subroutine for calculating the number of hits up to a certain date. Instead of $start_date which was specific to one calculation, we'd call it $date.

sub hits_upto {
    my($page_id, $date) = @_;
    $hits_upto_sth->execute($page_id, $date);
    return $hits_upto_sth->fetchrow_array();
}

And that's our new subroutine, does the same thing as the original code. Then it's a simple matter to use it in the code.

    $hits{upto}{$page_id} = hits_upto($page_id, $start_date);


my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    chomp($page_id);
    push @page_ids, $page_id;

    $hits{upto}{$page_id}      = hits_upto($page_id, $start_date);
    $hits{today}{$page_id}     = hits_daily($page_id, $today);
    $hits{yesterday}{$page_id} = hits_daily($page_id, $yesterday);
    $hits{monthly}{$page_id}   = hits_range($page_id, $start_of_30_days,
                                                        $end_of_30_days,);
}

sub hits_upto {
    my($page_id, $date) = @_;
    $hits_upto_sth->execute($page_id, $date);
    return scalar $hits_upto_sth->fetchrow_array();
}

sub hits_daily {
    my($page_id, $date) = @_;
    $hits_daily_sth->execute($page_id, $date);
    return scalar $hits_daily_sth->fetchrow_array();
}

sub hits_range {
    my($page_id, $start, $end) = @_;
    $hits_range_sth->execute($page_id, $start, $end);
    return scalar $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Undo.

Some may balk at putting that small of a snippet of code into a subroutine like that. There are definite performance concerns about adding four subroutine calls to a loop. But I'm not worried about that at all.

One of the beauties of Refactoring is that it's reversible. Refactorings don't change how the program works. We can reverse any of these refactorings and the code will work exactly the same. If a refactoring turns out to be a bad idea, undo it. Logging each refactoring in version control makes the job even easier.

So if it turns out moving the executes into their own functions causes a performance problem the change can easily be undone.

Done?

At this point, things are looking pretty nice. The code is well structured, readable, and efficient. The variables are sensibly named. The data is organized in a fairly flexible manner.

It's good enough. This is not to say that there's not more that could be done, but we don't need to. And Refactoring is about doing as much redesign as you need instead of what you might need.

Refactoring and the Swiss Army Knife

As programmers we have a tendency towards over-design. We like to design our code to deal with any possible situation that might arise, since it was hard to change the design later. This is known as Big Design Up Front (BDUF). It's like one of those enormous Swiss Army Knives with 50 functions. Most of the time all you need is a knife with something to open your beer with and then maybe pick your teeth afterwards but you never know. So you over-engineer because it's hard to improve it later. If it never gets used then a lot of effort has been wasted.

Refactoring turns design on its ear. Now you can continually evolve your design as needed. There's no longer a need to write for every possible situation up front so you can focus on just what you need right now. If you need more flexibility later, you can add that flexibility through refactoring. It's like having a Swiss Army knife that you can add tools to as you need them.

Further Reference

Identifying Audio Files with MusicBrainz

It's quite possible to end up with digital music files that don't have good information about what they are. Files that don't have ID3 information can rely on paths for album information, for example, and that is lost easily. M3U files describing track order can be deleted or ignored by naive archiving.

Wouldn't it be nice if, once you had a music file, you could use Perl to take what information you did have about a track, send it to the Internet, and get back the data you were missing? Well, you can.

A Step Through History

In the beginning (well, more or less), music was on CDs. People started listening to CDs on computers shortly after that, and they found that it would be nice to know the track's name, not just the number. Applications were developed that could store the CD metadata locally. Still, it was tedious to type in all those CD lists, so people shared the metadata in a single index file on the Internet.

As with most other single-file data stores in Internet history, soon it became sensible to turn this into a proper database. And so the CDDB was born. Clients could upload a description of the disc (the Table of Contents, which described how long each track is) and either download the information for that CD, or contribute it if it wasn't in the database.

During 1999 and 2000, however, the CDDB (after its acquisition by Gracenote) moved from an open position (with GPLed downloads of its data files) to a proprietary one. During this time it stopped access to clients speaking the first version of the CDDB protocol, and instead moved to licensing -- at some cost -- CDDB2 clients, and stopped offering downloads of its data.

However, a few projects started up, taking advantage of the data that had been freely available until this point. One of these was FreeDB, which quickly established an open replacement for the CDDB. The other is MusicBrainz, which is much more interesting.

FreeDB

FreeDB replicates the structure of the old CDDB very faithfully. This means that a number of Perl modules for handling CDDB data are applicable to the FreeDB as well.

However, despite the large number of FreeDB modules on CPAN, it's not really well suited to the task of finding or correcting digital music file metadata. FreeDB grew out of CDDB, which was designed around the task of identifying entire CDs, not merely single tracks, and that is still reflected in the way most of the modules work; they require you to either have or fake the CD's table of contents to get results.

FreeDB also has a search form on its web site, and there's a Perl module -- Webservice::FreeDB -- that you can use to find out information on a per-track basis. However, wherever possible a web service is probably preferable to using a screen scraper, and thankfully such a service is available.

MusicBrainz

MusicBrainz has similar origins to FreeDB in the post-Gracenote era. Unlike FreeDB, MB was much more ambitious; as the description says, `"MusicBrainz is a community music metadatabase that attempts to create a comprehensive music information site.''

In addition to taking the FreeDB data and making it available (in fact, the FreeDB changes appear to be regularly merged into MusicBrainz), MB takes care to make sure that their data is moderated regularly. FreeDB's discid-based system didn't always make sure that different versions CDs were recognized as duplicates, for example, whereas the MB volunteers attempt to consolidate such data. They also offer fairly powerful searches of the data from a web-based front end.

More importantly for our purposes, MusicBrainz has a web services API. Rather than using SOAP, it's a REST-based service based on RDF.

You can see an example of this by downloading the data at a URL like http://mm.musicbrainz.org/mm-2.1/album/1073abfc-768e-455b-9937-9b41b923c746/4. This returns RDF for the Underworld album Beaucoup Fish. The long hex string is the album's unique identifier within MusicBrainz, and the number at the end (4) tells MusicBrainz how deeply to go when building the RDF graph. This level of depth means that as well as merely getting a track listing as references to other RDF documents (like http://musicbrainz.org/track/55ef9194-bb58-4397-a8a2-e0d41d2e1435), you get the name of the track inlined in the document.

Using MusicBrainz::Client

However, requesting that URL directly requires you know the MusicBrainz ID for that album, track, or artist, and that you can parse RDF. Unsurprisingly, there's code out there that can do both from a given piece of information.

the MusicBrainz::Client manpage is a Perl interface to the C client library for MB, and is available as part of the Client SDK download, as well as on CPAN.

Here's a small example of using one of the more useful queries provided, the snappily-entitled MBQ_FileInfoLookup. This takes up to 10 parameters, as documented in the Query reference. However, you can provide as many or as few items as you wish, and in this example, merely two pieces of information are provided: an artist, and a track name.


  #!/usr/bin/perl -w
  use strict;
  
  use MusicBrainz::Client;
  use MusicBrainz::Queries qw(:all);

  my $mb = MusicBrainz::Client->new();
  my $query = [ '', 'Underworld', '', 'Air Towel' ];
  my $result;

Now we've set up the script, and initialized a client object, let's actually talk to the server.


  if (!$mb->query_with_args( MBQ_FileInfoLookup, $query )) {
    die "Query failed ".$mb->get_query_error();
  }
  
  if (!$mb->select1(MBS_SelectLookupResult, 1)) {
    die "No lookup result";
  }

This sends off a query to the MusicBrainz server, and does two checks to see if it's worth continuing. If there's no return value from query_with_args, the script dies with the error returned. If there's not at least one result in the returned data, it also dies.

The exact arguments that MBQ_FileInfoLookUp take are documented in the query reference above. Notably, the first argument is the TRM ID. This is a generated, unique identifier for the file, based on a number of weighted checks, including wavelet analysis. Generally I've found it's still possible to get good results without including it, though.


  my $type = $mb->get_result_data(MBE_LookupGetType);
  my $frag = $mb->get_fragment_from_url($type);
  
  if ($frag eq 'AlbumTrackResult') {
    $result = handle_album_track_list($mb);
  }
  else {
    die "Not an AlbumTrackResult; instead of type '$frag'";
  }

MBQ_FileInfoLookup can return different types of result. This code uses two more functions from MusicBrainz to find out the type of the result (the LookupGetType function) and then to parse out from the URL what type of result that is. We're only interested in AlbumTrackResult type, so we die if that's not what's found. If it is of that type, it's handled by a subroutine, which we'll look at now.


  sub handle_album_track_list {
    my $mb = shift;
    my $result;

First, we get the MusicBrainz client object and pre-declare our result variable.



    for (my $i = 1;; $i++) {
      $mb->select(MBS_Rewind);

      if (!$mb->select1(MBS_SelectLookupResult, $i)) {
        last;
      }

MusicBrainz results sets are a lot like database rows. You loop over them, and pull out the data you want.

However, the interface to the results is somewhat C-like. As you can see, we loop over the results one by one, stopping only when there isn't a result in the set.


      my $relevance = $mb->get_result_int(MBE_LookupGetRelevance);

However, once there is a result, we can pull out information from it, like the relevance of that data.



      # get track info
      $mb->select(MBS_SelectLookupResultTrack);
      my $track   = $mb->get_result_data(MBE_TrackGetTrackName);
      my $length  = $mb->get_result_data(MBE_TrackGetTrackDuration);     
      my $artist  = $mb->get_result_data(MBE_TrackGetArtistName);
      $mb->select(MBS_Back);

To get the information about the track, you select the track portion of that result, then issue get_result_data calls for each of the pieces of information you want (such as the artist name, track name and so on).


      # get album info
      $mb->select(MBS_SelectLookupResultAlbum);
      my $album   = $mb->get_result_data(MBE_AlbumGetAlbumName);
      my $trackct = $mb->get_result_int(MBE_AlbumGetNumTracks);
      $mb->select(MBS_Back);

Similarly, you select the album data, and then select the information about the album you want to return.



      $result->[$i-1] = { relevance => $relevance,
                          track     => $track,
                          album     => $album,
                          artist    => $artist,
                          total     => $trackct,
                          time      => $length,
                        };

This is stored in a hash reference, itself stored in the list of results. (Note we move from MusicBrainz offset of 1 to the Perl offset of 0 here.)


    }
    return $result;
  }

  use Data::Dumper;
  print Dumper($result);

Finally the result is returned and (crudely) inspected. Of course, you could instead take the result with the highest relevance and tag a file here, or offer the choice via some user interface of which result is more likely to be appropriate.

Using AudioFile::Identify::MusicBrainz

As you can see, returning the data from MusicBrainz::Client is a fairly verbose procedure. In addition, it's not a pure Perl implementation, so installing the module isn't as easy as it could be, and in some places it's not possible at all.

Given that the REST interface is open, Tom Insam and I decided to play with getting the RDF results and parsing them, putting together Perl modules along the way to help. The result is the AudioFile::Identify::MusicBrainz manpage:


  #!/usr/bin/perl -w
  use strict;

  use AudioFile::Identify::MusicBrainz::Query;
  my $query = { artist => 'Underworld', 
                track  => 'Air Towel',
              };
  my $result;

Again, this is simple setup stuff. You'll note that instead of a list, AIM takes a hash reference with named fields, which is hopefully a little easier to use.


  my $aim = AudioFile::Identify::MusicBrainz::Query->new()
            or die "Can't make query";
  
  $aim->FileInfoLookup($query);

This block of code instantiates the AIM object and sends off the query.


  for my $record (@{ $aim->results }) {
    push @{ $result }, {  relevance => $record->relevance,
                          track     => $record->track->title,
                          album     => $record->album->title,
                          artist    => $record->track->artist->title,
                          tracknum  => $record->track->trackNum,
                          total     => scalar @{$record->album->tracks},
                          time      => $record->track->duration,
                       };
  };

This manipulates the results from AIM such that they match the result list that we created from MusicBrainz::Client. Each of them is a method on the returned object. Some, such as the artist name, are objects referenced from other objects.


  use Data::Dumper;
  print Dumper($result);

Again, we crudely inspect the output, which is identical but for the addition of the track number.

Inside AudoFile::Identify::MusicBrainz

As you'd expect, Perl (and its retinue of modules) made writing this module fairly straightforward. Firstly, LWP makes requesting data from the MusicBrainz server pretty easy. (This code is in the AudioFile::Identify::MusicBrainz::Query manpage, for the curious.)


  use LWP;
  use LWP::UserAgent;

  # ...

  my $ua = LWP::UserAgent->new();

  my $req = HTTP::Request->new(POST => $self->url,);
  $req->content($rdf);

  my $res = $ua->request($req);

This sets up an LWP user agent, and sends the RDF query (more on that later) to a URL (returned by another method in the module). That's all you need to get the returned result into the string $res. (The real module has a custom UserAgent string that I've omitted to save space.)


  # Check the outcome of the response
  if ($res->is_success) {
    $self->response($res->content);
    return $self->parse();
  }

As long as there's a result, it gets stored and then parsed. (Don't worry; the real module also handles errors.) So, what does the parser do?


  my $parser = new XML::DOM::Parser;
  my $doc = $parser->parse($self->response);

MusicBrainz returns results in RDF, but that RDF is itself encapsulated in XML. Although it's not ideal to use XML tools on RDF, it works well enough in this case.


  my $result_nodes = $doc->getElementsByTagName('mq:AlbumTrackResult');
  
  $n = $result_nodes->getLength;
  for (my $i = 0; $i < $n; $i++) {
    my $node = $result_nodes->item($i);
    my $result =
      AudioFile::Identify::MusicBrainz::Result->new()
                                              ->store($self->store)
                                              ->type('Track')
                                              ->parse($node);
    push @$results, $result;
  }

This block of code is a good example of how the XML is parsed. Firstly, all elements with the name mq:AlbumTrackResult are found. These are progressively looped over, and stored in a new Result object (of type Track), and parsed. So, what happens within the parser?


  my $child = $node->getFirstChild();
  while($child) {
    if ($child->getNodeType == 1) {
      my $tag = $child->getTagName;
      $tag =~ s/.*://;
      if ($self->can($tag)) {
        $self->$tag($child);
      }
    }
    $child = $child->getNextSibling();
  }

The node (as passed in above) is examined, and the first child node is examined. While we have a child node to examine, the program checks that it's an element (of node type 1), gets the tag name and removes the namespace, then calls the appropriate get/set method with the appropriate XML node, before moving on to the next child. (This is a somewhat simplified version, with the error checking removed.)

What happens in an example get/set method? Here's part of the title method from the Track package.


  if (defined($set)) {
    if ($set->isa('XML::DOM::Element') and $set->getFirstChild) {
      $self->{title} = $set->getFirstChild->toString;
    } else {
      $self->{title} = $set;
    }
    return $self;
  } else {
    return $self->{title};
  }

If the method is called with some data, the program makes sure it's an XML::DOM element, then parses it and stores the string within that element, or stores the data that was passed in. Otherwise, it returns the data that was previously stored.

One point to note is that MusicBrainz doesn't return all the track information you might need in the initial FileInfoLookup query. Therefore the Result package uses another method, called getData in the Track package, to download the RDF for the track from MusicBrainz. This is then parsed and stored in the same way as the RDF above.

Conclusion

In this article I've shown you how to connect to MusicBrainz and retrieve information from their web services API with both the MusicBrainz::Client and AudioFile::Identify::MusicBrainz modules, and a little of the internal workings of the latter. This should allow you to find out all those niggling missing pieces of information about the tracks at the bottom of your music collection.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en