September 2003 Archives

Adding Search Functionality to Perl Applications

Introduction to Searching

Usually, when building an application, a lot of thought goes into how the data is entered and updated, rather than finding it again. Finding data is an afterthought, especially when developing with a small dataset.

If you are building a small, simple database-backed web site with only a couple of hundred records, then relatively simple SQL should be all you need. It would be trivial to add a simple and Foo_Name like '%keyword%' to the queries being used.

When your needs go beyond this, there are three ways you can proceed: you can use native database full-text searching, an external search engine, or you can roll your own. Most relational databases come with a full-text search functionality, but can have disadvantages:

  • Mutually incompatible syntax and behavior.
  • Poor and often unconfigurable ranking and scoring.
  • Difficult or impossible to query across multiple tables.
  • Require schema changes and rebuilding of indexes for any additional or removed fields.
  • Lack of advanced features such as application context, word
  • stemming, synonyms, and handling misspellings.

An external search engine, such as htdig, offers independence from your database and application, as well as powerful indexing and searching functionality. However, this solution is limited in how it can interact with your application -- it depends both on its crawler and own your applications interface. An external search engine can quickly become out of date, as changes to your data will not be reflected until it next reads through your site.

Integrating customized searching into your application can provide you with many benefits:

  • A cross-platform search component that:
    • Doesn't tie you to a specific database.
    • Is easy to abstract and re-use.
  • The ability to customize how searching works:
    • Search subsets of data or across tables.
    • Tune scoring, weighting, stemming, stop-words, etc.
    • Utilize relationships to provide additional data.
    • Specify when and how to update the index according to your needs.
    • Add extra columns or tables without modifying the schema or interface.
  • Leverage the metadata and word index:
    • Re-use the data to provide extra features, such as computing vectors to find similar records.
    • Categorize data according to vectors or important keywords, rather than using tables or by hand.

When you build searching into your application, it can make the difference between users finding the right information on your site or going somewhere else -- advanced search syntax and features are no substitute for finding the right results for the user in the first place!

Organizing and Indexing Your Data

Most search engines use a reverse (or inverted) index to store a list of words or phrases and locations where they are found. This means building up a list of locations, which could be web pages, files, objects or database records, and then adding an entry into the index for each word found, specifying the location, and possibly some metadata like a score or the type of location.

A reverse index can be just a simple table:

create table ReverseIndex (
 ReverseIndex_Word varchar(32) not null,
 ReverseIndex_Document varchar(255) not null,
 ReverseIndex_Score int,
 primary key (ReverseIndex_Word, ReverseIndex_Document)
)

This table assumes that each document has a score for each word, based on, perhaps, the number of occurrences. Some example data would look something like this:

ReverseIndex_Word ReverseIndex_Document ReverseIndex_Score
art /samantha/stuff/ballet.doc 2
ballet /samantha/stuff/ballet.doc 5
boy /aaron/simpsons/homer/quotes.foo 2
boy /samantha/stuff/ballet.doc 1
dance /aaron/simpsons/homer/quotes.foo 2
dance /samantha/stuff/ballet.doc 5
monkey /aaron/simpsons/homer/quotes.foo 1

To search the index for Homer's "Dance monkey boy dance!" quote, you would split, lowercase, remove duplicates and punctuation, and build a query from the array:


SELECT ReverseIndex_Document as Name, 
       Sum(ReverseIndex_Score) as Total_Score, 
       Count(ReverseIndex_Score) as Matches
FROM ReverseIndex
WHERE ReverseIndex_Word IN ('dance','monkey','boy')
GROUP BY ReverseIndex_Document
ORDER BY Matches DESC , Total_Score DESC

This query returns:

Name Total_Score Count
/aaron/simpsons/homer/quotes.foo 5 3
/samantha/stuff/ballet.doc 6 2

This query has a few features that make it more effective. The query tests not only the sum of scores, but also the number of matches. Ordering by matches before scores is a crude measure to give slightly better results than just by score, where a single highly scored word in a document can skew the results.

Because your search results are only as good as your index, it is worth investing the time to polishing your index to minimize skewed results. You can limit your scores per record or normalize the whole index. This would ensure that results are more even, and therefore more likely to give the results the user is looking for.

The key to integrating a customized search into your application is the ability to use additional contextual information to give more accurate results. This additional information could be records in other tables, filenames, or any other metadata you have available. In the example above, if we also indexed the words "homer" and "quote" from the file name and path, and the user had entered "homer" and/or "quote," the score and rank for the "quotes.foo" document would be much better and more accurately reflect what the user was looking for.

Adding Indexing and Searching to a Legacy Application

Although it can be easier to provide a search engine externally through Lucene or Swish, you don't have to tightly integrate searching at a code level to reap benefits from customized indexing. Adding your own search engine can be a high-level operation. When adding search functionality to a legacy application, you have the choice of reworking the code, annexing, or adding triggers.

Annexing Searching to an Application

Annexing searching to your application can be as simple as couple of extra tables in a database and a perl script called from cron. The database need only contain the reverse index and some summary information about the locations that would be displayed in search results.

Your searching logic should do the minimum of work to ensure fast results, with as much of the work as possible being done up front. This is the flaw in just adding simple SQL queries to pages to provide searching: they're simply not efficient. Exact matches on indexed columns as, in a reverse index, are far faster than the SQL-like/wildcard queries people tack onto a page to allow searching, and built-in database full-text indexes will be very slow or may not even be available when searching columns not specified in the index. This leaves a lot of work to be done for each search, which is avoided when you roll your own.

This simple example uses two tables, a script for indexing, and a mod_perl handler for searching. The ReverseIndex table is going to be three columns: Word, Score, and Location.

create table ReverseIndex ( 
ReverseIndex_Word varchar(64), 
ReverseIndex_Score float, 
Location_ID int, 
primary key( ReverseIndex_Word, ReverseIndex_Score )
)

The Location table will have seven columns: Location_ID, Title, Type, Key, Identifier, URL, and Summary.

create table Location (
Location_ID integer primary key auto_increment,
Location_Title varchar(64),
Location_Type varchar(16),
Location_Key varchar(32),
Location_Identifier integer,
Location_URL varchar(255),
Location_Summary text
)

The indexing script just reads records from the application table and populates the reverse index and locations tables. This example uses a simple local weighting based on the fields and doesn't check if locations already exist. (If you store your application data in XML, it should still be relatively simple to change the code fetching data from the database to pull from XML files instead.)

# give fields weighting
my %weight = ( Pub_Name => 1, 
               Pub_Type => 0.5, 
               Pub_Description => 0.8, 
               Brewery_Name=> 0.6, 
               Town_Name => 0.4);

# fetch records from the Pub table, joined against Town and Brewery
my $sth = $dbh->prepare('select 
    Pub_ID, Pub_Name, Pub_Type, Pub_Description, 
    Brewery_Name, Town_Name 
    from Pubs, Brewerys, Towns 
    where Towns.Town_ID = Pubs.Town_ID and 
          Brewerys.Brewery_ID = Pubs.Brewery_ID');
my $rv = $sth->execute();

while (my $record = $sth->fetchrow_hashref() ) {
  my %words = ();
  # create the location
  # index the record
  my $location_sth = $dbh->prepare_cached("
    insert into Location 
        (Location_Title, Location_Type, Location_Key, 
        Location_Identifier, Location_URL, 
        Location_Summary ) 
    values (?,'Pub','Pub_ID',?,'/pub.cgi', ?) 
  ");
  (my $summary) = $record->{Pub_Description} =~ /(\w+\s+){0,20}/;
  my $rv = $location_sth->execute(
        $record->{Pub_Name}, $record->{Pub_ID}, "$summary .."
  );
  my $location = $dbh->{mysql_insertid};

  # short stop-word dictionary
  my @stopwords = qw/a i you he she they the it is as to 
                     not on or no/;
  my %stopwords; @stopwords{@stopwords} = @stopwords;

  # delete all old words for this location
  my $deleted = $dbh->do("
    delete from ReverseIndex 
    where Location_ID = $location");

  foreach my $field ( keys %$record ) {
    # split the words out of the field ( from Tim Kientzle's article )
    my @words = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_\-]+/, 
                        lc $record->{$field});

                # Strip leading punct
    @words    = grep { s/^[^a-zA-Z0-9\xc0-\xff\_\-]+//; $_ } 
                # Must be longer than one character
                grep { length > 1 }	
                # must have an alphanumeric
                grep { /[a-zA-Z0-9\xc0-\xff]/ }	@words;

    # score the words
    foreach (@words) { 
      next if $stopwords{$_};
      $words{$_} += $weight{$field} 
        unless defined $words{$_} && $words{$_} > 5;
    }
  }

  # insert the new words into the index
  my $sth = $dbh->prepare('
 insert into ReverseIndex 
        (ReverseIndex_Word, ReverseIndex_Score, Location_ID) 
 values (?,?,? ) ');
  foreach my $word (keys %words) {
    my $rv = $sth->execute($word,$words{$word},$location);
  }
}

The search handler only really needs a simple SQL query with an autogenerated clause, and a short dictionary of stop words (see above).

package Search::Pub;

use strict;
use DBI;
use Template;

use Apache;
use Apache::Constants;

my %stopwords;
@stopwords{(qw(a i at be do to or is not no the 
               that they then these them who where 
               why can find on an of and it by))} = 1 x 27;

my $dbh = DBI->connect("DBI:mysql:pubs:host", 'username', 'password');

# initialise and create template
my $config = {
    INCLUDE_PATH => '/search/path',  # or list ref
    POST_CHOMP   => 1,               # cleanup whitespace
};
my $template_file = 'results.tt';
my $template = Template->new($config);

sub handler {
  my $r = shift;

  # remove stoplist words
  my @wordlist  = split(/\s+/,$r->param('searchstring'));
  my @shortlist = ();
  my @arglist   = ();

  foreach my $word (@wordlist) {
    next if ($stopwords{$word});
    push(@shortlist, $word);
    push(@arglist,'?')
  }
  
  # build SQL
  my $where = 'WHERE ReverseIndex_Word IN (' 
              . join(',',@arglist) . ') ';
  my $sql = qq{
  SELECT Location_Title, 
         sum(ReverseIndex_Score) as score, 
         count(*) as count, 
         Location_Summary, Location_ID, Location_Key, 
         Location_Identifier, Location_URL
  FROM   ReverseIndex, Location
  $where
  AND ReverseIndex.Location_ID = Location.Location_ID
  Group By ReverseIndex.Location_ID 
  Order By score DESC, Count DESC };

  my $sth = $dbh->prepare($sql) 
    or $r->warn("couldn't prepare statement!\n");
  $sth->execute(@shortlist);
  my $results = $sth->fetchall_arrayref();

  # populate template variables
  my $template_vars = { foo=>'bar'};
  $template_vars->{Results} = $results;
  $template_vars->{Words} = @shortlist;

  # output page
  $r->send_http_header('TEXT/HTML');
  $template->process($template_file, $template_vars, $r) 
    || die $template->error();

  return OK;
}

Modifying Your Objects to Use a Trigger

An easy way to add indexing to objects is to use Class::Trigger. A good example of this can be seen in Kake Pugh's article, where she combines Search::InvertedIndex with Class::DBI's trigger support, or in Tony Bowden's Class::DBI::mysql::FullTextSearch.

Using triggers provides an alternative to hacking calls to reindex the object in each place it is updated. More importantly, it allows you to separate the indexing from the object to another piece of code -- possibly implementing the update outside of the object, in the application, or even in a mod_perl cleanup handler (tricky, but possible).

If you already have objects to which you wish to add indexing, then using triggers means minimal changes to your objects and therefore less regression testing -- allowing you to get on with adding your indexing and searching code to your application. You can also build objects that index themselves with minimal code. For example:


 use Class::Trigger

 . . .

 sub set_foo {
      my ($self,$new_value) = @_;

      . . .

      $self->call_trigger('after_update');

      . . .
 }

You could then add a trigger elsewhere in the application, or possibly in a subclass, that would call your indexing logic:


 $object->add_trigger( after_update => sub {
      my $self = shift;
      # logic to index object
      reindex($self->{id},$self); # pass identifier and object
      return;
  } );

The indexing logic would update the reverse index table based on the object's identifier and type. If you are keeping data about your objects/locations in a separate table, then you will need to get the foreign key that joins the tables before updating the reverse index table itself.


sub reindex {
  my ($id,$object) = @_;
 
  # fetch the foreign key if required
  my ($location) = $dbh->selectrow_array("
    select Location_ID 
    from Location 
    where Location_Type = '$type' 
        and Location_Identifier = $id");

  # build new word/score data structure according to object type
  my %words = ();
  foreach my $field ( @{$objectfields{$type}} ) {
    my $text = $object->{$field->{name}};
    my @words = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_\-]+/, lc $text);
    @words    = grep { s/^[^a-zA-Z0-9\xc0-\xff\_\-]+//; $_ }
	        grep { length > 1 }         
	        grep { /[a-zA-Z0-9\xc0-\xff]/ } 
	        @words;

    # score the words
    foreach (@words) { 
      next if $stopwords{$_};
      $words{$_} += $field->{weight} 
        unless ($words{$_} > 5 );
    }
  }

  # mark old indexed values and fetch them
  $dbh->do("update ReverseIndex 
            set ReverseIndex_Obselete = 1 
            where Location_ID = $location");
  my @oldvalues = $dbh->selectcol_arrayref("
        select ReverseIndex_Word 
        from ReverseIndex 
        where Location_ID = $location" );

  # update values already present
  my $sth = $dbh->prepare('update ReverseIndex 
    set ReverseIndex_Score = ?, 
    set ReverseIndex_Obselete = 0 
    where Location_ID = ? and ReverseIndex_Word = ?');

  foreach my $word (@oldvalues) {
    next unless ( defined $words{$word} );
    $sth->execute($words{$word},$location,$word);
    delete $words{$word};
  }

  # insert new values
  $sth = $dbh->prepare('
    insert into ReverseIndex 
    (ReverseIndex_Word, ReverseIndex_Score, 
        Location_ID, ReverseIndex_Obselete) 
    values ( :word, :score, :location, 0) ');

  foreach my $word (keys %words) {
    $sth->execute($word,$words{$word},$location);
    delete $words{$word};
  }

  # delete remaining obselete values
  $dbh->do("delete from ReverseIndex 
            where Location_ID = $location 
                and ReverseIndex_Obselete = 1");
}

This process ensures the minimum of interruption to service -- rather than doubling entries or removing all entries for an object while it was being updated. It would be slower than a simple wipe and replace, but for frequently read and updated objects, it would ensure more consistent search results. It would be simple to add object-level locking by checking before the update that no entries for an object are marked obselete; this would stop corruption of the index when multiple processes are updating a single object.

That takes care of indexing; for searching, you can either re-use the mod_perl search handler above, or add a new search method to your class. The following example assumes it is called as $objects = myapp->myclass::search($searchstring);:

sub search {
  my ($class,$searchstring) = @_;

  # remove stoplist words
  . . .
  
  # build SQL
  my $where = 'WHERE ReverseIndex_Word IN (' 
                    . join(',',('?' x scalar @shortlist)) . ') ';
  my $sql = qq{
  SELECT Location_Title as title, 
         sum(ReverseIndex_Score) as score, 
         count(*) as count, 
         Location_Summary as summary, 
         Location_Identifier, 
         Location_URL as url
  FROM   ReverseIndex, Location
  $where
      AND ReverseIndex.Location_ID = Location.Location_ID
  Group By ReverseIndex.Location_ID 
  Order By score DESC, Count DESC };

  my $sth = $dbh->prepare($sql) or warn "couldn't prepare statement !\n";
  $sth->execute(@shortlist);

  # build a list of objects from the search results
  my $objects = [];
  while ( my $result = $sth->fetchrow_hashref ) {
    my $object = myapp::myclass->new(id => $result->{Location_Identifier}));
    push (@$objects, { object  => $object, 
                       score   => $result->{score}, 
                       count   => $result->{count},
                       url     => $result->{url}, 
                       summary => $result->{summary}, 
                       title   => $result->{title} );
  }

  return $objects;
}

Integrating Indexing Into a New Class or Application

Whether building a new application or refactoring an old one, it is worth factoring "searchability" into the design. Designing in searching at the start can save a lot of work later as database schemas and business logic change.

The foundation of your application's searching facilities will be the index. One way to proceed is to have a two-level index, with the normal reverse index table, plus a lookup table that provides additional information about the objects indexed. This lookup table can be used to hold all of the information you wish to display with the results in a single place.

create table ReverseIndex (
 ReverseIndex_Word varchar(64) not null,
 Location_ID int not null,
 ReverseIndex_Score int default 0,
 ReverseIndex_Fields text,
 primary key (ReverseIndex_Word,Location_ID)
)

create table Location (
 Location_ID int primary key auto_increment,
 Location_Type varchar(32),
 Location_Key varchar(32),
 Location_KeyValue int,
 Location_Title varchar(128),
 Location_Summary text,
 Location_URL varchar(255)
)

Any extra information you wish to show in the results -- such as paths, icons, or the objects status -- should be stored in the Location table. If a location has been replaced or should only be visible under certain conditions, you can keep relevant flags in this table and check them when processing the results.

Turning to the code side of searching, a good trick is to use perl's multiple inheritance to make your application's objects inherit from a superclass that contains the search logic. This IndexedObject class provides a fine-grained, incremental update to the index, reindexing individual object attributes as they are changed. If your object only updates the database when the object is synchronized explicitly, then you can either reindex the whole object or keep track of changes and index the fields that have been changed as part of your synchronize method. Here's a sample base class that provides a few reverse indexing methods:


package myapp::classes::IndexedObject;
use strict;

use myapp::libraries::Search;

################
# public methods

sub index_object {
  my $self = shift;
  foreach $field (keys %{$self->{_RIND_fields}} ) {
    $self->index_field($self->,$field,$self->{$field});
  }
}

sub index_field {
  my ($self, $field, $value) = @_;
  return 0 unless $self->{_RIND_fields}{$field};
  my $location = $self->{_RIND_Location};
  my $query = "select * from $self->{table} 
               where Location_ID = ?";
  my $sth = $self->{_RIND_DBH}->prepare($query);
  my $rv = $sth->execute($location);
  my %newwords = ();
  my @newwords = get_words($string);
  foreach my $word (@newwords) {
    next if ($stopwords{$word});
    $newwords{$word} += $self->{_RIND_fields}{$field}{weight};
  }

  while ( my $row = $sth->fetchrow_hashref() ) {
    next unless ($row->{ReverseIndex_Fields} =~ m/'$field'/);
    $self->{locationwords}{$row->{ReverseIndex_Word}} = $row;

    my %fields = ( $row->{ReverseIndex_Fields} =~ m/'(.*?)':([\d.]+)/g );

    if ( exists $newwords{$row->{ReverseIndex_Word}} ) {
      $self->_RIND_UpdateFieldEntry( $row, $field, 
                       $newwords{$row->{ReverseIndex_Word}});
      delete $newwords{$row->{ReverseIndex_Word}};
    } else {
      $self->_RIND_RemoveFieldEntry($row,$field,$lid);
    }
  }

  foreach my $word ( keys %newwords ) {
    $self->_RIND_AddFieldEntry($lid,$word,$newwords{$word},$field);
  }

}

sub delete_location {
  my $self = shift;

  my $query = "delete from ReverseIndex where Location_ID = ?";
  my $sth = $self->{_RIND_DBH}->prepare($query);
  my $rv1 = $sth->execute($self->{_RIND_Location});

  $query = "delete from Location where Location_ID = ?";
  $sth = $self->{_RIND_DBH}->prepare($query);
  my $rv2 = $sth->execute($self->{_RIND_Location});

  return $rv1 + $rv2;
}

sub indexed_fields {
  my ($self,%args) = @_;
  if (keys %args) {
    $self->{_RIND_DBH} = $args{dbh} if defined $args{dbh};
    if defined $args{key} {
      $self->{_RIND_Key} = $args{key};
      ($self->{_RIND_Location}) = 
        $self->{_RIND_DBH}->selectrow_array("
            select Location_ID 
            from Location 
            where Location_Key = '$args{key}' 
            and Location_KeyValue = $self->{$args{key}}");
    }
    if ( defined $args{fields} ) {
      foreach (@{$args{$fields}}) { 
        $self->{_RIND_fields}{$_->{name}} = $_ ; 
      }
    }
  }
  return @{$self->{_RIND_FIELDS}} if wantarray; 
}

#################
# private methods 

sub _RIND_UpdateFieldEntry {
  my ($self,$row, $field,$score) = @_;
  my %fields = ($row->{Location_Fields} =~ m/'(.*?)':([\d.]+)/g);

  # recalculate total score
  my $newscore = ($row->{ReverseIndex_Score} - $fields{$field}) 
                 + $score;
  if ($fields{$field} == $score) {
    # skip if the same
    return 1;
  }

  # update entry, removing field and score from end
  $fields{$field} = $score;
  my $newfields;
  foreach (keys %fields) {
    $newfields .= "'$_':$fields{$_}";
  }
  $self->_RIND_UpdateIndex(
        word      => $row->{ReverseIndex_Word},
        location  => $row->{Location_ID},
        newscore  => $newscore,
        newfields => $newfields);
}

sub _RIND_AddFieldEntry {
  my ($self,$location, $word, $score, $field) = @_;

  # check if record already exists for this location 
  # and update/insert entry
  if (exists $self->{locationwords}{$word}) {
    # recalculate total score
    my $newscore = $self->{locationwords}{$word}{ReverseIndex_Score} 
                  + $score;
    # update entry, appending field and score to end
    my $newfields = $self->{locationwords}{$word}{Location_Fields} 
                    . "'$field':$score";
    $self->_RIND_UpdateIndex(
        word      => $word,
        location  => $location,
        newscore  => $newscore,
        newfields => $newfields);
  } else {
    # insert new entry
    $self->_RIND_UpdateIndex(
        insert   => 1,
        word     => $word,
        location => $location,
        score    => $score,
        fields   =>"'$field':$score");
  }
}

sub _RIND_RemoveFieldEntry {
  my ($self,$row,$field,$lid) = @_;

  # check if record contains other fields
  my %fields = ( $row->{Location_Fields} =~ m/'(.*?)':([\d.]+)/g );
  if ( keys %fields > 1 ) {
    # recalculate total score
    my $newscore = $row->{ReverseIndex_Score} - $fields{$field};
    delete $fields{$field};
    # update entry, removing field and score from end
    my $newfields = '';
    foreach (keys %fields) {
      $newfields .= "'$_':$fields{$_}";
    }
    $self->_RIND_UpdateIndex(
        word      => $row->{ReverseIndex_Word},
        location  => $lid,
        newscore  => $newscore,
        newfields => $newfields
    );
  } else {
    # delete entry
    $self->_RIND_UpdateIndex(
        word      => $row->{ReverseIndex_Word},
        location  => $lid,
        newscore  => $newscore,
        newfields => $newfields,
        delete    => 1
    );
  }
}

sub _RIND_UpdateIndex {
  my ($self,%args) = @_;
  my $query = "update $self->{table} 
               set ReverseIndex_Score = ?, 
                   Location_Fields = ? 
               where ReverseIndex_Word = ? 
                   and Location_ID = ?";

  my @args = ($args{newscore},$args{newfields},$args{word},$args{location});

  SWITCH : {
    if ($args{insert} == 1) {
      $query = "insert into $self->{table} 
                (ReverseIndex_Score, Location_Fields, 
                 ReverseIndex_Word, Location_ID) 
                values (?,?,?,?) "; 
      last;
    }
    if ($args{delete} == 1) {
      $query = "delete from $self->{table} 
                where ReverseIndex_Word = ? 
                    and Location_ID = ?";
      @args = ($args{word},$args{location});
      last;
    }
  } # end of SWITCH
  my $sth = $self->{dbh}->prepare($query);
  my $rv = $sth->execute(@args);
  return;
}

The code that extracts words from objects and search queries has to be the same, so it is a good candidate for putting into a separate library; this also helps make the code more manageable.

myapp::libraries::Search;
use strict;
require Exporter;

our @ISA = qw(Exporter);
our @EXPORT = qw(%stopwords &get_words);

# stop words
my %stopwords;
@stopwords{(qw(a i at be do to or is not no the that they 
    then these them who where why can find on an of and it by))} = 1 x 27;

sub get_words {
  my $text = shift;
  # Split text into Array of words
  my @words = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_\-]+/, lc $text);
           # Strip leading punct
  @words = grep { s/^[^a-zA-Z0-9\xc0-\xff\_\-]+//; $_ } 
           # Must be longer than one character
	       grep { length > 1 }         
           # must have an alphanumeric
	       grep { /[a-zA-Z0-9\xc0-\xff]/ } @words;
  return @words;
}

Your own objects can then inherit the index and search methods from the superclass and provide their own logic to manage how metadata is stored.

package myapp::classes::Pub;
use strict;

our @ISA = qw(myapp::classes::IndexedObject 
              myapp::classes::DatabaseObject);

sub new {

 . . .

 $self->indexed_fields(
                       dbh=>$self->get_dbh, key=>'Pub_ID',
                       fields=>[
                                { name=>'Pub_Name', weight=>1}, 
                                  . . .
                               ],
                      );
 return $self;
}

sub create {
  my ($class,%args) = @_;
  my $self = $class->_new();
  $self->_initialise_from_values(%args);
  $self->create_location(%args);
  $self->index_object();
  return $self;
}

sub load {
  my ($class,%args) = @_;
  my $self = $class->_new();
  $self->_initialise_from_db(%args);
  return $self;
}

sub update {
  my ($self, $field, $value) = @_;
  $self->{$field} = $value;
  $self->execute("update Pubs 
                  set $field = ? 
                  where Pub_ID = ?",
                  $value, $self->{Pub_ID});

  $self->IndexField($self->{Pub_ID},$field,$value);
  return 1;
}

sub delete {
  my $self = shift;
  $self->delete_location();
  $self->execute("delete from pubs 
                  where = Pub_ID = ?",$value);
}

Adding lookups and replacements to your objects indexing logic can be pretty painless. Here's the data that gets passed to indexed_fields for a Pub object.

      fields=>[
           { name=>'Pub_Name', weight=>1}, 
           { name   => 'Brewery_Name', 
             weight => '0.4', 
             lookup => 'Brewery_ID', 
             table  => 'Brewery'},
           { name   =>'Pub_IsCAMRA', 
             weight =>'0.6', 
             replace=>'CAMRA Real Ale'}
          ],
      table=>'Pub',
The hard work can be done in the superclass, updating the index_fields method to do lookups and replacements.
sub index_fields {
  my ($self, $field, $value) = @_;
  return 0 unless $self->{_RIND_fields}{$field};
  my $location = $self->{_RIND_Location};
  my $query = 'select * from $self->{table} where Location_ID = ?';
  my $sth = $self->{_RIND_DBH}->prepare($query);
  my $rv = $sth->execute($location);
  my %newwords = ();

  if ( defined $self->{_RIND_fields}{$field}{replace} ) {
    @words = get_words($self->{_RIND_fields}{$field}{replace});
  } elsif ( defined $self->{_RIND_fields}{$field}{lookup} ) {
    my $column = $self->{_RIND_fields}{$field}{lookup};
    my $table = $self->{_RIND_fields}{$field}{table};
    my $words = $self->{_RIND_DBH}->selectrow_array("
        select $field 
        from $table 
        where $table.$column = $self->{table}.$column ");
    @words = get_words($words);
  } else {
    warn "this is just a normal field\n";
    @words = get_words ($fields{$field->{name}});
  }

  my @newwords = get_words($string);
  foreach my $word (@newwords) {
    next if ($stopwords{$word});
    $newwords{$word} += $self->{_RIND_fields}{$field}{weight};
  }

  while ( my $row = $sth->fetchrow_hashref() ) {
    $self->{locationwords}{$row->{ReverseIndex_Word}} = $row;
    next unless ($row->{ReverseIndex_Fields} =~ m/'$field'/);
    my %fields = ( $row->{ReverseIndex_Fields} =~ m/'(.*?)':([\d.]+)/g );
    if ( exists $newwords{$row->{ReverseIndex_Word}} ) {
      $self->_RIND_UpdateFieldEntry($row,$field,
                        $newwords{$row->{ReverseIndex_Word}});
      delete $newwords{$row->{ReverseIndex_Word}};
    } else {
      $self->_RIND_RemoveFieldEntry($row,$field,$lid);
    }
  }

  foreach my $word ( keys %newwords ) {
    $self->_RIND_AddFieldEntry($lid,$word,$newwords{$word},$field);
  }

}

The problem with doing lookups is that it's possible that another object could update some data that affects other objects. To avoid this, you'll have to make the other object check which objects would be affected by changes to itself.

If you store the indexed fields in the database, it's possible to only check those object types affected with two queries: the first query will get the object types that index the changed field, and the second will update the affected records, joining as per the original lookup. An alternative to keeping the indexed fields in the database would be to keep the indexing information in an XML file -- such a file could also contain configuration options that the search system could check, such as whether to use stemming, ranges for grades, and so on.

The two-level solution we discussed with the additional metadata table lets us store data about which object attributes are indexed and how, and it also allows for easy reporting. Additionally, we can control the indexing process purely by updating the database or XML, without having to modify the codebase at all.

Normalizing and Global Weighting

Normalizing scores within the reverse index ensures that all scores are within constrained limits, making them much easier to interpret and use in your application. How you normalize the scores depends on both the data you have indexed and how it will be searched. A common scenario is that the index breaks down into three groups of words.

  • A small number of high-scoring words, with relatively low frequency. These words are usually rare across the data set, but appear frequently in a small number of objects.
  • Some middle-scoring words with a high frequency across the index. These words are common across the whole data set.
  • A large number of low-scoring words with low frequency. These words occur rarely in the data set and rarely in any object.

A simple way to normalize scores, while at the same time narrowing the gap between high-scoring and low-scoring words, is to use the sine curve to reshape the distribution of scores.

This graph shows the area of the sine curve we are using -- the flat top reducing the impact of outlying high scores and translating scores into a value between 0 and 1 -- in this case, the maximum is assumed to be 10. The normalize function show here can be added to the myapp::libraries::Search module and called from IndexObject's indexing methods.

 sub normalise {
   my $score = shift;
   return sin(($score/$max)/(PI/4));
 }

If your data (scores by frequency) follow more of a bell curve, with a small number of low-scoring words, many middle-scoring words and a few high-scoring words, you would want to normalize using mu-law or a-law functions. In this example, outliers at top and bottom are compressed to fit within the range of 0 to 1 -- see the chart below.


use Math::Trig;

. . .

sub normalise {
   my $score = shift;
   $score = ($score / $max) * 10;
   return sin(1 + tanh($score -5 )) / 2;
} 

When indexing, you can weight scores both locally and globally. Local weighting is covered earlier in the article, and global weighting reduces the scores of frequently found or particularly highly scoring words that can skew results, as well as increase the scores of rare words.

For the best results when weighting scores globally, you should normalize in advance to ensure a limited range of scores; this can also reign in outlying scores that could skew the weighting. Global weighting of scores requires that scores still reflect the relevance of an object, and can be problematic if the solution doesn't consider that a rare word with a low mean score can still have a few high scores that could end up being scored too highly, or vice versa, making results less, rather than more, useful.


sub global_weighted_score {
  my ($word,$score) = @_;
  my $word_avg = get_average($word); # get average score for word
  my $global_avg = get_average($word);       # get average score across index
  $score = normalise($score);                # normalise score before weighting
  if ($word_avg > $global_avg) {
    $score += (($global_avg - $word_avg) * 0.25) / 
                ($global_avg / $score ); 
  } else {
    $score += (($global_avg - $word_avg) * 0.25) / 
                ($score / $global_avg );
  }
  return $score;
}

sub get_average {
  my $word = shift;
  my $query = 'select avg(ReverseIndex_Score) 
                 from ReverseIndex';
  if ($word) {
    $word = $dbh->quote($word);
    $query .= " where ReverseIndex_Word = $word";
  }
  my ($avg) = $dbh->selectrow_array($query);
  return $avg;
}

There is nothing stopping you from keeping multiple scores in your index for each word; as long as you index them appropriately, there will not be a significant impact on performance. In particular, you could weight scores at the class level as well as at the global level. Keeping track of original, normalized, and weighted scores means that changes to code require only recalculating of scores rather than re-indexing everything. This additional information could also be included in data made available through web services.

Stopwords

Integrating indexing and searching into your classes allows you to have class-level or even object-level stopwords. This can be particularly handy when one word is so frequent in one class or set of objects as to become meaningless, while rare and useful in others, and resolving the limitations of global weighting.

Stopwords can be normalized by ignoring them while building the word list and adding them later, based on other object attributes. For instance, if you had a load of objects representing pubs, you could add the pub's town to the global or class stopword list and then give all pubs in "Watford" the same score for "Watford" in the index. Other options would be to score down words in the index based on object attributes such as location, or even having stopwords apply to specific fields -- for example, if you ignored the word "watford" in the description of a pub, you could still increase the pub's score for "watford" if it was in a name or address field -- so a pub called "The Watford Arms" would score higher than "The Kings Head."

Integrating the Search into Your Application

The results are the important part as far as the user is concerned -- this where all the hard work should bear fruit with a responsive site and useful information.

Critical considerations are:

  • What information is needed.
  • How ranking and scoring will be determined.
  • How to normalize scores if they are shown.

Showing results for subsets or grouping can also be important. Your presentation of the results can also make a difference -- paging and balancing the trade-off between the amount of information you can show for each result and how many results you can fit on a page.

Assuming the database schema outlined earlier, it would be possible to denormalize some of the information held into the Location table. Although this will consume some more memory, it saves on joining tables when searching, and would be updated at the same time as the index, so remaining in sync with the application data itself.

When displaying results to your users, you are heavily constrained in how much data you can present at once. This means some form of paging is often required if you have more than a screenful of results. It also means that you must sacrifice the number of results shown per page if you want to show more than a trivial amount of information in that page.

Often it can be difficult to grade the quality of results returned to the user -- although index scores are limited and normalized, you also need to be able to display scores in a meaningful way to the user. This means both simplifying and explaining the scores. A numeric score without qualification is meaningless: is 5 out of 5, or 10, or 100?

The simplest way to grade scores is to work out the maximum score and divide it into grades. For example, if you have a maximum of 5 points per word matched, then you could divide the score by the number of words searched for and grade it by rounding up to 1, 2, 3, 4 or 5 out of 5. This information can then be presented using text and/or graphics, the latter allowing for color-coding of results. A small colored bar for each result allows you to show both the score and the grade in the minimum of space. The bar can easily be replaced by stars or other symbols, to fit in with the look and feel of an application.

By normalizing the scores as you index the objects (see above), you make the results much easier to use. If you know that the maximum score per search term is 1, then scores can be easily graded with a simple piece of code into something users can understand.

my $wc = scalar @searchwords;
foreach my $result (@$results) {
  $result->{grade} = get_grade($result->{score}/$wc);
}

sub get_grade {
  my $score = shift;
  return 'poor' if ($score < 0.35);
  return 'good' if ($score < 0.65);
  return 'very good' if ($score < 0.85);
  return 'excellent';
}

There are many ways of getting extra value from your results. You can group results by object type, either by adding logic to the query or by using an Iterator class that differentiates between object types.

By checking the status or type of each item in the result list, you can present it in a different way or provide additional information. Items in a catalogue that have been recalled or replaced can include a link to the replacement or recall notice -- again, this is one of the benefits of keeping such information in a metadata table.

You can check for spelling mistakes and alternative words using CPAN modules such as Lingua::Spelling::Alternative. You can also provide related links for items or keyword-based advertising.

Tuning and Customizing

Once you have your search engine working and integrated with your application, you can work on tuning it for more accurate scoring and more intuitive results. You can also work on customizing it further to meet the needs of your application.

For instance, if part of your application was a catalogue, then you could add a status field to your locations. This would allow your to mark old items as replaced and provide an alternative result in its place in search results, with a note saying which item it replaced. You could also provide similar features for items that have been recalled or books no longer in print.

You can further tune results using two powerful modules on CPAN -- Lingua::Stem::En (replace En with whichever language suits your needs) and Lingua::EN::Tagger. For the sake of simplicity, I haven't used these in this article, but they are relatively simple to integrate into searching and indexing.

Linga::Stem::En provides Porter Stemming for perl. Porter's algorithm is a well-known way of cutting down a word to its stem -- removing grammatical information from words to find their root. For instance, you'll want "training" and "trains" both to match the same results as "train," so Porter Stemming can be used to reduce both words to "train." As well as increasing the accuracy of your search, this technique also drastically reduces the number of words in the index. If you are getting a high number of word misses on your index, this can improve results greatly -- if you are already getting plenty of word hits on your index, then this can normalize your results more by losing grammatical information in the words that may distinguish results better. A simple rule of thumb is that if you get a low number of results for each word, then you need it; if you have a high number of results for each word, you don't. To add stemming to your index with a module like Lingua::Stem::En, you would use the module's function to extract words rather than (or as well as) your own, when splitting search phrases and text to be indexed into keywords.

Tagger is a clever module that can add something approaching phrase matching without having to muck about (too much) with your working index algorithm. Tagger will pull out groups of words from a string of text (optionally stemming words) by looking for nouns and "Parts Of Speech." By passing text to be indexed and text to be searched for through the tagger, you can extract groups of words. For example, instead of just indexing "Justice Department" as two separate words, a good tagger will return it as a single phrase.

Once you're used the tagger to segment your text, you can treat the individual words and the phrases alike for both indexing and searching purposes. This means you can avoid the slow and unpleasant task of doing phrase matching properly -- because the tagger would also apply to the search query, a phrase like "Justice Department" in a search term would be automatically kept together. The phrase-matching process would be transparent to the user, meaning there's no need for additional syntax such as putting quotes around phrases.

Both Tagger and Lingua::Stem can be integrated into the get_words function above, transparent to both index and search logic.

When you control the indexing, it is possible to recognize dates and convert them to an internal format for full-text searching, applying similar logic to stemming -- as long as your internal date format is consistent, it doesn't matter how dates are entered by the user or stored in the data. They can be converted to the internal date format when indexed or queried. This is useful if date information is particularly important to your objects.

References and Further Reading

  • "Designing a Search Engine" by Pete Sergeant. Useful information on coping without an RDBMS and implementing phrase-matching/Boolean searches. (perl.com)
  • "The Windows 2000 Content Index" by Bartosz Milewski. The principles and design of the Windows 2000 content index. (Dr. Dobb's Journal -- registration required.)
  • "Web Site Searching and Indexing in Perl" by Neil Gunton. Indexing websites using MySQL and DBIx::FullTextSearch. (Dr. Dobb's Journal -- registration required.)
  • "Full-text indexing in perl" by Tim Kientzle. A concise introduction to full-text indexing in perl -- an essential read. (Dr. Dobb's Journal -- registration required.)
  • "Building a Vector Space Search Engine in Perl" by Maciej Ceglowski. Covers the Vector-based alternative to Reverse Index searching, as well as pointers on splitting text into words, etc. (perl.com)
  • "How to Avoid Writing Code" by Kake Pugh. Quick introduction to practical use of Class::DBI including how to add a full-text search to an object. (perl.com)

This week on Perl 6, week ending 2003-09-21

Deadlines, I love the sound they make as they fly past.

Those of you who receive this summary via mail may have noticed that this summary is a little late, with any luck it will make up for its tardiness by being inaccurate and badly written as well.

I'm going to reinstitute the tradition of starting with the internals list this week, so without further ado:

Pondering argument passing

Steve Fink pondered passing parameters properly, with a prototyped Perl 6 example. Perl 5 certainly, and Perl 6 possibly, allow you to ignore any function prototype by calling the function like &f(1,2). Steve had concerns about what to do in the presence of 'splatted' argument lists. (Consider

    sub foo(@a, $b) {...}

    @generated_arglist = ([1,2,3], 2);

Calling foo(@generated_arglist) will throw an error, because the function is expecting an array followed by a scalar. However, you can get around that by using the * (splat) operator, which tells Perl to wait until runtime and then treat @generated_arglist as if it were a list of arguments to the function, and check its contents against foo's parameter list. In Perl 5, if you had a prototyped function sub foo(@$) {...} you would work around the problem by calling it as &foo(@generated_arglist), but I have digressed a long way already.)

I'm not entirely sure I got what this thread was about. I've just had a chat with Dan about it on IRC, and I think I'm a little wiser, but the thread that arose from Steve's post gave me the impression of people talking past each other a little.

http://groups.google.com/groups

Feature Freeze for 0.0.11

Steve Fink announced that Parrot was feature frozen in the wee small hours of Monday morning (at least it was Monday morning if you live in GMT; it was still Sunday if you're in Steve's timezone). Everyone set about tidying things up, nailing bugs and generally getting Parrot's house in order for a public release.

http://groups.google.com/groups

Some Parrot people are disgustingly young

In a move calculated to annoy your summarizer, Dan Sugalski pointed out that Piers is now twice the age of Brent Dax. I thought it was bad enough back when I was working with Leon Brocard, who is also disgustingly young, but at least he was old enough to drink beer and vote. (It's very important that you do the former before the latter apparently)

Parrot musical chairs

Some time after the next Parrot release everything's going to get moved around in the parrot directory tree so things make a little more sense. Dan laid out his vision. Leo liked it. Looks like it'll be happening.

http://groups.google.com/groups

Sorting out dependencies

Andy Dougherty has been shaking the parrot tree with different versions of Perl and found some incompatibilities between the jako and perl6 languages and Perl 5.005. A few patches and skipped tests solved that.

http://groups.google.com/groups

Attaching Debuggers

Nicholas Clark passed on the concerns of a nameless interlocutor who had expressed a desire for a Perl debugger which could attach itself to a troubled running process. This is, after all, something that gdb can already do for C programs. Dan punted on details, but thinks it should be possible to implement. It's definitely post 0.0.11 though. Michal Wallace pointed out that Python is very 'hooky' and supports this.

http://groups.google.com/groups

Various fixes

Because most of the week was spent under a feature freeze, the vast majority of this week's threads have been the kind of short lived ``Patch! Applied!'' bugfix threads. Suffice to say, lots of bugs and niggles got stamped on by the usual heroes plus a few more heroes for good measure.

Parrot 0.0.11 ``Doubloon'' Released

Parrot 0.0.11 got released on Saturday, narrowly missing ``International Talk Like a Pirate Day''. It was almost immediately superceded by version 0.0.11.1 which fixed a slight oversight and introduced a new bug. Breaking my 'midnight GMT cutoff' rule, the latest version, released is actually 0.0.11.2, which deals with the fact that Parrot isn't set up to handle 4 part version numbers.

http://groups.google.com/groups

http://www.cpan.org/authors/id/S/SF/SFINK/parrot-0.0.11.2.tar.gz -- The ``We really can't handle these version numbers'' release

Meanwhile, in perl6-language

Disposable optimization

The increasingly poorly named ``Next Apocalypse'' thread went off into discussing the possibilities of disposable optimizations that could get thrown out when invariants that they depend on no longer hold. This rapidly developed into serious blue sky stuff that I can't help but think is a little premature. Fascinating certainly, but I would like a working language before we got off making it do all sorts of clever optimization stuff.

http://groups.google.com/groups

Acknowledgements, Announcements, Apologies

Sorry it's late.

I promised there will be new content at http://www.bofh.org.uk:8080/ last week and, well, I put some photos up. Maybe this week I'll actually write something.

As ever, if you've appreciated this summary, please consider one or more of the following options:

Cooking with Perl, Part 3

Editor's note: In this third and final batch of recipes excerpted from Perl Cookbook, you'll find solutions and code examples for extracting HTML table data, templating with HTML::Mason, and making simple changes to elements or text.

Sample Recipe: Extracting Table Data

Problem

You have data in an HTML table, and you would like to turn that into a Perl data structure. For example, you want to monitor changes to an author's CPAN module list.

Solution

Use the HTML::TableContentParser module from CPAN:

use HTML::TableContentParser;
 
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
 
foreach $table (@$tables) {
  @headers = map { $_->{data} } @{ $table->{headers} };
  # attributes of table tag available as keys in hash
  $table_width = $table->{width};
 
  foreach $row (@{ $tables->{rows} }) {
    # attributes of tr tag available as keys in hash
    foreach $col (@{ $row->{cols} }) {
      # attributes of td tag available as keys in hash
      $data = $col->{data};
    }
  }
}

Discussion

The HTML::TableContentParser module converts all tables in the HTML document into a Perl data structure. As with HTML tables, there are three layers of nesting in the data structure: the table, the row, and the data in that row.

Each table, row, and data tag is represented as a hash reference. The hash keys correspond to attributes of the tag that defined that table, row, or cell. In addition, the value for a special key gives the contents of the table, row, or cell. In a table, the value for the rows key is a reference to an array of rows. In a row, the cols key points to an array of cells. In a cell, the data key holds the HTML contents of the data tag.

For example, take the following table:

<table width="100%" bgcolor="#ffffff">
  <tr>
    <td>Larry &amp; Gloria</td>
    <td>Mountain View</td>
    <td>California</td>
  </tr>
  <tr>
    <td><b>Tom</b></td>
    <td>Boulder</td>
    <td>Colorado</td>
  </tr>
  <tr>
    <td>Nathan &amp; Jenine</td>
    <td>Fort Collins</td>
    <td>Colorado</td>
  </tr>
</table>

The parse method returns this data structure:

[
  {
    'width' => '100%',
    'bgcolor' => '#ffffff',
    'rows' => [
               {
                'cells' => [
                            { 'data' => 'Larry &amp; Gloria' },
                            { 'data' => 'Mountain View' },
                            { 'data' => 'California' },
                           ],
                'data' => "\n      "
               },
               {
                'cells' => [
                            { 'data' => '<b>Tom</b>' },
                            { 'data' => 'Boulder' },
                            { 'data' => 'Colorado' },
                           ],
                'data' => "\n      "
               },
               {
                'cells' => [
                            { 'data' => 'Nathan &amp; Jenine' },
                            { 'data' => 'Fort Collins' },
                            { 'data' => 'Colorado' },
                           ],
                'data' => "\n      "
               }
              ]
  }
]

The data tags still contain tags and entities. If you don't want the tags and entities, remove them by hand using techniques from "Extracting or Removing HTML Tags."

Previous Articles in this Series

Cooking with Perl
Cooking with Perl, Part 2

Example 20-11 fetches a particular CPAN author's page and displays in plain text the modules they own. You could use this as part of a system that notifies you when your favorite CPAN authors do something new.

Example 20-11: Dump modules for a particular CPAN author

  #!/usr/bin/perl -w
  # dump-cpan-modules-for-author - display modules a CPAN author owns
  use LWP::Simple;
  use URI;
  use HTML::TableContentParser;
  use HTML::Entities;
  use strict;
  our $URL = shift || 'http://search.cpan.org/author/TOMC/';
  my $tables = get_tables($URL);
  my $modules = $tables->[4];    # 5th table holds module data
  foreach my $r (@{ $modules->{rows} }) {
    my ($module_name, $module_link, $status, $description) = 
        parse_module_row($r, $URL);
    print "$module_name <$module_link>\n\t$status\n\t$description\n\n";
  } 
  sub get_tables {
    my $URL = shift;
    my $page = get($URL);
    my $tcp = new HTML::TableContentParser;
    return $tcp->parse($page);
  }
  sub parse_module_row {
    my ($row, $URL) = @_;
    my ($module_html, $module_link, $module_name, $status, $description);
    # extract cells
    $module_html = $row->{cells}[0]{data};  # link and name in HTML
    $status      = $row->{cells}[1]{data};  # status string and link
    $description = $row->{cells}[2]{data};  # description only
    $status =~ s{<.*?>}{  }g; # naive link removal, works on this simple HTML
    # separate module link and name from html
    ($module_link, $module_name) = $module_html =~ m{href="(.*?)".*?>(.*)<}i;
    $module_link = URI->new_abs($module_link, $URL); # resolve relative links
    # clean up entities and tags
    decode_entities($module_name);
    decode_entities($description);
    return ($module_name, $module_link, $status, $description);
  }

See Also

The documentation for the CPAN module HTML::TableContentParser; http://search.cpan.org/

Sample Recipe: Templating with HTML::Mason

Problem

You want to separate presentation (HTML formatting) from logic (Perl code) in your program. Your web site has a lot of components with only slight variations between them. You'd like to abstract out common elements and build your pages from templates without having a lot of "if I'm in this page, then print this; else if I'm in some other page . . . " conditional statements in a single master template.

Solution

Use HTML::Mason components and inheritance.

Discussion

HTML::Mason (also simply called Mason) offers the power of Perl in templates. The basic unit of a web site built with Mason is the component--a file that produces output. The file can be HTML, Perl, or a mixture of both. Components can take arguments and execute arbitrary Perl code. Mason has many features, documented at http://masonhq.com/ and in Embedding Perl in HTML with Mason by Dave Rolsky and Ken Williams (O'Reilly; online at http://masonbook.com/).

Mason works equally well with CGI, mod_perl, and non-web programs. For the purposes of this recipe, however, we look at how to use it with mod_perl. The rest of this recipe contains a few demonstrations to give you a feel for what you can do with Mason and how your site will be constructed. There are more tricks, traps, and techniques for everything we discuss, though, so be sure to visit the web site and read the book for the full story.

Configuration

Install the HTML-Mason distribution from CPAN and add the following to your httpd.conf:

PerlModule HTML::Mason::ApacheHandler
<Location /mason>
  SetHandler perl-script
  PerlHandler HTML::Mason::ApacheHandler
  DefaultType text/html
</Location>

This tells mod_perl that every URL that starts with /mason is handled by Mason. So if you request /mason/hello.html, the file mason/hello.html in your document directory will be compiled and executed as a Mason component. The DefaultType directive lets you omit the .html from component names.

Next create a directory for Mason to cache the compiled components in. Mason does this to speed up execution.

cd $SERVER_ROOT
mkdir mason

Then make a mason directory for components to live in:

cd $DOCUMENT_ROOT
mkdir mason

Now you're ready for "Hello, World". Put this in mason/hello:

Hello, <% ("World", "Puny Human")[rand 2] %>

Restart Apache and load up the mason/hello page. If you reload it, you should see "Hello, World" and "Hello, Puny Human" randomly. If not, look at the Mason FAQ (http://www.masonhq.com/docs/faq/), which answers most commonly encountered problems.

Basic Mason syntax

There are four types of new markup in Mason components: substitutions, Perl code, component calls, and block tags. You saw a substitution in the "Hello World" example: <% ... %> evaluates the contents as Perl code and inserts the result into the surrounding text.

Perl code is marked with a % at the start of the line:

% $now = localtime;   # embedded Perl
This page was generated on <% $now %>.

Because substitutions can be almost any Perl code you like, this could have been written more simply as:

This page was generated on <% scalar localtime %>.

If either of these variations were saved in footer.mas, you could include it simply by saying:

<& footer.mas &>

This is an example of a component call--Mason runs the component and inserts its result into the document that made the call.

Block tags define different regions of your component. <%perl> ... </%perl> identifies Perl code. While % at the start of a line indicates that just that line is Perl code, you can have any number of lines in a <%perl> block.

A <%init> ... </%init> block is like an INIT block in Perl. The code in the block is executed before the main body of code. It lets you store definitions, initialization, database connections, etc. at the bottom of your component, where they're out of the way of the main logic.

The <%args> ... </%args> block lets you define arguments to your component, optionally with default values. For example, here's greet.mas:

<%args>
   $name => "Larry"
   $town => "Mountain View"
</%args>
Hello, <% $name %>.  How's life in <% $town %>?

Calling it with:

<& greet.mas &>

emits:

Hello, Larry.  How's life in Mountain View?

You can provide options on the component call:

<& greet.mas, name => "Nat", town => "Fort Collins" &>

That emits:

Hello, Nat.  How's life in Fort Collins?

Because there are default values, you can supply only some of the arguments:

<& greet.mas, name => "Bob" &>

That emits:

Hello, Bob.  How's life in Mountain View?

Arguments are also how Mason components access form parameters. Take this form:

<form action="compliment">
  How old are you?  <input type="text" name="age"> <br />
  <input type="submit">
</form>

Here's a compliment component that could take that parameter:

<%args>
  $age
</%args>
Hi.  Are you really <% $age %>?  You don't look it!
Objects

All Mason components have access to a $m variable, which contains an HTML::Mason::Request object. Methods on this object give access to Mason features. For example, you can redirect with:

$m->redirect($URL);

The $r variable is the mod_perl request object, so you have access to the information and functions of Apache from your Mason handlers. For example, you can discover the client's IP address with:

$ip = $r->connection->remote_ip;
Autohandlers

When a page is requested through Mason, Mason can do more than simply execute the code in that page. Mason inspects each directory between the component root and the requested page, looking for components called autohandler. This forms a wrapping chain, with the top-level autohandler at the start of the chain and the requested page at the end. Mason then executes the code at the start of the chain. Each autohandler can say "insert the output of the next component in the chain here."

Imagine a newspaper site. Some parts don't change, regardless of which article you're looking at: the banner at the top, the random selection of ads, the list of sections down the lefthand side. However, the actual article text varies from article to article. Implement this in Mason with a directory structure like this:

/sports
/sports/autohandler
/sports/story1
/sports/story2
/sports/story3

The individual story files contain only the text of each story. The autohandler builds the page (the banner, the ads, the navigation bar), and when it wants to insert the content of the story, it says:

% $m->call_next;

This tells Mason to call the next component in the chain (the story) and insert its output here.

The technique of having a chain of components is called inheritance, and autohandlers aren't the only way to do it. In a component, you can designate a parent with:

<%flags>
  inherit = 'parent.mas'
</%flags>

This lets you have different types of content in the one directory, and each contained component gets to identify its surrounding page (its parent).

Dhandlers

Sometimes it's nice to provide the illusion of a directory full of pages, when in reality they are all dynamically generated. For example, stories kept in a database could be accessed through URLs like:

/sports/1
/sports/2
/sports/3

The Mason way to dynamically generate the pages at these URLs is with a component called dhandler in the sports directory. The dhandler component accesses the name of the missing page (123 in this case) by calling:

$m->dhandler_arg

You could then use this to retrieve the story from the database and insert it into a page template.

See Also

Recipe 15.11 in mod_perl Developer's Cookbook; Embedding Perl in HTML with Mason; http://www.masonhq.com/ and http://www.masonbook.com/

Sample Recipe: Making Simple Changes to Elements or Text

Problem

You want to filter some XML. For example, you want to make substitutions in the body of a document, or add a price to every book described in an XML document, or you want to change <book id="1"> to <book> <id>1</id>.

Solution

Use the XML::SAX::Machines module from CPAN:

#!/usr/bin/perl -w
 
use MySAXFilter1;
use MySAXFilter2;
use XML::SAX::ParserFactory;
use XML::SAX::Machines qw(Pipeline);
 
my $machine = Pipeline(MySAXFilter1 => MySAXFilter2); # or more
$machine->parse_uri($FILENAME);

Write a handler, inheriting from XML::SAX::Base as in "Parsing XML into SAX Events," then whenever you need a SAX event, call the appropriate handler in your superclass. For example:

$self->SUPER::start_element($tag_struct);

Discussion

A SAX filter accepts SAX events and triggers new ones. The XML::SAX::Base module detects whether your handler object is called as a filter. If so, the XML::SAX::Base methods pass the SAX events onto the next filter in the chain. If your handler object is not called as a filter, then the XML::SAX::Base methods consume events but do not emit them. This makes it almost as simple to write events as it is to consume them.

The XML::SAX::Machines module chains the filters for you. Import its Pipeline function, then say:

my $machine = Pipeline(Filter1 => Filter2 => Filter3 => Filter4);
$machine->parse_uri($FILENAME);

SAX events triggered by parsing the XML file go to Filter1, which sends possibly different events to Filter2, which in turn sends events to Filter3, and so on to Filter4. The last filter should print or otherwise do something with the incoming SAX events. If you pass a reference to a typeglob, XML::SAX::Machines writes the XML to the filehandle in that typeglob.

Example 22-5 shows a filter that turns the id attribute in book elements from the XML document in Example 22-1 into a new id element. For example, <book id="1"> becomes <book><id>1</id>.

Example 22-5: filters-rewriteids

package RewriteIDs;
# RewriteIDs.pm -- turns "id" attributes into elements
 
use base qw(XML::SAX::Base);
 
my $ID_ATTRIB = "{  }id";   # the attribute hash entry we're interested in
 
sub start_element {
    my ($self, $data) = @_;
 
    if ($data->{Name} eq 'book') {
        my $id = $data->{Attributes}{$ID_ATTRIB}{Value};
        delete $data->{Attributes}{$ID_ATTRIB};
        $self->SUPER::start_element($data);
 
        # make new element parameter data structure for the <id> tag
        my $id_node = {  };
        %$id_node = %$self;
        $id_node->{Name} = 'id';     # more complex if namespaces involved
        $id_node->{Attributes} = {  };
 
        # build the <id>$id</id>
        $self->SUPER::start_element($id_node);
        $self->SUPER::characters({ Data => $id });
        $self->SUPER::end_element($id_node);
    } else {
        $self->SUPER::start_element($data);
    }
}
 
1;

Example 22-6 is the stub that uses XML::SAX::Machines to create the pipeline for processing books.xml and print the altered XML.

Example 22-6: filters-rewriteprog

#!/usr/bin/perl -w
# rewrite-ids -- call RewriteIDs SAX filter to turn id attrs into elements
 
use RewriteIDs;
use XML::SAX::Machines qw(:all);
 
my $machine = Pipeline(RewriteIDs => *STDOUT);
$machine->parse_uri("books.xml");

The output of Example 22-6 is as follows (truncated for brevity):

<book><id>1</id>
    <title>Programming Perl</title>
 ...
<book><id>2</id>
    <title>Perl &amp; LWP</title>
 ...

To save the XML to the file new-books.xml, use the XML::SAX::Writer module:

#!/usr/bin/perl -w
 
use RewriteIDs;
use XML::SAX::Machines qw(:all);
use XML::SAX::Writer;
 
my $writer = XML::SAX::Writer->new(Output => "new-books.xml");
my $machine = Pipeline(RewriteIDs => $writer);
$machine->parse_uri("books.xml");

You can also pass a scalar reference as the Output parameter to have the XML appended to the scalar; as an array reference to have the XML appended to the array, one array element per SAX event; or as a filehandle to have the XML printed to that filehandle.

See Also

The documentation for the modules XML::SAX::Machines and XML::SAX::Writer

A Chromosome at a Time with Perl, Part 1

James D. Tisdall is the author of the soon-to-be-released Mastering Perl for Bioinformatics.

For some time now, the use of Perl in biology has been standard practice. Perl remains the most popular language among biologists for a multitude of programming tasks. The same reasons why Perl has been a success story among system administrators, as well as one of the big success stories in the early days of the Web and CGI programming, have also made it the lingua franca of programming in biology, known as bioinformatics.

One of the reasons why Perl has been equally well suited to dealing with things like DNA and protein sequence data is that it's so easy to declare and use a string. You just use it, without worrying about allocating memory, or managing memory as the string shrinks or grows. DNA and proteins and other standard biological data are almost always represented in Perl as strings, so this facility with strings translates directly into a facility with DNA and proteins.

For example, say you have a subroutine get_chromosome that returns a string of all the DNA in a human chromosome. In humans, this might be a string about 100Mb in length. This snippet of code calls get_chromosome to initialize a scalar variable, $chromosome1, with the string of DNA sequence data that summarizes human chromosome 1:

$chromosome1 = get_chromosome( 1 );

This programming is as easy as cake. I mean, simple as pie. Well, you know what I mean.

But beneath this wonderful ease of programming lurks a problem. It's a problem that can make your wonderful, intuitive code for tossing around chromosomes and genomes--which looks so elegant in your printout, and which appears so neatly divided into intuitively satisfying, interacting subroutines--an inefficient mess that barely runs at all, when it's not completely clogging up your computer.

So, in this short article I'll show you a handful of tricks that enable you to write code for dealing with large amounts of biological sequence data--in this case, very long strings--while still getting satisfactory speed from the program.

Memory is the Bottleneck

What is the problem, exactly? It usually comes down to this: by dealing with very large strings, each one of which uses a significant portion of the main memory that your computer uses to hold a running program, you can easily overtax the amount of main memory available.

When a program on your computer (a process on your Linux, Unix, or Mac OS X computer) runs out of main memory, its performance starts to seriously degrade. It may try to overcome the lack of fast and efficient main memory by enlisting a portion of disk space to hold the part of the running program that it can no longer fit.

But when a program starts writing and reading to and from hard disk memory it can get awfully slow awfully fast. And depending on the nature of the computation, the program may start "thrashing," that is, repeatedly writing and reading large amounts of data between main memory and hard disk. Your elegant program has turned into a greedy, lazy misanthrope that grabs up all the resources available and then seems to just sit there. You've created a monster!

Take the snippet of code above that calls get_chromosome. Without knowing anything more about the subroutine, it's a pretty good bet that it is fetching the 100Mb of data from somewhere, perhaps a disk file, or a relational database, or a web site. To do so, it must be using at least 100Mb of memory. Then, when it returns the data to be stored in $chromosome1, the program uses another 100Mb of memory. Now, perhaps you want to do a regular expression search on the chromosome, saving the desired expression with parentheses that set the special variables $1, $&, and so on. These special variables can be quite large, and that means use of even more memory by your program.

And since this is elegant, simple code you've written, you may well make other copies of the chromosome data or portions of it, in your tenacious hunt for the elusive cure for a deadly disease. The resulting code may be clear, straightforward to understand, and correct--all good and proper things for code to be--but the amount of string copies will land you in the soup. Not only does copying a large string take up memory, but the actual copying can itself be slow, especially if there's a lot of it.

Space Efficiency

You may need to add a new constraint to your program design when you've got a large amount of data in a running program. The constraint is "Use minimal memory." Often, a program that barely runs at all and takes many hours of clogging up the computer, can be rewritten to run in a few minutes by reworking the algorithm so that it uses only a small fraction of the memory.

It's a case of decreasing time by first decreasing space. (Astrophysicists, take note.)

References

There's one easy way to cut down on the number of big strings in a program.

If you need a subroutine to return a large string, as in the get_chromosome subroutine I've used as an example, you can use references to eliminate some of this memory usage.

The practice of passing references to a subroutine is familiar to experienced Perl programmers. In our example, we can rewrite the subroutine so that the return value is placed into a string that is passed in as an argument. But we don't pass a copy of the string--we pass a reference to the string, which takes almost no additional space, and which still enables the subroutine to provide the entire chromosome 1 DNA to the calling program. Here's an example:

load_chromosome( 1, \$chromosome1 );

This new subroutine has two arguments. The 1 presumably will tell the subroutine which human chromosome we want (we want the biggest human chromosome, chromosome 1).

The second argument is a reference to a scalar variable. Inside the subroutine, the reference is most likely used to initialize an argument like this:

my($chromnumber, $chromref) = @_;

And then the DNA data is put into the string by calling it $$chromref, for instance like so:

$$chromref = 'ACGTGTGAACGGA';

No return value is needed. After the subroutine call, the main program will find that the contents of $chromosome1 have changed, and now consist of "ACGTGTGAACGGA." (Of course, a chromosome is much longer than this little fragment.)

Using references is also a great way to pass a large amount of data into a subroutine without making copies of it. In this case, however, the fact that the subroutine can change the contents of the referenced data is something to watch out for. Sometimes you just want a subroutine to get to use the data, but you expect the variable containing the data to still have the same data after the subroutine gets a look at it. So you have to watch what you do when you're passing references to data into a subroutine, and make sure you know what you want.

Managing Memory with Buffers

One of the most efficient ways to deal with very large strings is to deal with them a little at a time.

Here's an example of a program for searching an entire chromosome for a particular 12-base pattern, using very little memory. (A base is one of the four molecules that are the principal building blocks of DNA. The four bases are represented in Perl strings as the characters A, C, G, and T. You'll often hear biologists talking about "megabases" instead of "megabytes" in a string. If you hear that, you're probably talking to a bioinformatician.)

When writing a program that will search for any regular expression in a chromosome, it's hard to see how you could avoid putting the whole chromosome in a string. But very often there's a limit to the size of what you're searching for. In this program, I'm looking for the 12-base pattern "ACGTACGTACGT." And I'm going to get the chromosome data from a disk file.

My trick is going to be to just read in the chromosome data a line or two at a time, search for the pattern, and then reuse the memory to read in the next line or two of data.

The extra work I have to do in programming is, first, I need to keep track myself of how much of the data has been read in, so I can report the locations in the chromosome of successful searches. Second, I need to keep aware that my pattern might start at the end of one line and complete at the beginning of the next line, so I need to make sure I search across line breaks as well as within lines of data from the input file.

Here's a small program that reads in a FASTA file given as an argument on the command line and searches for my pattern in any amount of DNA--a whole chromosome, a whole genome, even all known genetic data, just assuming that the data is in a FASTA file named in the command line. I'll call my program find_fragment, and assuming the DNA is in a FASTA file called human.dna, I'll call it like so:

[tisdall@coltrane]$ perl find_fragment human.dna

For testing purposes I made a very short FASTA DNA file, human.dna, which contains:

> human dna--Not!  The fragment ACGTACGTACGT appears at positions 10, 40, and 98
AAAAAAAAAACGTACGTACGTCCGCGCGCGCGCGCGCGCACGTACGTACG
TGGGGGGGGGGGGGGGCCCCCCCCCCGGGGGGGGGGGGAAAAAAAAAACG
TACGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTT

Here's the code for the program find_fragment:

#!/usr/bin/perl

#
# find_fragment : find 'ACGTACGTACGT' in a very large DNA FASTA file 
# using minimal memory
#
#  N.B. This example program does no checking of the input to ensure 
#       that it is DNA data in FASTA format; it just assumes that 
#       it is. This program also assumes there is just one FASTA
#       record in the input file.
#
#  Copyright (c) 2003 James Tisdall
#

use warnings;
use strict;
use Carp;

# Make sure the program is called with one argument, presumably a 
# FASTA file of DNA
my $USAGE = "perl find_fragment file.FASTA";
unless(@ARGV == 1) { croak "$USAGE:$!\n" }

# $fragment: the pattern to search for
# $fraglen:  the length of $fragment
# $buffer:   a buffer to hold the DNA from the input file
# $position: the position of the buffer in the total DNA
my($fragment, $fraglen, $buffer, $position) = ('ACGTACGTACGT', 12, '', 0);

# The first line of a FASTA file is a header and begins with '>'
my $header = <>;

# Get the first line of DNA data, to start the ball rolling
$buffer = <>;
chomp $buffer;

# The remaining lines are DNA data ending with newlines
while(my $newline = <>) {

    # Add the new line to the buffer
    chomp $newline;
    $buffer .= $newline;

    # Search for the DNA fragment, which has a length of 12
    # (Report the character at string position 0 as being at position 1, 
    # as usual in biology)
    while($buffer =~ /$fragment/gi) {
        print "Found $fragment at position ", $position + $-[0] + 1, "\n";
    }

    # Reset the position counter (will be true after you reset the buffer, next)
    $position = $position + length($buffer) - $fraglen + 1;

    # Discard the data in the buffer, except for a portion at the end
    # so patterns that appear across line breaks are not missed
    $buffer = substr($buffer, length($buffer) - $fraglen + 1, $fraglen - 1);
}

Here's the output of running the command perl find_fragment human.dna:

Found ACGTACGTACGT at position 10
Found ACGTACGTACGT at position 40
Found ACGTACGTACGT at position 98

How the Code Works

After the highly recommended use strict and use warnings are turned on, and the Carp module is loaded so the program can "croak" when needed, the program variables are declared and initialized.

The first line of the FASTA file is a header and is not needed here, so it's read and not used. Then the first line of DNA data is read into the buffer and its newline character is removed. I start with this because I want to search for the fragment even if it is broken by new lines, so I'll have to look at least at the first two lines; here I get the first line, and in the while loop that follows I'll start by adding the second line to the buffer.

Then the while loop, which does the main work of the program, starts reading in the next line of the FASTA file named on the command line, in this case the FASTA file human.dna. The newline is removed with "chomp," and the new line is added to the buffer.

Then comes the short while loop that does the regular expression pattern match of the $fragment in the $buffer. It has modifiers "g" for global search (the fragment may appear more than once in the buffer); and "i" for case insensitive search, that is, either uppercase or lowercase DNA data (e.g. ACGT or acgt).

When the fragment is found the program simply prints out the position. $position holds the position of the beginning of the buffer in the total DNA, and is something I have to keep track of. $-[0] is a special variable that gives the offset of the last successful pattern match in the string. I also add 1, because biologists always say that the first base in a sequence of DNA is at position 1, whereas Perl says that the first character in a string is at position 0. So I add 1 to the Perl position to get the biologist's position.

The last two lines of code reset the buffer by eliminating the beginning part of it, and then adjust the position counter accordingly. The buffer is shortened so that it just keeps the part at the very end that might be part of a pattern match that crosses over the lines of the input file. This would be the tail part of the buffer that is just one base shorter than the length of the fragment.

In this way, the program keeps at most two lines' worth of DNA in $buffer, but still manages to search the entire genome (or chromosome or whatever is in the FASTA file) for the fragment. It performs very quickly, compared to a program that reads in a whole genome and blows out the memory in the process.

When You Should Bother

A space-inefficient program might well work fine on your better computers, but it won't work well at all when you need to run it on another computer with less main memory installed. Or, it might work fine on the fly genome, but slow to a crawl on the human genome.

The rule of thumb is that if you know you'll be dealing with large data sets, consider the amount of space your program uses as an important constraint when designing and coding. Then you won't have to go back and redo the entire program when a large amount of DNA gets thrown at you.

Editor's note: Stay tuned for part two in this two-part series later this month. In it, James will take a more in-depth look at space efficiency, and include a more general version of a program that uses a buffer. In particular, part two will cover running subroutines with minimal space, eliminating subroutines altogether, and sequence motifs with bounded lengths.


O'Reilly & Associates will soon release (September 2003) Mastering Perl for Bioinformatics.

This week on Perl 6, week ending 2003-09-07

Welcome to the last Perl 6 summary of my 35th year. Next week's summary will (in theory) be written on my 36th birthday (a year of being square, so no change there then). I'll give you fair warning that it might be late, though it probably won't. Newcastle University has, in its infinite wisdom decided to have its students enrolling on that day so Gill will be off up to Newcastle to register leaving me at home with nothing to do but keep the house tidy in case a buyer wants to come and look at it, so sitting in one place with a laptop writing a summary seems like a good strategy.

As last week's 'world turned upside down' approach of starting with perl6-language was such a success we'll do the same again this week.

The language list gets some traffic shock!

Jonadab the Unsightly One replied to Abhijit A. Mahabal's message from the first of August concerning junctions and set theory.

http://groups.google.com/groups

Meanwhile, on perl6-internals

Serialization is Hard!

Last week's discussion of serialization sparked off by Leopold Tötsch's suggestion of a vtable->dump mode really got into its stride this week. It turns out that getting this right is a Hard Problem in the presence of threads.

Dan's plan for serialization involves using the GC's object graph walker to work out what to serialize when you tell Parrot to dump a PMC. Leo worried that this would essentially stop the garbage collector running during serialization which could be tricky if the serialization process tried to allocate any memory.

Dan and Leo ended up in a protracted, but polite, argument about details.

At about 45 entries into the thread, Leo produced a summary of the various options and issues associated with them.

http://groups.google.com/groups

http://groups.google.com/groups -- Leo's summary

File Spec

Leo Tötsch commented on Vladimir Lipskiy's implementation of a the File::Spec manpage like tool for Parrot. (File::Spec is Perl's tool for dealing with filenames and paths in a platform independent fashion). Michael Schwern pointed at Ken Williams' ``excellent the Path::Class manpage module which gives you actual file and directory objects'' which he reckons has a much better interface than File::Spec.

http://groups.google.com/groups

Notifications

Gordon Henriksen posted a great discussion of using notifications to implement weakrefs. Rather wonderfully he used the notification system itself as a good example of why dying object notifications were a good idea.

http://groups.google.com/groups

Parrot 100% GNU .NET

Danger. Here be Licensing Issues. I don't do Licensing issues.

The main thrust of the discussion was what kind of library would ship with Parrot. Dan's answer is worth reading, if only for the ``That's a swamp I don't have enough castles for'' line.

http://groups.google.com/groups

http://groups.google.com/groups -- Dan's take on the library

You are in a maze of keyed variants, all similar

This seems to have been a week in which Dan and Leo spent a good deal of their time politely disagreeing with each other. This time they were disagreeing about the need for all the keyed variants of Parrot's opcodes.

Dan outlined the reasoning behind demanding keyed variants of every operation in a PMC's vtable (Executive summary: A combination of speed and space reasons). Leo still doesn't seem convinced but, for now, Pumpking trumps Patch monster.

http://groups.google.com/groups

Parrot Z-machine

Amir Karger's post from last week about implementing the Z-machine (the VM that runs Infocom and other text adventures) got de-Warnocked this week. Nicholas Clark explained that doing the Z-machine 'properly' would require some bits of Parrot that weren't actually there yet, specifically dynamic opcode loading and dynamic bytecode conversion. This led to a discussion of how to get those things implemented.

http://groups.google.com/groups

PIO Questions

Benjamin Goldberg posted a long list of issues and suggestions about handling character type and encoding on Parrot IO objects. Jürgen Bömels said that there were indeed issues, that he'd be dealing with them as tuits allowed and that patches are welcome.

http://groups.google.com/groups

How to dynamically add a method to a class

Joseph Ryan had asked how to add a method to a class at runtime. Dan explained what was supposed to happen (each class has a 'backing namespace' associated with it which contained all the class's methods). Leo asked for a few details about how that would look in Parrot assembly.

A little later, Joseph reported what appeared to be a bug in the way IMCC handles .namespace. It appears that IMCC is working as designed, the question is whether the design is doing the Right Thing.

http://groups.google.com/groups

http://groups.google.com/groups

Proposed amendment to chartype structure

Peter Gibbs is working on adding support for additional chartypes to Parrot, along with support for dynamic loading of the same. He outlined how he planned to do it. Dan liked the idea and Peter set off to implement it.

http://groups.google.com/groups

Acknowledgements, Announcements, Apologies

First up, a combined apology and announcement. Mitchell Charity nudged me to remind me about Mike Scott's wonderful Getting Started with Parrot Guide/Wiki at http://www.vendian.org/parrot/wiki/bin/view.cgi/Main/GettingStartedWithParrotDevelopment which is wonderful and should be checked out immediately. Bravo Mike, sorry it's taken so long to get round to mentioning it in the summary.

Hopefully next week I'll have some info from the Perl Foundation about their Parrot related grants. Gav Estey gave me the details in an AIM conversation which I foolishly didn't log.

Apologies to everyone for spelling 'seven years and two days' as 'seven and 2 days' last week. I would fire my proofreader, but then there would be nobody to write the summary.

ObLeonBrocard: Leon didn't say anything this week. As per usual.

My weblog has a shiny new URL this week. No new content (yet), but you can admire the old stuff at http://www.bofh.org.uk:8080/.

As ever, if you've appreciated this summary, please consider one or more of the following options:

Cooking with Perl, Part 2

Editor's note: The new edition of Perl Cookbook has released, so this week we continue to highlight recipes--new to the second edition--for your sampling pleasure. This week's excerpts include recipes from Chapter 14 ("Database Access") and Chapter 18 ("Internet Services"). And be sure to check back here next week for more new recipes on extracting table data, making simple changes to elements or text, and templating with HTML::Mason.

Sample Recipe: Using SQL Without a Database Server

Problem

You want to make complex SQL queries but don't want to maintain a relational database server.

Solution

Use the DBD::SQLite module from CPAN:

use DBI;
 
$dbh = DBI->connect("dbi:SQLite:dbname=/Users/gnat/salaries.sqlt", "", "",
                    { RaiseError => 1, AutoCommit => 1 });
 
$dbh->do("UPDATE salaries SET salary = 2 * salary WHERE name = 'Nat'");
 
$sth = $dbh->prepare("SELECT id,deductions FROM salaries WHERE name = 'Nat'");
# ...

Discussion

An SQLite database lives in a single file, specified with the dbname parameter in the DBI constructor. Unlike most relational databases, there's no database server here--DBD::SQLite interacts directly with the file. Multiple processes can read from the same database file at the same time (with SELECTs), but only one process can make changes (and other processes are prevented from reading while those changes are being made).

SQLite supports transactions. That is, you can make a number of changes to different tables, but the updates won't be written to the file until you commit them:

use DBI;
$dbh = DBI->connect("dbi:SQLite:dbname=/Users/gnat/salaries.sqlt", "", "",
                    { RaiseError => 1, AutoCommit => 0 });
eval {
  $dbh->do("INSERT INTO people VALUES (29, 'Nat', 1973)");
  $dbh->do("INSERT INTO people VALUES (30, 'William', 1999)");
  $dbh->do("INSERT INTO father_of VALUES (29, 30)");
  $dbh->commit(  );
};
if ($@) {
      eval { $dbh->rollback(  ) };
      die "Couldn't roll back transaction" if $@;
}

SQLite is a typeless database system. Regardless of the types specified when you created a table, you can put any type (strings, numbers, dates, blobs) into any field. Indeed, you can even create a table without specifying any types:

CREATE TABLE people (id, name, birth_year);

The only time that data typing comes into play is when comparisons occur, either through WHERE clauses or when the database has to sort values. The database ignores the type of the column and looks only at the type of the specific value being compared. Like Perl, SQLite recognizes only strings and numbers. Two numbers are compared as floating-point values, two strings are compared as strings, and a number is always less than a string when values of two different types are compared.

There is only one case when SQLite looks at the type you declare for a column. To get an automatically incrementing column, such as unique identifiers, specify a field of type "INTEGER PRIMARY KEY":

CREATE TABLE people (id INTEGER PRIMARY KEY, name, birth_year);

Example 14-6 shows how this is done.

Example 14-6: ipk

  #!/usr/bin/perl -w
  # ipk - demonstrate integer primary keys
  use DBI;
  use strict;
  my $dbh = DBI->connect("dbi:SQLite:ipk.dat", "", "",
  {RaiseError => 1, AutoCommit => 1});
  # quietly drop the table if it already existed
  eval {
    local $dbh->{PrintError} = 0;
    $dbh->do("DROP TABLE names");
  };
  # (re)create it
  $dbh->do("CREATE TABLE names (id INTEGER PRIMARY KEY, name)");
  # insert values
  foreach my $person (qw(Nat Tom Guido Larry Damian Jon)) {
    $dbh->do("INSERT INTO names VALUES (NULL, '$person')");
  }
  # remove a middle value
  $dbh->do("DELETE FROM names WHERE name='Guido'");
  # add a new value
  $dbh->do("INSERT INTO names VALUES (NULL, 'Dan')");
  # display contents of the table
  my $all = $dbh->selectall_arrayref("SELECT id,name FROM names");
  foreach my $row (@$all) {
    my ($id, $word) = @$row;
    print "$word has id $id\n";
  }

SQLite can hold 8-bit text data, but can't hold an ASCII NUL character (\0). The only workaround is to do your own encoding (for example, URL encoding or Base64) before you store and after you retrieve the data. This is true even of columns declared as BLOBs.

See Also

"Executing an SQL Command Using DBI;" the documentation for the CPAN module DBD::SQLite; the SQLite home page at http://www.hwaci.com/sw/sqlite/

Sample Recipe: Sending Attachments in Mail

Problem

You want to send mail that includes attachments; for example, you want to mail a PDF document.

Solution

Use the MIME::Lite module from CPAN. First, create a MIME::Lite object representing the multipart message:

use MIME::Lite;
 
$msg = MIME::Lite->new(From    => 'sender@example.com',
                       To      => 'recipient@example.com',
                       Subject => 'My photo for the brochure',
                       Type    => 'multipart/mixed');

Then, add content through the attach method:

$msg->attach(Type        => 'image/jpeg',
             Path        => '/Users/gnat/Photoshopped/nat.jpg',
             Filename    => 'gnat-face.jpg');
 
$msg->attach(Type        => 'TEXT',
             Data        => 'I hope you can use this!');

Finally, send the message, optionally specifying how to send it:

$msg->send(  );            # default is to use sendmail(1)
# alternatively
$msg->send('smtp', 'mailserver.example.com');

Discussion

The MIME::Lite module creates and sends mail with MIME-encoded attachments. MIME stands for Multimedia Internet Mail Extensions, and is the standard way of attaching files and documents. It can't, however, extract attachments from mail messages--for that you need to read Recipe "Extracting Attachments from Mail."

When creating and adding to a MIME::Lite object, pass parameters as a list of named parameter pairs. The pair conveys both mail headers (e.g., From, To, Subject) and those specific to MIME::Lite. In general, mail headers should be given with a trailing colon:

$msg = MIME::Lite->new('X-Song-Playing:' => 'Natchez Trace');

However, MIME::Lite accepts the headers in Table 18-2 without a trailing colon. * indicates a wildcard, so Content-* includes Content-Type and Content-ID but not Dis-Content.

Table 18-2: MIME::Lite headers
Approved Encrypted Received Sender
Bcc From References Subject
Cc Keywords Reply-To To
Comments Message-ID Resent-* X-*
Content-* MIME-Version Return-Path  
Date Organization    

The full list of MIME::Lite options is given in Table 18-3.

Table 18-3: MIME::Lite options
Data FH ReadNow
Datestamp Filename Top
Disposition Id Type
Encoding Length  
Filename Path  

The MIME::Lite options and their values govern what is attached (the data) and how:

Path
The file containing the data to attach.
Filename
The default filename for the reader of the message to save the file as. By default this is the filename from the Path option (if Path was specified).
Data
The data to attach.
Type
The Content-Type of the data to attach.
Disposition
Either inline or attachment. The former indicates that the reader should display the data as part of the message, not as an attachment. The latter indicates that the reader should display an option to decode and save the data. This is, at best, a hint.
FH
An open filehandle from which to read the attachment data.

There are several useful content types: TEXT means text/plain, which is the default; BINARY similarly is short for application/octet-stream; multipart/mixed is used for a message that has attachments; application/msword for Microsoft Word files; application/vnd.ms-excel for Microsoft Excel files; application/pdf for PDF files; image/gif, image/jpeg, and image/png for GIF, JPEG, and PNG files, respectively; audio/mpeg for MP3 files; video/mpeg for MPEG movies; video/quicktime for Quicktime (.mov) files.

The only two ways to send the message are using sendmail(1) or using Net::SMTP. Indicate Net::SMTP by calling send with a first argument of "smtp". Remaining arguments are parameters to the Net::SMTP constructor:

# timeout of 30 seconds
$msg->send("smtp", "mail.example.com", Timeout => 30);

If you plan to make more than one MIME::Lite object, be aware that invoking send as a class method changes the default way to send messages:

MIME::Lite->send("smtp", "mail.example.com");
$msg = MIME::Lite->new(%opts);
# ...
$msg->send(  );                   # sends using SMTP

If you're going to process multiple messages, also look into the ReadNow parameter. This specifies that the data for the attachment should be read from the file or filehandle immediately, rather than when the message is sent, written, or converted to a string.

Sending the message isn't the only thing you can do with it. You can get the final message as a string:

$text = $msg->as_string;

The print method writes the string form of the message to a filehandle:

$msg->print($SOME_FILEHANDLE);

Example 18-3 is a program that mails filenames given on the command line as attachments.

Example 18-3: mail-attachment

#!/usr/bin/perl -w
# mail-attachment - send files as attachments
 
use MIME::Lite;
use Getopt::Std;
 
my $SMTP_SERVER = 'smtp.example.com';           # CHANGE ME
my $DEFAULT_SENDER = 'sender@example.com';      # CHANGE ME
my $DEFAULT_RECIPIENT = 'recipient@example.com';# CHANGE ME  
 
MIME::Lite->send('smtp', $SMTP_SERVER, Timeout=>60);
 
my (%o, $msg);
 
# process options
 
getopts('hf:t:s:', \%o);
 
$o{f} ||= $DEFAULT_SENDER;
$o{t} ||= $DEFAULT_RECIPIENT;
$o{s} ||= 'Your binary file, sir';
 
if ($o{h} or !@ARGV) {
    die "usage:\n\t$0 [-h] [-f from] [-t to] [-s subject] file ...\n";
}
 
# construct and send email
 
$msg = new MIME::Lite(
    From => $o{f},
    To   => $o{t},
    Subject => $o{s},
    Data => "Hi",
    Type => "multipart/mixed",
);
 
while (@ARGV) {
  $msg->attach('Type' => 'application/octet-stream',
               'Encoding' => 'base64',
               'Path' => shift @ARGV);
}
 
$msg->send(  );

See Also

The documentation for MIME::Lite


O'Reilly & Associates recently released (August 2003) Perl Cookbook, 2nd Edition.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en