Sign In/My Account | View Cart  
advertisement


Listen Print

Adding Search Functionality to Perl Applications
by Aaron Trevena | Pages: 1, 2, 3, 4, 5

The indexing script just reads records from the application table and populates the reverse index and locations tables. This example uses a simple local weighting based on the fields and doesn't check if locations already exist. (If you store your application data in XML, it should still be relatively simple to change the code fetching data from the database to pull from XML files instead.)

# give fields weighting
my %weight = ( Pub_Name => 1, 
               Pub_Type => 0.5, 
               Pub_Description => 0.8, 
               Brewery_Name=> 0.6, 
               Town_Name => 0.4);

# fetch records from the Pub table, joined against Town and Brewery
my $sth = $dbh->prepare('select 
    Pub_ID, Pub_Name, Pub_Type, Pub_Description, 
    Brewery_Name, Town_Name 
    from Pubs, Brewerys, Towns 
    where Towns.Town_ID = Pubs.Town_ID and 
          Brewerys.Brewery_ID = Pubs.Brewery_ID');
my $rv = $sth->execute();

while (my $record = $sth->fetchrow_hashref() ) {
  my %words = ();
  # create the location
  # index the record
  my $location_sth = $dbh->prepare_cached("
    insert into Location 
        (Location_Title, Location_Type, Location_Key, 
        Location_Identifier, Location_URL, 
        Location_Summary ) 
    values (?,'Pub','Pub_ID',?,'/pub.cgi', ?) 
  ");
  (my $summary) = $record->{Pub_Description} =~ /(\w+\s+){0,20}/;
  my $rv = $location_sth->execute(
        $record->{Pub_Name}, $record->{Pub_ID}, "$summary .."
  );
  my $location = $dbh->{mysql_insertid};

  # short stop-word dictionary
  my @stopwords = qw/a i you he she they the it is as to 
                     not on or no/;
  my %stopwords; @stopwords{@stopwords} = @stopwords;

  # delete all old words for this location
  my $deleted = $dbh->do("
    delete from ReverseIndex 
    where Location_ID = $location");

  foreach my $field ( keys %$record ) {
    # split the words out of the field ( from Tim Kientzle's article )
    my @words = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_\-]+/, 
                        lc $record->{$field});

                # Strip leading punct
    @words    = grep { s/^[^a-zA-Z0-9\xc0-\xff\_\-]+//; $_ } 
                # Must be longer than one character
                grep { length > 1 }	
                # must have an alphanumeric
                grep { /[a-zA-Z0-9\xc0-\xff]/ }	@words;

    # score the words
    foreach (@words) { 
      next if $stopwords{$_};
      $words{$_} += $weight{$field} 
        unless defined $words{$_} && $words{$_} > 5;
    }
  }

  # insert the new words into the index
  my $sth = $dbh->prepare('
 insert into ReverseIndex 
        (ReverseIndex_Word, ReverseIndex_Score, Location_ID) 
 values (?,?,? ) ');
  foreach my $word (keys %words) {
    my $rv = $sth->execute($word,$words{$word},$location);
  }
}

The search handler only really needs a simple SQL query with an autogenerated clause, and a short dictionary of stop words (see above).

package Search::Pub;

use strict;
use DBI;
use Template;

use Apache;
use Apache::Constants;

my %stopwords;
@stopwords{(qw(a i at be do to or is not no the 
               that they then these them who where 
               why can find on an of and it by))} = 1 x 27;

my $dbh = DBI->connect("DBI:mysql:pubs:host", 'username', 'password');

# initialise and create template
my $config = {
    INCLUDE_PATH => '/search/path',  # or list ref
    POST_CHOMP   => 1,               # cleanup whitespace
};
my $template_file = 'results.tt';
my $template = Template->new($config);

sub handler {
  my $r = shift;

  # remove stoplist words
  my @wordlist  = split(/\s+/,$r->param('searchstring'));
  my @shortlist = ();
  my @arglist   = ();

  foreach my $word (@wordlist) {
    next if ($stopwords{$word});
    push(@shortlist, $word);
    push(@arglist,'?')
  }
  
  # build SQL
  my $where = 'WHERE ReverseIndex_Word IN (' 
              . join(',',@arglist) . ') ';
  my $sql = qq{
  SELECT Location_Title, 
         sum(ReverseIndex_Score) as score, 
         count(*) as count, 
         Location_Summary, Location_ID, Location_Key, 
         Location_Identifier, Location_URL
  FROM   ReverseIndex, Location
  $where
  AND ReverseIndex.Location_ID = Location.Location_ID
  Group By ReverseIndex.Location_ID 
  Order By score DESC, Count DESC };

  my $sth = $dbh->prepare($sql) 
    or $r->warn("couldn't prepare statement!\n");
  $sth->execute(@shortlist);
  my $results = $sth->fetchall_arrayref();

  # populate template variables
  my $template_vars = { foo=>'bar'};
  $template_vars->{Results} = $results;
  $template_vars->{Words} = @shortlist;

  # output page
  $r->send_http_header('TEXT/HTML');
  $template->process($template_file, $template_vars, $r) 
    || die $template->error();

  return OK;
}

Modifying Your Objects to Use a Trigger

An easy way to add indexing to objects is to use Class::Trigger. A good example of this can be seen in Kake Pugh's article, where she combines Search::InvertedIndex with Class::DBI's trigger support, or in Tony Bowden's Class::DBI::mysql::FullTextSearch.

Using triggers provides an alternative to hacking calls to reindex the object in each place it is updated. More importantly, it allows you to separate the indexing from the object to another piece of code -- possibly implementing the update outside of the object, in the application, or even in a mod_perl cleanup handler (tricky, but possible).

If you already have objects to which you wish to add indexing, then using triggers means minimal changes to your objects and therefore less regression testing -- allowing you to get on with adding your indexing and searching code to your application. You can also build objects that index themselves with minimal code. For example:


 use Class::Trigger

 . . .

 sub set_foo {
      my ($self,$new_value) = @_;

      . . .

      $self->call_trigger('after_update');

      . . .
 }

You could then add a trigger elsewhere in the application, or possibly in a subclass, that would call your indexing logic:


 $object->add_trigger( after_update => sub {
      my $self = shift;
      # logic to index object
      reindex($self->{id},$self); # pass identifier and object
      return;
  } );

Pages: 1, 2, 3, 4, 5

Next Pagearrow