Sign In/My Account | View Cart  
advertisement


Listen Print

Adding Search Functionality to Perl Applications
by Aaron Trevena | Pages: 1, 2, 3, 4, 5

The indexing logic would update the reverse index table based on the object's identifier and type. If you are keeping data about your objects/locations in a separate table, then you will need to get the foreign key that joins the tables before updating the reverse index table itself.


sub reindex {
  my ($id,$object) = @_;
 
  # fetch the foreign key if required
  my ($location) = $dbh->selectrow_array("
    select Location_ID 
    from Location 
    where Location_Type = '$type' 
        and Location_Identifier = $id");

  # build new word/score data structure according to object type
  my %words = ();
  foreach my $field ( @{$objectfields{$type}} ) {
    my $text = $object->{$field->{name}};
    my @words = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_\-]+/, lc $text);
    @words    = grep { s/^[^a-zA-Z0-9\xc0-\xff\_\-]+//; $_ }
	        grep { length > 1 }         
	        grep { /[a-zA-Z0-9\xc0-\xff]/ } 
	        @words;

    # score the words
    foreach (@words) { 
      next if $stopwords{$_};
      $words{$_} += $field->{weight} 
        unless ($words{$_} > 5 );
    }
  }

  # mark old indexed values and fetch them
  $dbh->do("update ReverseIndex 
            set ReverseIndex_Obselete = 1 
            where Location_ID = $location");
  my @oldvalues = $dbh->selectcol_arrayref("
        select ReverseIndex_Word 
        from ReverseIndex 
        where Location_ID = $location" );

  # update values already present
  my $sth = $dbh->prepare('update ReverseIndex 
    set ReverseIndex_Score = ?, 
    set ReverseIndex_Obselete = 0 
    where Location_ID = ? and ReverseIndex_Word = ?');

  foreach my $word (@oldvalues) {
    next unless ( defined $words{$word} );
    $sth->execute($words{$word},$location,$word);
    delete $words{$word};
  }

  # insert new values
  $sth = $dbh->prepare('
    insert into ReverseIndex 
    (ReverseIndex_Word, ReverseIndex_Score, 
        Location_ID, ReverseIndex_Obselete) 
    values ( :word, :score, :location, 0) ');

  foreach my $word (keys %words) {
    $sth->execute($word,$words{$word},$location);
    delete $words{$word};
  }

  # delete remaining obselete values
  $dbh->do("delete from ReverseIndex 
            where Location_ID = $location 
                and ReverseIndex_Obselete = 1");
}

This process ensures the minimum of interruption to service -- rather than doubling entries or removing all entries for an object while it was being updated. It would be slower than a simple wipe and replace, but for frequently read and updated objects, it would ensure more consistent search results. It would be simple to add object-level locking by checking before the update that no entries for an object are marked obselete; this would stop corruption of the index when multiple processes are updating a single object.

That takes care of indexing; for searching, you can either re-use the mod_perl search handler above, or add a new search method to your class. The following example assumes it is called as $objects = myapp->myclass::search($searchstring);:

sub search {
  my ($class,$searchstring) = @_;

  # remove stoplist words
  . . .
  
  # build SQL
  my $where = 'WHERE ReverseIndex_Word IN (' 
                    . join(',',('?' x scalar @shortlist)) . ') ';
  my $sql = qq{
  SELECT Location_Title as title, 
         sum(ReverseIndex_Score) as score, 
         count(*) as count, 
         Location_Summary as summary, 
         Location_Identifier, 
         Location_URL as url
  FROM   ReverseIndex, Location
  $where
      AND ReverseIndex.Location_ID = Location.Location_ID
  Group By ReverseIndex.Location_ID 
  Order By score DESC, Count DESC };

  my $sth = $dbh->prepare($sql) or warn "couldn't prepare statement !\n";
  $sth->execute(@shortlist);

  # build a list of objects from the search results
  my $objects = [];
  while ( my $result = $sth->fetchrow_hashref ) {
    my $object = myapp::myclass->new(id => $result->{Location_Identifier}));
    push (@$objects, { object  => $object, 
                       score   => $result->{score}, 
                       count   => $result->{count},
                       url     => $result->{url}, 
                       summary => $result->{summary}, 
                       title   => $result->{title} );
  }

  return $objects;
}

Integrating Indexing Into a New Class or Application

Whether building a new application or refactoring an old one, it is worth factoring "searchability" into the design. Designing in searching at the start can save a lot of work later as database schemas and business logic change.

The foundation of your application's searching facilities will be the index. One way to proceed is to have a two-level index, with the normal reverse index table, plus a lookup table that provides additional information about the objects indexed. This lookup table can be used to hold all of the information you wish to display with the results in a single place.

create table ReverseIndex (
 ReverseIndex_Word varchar(64) not null,
 Location_ID int not null,
 ReverseIndex_Score int default 0,
 ReverseIndex_Fields text,
 primary key (ReverseIndex_Word,Location_ID)
)

create table Location (
 Location_ID int primary key auto_increment,
 Location_Type varchar(32),
 Location_Key varchar(32),
 Location_KeyValue int,
 Location_Title varchar(128),
 Location_Summary text,
 Location_URL varchar(255)
)

Any extra information you wish to show in the results -- such as paths, icons, or the objects status -- should be stored in the Location table. If a location has been replaced or should only be visible under certain conditions, you can keep relevant flags in this table and check them when processing the results.

Turning to the code side of searching, a good trick is to use perl's multiple inheritance to make your application's objects inherit from a superclass that contains the search logic. This IndexedObject class provides a fine-grained, incremental update to the index, reindexing individual object attributes as they are changed. If your object only updates the database when the object is synchronized explicitly, then you can either reindex the whole object or keep track of changes and index the fields that have been changed as part of your synchronize method. Here's a sample base class that provides a few reverse indexing methods:


package myapp::classes::IndexedObject;
use strict;

use myapp::libraries::Search;

################
# public methods

sub index_object {
  my $self = shift;
  foreach $field (keys %{$self->{_RIND_fields}} ) {
    $self->index_field($self->,$field,$self->{$field});
  }
}

sub index_field {
  my ($self, $field, $value) = @_;
  return 0 unless $self->{_RIND_fields}{$field};
  my $location = $self->{_RIND_Location};
  my $query = "select * from $self->{table} 
               where Location_ID = ?";
  my $sth = $self->{_RIND_DBH}->prepare($query);
  my $rv = $sth->execute($location);
  my %newwords = ();
  my @newwords = get_words($string);
  foreach my $word (@newwords) {
    next if ($stopwords{$word});
    $newwords{$word} += $self->{_RIND_fields}{$field}{weight};
  }

  while ( my $row = $sth->fetchrow_hashref() ) {
    next unless ($row->{ReverseIndex_Fields} =~ m/'$field'/);
    $self->{locationwords}{$row->{ReverseIndex_Word}} = $row;

    my %fields = ( $row->{ReverseIndex_Fields} =~ m/'(.*?)':([\d.]+)/g );

    if ( exists $newwords{$row->{ReverseIndex_Word}} ) {
      $self->_RIND_UpdateFieldEntry( $row, $field, 
                       $newwords{$row->{ReverseIndex_Word}});
      delete $newwords{$row->{ReverseIndex_Word}};
    } else {
      $self->_RIND_RemoveFieldEntry($row,$field,$lid);
    }
  }

  foreach my $word ( keys %newwords ) {
    $self->_RIND_AddFieldEntry($lid,$word,$newwords{$word},$field);
  }

}

sub delete_location {
  my $self = shift;

  my $query = "delete from ReverseIndex where Location_ID = ?";
  my $sth = $self->{_RIND_DBH}->prepare($query);
  my $rv1 = $sth->execute($self->{_RIND_Location});

  $query = "delete from Location where Location_ID = ?";
  $sth = $self->{_RIND_DBH}->prepare($query);
  my $rv2 = $sth->execute($self->{_RIND_Location});

  return $rv1 + $rv2;
}

sub indexed_fields {
  my ($self,%args) = @_;
  if (keys %args) {
    $self->{_RIND_DBH} = $args{dbh} if defined $args{dbh};
    if defined $args{key} {
      $self->{_RIND_Key} = $args{key};
      ($self->{_RIND_Location}) = 
        $self->{_RIND_DBH}->selectrow_array("
            select Location_ID 
            from Location 
            where Location_Key = '$args{key}' 
            and Location_KeyValue = $self->{$args{key}}");
    }
    if ( defined $args{fields} ) {
      foreach (@{$args{$fields}}) { 
        $self->{_RIND_fields}{$_->{name}} = $_ ; 
      }
    }
  }
  return @{$self->{_RIND_FIELDS}} if wantarray; 
}

#################
# private methods 

sub _RIND_UpdateFieldEntry {
  my ($self,$row, $field,$score) = @_;
  my %fields = ($row->{Location_Fields} =~ m/'(.*?)':([\d.]+)/g);

  # recalculate total score
  my $newscore = ($row->{ReverseIndex_Score} - $fields{$field}) 
                 + $score;
  if ($fields{$field} == $score) {
    # skip if the same
    return 1;
  }

  # update entry, removing field and score from end
  $fields{$field} = $score;
  my $newfields;
  foreach (keys %fields) {
    $newfields .= "'$_':$fields{$_}";
  }
  $self->_RIND_UpdateIndex(
        word      => $row->{ReverseIndex_Word},
        location  => $row->{Location_ID},
        newscore  => $newscore,
        newfields => $newfields);
}

sub _RIND_AddFieldEntry {
  my ($self,$location, $word, $score, $field) = @_;

  # check if record already exists for this location 
  # and update/insert entry
  if (exists $self->{locationwords}{$word}) {
    # recalculate total score
    my $newscore = $self->{locationwords}{$word}{ReverseIndex_Score} 
                  + $score;
    # update entry, appending field and score to end
    my $newfields = $self->{locationwords}{$word}{Location_Fields} 
                    . "'$field':$score";
    $self->_RIND_UpdateIndex(
        word      => $word,
        location  => $location,
        newscore  => $newscore,
        newfields => $newfields);
  } else {
    # insert new entry
    $self->_RIND_UpdateIndex(
        insert   => 1,
        word     => $word,
        location => $location,
        score    => $score,
        fields   =>"'$field':$score");
  }
}

sub _RIND_RemoveFieldEntry {
  my ($self,$row,$field,$lid) = @_;

  # check if record contains other fields
  my %fields = ( $row->{Location_Fields} =~ m/'(.*?)':([\d.]+)/g );
  if ( keys %fields > 1 ) {
    # recalculate total score
    my $newscore = $row->{ReverseIndex_Score} - $fields{$field};
    delete $fields{$field};
    # update entry, removing field and score from end
    my $newfields = '';
    foreach (keys %fields) {
      $newfields .= "'$_':$fields{$_}";
    }
    $self->_RIND_UpdateIndex(
        word      => $row->{ReverseIndex_Word},
        location  => $lid,
        newscore  => $newscore,
        newfields => $newfields
    );
  } else {
    # delete entry
    $self->_RIND_UpdateIndex(
        word      => $row->{ReverseIndex_Word},
        location  => $lid,
        newscore  => $newscore,
        newfields => $newfields,
        delete    => 1
    );
  }
}

sub _RIND_UpdateIndex {
  my ($self,%args) = @_;
  my $query = "update $self->{table} 
               set ReverseIndex_Score = ?, 
                   Location_Fields = ? 
               where ReverseIndex_Word = ? 
                   and Location_ID = ?";

  my @args = ($args{newscore},$args{newfields},$args{word},$args{location});

  SWITCH : {
    if ($args{insert} == 1) {
      $query = "insert into $self->{table} 
                (ReverseIndex_Score, Location_Fields, 
                 ReverseIndex_Word, Location_ID) 
                values (?,?,?,?) "; 
      last;
    }
    if ($args{delete} == 1) {
      $query = "delete from $self->{table} 
                where ReverseIndex_Word = ? 
                    and Location_ID = ?";
      @args = ($args{word},$args{location});
      last;
    }
  } # end of SWITCH
  my $sth = $self->{dbh}->prepare($query);
  my $rv = $sth->execute(@args);
  return;
}

Pages: 1, 2, 3, 4, 5

Next Pagearrow