Adding Search Functionality to Perl Applications
by Aaron Trevena
|
Pages: 1, 2, 3, 4, 5
The indexing logic would update the reverse index table based on the object's identifier and type. If you are keeping data about your objects/locations in a separate table, then you will need to get the foreign key that joins the tables before updating the reverse index table itself.
sub reindex {
my ($id,$object) = @_;
# fetch the foreign key if required
my ($location) = $dbh->selectrow_array("
select Location_ID
from Location
where Location_Type = '$type'
and Location_Identifier = $id");
# build new word/score data structure according to object type
my %words = ();
foreach my $field ( @{$objectfields{$type}} ) {
my $text = $object->{$field->{name}};
my @words = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_\-]+/, lc $text);
@words = grep { s/^[^a-zA-Z0-9\xc0-\xff\_\-]+//; $_ }
grep { length > 1 }
grep { /[a-zA-Z0-9\xc0-\xff]/ }
@words;
# score the words
foreach (@words) {
next if $stopwords{$_};
$words{$_} += $field->{weight}
unless ($words{$_} > 5 );
}
}
# mark old indexed values and fetch them
$dbh->do("update ReverseIndex
set ReverseIndex_Obselete = 1
where Location_ID = $location");
my @oldvalues = $dbh->selectcol_arrayref("
select ReverseIndex_Word
from ReverseIndex
where Location_ID = $location" );
# update values already present
my $sth = $dbh->prepare('update ReverseIndex
set ReverseIndex_Score = ?,
set ReverseIndex_Obselete = 0
where Location_ID = ? and ReverseIndex_Word = ?');
foreach my $word (@oldvalues) {
next unless ( defined $words{$word} );
$sth->execute($words{$word},$location,$word);
delete $words{$word};
}
# insert new values
$sth = $dbh->prepare('
insert into ReverseIndex
(ReverseIndex_Word, ReverseIndex_Score,
Location_ID, ReverseIndex_Obselete)
values ( :word, :score, :location, 0) ');
foreach my $word (keys %words) {
$sth->execute($word,$words{$word},$location);
delete $words{$word};
}
# delete remaining obselete values
$dbh->do("delete from ReverseIndex
where Location_ID = $location
and ReverseIndex_Obselete = 1");
}
This process ensures the minimum of interruption to service -- rather than doubling entries or removing all entries for an object while it was being updated. It would be slower than a simple wipe and replace, but for frequently read and updated objects, it would ensure more consistent search results. It would be simple to add object-level locking by checking before the update that no entries for an object are marked obselete; this would stop corruption of the index when multiple processes are updating a single object.
That takes care of indexing; for searching, you can either re-use the
mod_perl search handler above, or add a new search method to your class.
The following example assumes it is called as $objects = myapp->myclass::search($searchstring);:
sub search {
my ($class,$searchstring) = @_;
# remove stoplist words
. . .
# build SQL
my $where = 'WHERE ReverseIndex_Word IN ('
. join(',',('?' x scalar @shortlist)) . ') ';
my $sql = qq{
SELECT Location_Title as title,
sum(ReverseIndex_Score) as score,
count(*) as count,
Location_Summary as summary,
Location_Identifier,
Location_URL as url
FROM ReverseIndex, Location
$where
AND ReverseIndex.Location_ID = Location.Location_ID
Group By ReverseIndex.Location_ID
Order By score DESC, Count DESC };
my $sth = $dbh->prepare($sql) or warn "couldn't prepare statement !\n";
$sth->execute(@shortlist);
# build a list of objects from the search results
my $objects = [];
while ( my $result = $sth->fetchrow_hashref ) {
my $object = myapp::myclass->new(id => $result->{Location_Identifier}));
push (@$objects, { object => $object,
score => $result->{score},
count => $result->{count},
url => $result->{url},
summary => $result->{summary},
title => $result->{title} );
}
return $objects;
}
Integrating Indexing Into a New Class or Application
Whether building a new application or refactoring an old one, it is worth factoring "searchability" into the design. Designing in searching at the start can save a lot of work later as database schemas and business logic change.
The foundation of your application's searching facilities will be the index. One way to proceed is to have a two-level index, with the normal reverse index table, plus a lookup table that provides additional information about the objects indexed. This lookup table can be used to hold all of the information you wish to display with the results in a single place.
create table ReverseIndex (
ReverseIndex_Word varchar(64) not null,
Location_ID int not null,
ReverseIndex_Score int default 0,
ReverseIndex_Fields text,
primary key (ReverseIndex_Word,Location_ID)
)
create table Location (
Location_ID int primary key auto_increment,
Location_Type varchar(32),
Location_Key varchar(32),
Location_KeyValue int,
Location_Title varchar(128),
Location_Summary text,
Location_URL varchar(255)
)
Any extra information you wish to show in the results -- such as paths,
icons, or the objects status -- should be stored in the Location table.
If a location has been replaced or should only be visible under certain
conditions, you can keep relevant flags in this table and check them
when processing the results.
Turning to the code side of searching, a good trick is to use perl's
multiple inheritance to make your application's objects inherit from
a superclass that contains the search logic. This IndexedObject class
provides a fine-grained, incremental update to the index, reindexing
individual object attributes as they are changed. If your object only
updates the database when the object is synchronized explicitly, then
you can either reindex the whole object or keep track of changes and
index the fields that have been changed as part of your synchronize
method. Here's a sample base class that provides a few reverse indexing
methods:
package myapp::classes::IndexedObject;
use strict;
use myapp::libraries::Search;
################
# public methods
sub index_object {
my $self = shift;
foreach $field (keys %{$self->{_RIND_fields}} ) {
$self->index_field($self->,$field,$self->{$field});
}
}
sub index_field {
my ($self, $field, $value) = @_;
return 0 unless $self->{_RIND_fields}{$field};
my $location = $self->{_RIND_Location};
my $query = "select * from $self->{table}
where Location_ID = ?";
my $sth = $self->{_RIND_DBH}->prepare($query);
my $rv = $sth->execute($location);
my %newwords = ();
my @newwords = get_words($string);
foreach my $word (@newwords) {
next if ($stopwords{$word});
$newwords{$word} += $self->{_RIND_fields}{$field}{weight};
}
while ( my $row = $sth->fetchrow_hashref() ) {
next unless ($row->{ReverseIndex_Fields} =~ m/'$field'/);
$self->{locationwords}{$row->{ReverseIndex_Word}} = $row;
my %fields = ( $row->{ReverseIndex_Fields} =~ m/'(.*?)':([\d.]+)/g );
if ( exists $newwords{$row->{ReverseIndex_Word}} ) {
$self->_RIND_UpdateFieldEntry( $row, $field,
$newwords{$row->{ReverseIndex_Word}});
delete $newwords{$row->{ReverseIndex_Word}};
} else {
$self->_RIND_RemoveFieldEntry($row,$field,$lid);
}
}
foreach my $word ( keys %newwords ) {
$self->_RIND_AddFieldEntry($lid,$word,$newwords{$word},$field);
}
}
sub delete_location {
my $self = shift;
my $query = "delete from ReverseIndex where Location_ID = ?";
my $sth = $self->{_RIND_DBH}->prepare($query);
my $rv1 = $sth->execute($self->{_RIND_Location});
$query = "delete from Location where Location_ID = ?";
$sth = $self->{_RIND_DBH}->prepare($query);
my $rv2 = $sth->execute($self->{_RIND_Location});
return $rv1 + $rv2;
}
sub indexed_fields {
my ($self,%args) = @_;
if (keys %args) {
$self->{_RIND_DBH} = $args{dbh} if defined $args{dbh};
if defined $args{key} {
$self->{_RIND_Key} = $args{key};
($self->{_RIND_Location}) =
$self->{_RIND_DBH}->selectrow_array("
select Location_ID
from Location
where Location_Key = '$args{key}'
and Location_KeyValue = $self->{$args{key}}");
}
if ( defined $args{fields} ) {
foreach (@{$args{$fields}}) {
$self->{_RIND_fields}{$_->{name}} = $_ ;
}
}
}
return @{$self->{_RIND_FIELDS}} if wantarray;
}
#################
# private methods
sub _RIND_UpdateFieldEntry {
my ($self,$row, $field,$score) = @_;
my %fields = ($row->{Location_Fields} =~ m/'(.*?)':([\d.]+)/g);
# recalculate total score
my $newscore = ($row->{ReverseIndex_Score} - $fields{$field})
+ $score;
if ($fields{$field} == $score) {
# skip if the same
return 1;
}
# update entry, removing field and score from end
$fields{$field} = $score;
my $newfields;
foreach (keys %fields) {
$newfields .= "'$_':$fields{$_}";
}
$self->_RIND_UpdateIndex(
word => $row->{ReverseIndex_Word},
location => $row->{Location_ID},
newscore => $newscore,
newfields => $newfields);
}
sub _RIND_AddFieldEntry {
my ($self,$location, $word, $score, $field) = @_;
# check if record already exists for this location
# and update/insert entry
if (exists $self->{locationwords}{$word}) {
# recalculate total score
my $newscore = $self->{locationwords}{$word}{ReverseIndex_Score}
+ $score;
# update entry, appending field and score to end
my $newfields = $self->{locationwords}{$word}{Location_Fields}
. "'$field':$score";
$self->_RIND_UpdateIndex(
word => $word,
location => $location,
newscore => $newscore,
newfields => $newfields);
} else {
# insert new entry
$self->_RIND_UpdateIndex(
insert => 1,
word => $word,
location => $location,
score => $score,
fields =>"'$field':$score");
}
}
sub _RIND_RemoveFieldEntry {
my ($self,$row,$field,$lid) = @_;
# check if record contains other fields
my %fields = ( $row->{Location_Fields} =~ m/'(.*?)':([\d.]+)/g );
if ( keys %fields > 1 ) {
# recalculate total score
my $newscore = $row->{ReverseIndex_Score} - $fields{$field};
delete $fields{$field};
# update entry, removing field and score from end
my $newfields = '';
foreach (keys %fields) {
$newfields .= "'$_':$fields{$_}";
}
$self->_RIND_UpdateIndex(
word => $row->{ReverseIndex_Word},
location => $lid,
newscore => $newscore,
newfields => $newfields
);
} else {
# delete entry
$self->_RIND_UpdateIndex(
word => $row->{ReverseIndex_Word},
location => $lid,
newscore => $newscore,
newfields => $newfields,
delete => 1
);
}
}
sub _RIND_UpdateIndex {
my ($self,%args) = @_;
my $query = "update $self->{table}
set ReverseIndex_Score = ?,
Location_Fields = ?
where ReverseIndex_Word = ?
and Location_ID = ?";
my @args = ($args{newscore},$args{newfields},$args{word},$args{location});
SWITCH : {
if ($args{insert} == 1) {
$query = "insert into $self->{table}
(ReverseIndex_Score, Location_Fields,
ReverseIndex_Word, Location_ID)
values (?,?,?,?) ";
last;
}
if ($args{delete} == 1) {
$query = "delete from $self->{table}
where ReverseIndex_Word = ?
and Location_ID = ?";
@args = ($args{word},$args{location});
last;
}
} # end of SWITCH
my $sth = $self->{dbh}->prepare($query);
my $rv = $sth->execute(@args);
return;
}

