Adding Search Functionality to Perl Applications
by Aaron Trevena
|
Pages: 1, 2, 3, 4, 5
The code that extracts words from objects and search queries has to be the same, so it is a good candidate for putting into a separate library; this also helps make the code more manageable.
myapp::libraries::Search;
use strict;
require Exporter;
our @ISA = qw(Exporter);
our @EXPORT = qw(%stopwords &get_words);
# stop words
my %stopwords;
@stopwords{(qw(a i at be do to or is not no the that they
then these them who where why can find on an of and it by))} = 1 x 27;
sub get_words {
my $text = shift;
# Split text into Array of words
my @words = split(/[^a-zA-Z0-9\xc0-\xff\+\/\_\-]+/, lc $text);
# Strip leading punct
@words = grep { s/^[^a-zA-Z0-9\xc0-\xff\_\-]+//; $_ }
# Must be longer than one character
grep { length > 1 }
# must have an alphanumeric
grep { /[a-zA-Z0-9\xc0-\xff]/ } @words;
return @words;
}
Your own objects can then inherit the index and search methods from the superclass and provide their own logic to manage how metadata is stored.
package myapp::classes::Pub;
use strict;
our @ISA = qw(myapp::classes::IndexedObject
myapp::classes::DatabaseObject);
sub new {
. . .
$self->indexed_fields(
dbh=>$self->get_dbh, key=>'Pub_ID',
fields=>[
{ name=>'Pub_Name', weight=>1},
. . .
],
);
return $self;
}
sub create {
my ($class,%args) = @_;
my $self = $class->_new();
$self->_initialise_from_values(%args);
$self->create_location(%args);
$self->index_object();
return $self;
}
sub load {
my ($class,%args) = @_;
my $self = $class->_new();
$self->_initialise_from_db(%args);
return $self;
}
sub update {
my ($self, $field, $value) = @_;
$self->{$field} = $value;
$self->execute("update Pubs
set $field = ?
where Pub_ID = ?",
$value, $self->{Pub_ID});
$self->IndexField($self->{Pub_ID},$field,$value);
return 1;
}
sub delete {
my $self = shift;
$self->delete_location();
$self->execute("delete from pubs
where = Pub_ID = ?",$value);
}
Adding lookups and replacements to your objects indexing logic can be
pretty painless. Here's the data that gets passed to indexed_fields
for a Pub object.
fields=>[
{ name=>'Pub_Name', weight=>1},
{ name => 'Brewery_Name',
weight => '0.4',
lookup => 'Brewery_ID',
table => 'Brewery'},
{ name =>'Pub_IsCAMRA',
weight =>'0.6',
replace=>'CAMRA Real Ale'}
],
table=>'Pub',
The hard work can be done in the superclass, updating the index_fields
method to do lookups and replacements.
sub index_fields {
my ($self, $field, $value) = @_;
return 0 unless $self->{_RIND_fields}{$field};
my $location = $self->{_RIND_Location};
my $query = 'select * from $self->{table} where Location_ID = ?';
my $sth = $self->{_RIND_DBH}->prepare($query);
my $rv = $sth->execute($location);
my %newwords = ();
if ( defined $self->{_RIND_fields}{$field}{replace} ) {
@words = get_words($self->{_RIND_fields}{$field}{replace});
} elsif ( defined $self->{_RIND_fields}{$field}{lookup} ) {
my $column = $self->{_RIND_fields}{$field}{lookup};
my $table = $self->{_RIND_fields}{$field}{table};
my $words = $self->{_RIND_DBH}->selectrow_array("
select $field
from $table
where $table.$column = $self->{table}.$column ");
@words = get_words($words);
} else {
warn "this is just a normal field\n";
@words = get_words ($fields{$field->{name}});
}
my @newwords = get_words($string);
foreach my $word (@newwords) {
next if ($stopwords{$word});
$newwords{$word} += $self->{_RIND_fields}{$field}{weight};
}
while ( my $row = $sth->fetchrow_hashref() ) {
$self->{locationwords}{$row->{ReverseIndex_Word}} = $row;
next unless ($row->{ReverseIndex_Fields} =~ m/'$field'/);
my %fields = ( $row->{ReverseIndex_Fields} =~ m/'(.*?)':([\d.]+)/g );
if ( exists $newwords{$row->{ReverseIndex_Word}} ) {
$self->_RIND_UpdateFieldEntry($row,$field,
$newwords{$row->{ReverseIndex_Word}});
delete $newwords{$row->{ReverseIndex_Word}};
} else {
$self->_RIND_RemoveFieldEntry($row,$field,$lid);
}
}
foreach my $word ( keys %newwords ) {
$self->_RIND_AddFieldEntry($lid,$word,$newwords{$word},$field);
}
}
The problem with doing lookups is that it's possible that another object could update some data that affects other objects. To avoid this, you'll have to make the other object check which objects would be affected by changes to itself.
If you store the indexed fields in the database, it's possible to only check those object types affected with two queries: the first query will get the object types that index the changed field, and the second will update the affected records, joining as per the original lookup. An alternative to keeping the indexed fields in the database would be to keep the indexing information in an XML file -- such a file could also contain configuration options that the search system could check, such as whether to use stemming, ranges for grades, and so on.
The two-level solution we discussed with the additional metadata table lets us store data about which object attributes are indexed and how, and it also allows for easy reporting. Additionally, we can control the indexing process purely by updating the database or XML, without having to modify the codebase at all.
Normalizing and Global Weighting
Normalizing scores within the reverse index ensures that all scores are within constrained limits, making them much easier to interpret and use in your application. How you normalize the scores depends on both the data you have indexed and how it will be searched. A common scenario is that the index breaks down into three groups of words.
- A small number of high-scoring words, with relatively low frequency. These words are usually rare across the data set, but appear frequently in a small number of objects.
- Some middle-scoring words with a high frequency across the index. These words are common across the whole data set.
- A large number of low-scoring words with low frequency. These words occur rarely in the data set and rarely in any object.
A simple way to normalize scores, while at the same time narrowing the gap between high-scoring and low-scoring words, is to use the sine curve to reshape the distribution of scores.
This graph shows the area of the sine curve we are using -- the flat top
reducing the impact of outlying high scores and translating scores into
a value between 0 and 1 -- in this case, the maximum is assumed to be
10. The normalize function show here can be added to the myapp::libraries::Search
module and called from IndexObject's indexing methods.
sub normalise {
my $score = shift;
return sin(($score/$max)/(PI/4));
}
If your data (scores by frequency) follow more of a bell curve, with a small number of low-scoring words, many middle-scoring words and a few high-scoring words, you would want to normalize using mu-law or a-law functions. In this example, outliers at top and bottom are compressed to fit within the range of 0 to 1 -- see the chart below.
use Math::Trig;
. . .
sub normalise {
my $score = shift;
$score = ($score / $max) * 10;
return sin(1 + tanh($score -5 )) / 2;
}
When indexing, you can weight scores both locally and globally. Local weighting is covered earlier in the article, and global weighting reduces the scores of frequently found or particularly highly scoring words that can skew results, as well as increase the scores of rare words.

