Sign In/My Account | View Cart  
advertisement


Listen Print

Adding Search Functionality to Perl Applications
by Aaron Trevena | Pages: 1, 2, 3, 4, 5

For the best results when weighting scores globally, you should normalize in advance to ensure a limited range of scores; this can also reign in outlying scores that could skew the weighting. Global weighting of scores requires that scores still reflect the relevance of an object, and can be problematic if the solution doesn't consider that a rare word with a low mean score can still have a few high scores that could end up being scored too highly, or vice versa, making results less, rather than more, useful.


sub global_weighted_score {
  my ($word,$score) = @_;
  my $word_avg = get_average($word); # get average score for word
  my $global_avg = get_average($word);       # get average score across index
  $score = normalise($score);                # normalise score before weighting
  if ($word_avg > $global_avg) {
    $score += (($global_avg - $word_avg) * 0.25) / 
                ($global_avg / $score ); 
  } else {
    $score += (($global_avg - $word_avg) * 0.25) / 
                ($score / $global_avg );
  }
  return $score;
}

sub get_average {
  my $word = shift;
  my $query = 'select avg(ReverseIndex_Score) 
                 from ReverseIndex';
  if ($word) {
    $word = $dbh->quote($word);
    $query .= " where ReverseIndex_Word = $word";
  }
  my ($avg) = $dbh->selectrow_array($query);
  return $avg;
}

There is nothing stopping you from keeping multiple scores in your index for each word; as long as you index them appropriately, there will not be a significant impact on performance. In particular, you could weight scores at the class level as well as at the global level. Keeping track of original, normalized, and weighted scores means that changes to code require only recalculating of scores rather than re-indexing everything. This additional information could also be included in data made available through web services.

Stopwords

Integrating indexing and searching into your classes allows you to have class-level or even object-level stopwords. This can be particularly handy when one word is so frequent in one class or set of objects as to become meaningless, while rare and useful in others, and resolving the limitations of global weighting.

Stopwords can be normalized by ignoring them while building the word list and adding them later, based on other object attributes. For instance, if you had a load of objects representing pubs, you could add the pub's town to the global or class stopword list and then give all pubs in "Watford" the same score for "Watford" in the index. Other options would be to score down words in the index based on object attributes such as location, or even having stopwords apply to specific fields -- for example, if you ignored the word "watford" in the description of a pub, you could still increase the pub's score for "watford" if it was in a name or address field -- so a pub called "The Watford Arms" would score higher than "The Kings Head."

Integrating the Search into Your Application

The results are the important part as far as the user is concerned -- this where all the hard work should bear fruit with a responsive site and useful information.

Critical considerations are:

  • What information is needed.
  • How ranking and scoring will be determined.
  • How to normalize scores if they are shown.

Showing results for subsets or grouping can also be important. Your presentation of the results can also make a difference -- paging and balancing the trade-off between the amount of information you can show for each result and how many results you can fit on a page.

Assuming the database schema outlined earlier, it would be possible to denormalize some of the information held into the Location table. Although this will consume some more memory, it saves on joining tables when searching, and would be updated at the same time as the index, so remaining in sync with the application data itself.

When displaying results to your users, you are heavily constrained in how much data you can present at once. This means some form of paging is often required if you have more than a screenful of results. It also means that you must sacrifice the number of results shown per page if you want to show more than a trivial amount of information in that page.

Often it can be difficult to grade the quality of results returned to the user -- although index scores are limited and normalized, you also need to be able to display scores in a meaningful way to the user. This means both simplifying and explaining the scores. A numeric score without qualification is meaningless: is 5 out of 5, or 10, or 100?

The simplest way to grade scores is to work out the maximum score and divide it into grades. For example, if you have a maximum of 5 points per word matched, then you could divide the score by the number of words searched for and grade it by rounding up to 1, 2, 3, 4 or 5 out of 5. This information can then be presented using text and/or graphics, the latter allowing for color-coding of results. A small colored bar for each result allows you to show both the score and the grade in the minimum of space. The bar can easily be replaced by stars or other symbols, to fit in with the look and feel of an application.

By normalizing the scores as you index the objects (see above), you make the results much easier to use. If you know that the maximum score per search term is 1, then scores can be easily graded with a simple piece of code into something users can understand.

my $wc = scalar @searchwords;
foreach my $result (@$results) {
  $result->{grade} = get_grade($result->{score}/$wc);
}

sub get_grade {
  my $score = shift;
  return 'poor' if ($score < 0.35);
  return 'good' if ($score < 0.65);
  return 'very good' if ($score < 0.85);
  return 'excellent';
}

There are many ways of getting extra value from your results. You can group results by object type, either by adding logic to the query or by using an Iterator class that differentiates between object types.

By checking the status or type of each item in the result list, you can present it in a different way or provide additional information. Items in a catalogue that have been recalled or replaced can include a link to the replacement or recall notice -- again, this is one of the benefits of keeping such information in a metadata table.

You can check for spelling mistakes and alternative words using CPAN modules such as Lingua::Spelling::Alternative. You can also provide related links for items or keyword-based advertising.

Tuning and Customizing

Once you have your search engine working and integrated with your application, you can work on tuning it for more accurate scoring and more intuitive results. You can also work on customizing it further to meet the needs of your application.

For instance, if part of your application was a catalogue, then you could add a status field to your locations. This would allow your to mark old items as replaced and provide an alternative result in its place in search results, with a note saying which item it replaced. You could also provide similar features for items that have been recalled or books no longer in print.

You can further tune results using two powerful modules on CPAN -- Lingua::Stem::En (replace En with whichever language suits your needs) and Lingua::EN::Tagger. For the sake of simplicity, I haven't used these in this article, but they are relatively simple to integrate into searching and indexing.

Linga::Stem::En provides Porter Stemming for perl. Porter's algorithm is a well-known way of cutting down a word to its stem -- removing grammatical information from words to find their root. For instance, you'll want "training" and "trains" both to match the same results as "train," so Porter Stemming can be used to reduce both words to "train." As well as increasing the accuracy of your search, this technique also drastically reduces the number of words in the index. If you are getting a high number of word misses on your index, this can improve results greatly -- if you are already getting plenty of word hits on your index, then this can normalize your results more by losing grammatical information in the words that may distinguish results better. A simple rule of thumb is that if you get a low number of results for each word, then you need it; if you have a high number of results for each word, you don't. To add stemming to your index with a module like Lingua::Stem::En, you would use the module's function to extract words rather than (or as well as) your own, when splitting search phrases and text to be indexed into keywords.

Tagger is a clever module that can add something approaching phrase matching without having to muck about (too much) with your working index algorithm. Tagger will pull out groups of words from a string of text (optionally stemming words) by looking for nouns and "Parts Of Speech." By passing text to be indexed and text to be searched for through the tagger, you can extract groups of words. For example, instead of just indexing "Justice Department" as two separate words, a good tagger will return it as a single phrase.

Once you're used the tagger to segment your text, you can treat the individual words and the phrases alike for both indexing and searching purposes. This means you can avoid the slow and unpleasant task of doing phrase matching properly -- because the tagger would also apply to the search query, a phrase like "Justice Department" in a search term would be automatically kept together. The phrase-matching process would be transparent to the user, meaning there's no need for additional syntax such as putting quotes around phrases.

Both Tagger and Lingua::Stem can be integrated into the get_words function above, transparent to both index and search logic.

When you control the indexing, it is possible to recognize dates and convert them to an internal format for full-text searching, applying similar logic to stemming -- as long as your internal date format is consistent, it doesn't matter how dates are entered by the user or stored in the data. They can be converted to the internal date format when indexed or queried. This is useful if date information is particularly important to your objects.

References and Further Reading

  • "Designing a Search Engine" by Pete Sergeant. Useful information on coping without an RDBMS and implementing phrase-matching/Boolean searches. (perl.com)
  • "The Windows 2000 Content Index" by Bartosz Milewski. The principles and design of the Windows 2000 content index. (Dr. Dobb's Journal -- registration required.)
  • "Web Site Searching and Indexing in Perl" by Neil Gunton. Indexing websites using MySQL and DBIx::FullTextSearch. (Dr. Dobb's Journal -- registration required.)
  • "Full-text indexing in perl" by Tim Kientzle. A concise introduction to full-text indexing in perl -- an essential read. (Dr. Dobb's Journal -- registration required.)
  • "Building a Vector Space Search Engine in Perl" by Maciej Ceglowski. Covers the Vector-based alternative to Reverse Index searching, as well as pointers on splitting text into words, etc. (perl.com)
  • "How to Avoid Writing Code" by Kake Pugh. Quick introduction to practical use of Class::DBI including how to add a full-text search to an object. (perl.com)