Adding Search Functionality to Perl Applications
by Aaron TrevenaSeptember 25, 2003
Introduction to Searching
Usually, when building an application, a lot of thought goes into how the data is entered and updated, rather than finding it again. Finding data is an afterthought, especially when developing with a small dataset.
If you are building a small, simple database-backed web site with only a couple of hundred records, then relatively simple SQL should be all you need. It would be trivial to add a simple and Foo_Name like '%keyword%' to the queries being used.
When your needs go beyond this, there are three ways you can proceed: you can use native database full-text searching, an external search engine, or you can roll your own. Most relational databases come with a full-text search functionality, but can have disadvantages:
- Mutually incompatible syntax and behavior.
- Poor and often unconfigurable ranking and scoring.
- Difficult or impossible to query across multiple tables.
- Require schema changes and rebuilding of indexes for any additional or removed fields.
- Lack of advanced features such as application context, word stemming, synonyms, and handling misspellings.
An external search engine, such as htdig, offers independence from your
database and application, as well as powerful indexing and
searching functionality. However, this solution is limited in how it
can interact with your application -- it depends both on its crawler
and own your applications interface. An external search engine can quickly
become out of date, as changes to your data will not be reflected until
it next reads through your site.
Integrating customized searching into your application can provide you with many benefits:
- A cross-platform search component that:
- Doesn't tie you to a specific database.
- Is easy to abstract and re-use.
- The ability to customize how searching works:
- Search subsets of data or across tables.
- Tune scoring, weighting, stemming, stop-words, etc.
- Utilize relationships to provide additional data.
- Specify when and how to update the index according to your needs.
- Add extra columns or tables without modifying the schema or interface.
- Leverage the metadata and word index:
- Re-use the data to provide extra features, such as computing vectors to find similar records.
- Categorize data according to vectors or important keywords, rather than using tables or by hand.
When you build searching into your application, it can make the difference between users finding the right information on your site or going somewhere else -- advanced search syntax and features are no substitute for finding the right results for the user in the first place!
Organizing and Indexing Your Data
Most search engines use a reverse (or inverted) index to store a list of words or phrases and locations where they are found. This means building up a list of locations, which could be web pages, files, objects or database records, and then adding an entry into the index for each word found, specifying the location, and possibly some metadata like a score or the type of location.
A reverse index can be just a simple table:
create table ReverseIndex (
ReverseIndex_Word varchar(32) not null,
ReverseIndex_Document varchar(255) not null,
ReverseIndex_Score int,
primary key (ReverseIndex_Word, ReverseIndex_Document)
)
This table assumes that each document has a score for each word, based on, perhaps, the number of occurrences. Some example data would look something like this:
| ReverseIndex_Word | ReverseIndex_Document | ReverseIndex_Score |
|---|---|---|
| art | /samantha/stuff/ballet.doc | 2 |
| ballet | /samantha/stuff/ballet.doc | 5 |
| boy | /aaron/simpsons/homer/quotes.foo | 2 |
| boy | /samantha/stuff/ballet.doc | 1 |
| dance | /aaron/simpsons/homer/quotes.foo | 2 |
| dance | /samantha/stuff/ballet.doc | 5 |
| monkey | /aaron/simpsons/homer/quotes.foo | 1 |
To search the index for Homer's "Dance monkey boy dance!" quote, you would split, lowercase, remove duplicates and punctuation, and build a query from the array:
SELECT ReverseIndex_Document as Name,
Sum(ReverseIndex_Score) as Total_Score,
Count(ReverseIndex_Score) as Matches
FROM ReverseIndex
WHERE ReverseIndex_Word IN ('dance','monkey','boy')
GROUP BY ReverseIndex_Document
ORDER BY Matches DESC , Total_Score DESC
This query returns:
| Name | Total_Score | Count |
| /aaron/simpsons/homer/quotes.foo | 5 | 3 |
| /samantha/stuff/ballet.doc | 6 | 2 |
This query has a few features that make it more effective. The query tests not only the sum of scores, but also the number of matches. Ordering by matches before scores is a crude measure to give slightly better results than just by score, where a single highly scored word in a document can skew the results.
Because your search results are only as good as your index, it is worth investing the time to polishing your index to minimize skewed results. You can limit your scores per record or normalize the whole index. This would ensure that results are more even, and therefore more likely to give the results the user is looking for.
The key to integrating a customized search into your application is the ability to use additional contextual information to give more accurate results. This additional information could be records in other tables, filenames, or any other metadata you have available. In the example above, if we also indexed the words "homer" and "quote" from the file name and path, and the user had entered "homer" and/or "quote," the score and rank for the "quotes.foo" document would be much better and more accurately reflect what the user was looking for.
Adding Indexing and Searching to a Legacy Application
Although it can be easier to provide a search engine externally through Lucene or Swish, you don't have to tightly integrate searching at a code level to reap benefits from customized indexing. Adding your own search engine can be a high-level operation. When adding search functionality to a legacy application, you have the choice of reworking the code, annexing, or adding triggers.
Annexing Searching to an Application
Annexing searching to your application can be as simple as couple of
extra tables in a database and a perl script called from cron. The database
need only contain the reverse index and some summary information about
the locations that would be displayed in search results.
Your searching logic should do the minimum of work to ensure fast results, with as much of the work as possible being done up front. This is the flaw in just adding simple SQL queries to pages to provide searching: they're simply not efficient. Exact matches on indexed columns as, in a reverse index, are far faster than the SQL-like/wildcard queries people tack onto a page to allow searching, and built-in database full-text indexes will be very slow or may not even be available when searching columns not specified in the index. This leaves a lot of work to be done for each search, which is avoided when you roll your own.
This simple example uses two tables, a script for indexing, and a mod_perl
handler for searching. The ReverseIndex table is going to be three columns: Word, Score, and Location.
create table ReverseIndex (
ReverseIndex_Word varchar(64),
ReverseIndex_Score float,
Location_ID int,
primary key( ReverseIndex_Word, ReverseIndex_Score )
)
The Location table will have seven columns: Location_ID, Title, Type, Key,
Identifier, URL, and Summary.
create table Location (
Location_ID integer primary key auto_increment,
Location_Title varchar(64),
Location_Type varchar(16),
Location_Key varchar(32),
Location_Identifier integer,
Location_URL varchar(255),
Location_Summary text
)

