Building a Vector Space Search Engine in Perl
by Maciej CeglowskiFebruary 19, 2003
- Building a Vector Space Search Engine in Perl
- A Few Words About Vectors
- Getting Down To Business
- Building the Search Engine
- Making it Better
- Further Reading
Why waste time reinventing the wheel, when you could be reinventing the engine? -Damian Conway
As a Perl programmer, sooner or later you'll get an opportunity to build a search engine. Like many programming tasks - parsing a date, validating an e-mail address, writing to a temporary file - this turns out to be easy to do, but hard to get right. Most people try end up with some kind of reverse index, a data structure that associates words with lists of documents. Onto this, they graft a scheme for ranking the results by relevance.
Nearly every search engine in use today - from Google on down - works on the basis of a reverse keyword index. You can write such a keyword engine in Perl, but as your project grows you will inevitably find yourself gravitating to some kind of relational database system. Since databases are customized for fast lookup and indexing, it's no surprise that most keyword search engines make heavy use of them. But writing code for them isn't much fun.
More to the point, companies like Google and Atomz already offer excellent, free search services for small Web sites. You can get an instant search engine with a customizable interface, and spend no time struggling with Boolean searches, text highlighting, or ranking algorithms. Why bother duplicating all that effort?
As Perl programmers, we know that laziness is a virtue. But we also know that there is more than one way to do things. Despite the ubiquity of reverse-index search, there are many other ways to build a search engine. Most of them originate in the field of information retrieval, where researchers are having all kinds of fun. Unfortunately, finding documentation about these alternatives isn't easy. Most of the material available online is either too technical or too impractical to be of use on real-world data sets. So programmers are left with the false impression that vanilla search is all there is.
In this article, I want to show you how to build and run a search engine using a vector-space model, an alternative to reverse index lookup that does not require a database, or indeed any file storage at all. Vector-space search engines eliminate many of the disadvantages of keyword search without introducing too many disadvantages of their own. Best of all, you can get one up and running in just a few dozen lines of Perl.
A Few Words About Vectors
Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.
To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don't worry - it will become clearer when we write the code.
The vector-space data model gives us a search engine with several useful features:
- Searches take place in RAM, there is no disk or database access
- Queries can be arbitrarily long
- Users don't have to bother with Boolean logic or regular expressions
- It's trivially easy to do 'find similar' searches on returned documents
- You can set up a 'saved results' basket, and do similarity searches on the documents in it
- You get to talk about 'vector spaces' and impress your friends
Getting Down to Business
The easiest way to understand the vector model is with a concrete example.
Let's say you have a collection of 10 documents, and together they contain
50 unique words. You can represent each document as a vector by counting
how many times a word appears in the document, and moving that distance along
the appropriate axis. So if Document A contains the sentence "cats like
chicken", then you find the axis for cat, move along one unit, and then do
the same for like and chicken. Since the other 47 words
don't appear in your document, you don't move along the corresponding
axes at all.
Plot this point and draw a line to it from the origin, and you have your document vector. Like any vector, it has a magnitude (determined by how many times each word occurs), and a direction (determined by which words appeared, and their relative abundance).
There are two things to notice here:
The first is that we throw away all information word order, and there is no guarantee that the vector will be unique. If we had started with the sentence "chickens like cats" (ignoring plurals for the moment), then we would have ended up with an identical document vector, even though the documents are not the same.
This may seem to be a big limitation, but it turns out that word order in natural language contains little information about content - you can infer most of what a document is about by studying the word list. Bad news for English majors, good news for us.
The second thing to notice is that with three non-zero values out of a possible 50, our document vector is sparse. This will hold true for any natural language collection, where a given document will contain only a tiny proportion of all the possible words in the language. This makes it possible to build in-RAM search engines even for large collections, although the details are outside the scope of this article. The point is, you can scale this model up quite far before having to resort to disk access.
To run a vector-space search engine, we need to do the following:
- Assemble a document collection
- Create a term space and map the documents into it
- Map an incoming query into the same term space
- Compare the query vector to the stored document vectors
- Return a list of nearest documents, ranked by distance
Now let's see how to implement these steps in Perl.
Building the Search Engine
We'll make things easy by starting with a tiny collection of four documents, each just a few words long:
"The cat in the hat"
"A cat is a fine pet."
"Dogs and cats make good pets."
"I haven't got a hat."
Our first step is to find all the unique words in the document set. The easiest way to do this is to convert each document into a word list, and then combine the lists together into one. Here's one (awful) way do it:
sub get_words {
my ( $text ) = @_;
return map { lc $_ => 1 }
map { /([a-z\-']+)/i }
split /\s+/s, $text;
}
The subroutine splits a text string on whitespace, takes out all punctuation except hyphens and apostrophes, and converts everything to lower case before returning a hash of words.
The curious do statement is just a compact way of creating a lookup hash.
%doc_words will end up containing our word list as its keys, and its values
will be the number of times each word appeared.
If we run this 'parser' on all four documents in turn, then we get a master word list:
a
and
cat
cats
dogs
fine
good
got
hat
haven't
i
in
is
make
pet
pets
the
Notice that many of the words in this list are junk words - pronouns, articles, and other grammatical flotsam that's not useful in a search context. A common procedure in search engine design is to strip out words like these using a stop list. Here's the same subroutine with a rudimentary stop list added in, filtering out the most common words:
our %stop_hash;
our @stop_words = qw/i in a to the it have haven't was but is be from/;
foreach @stop_words {$stop_hash{$_}++ };
sub get_words {
# Now with stop list action!
my ( $text ) = @_;
return map { $_ => 1 }
grep { !( exists $stop_hash{$_} ) }
map lc,
map { /([a-z\-']+)/i }
split /\s+/s, $text;
}
A true stop list would be longer, and tailored to fit our document
collection. You can find a real stop list in the DATA section of
Listing 1, along with a complete implementation of the search engine
described here. Note that because of Perl's fast hash lookup algorithm,
we can have a copious stop list without paying a big price in program
speed. Because word frequencies in natural language obey a power-law
distribution, getting rid of the most common words removes a
disproportionate amount of bulk from our vectors.
Here is what our word list looks after we munge it with the stop list:
cat
cats
dogs
fine
good
got
hat
make
pet
pets
We've narrowed the list down considerably, which is good. But notice that our list contains a couple of variants ("cat" and "cats", "pet" and "pets"), that differ only in number. Also note that someone who searches on 'dog' in our collection won't get any matches, even though 'dogs' in the plural form is a valid hit. That's bad.
This is a common problem in search engine design, so of course there is a module on the CPAN to solve it. The bit of code we need is called a stemmer, a set of heuristics for removing suffixes from English words, leaving behind a common root. The stemmer we can get from CPAN uses the Porter stemming algorithm, which is an imperfect but excellent way of finding word stems in English.
use Lingua::Stem;
We'll wrap the stemmer in our own subroutine, to hide the clunky Lingua::Stem syntax:
sub stem {
my ( $word) = @_;
my $stemref = Lingua::Stem::stem( $word );
return $stemref->[0];
}
And here's how to fold it in to the get_words subroutine:
return map { stem($_) => 1 }
grep { !( exists $stop_hash{$_} ) }
map lc,
map { /([a-z\-']+)/i }
split /\s+/s, $text;
Notice that we apply our stop list before we stem the words. Otherwise,
a valid word like beings (which stems to be) would be caught by the
overzealous stop list. It's easy to make little slips like this in search algorithm design, so be extra careful.
With the stemmer added in, our word list now looks like this:
cat
dog
fine
good
got
hat
make
pet
Much better! We have halved the size of our original list, while preserving all of the important content.
Pages: 1, 2 |

