Sign In/My Account | View Cart  
advertisement


Listen Print

Building a Vector Space Search Engine in Perl
by Maciej Ceglowski | Pages: 1, 2

Now that we have a complete list of content words, we're ready for the second step - mapping our documents into the term space. Because our collection has a vocabulary of eight content words, each of our documents will map onto an eight-dimensional vector.

Here is one example:


        # A cat is a fine pet



        $vec = [ 1, 0, 1, 0, 0, 0, 1 ];

The sentence "A cat is a fine pet" contains three content words. Looking at our sorted list of words, we find cat, fine, and pet at positions one, three, and eight respectively, so we create an anonymous array and put ones at those positions, with zeroes everywhere else.

If we wanted to go in the opposite direction, then we could take the vector and look up the non-zero values at the appropriate position in a sorted word list, getting back the content words in the document (but no information about word order).

The problem with using Perl arrays here is that they won't scale. Perl arrays eat lots of memory, and there are no native functions for comparing arrays to one another. We would have to loop through our arrays, which is slow.

A better data way to do it is to use the PDL module, a set of compiled C extensions to Perl made especially for use with matrix algebra. You can find it on the CPAN. PDL stands for "Perl Data Language", and it is a powerful language indeed, optimized for doing math operations with enormous multidimensional matrices. All we'll be using is a tiny slice of its functionality, the equivalent of driving our Ferrari to the mailbox.

It turns out that a PDL vector (or "piddle") looks similar to our anonymous array:



        use PDL;
        my $vec = piddle [ 1, 0, 1, 0, 0, 0, 1 ];


        > print $vec

        [1 0 0 1 0 0 0 1]

We give the piddle constructor the same anonymous array as an argument, and it converts it to a smaller data structure, requiring less storage.

Since we already know that most of our values in each document vector will be zero (remember how sparse natural language is), passing full-length arrays to the piddle constructor might get a little cumbersome. So instead we'll use a shortcut to create a zero-filled piddle, and then set the non-zero values explicitly. For this we have the zeroes function, which takes the size of the vector as its argument:


        my $num_words = 8;
        my $vec = zeroes $num_words;



        > print $vec

        [0 0 0 0 0 0 0 0 0]

To set one of the zero values to something else, we'll have to use the obscure PDL syntax:


        my $value = 3;
        my $offset = 4;



        index( $vec, $offset ) .= $value;

        > print $vec;

        [0 0 0 3 0 0 0 0 0]

Here we've said "take this vector, and set the value at position 4 to 3".

This turns out to be all we need to create a document vector. Now we just have to loop through each document's word list, and set the appropriate values in the corresponding vector. Here's a subroutine that does the whole thing:


        # $word_count is the total number of words in collection
        # %index is a lookup hash of word to its position in the master list



        sub make_vector {
                my ( $doc ) = @_;
                my %words = get_words( $doc );  
                my $vector = zeroes $word_count;


                foreach my $w ( keys %words ) {
                        my $value = $words{$w};
                        my $offset = $index{$w};
                        index( $vector, $offset ) .= $value;
                }
                return $vector;
        }

Now that we can generate a vector for each document in our collection, as well as turn an incoming query into a query vector (by feeding the query into the make_vector subroutine), all we're missing is a way to calculate the distance between vectors.

There are many ways to do this. One of the simplest (and most intuitive) is the cosine measure. Our intuition is that document vectors with many words in common will point in roughly the same direction, so the angle between two document vectors is a good measure of their similarity. Taking the cosine of that angle gives us a value from zero to one, which is handy. Documents with no words in common will have a cosine of zero; documents that are identical will have a cosine of one. Partial matches will have an intermediate value - the closer that value is to one, the more relevant the document.

The formula for calculating the cosine is this:


        cos  = ( V1 * V2 ) / ||V1|| x ||V2||

Where V2 and V2 are our vectors, the vertical bars indicate the 2-norm, and the * indicates the inner product. You can take the math on faith, or look it up in any book on linear algebra.

With PDL, we can express that relation easily:


        sub cosine {
                my ($vec1, $vec2 ) = @_;
                my $n1 = norm $vec1;
                my $n2 = norm $vec2;
                my $cos = inner( $n1, $n2 );    # inner product
                return $cos->sclr();  # converts PDL object to Perl scalar
        }

We can normalize the vectors to unit length using the norm function, because we're not interested in their absolute magnitude, only the angle between them.

Now that we have a way of computing distance between vectors, we're almost ready to run our search engine. The last bit of the puzzle is a subroutine to take a query vector and compare it against all of the document vectors in turn, returning a ranked list of matches:


        sub get_cosines {
                my ( $query_vec ) = @_;
                my %cosines;
                while ( my ( $id, $vec ) = each  %document_vectors ) {
                        my $cosine = cosine( $vec, $query_vec );
                        next unless $cosine > $threshold;
                        $cosines{$id} = $cosine;
                }
                return %cosines;
        }

This gives us back a hash with document IDs as its keys, and cosines as its values. We'll call this subroutine from a search subroutine that will be our module's interface with the outside world:


        sub search {
                my ( $query ) = @_;
                my $query_vec = make_vector( $query );
                my %results = get_cosines( $query_vec );
                return %results;
        }

All that remains is to sort the results by the cosine (in descending order), format them, and display them to the user.

You can find an object-oriented implementation of this code in Listing 1, complete with built-in stop list and some small changes to make the search go faster (for the curious, we normalize the document vectors before storing them, to save having to do it every time we run the cosine subroutine).

Once we've actually written the code, using it is straightforward:


    use Search::VectorSpace;



    my @docs = get_documents_from_somewhere();


    my $engine = Search::VectorSpace->new( docs => \@docs );


    $engine->build_index();
    $engine->set_threshold( 0.8 );


    while ( my $query = <> ) {
        my %results = $engine->search( $query );
        foreach my $result ( sort { $results{$b} <=> $results{$a} }
                             keys %results ) {
                print "Relevance: ", $results{$result}, "\n";
                print $result, "\n\n";
        }


        print "Next query?\n";
    }

And there we have it, an instant search engine, all in Perl.

Making It Better

There are all kinds of ways to improve on this basic model. Here are a few ideas to consider:

Better parsing
Our get_words subroutine is rudimentary - the code equivalent of a sharpened stick. For one thing, it will completely fail on text containing hyperlinks, acronyms or XML. It also won't recognize proper names or terms that contain more than one word ( like "Commonwealth of Independent States"). You can make the parser smarter by stripping out HTML and other markup with a module like HTML::TokeParser, and building in a part-of-speech tagger to find proper names and noun phrases (look for our own Lingua::Tagger::En, coming soon on the CPAN).
Non-English Collections
Perl has great Unicode support, and the vector model doesn't care about language, so why limit ourselves to English? As long as you can write a parser, you can adapt the search to work with any language.

Most likely you will need a special stemming algorithm. This can be easy as pie for some languages (Chinese, Italian), and really hard for others (Arabic, Russian, Hungarian). It depends entirely on the morphology of the language. You can find published stemming algorithms online for several Western European languages, including German, Spanish and French.

Similarity Search
It's easy to add a "find similar" feature to your search engine. Just use an existing document vector as your query, and everything else falls into place. If you want to do a similarity search on multiple documents, then add the vectors together.
Term Weighting
Term weighting is a fancy way of saying "some words are more important than others". Done properly, it can greatly improve search results.

You calculate weights when building document vectors. Local weighting assigns values to words based on how many times they appear in a single document, while global weighting assigns values based on word frequency across the entire collection. The intuition is that rare words are more interesting than common words (global weighting), and that words that appear once in a document are not as relevant as words that occur multiple times (local weighting).

Incorporating Metadata
If your documents have metadata descriptors (dates, categories, author names), then you can build those in to the vector model. Just add a slot for each category, like you did for your keywords, and apply whatever kind of weighting you desire.
Exact Phrase Matching
You can add arbitrary constraints to your result set by adding a chain of filters to your result set. An easy way to do exact phrase matching is to loop through your result set with a regular expression. This kind of post-processing is also a convenient way to sort your results by something other than relevance.

Further Reading

There's a vast body of material available on search engine design, but little of it is targeted at the hobbyist or beginner. The following are good places to start:

http://hotwired.lycos.com/webmonkey/code/97/16/index2a.html?collection=perl
This Webmonkey article dates back to 1997, but it's still the best tutorial on writing a reverse index search engine in Perl.
http://moskalyuk.com/software/perl/search/kiss.htm
An example of a simple keyword search engine - no database, just a deep faith in Perl's ability to parse files quickly.
http://www.movabletype.org/download.shtml
Movable Type is a popular weblog application, written in Perl. The search engine lives in MT::App::Search, and supports several advanced features. It's not pretty, but it's real production code.
http://jakarta.apache.org/lucene/docs/index.html
Lucene is a Java-based keyword search engine, part of the Apache project. It's a well-designed, open-source search engine, intended for larger projects. The documentation discusses some of the challenges of implementing a large search engine; it's worth reading even if you don't know Java.
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=3391
Foundations of Statistical Natural Language Processing, MIT Press (hardcover textbook). Don't be put off by the title - this book is a fantastic introduction to all kinds of issues in text search, and includes a thorough discussion of vector space models.
http://www.nitle.org/lsi/intro/
An introduction to latent semantic indexing, the vector model on steroids. The document is aimed at nontechnical readers, but gives some more background information on using vector techniques to search and visualize data collections.

The adventurous can also download some Perl code for latent semantic indexing at http://www.nitle.org/lsi.php. Both the code and the article come from my own work for the National Institute for Technology and Liberal Education.