February 2003 Archives

Genomic Perl

This is a book I have been looking forward to for a long time. Back when James Tisdall had just finished his Beginning Perl for Bioinformatics, I asked him to write an article about how to get into bioinf from a Perl programmer's perspective. With bioinformatics being a recently booming sphere and many scientists finding themselves in need of computer programmers to help them process their data, I thought it would be good if there was more information for programmers about the genomic side of things.

Rex Dwyer has produced a book, Genomic Perl, which bridges the gap. As well as teaching basic Perl programming techniques to biologists, it introduces many useful genetic concepts to those already familiar with programming. Of course, as a programmer and not a biologist, I'm by no means qualified to assess the quality of that side of the book, but I certainly learned a lot from it.

The first chapter, for instance, taught the basics of DNA transcription and translation, the basics of Perl string handling, and tied the two concepts together with an example of transcription in Perl. This is typical of the format of the book - each chapter introduces a genetic principle and a related problem, a Perl principle which can be used to solve the problem, and a program illustrates both. It's a well thought-out approach which deftly caters for both sides of the audience.

However, it should be stressed that the book is a substitute neither for a good introductory book on Perl nor a good textbook on genetics; and indeed, I think it will turn out to be better for programmers who need an over-arching idea of some of the problems involved with bioinformatics than for biologists who need to turn out working code. For instance, when it states that a hash is the most convenient data structure for looking up amino acids by their codons, it doesn't say why, or even what a hash is. On the other hand, amino acids and codons are both explained in detail.

The book covers a wide range of biological areas - from the structure of DNA to building predictive models of species, exploring the available databases of genetic sequences including readers of the GenBank database and an implementation of the BLAST algorithm, phylology, protein databases, DNA sequence assembly and prediction, restriction mapping, and a lot more besides. In all, it's a good overview of the common areas in which biologists need computer programs.

There's a significant but non-threatening amount of math in there, particularly in dealing with probabilities of mutation and determining whether or not events are significant, but I was particularly encouraged to see discussion of algorithmic running time; as the author is primarily a computer science professor and secondarily a bioinformaticists, this should not be too surprising. However, a significant number of bioinformaticists tend to produce code which works... eventually. Stopping to say "well, this is order n-to-the-6 and we can probably do better than that" is most welcome.

Onto the code itself. The first thing any reader will notice about the book is that the code isn't monospaced. Instead, the code is "ground", pretty-printed, as in days of old. This means you'll see code snippets like:

next unless $succReadSite; ## dummy sinks have no successor
my $succContigSite = $classNames->find($succReadSite);

Now, I have to admit I really like this, but others may find it difficult to read, and those who know slightly less Perl may find it confusing - the distinction between '' and " (that's two single quotes and a double quote) can be quite subtle, and if you're going to grind Perl code, regular expressions really, really ought to be monospaced. "$elem =~ /^[(^\[(]*)(\(.*\))?$/;" is just plain awkward.

The code is more idiomatic than much bioinformatic code that I've seen, but still feels a little unPerlish; good use is made of references, callbacks and object oriented programming, but three-argument for is used more widely than the fluent Perl programmer would have it, and things like


    main();
    sub main {
        ...
    }

worry me somewhat. But it works, it's efficient, and it's certainly enough to get the point across.

The appendices were enlightening and well thought-out: the first turns an earlier example, RNA structure folding, into a practical problem of drawing diagrams of folded RNA for publication; the other two tackle matters of how to make some of the algorithms in the text more efficient.

All in all, I came away from this book not just with more knowledge about genetics and biology - indeed, some of what I learned has been directly applicable to some work I have - but also with an understanding of some of the complexity of the problems geneticists face. It fully satisfies its goals, expressed in the preface: teaching computer scientists the biological underpinnings of bioinformatics, providing real, working code for biologists without boring the programmers, and providing an elementary handling of the statistical considerations of the subject matter. While it will end up being more used by programmers getting into the field, it's still useful for the biologists already there, particularly when combined with something like James Tisdall's book or Learning Perl. But for the programmer like me, interested in what biologists do and how we can help them do it, it's by far the clearest introduction available, and I would heartily recommend it.

This week on Perl 6, week ending 2003-02-23

Another week, another Perl 6 Summary, in which you'll find gratuitous mentions of Leon Brocard, awed descriptions of what Leopold Tötsch got up to and maybe even a summary of what's been happening in Perl 6 design and development.

Kicking off with perl6-internals as usual.

Strings and header reuse

Dan responded to prompting from Leo Tötsch about the use and reuse of string headers. The problem is that most of the string functions that produce modified strings do it in new string headers; there's no way of reusing existing string headers. This can end up generating loads of garbage. Dan's going through the various string handling ops and PMC interfaces working out what needs to do what, and documenting them, as well as adding in versions of the ops that take their destination string headers as an argument. Dan hopes that 'we can make the changes quickly and get this out of the way once and for all', leading Robert Spier to mutter something about 'famous last words'.

http://groups.google.com/groups

PXS help

Tupshin Harper has been trying to use an XML parser from within Parrot and started off by looking at the PXS example (in examples/pxs) but had problems following the instructions given there as his compiler spat out errors by the bucket load. Leo Tötsch thought that PXS was probably deprecated and the native call interface (NCI) was the thing to use. Being Leo, he provided a port of the PXS Qt example to NCI. Although PXS appears to be out of sync with the parrot core, nobody was entirely sure whether it should be removed.

http://groups.google.com/groups

Bit rot in parrot/language/*

Tupshin Harper had some problems with some of the language examples not working well with the most recent versions of Parrot. Leo Tötsch addressed most of the issues he raised, but there are definitely issues with the interpreter and the languages getting out of sync.

http://groups.google.com/groups

http://groups.google.com/groups

Macros in IMCC (part 2)

Jürgen Bömmels extended the macro support in IMCC, implementing .constant and adding some docs. The patch was promptly applied.

http://rt.perl.org/rt2/Ticket/Display.html

[RFD] IMCC calling conventions

Leo Tötsch posted an RFD covering his understanding of the various calling conventions that IMCC would have to deal with which sparked some discussion. I'm now confused as to whether function calls in Parrot will be caller saves, callee saves, or some unholy mixture of the two.

http://groups.google.com/groups

Parrot performance vs the good, the bad and the ugly

Tupshin Harper decided to port primes.pasm to C and Perl 5 to compare results. Parrot came out very quick indeed. (Close to C). For bonus points he then took a python primes algorithm that had been ported to C and ported that to Parrot as well. In full on, all stops pulled out, knobs turned to 11 mode, Parrot came in at about 50% slower than C and around 14 times faster than Python. There was some muttering about the demo being rigged. However, Jim Meyer redid the Perl and Python implementations to use a loop that duplicated the algorithm used in primes.pasm and, whilst it improved their performance somewhat, Parrot was still substantially faster.

This probably won't be the test that's run when Dan and Guido face the possibility of custard pies at 10 paces or whatever the performance challenge stake stands at now.

http://groups.google.com/groups

Mmm... spaceships...

Leon Brocard patched examples/assembly/life.pasm to use a small spaceship as its starting pattern. Apparently because it 'can provide hours of enjoyment at conferences if projected onto a big screen while answering parrot questions.' Nobody thought to ask him how a spaceship was going to answer Parrot questions, but Leo Tötsch applied the patch.

http://groups.google.com/groups

Using IMCC as JIT optimizer

Apparently, Leo Tötsch finds it unbearable that 'optimized compiled C is still faster than parrot -j' so he's been experimenting with adding smarts to IMCC, making it add hardware register allocation hints to its emitted bytecode. Sean O'Rourke liked the basic idea, but reckoned that the information generated by IMCC should really be platform-independent, suggesting that it'd be okay to pass a control flow graph to the JIT, but that hardware register allocation for a specific number of registers would iffy. He suggested that another option would be for IMCC to 'just rank the Parrot registers in order of decreasing spill cost', then the JIT could just move the most important parrot registers into architectural registers.

Dan thought the idea was interesting too, but worried that the JIT might spend more time optimizing code than it could possibly gain from the optimization. The discussion got more than a little technical after this. I'm afraid I'm a bear of little brain when it comes to this sort of magic, so you'll have to read the whole thread if you're interested in the details.

The response was pretty positive throughout the discussion, so Leo went ahead and implemented it. The new, improved version shaved slightly more than a tenth of a second from the primes.pasm runtime (not a great percentage win, but the total runtime includes the compilation time)

http://groups.google.com/groups

http://groups.google.com/groups

invoke

Steve Fink is bothered by the invoke op because it operates implicitly on P0. He wants to replace it with a new version that takes a PMC argument. Leo Tötsch is less bothered by the implicit behaviour of the op, but would like to see an additional invoke_p op, which would take a single PMC argument. So Steve implemented it, but hit a couple of snags. I'm not quite sure whether these have been overcome yet.

http://groups.google.com/groups

Problems with Configure.pl --cgoto=0

Nicholas Clark was unable to build Parrot with computed gotos turned off. Simon Glover offered a simple patch that fixed Nick's problem. Leo Tötsch talked about the different reasons for not building a computed goto core, which depends on both the compiler's capabilities and the user's choices. This led to a discussion of which cores were now obsolete. Leo believes that the simple Computed Goto and Prederef cores have been obsoleted by the CGP core (Computed Goto Prederefed). Nick thought we should continue to ship code for all the cores, and ensure that the config system is flexible enough to let anyone build any arbitrary combination of cores, which convinced Leo. Dan suggested that, once we'd done this we should revamp the Tinderbox system to run tests on all the core types.

http://groups.google.com/groups

Non-inline text in parrot assembly

Tupshin Harper wondered if there were any plans for Parrot to support a distinct .string asm section. Leo Tötsch pointed to .constant (in PASM) and .const (in IMCC) as ways of keeping strings visually together. Tupshin realised he'd missed a chunk of documentation (perldoc assemble.pl for anyone else who hasn't read it) which he thinks should probably be moved into docs/parrot_assembly.pod or somewhere similar.

http://groups.google.com/groups

Access to partial registers?

Tupshin Harper wondered if it was possible and/or meaningful to read and write from a part of a register (eg. a single word) in PASM. Answer: Not at the moment, what do you want? We can always add an intreg.ops set.

http://groups.google.com/groups

Meanwhile, over in perl6-language

Things were, once more quiet (all of 16 messages). I think we're all awaiting the coming of the Prophet Zarquon (or possibly the next Apocalypse, whichever comes soonest.)

Arrays, lists, referencing

David Storrs suggested some possible semantics for (4, 1, 2) + 7, noting that he doesn't like the Perl5's current behaviour (evaluates to 9). Michael Lazzaro thinks it should evaluate to 10 (length of the list + 7) or possibly throw a syntax error. This then morphed into a discussion of pass by value, pass by reference and pass by constant reference, which Allison Randal told us would be addressed in the upcoming Apocalypse 6.

http://groups.google.com/groups

Err...

That's it

Acknowledgements, Announcements and Apologies

Mmm... the comfy chair... it's the only place to write a summary in. Especially when you're plied with lashings of Earl Grey tea and entertained by the antics of a couple of adolescent cats. Things could be lots worse.

I'd like to apologize to Danny O'Brien for last weeks mention of 'a Brainf*ck compiler, in Brainf*ck, for the Brainf*ck interpreter supplied with Parrot, a virtual machine named after a joke, written for a language that doesn't yet exist.' Apparently this sprained Danny's head. And I extend similar apologies to anyone else whose head was sprained by that concept.

The American Odyssey web page is still more of a beautiful idea than an actual link you can click. Hopefully it'll spring into existence before we set off, but I wouldn't bet on it.

Aspell is, once more, the weapon of proofreading choice.

If you appreciated this summary, please consider one or more of the following options:

This week's summary was again sponsored by Darren Duncan. Thanks Darren. If you'd like to become a summary sponsor, drop me a line at p6summarizer@bofh.org.uk.

Building a Vector Space Search Engine in Perl

Why waste time reinventing the wheel, when you could be reinventing the engine? -Damian Conway

As a Perl programmer, sooner or later you'll get an opportunity to build a search engine. Like many programming tasks - parsing a date, validating an e-mail address, writing to a temporary file - this turns out to be easy to do, but hard to get right. Most people try end up with some kind of reverse index, a data structure that associates words with lists of documents. Onto this, they graft a scheme for ranking the results by relevance.

Nearly every search engine in use today - from Google on down - works on the basis of a reverse keyword index. You can write such a keyword engine in Perl, but as your project grows you will inevitably find yourself gravitating to some kind of relational database system. Since databases are customized for fast lookup and indexing, it's no surprise that most keyword search engines make heavy use of them. But writing code for them isn't much fun.

More to the point, companies like Google and Atomz already offer excellent, free search services for small Web sites. You can get an instant search engine with a customizable interface, and spend no time struggling with Boolean searches, text highlighting, or ranking algorithms. Why bother duplicating all that effort?

As Perl programmers, we know that laziness is a virtue. But we also know that there is more than one way to do things. Despite the ubiquity of reverse-index search, there are many other ways to build a search engine. Most of them originate in the field of information retrieval, where researchers are having all kinds of fun. Unfortunately, finding documentation about these alternatives isn't easy. Most of the material available online is either too technical or too impractical to be of use on real-world data sets. So programmers are left with the false impression that vanilla search is all there is.

In this article, I want to show you how to build and run a search engine using a vector-space model, an alternative to reverse index lookup that does not require a database, or indeed any file storage at all. Vector-space search engines eliminate many of the disadvantages of keyword search without introducing too many disadvantages of their own. Best of all, you can get one up and running in just a few dozen lines of Perl.

A Few Words About Vectors

Vector-space search engines use the notion of a term space, where each document is represented as a vector in a high-dimensional space. There are as many dimensions as there are unique words in the entire collection. Because a document's position in the term space is determined by the words it contains, documents with many words in common end up close together, while documents with few shared words end up far apart.

To search our collection, we project a query into this term space and calculate the distance from the query vector to all the document vectors in turn. Those documents that are within a certain threshold distance get added to our result set. If all this sounds like gobbledygook to you, then don't worry - it will become clearer when we write the code.

The vector-space data model gives us a search engine with several useful features:

  • Searches take place in RAM, there is no disk or database access
  • Queries can be arbitrarily long
  • Users don't have to bother with Boolean logic or regular expressions
  • It's trivially easy to do 'find similar' searches on returned documents
  • You can set up a 'saved results' basket, and do similarity searches on the documents in it
  • You get to talk about 'vector spaces' and impress your friends

Getting Down to Business

The easiest way to understand the vector model is with a concrete example.

Let's say you have a collection of 10 documents, and together they contain 50 unique words. You can represent each document as a vector by counting how many times a word appears in the document, and moving that distance along the appropriate axis. So if Document A contains the sentence "cats like chicken", then you find the axis for cat, move along one unit, and then do the same for like and chicken. Since the other 47 words don't appear in your document, you don't move along the corresponding axes at all.

Plot this point and draw a line to it from the origin, and you have your document vector. Like any vector, it has a magnitude (determined by how many times each word occurs), and a direction (determined by which words appeared, and their relative abundance).

There are two things to notice here:

The first is that we throw away all information word order, and there is no guarantee that the vector will be unique. If we had started with the sentence "chickens like cats" (ignoring plurals for the moment), then we would have ended up with an identical document vector, even though the documents are not the same.

This may seem to be a big limitation, but it turns out that word order in natural language contains little information about content - you can infer most of what a document is about by studying the word list. Bad news for English majors, good news for us.

The second thing to notice is that with three non-zero values out of a possible 50, our document vector is sparse. This will hold true for any natural language collection, where a given document will contain only a tiny proportion of all the possible words in the language. This makes it possible to build in-RAM search engines even for large collections, although the details are outside the scope of this article. The point is, you can scale this model up quite far before having to resort to disk access.

To run a vector-space search engine, we need to do the following:

  1. Assemble a document collection
  2. Create a term space and map the documents into it
  3. Map an incoming query into the same term space
  4. Compare the query vector to the stored document vectors
  5. Return a list of nearest documents, ranked by distance

Now let's see how to implement these steps in Perl.

Building the Search Engine

We'll make things easy by starting with a tiny collection of four documents, each just a few words long:


        "The cat in the hat"

        "A cat is a fine pet."

        "Dogs and cats make good pets."

        "I haven't got a hat."

Our first step is to find all the unique words in the document set. The easiest way to do this is to convert each document into a word list, and then combine the lists together into one. Here's one (awful) way do it:



        sub get_words { 
                my ( $text ) = @_;
                return map { lc $_ => 1 }
                       map { /([a-z\-']+)/i } 
                       split /\s+/s, $text;
        }

The subroutine splits a text string on whitespace, takes out all punctuation except hyphens and apostrophes, and converts everything to lower case before returning a hash of words.

The curious do statement is just a compact way of creating a lookup hash. %doc_words will end up containing our word list as its keys, and its values will be the number of times each word appeared.

If we run this 'parser' on all four documents in turn, then we get a master word list:


        a
        and
        cat
        cats
        dogs
        fine
        good
        got
        hat
        haven't
        i
        in
        is
        make
        pet
        pets
        the

Notice that many of the words in this list are junk words - pronouns, articles, and other grammatical flotsam that's not useful in a search context. A common procedure in search engine design is to strip out words like these using a stop list. Here's the same subroutine with a rudimentary stop list added in, filtering out the most common words:



        our %stop_hash;
        our @stop_words = qw/i in a to the it have haven't was but is be from/;
        foreach @stop_words {$stop_hash{$_}++ };


        sub get_words {


                # Now with stop list action!


                my ( $text ) = @_;
                return map { $_ => 1 }
                       grep { !( exists $stop_hash{$_} ) }
                       map lc,
                       map { /([a-z\-']+)/i } 
                       split /\s+/s, $text;
        }

A true stop list would be longer, and tailored to fit our document collection. You can find a real stop list in the DATA section of Listing 1, along with a complete implementation of the search engine described here. Note that because of Perl's fast hash lookup algorithm, we can have a copious stop list without paying a big price in program speed. Because word frequencies in natural language obey a power-law distribution, getting rid of the most common words removes a disproportionate amount of bulk from our vectors.

Here is what our word list looks after we munge it with the stop list:


        cat
        cats
        dogs
        fine
        good
        got
        hat
        make
        pet
        pets

We've narrowed the list down considerably, which is good. But notice that our list contains a couple of variants ("cat" and "cats", "pet" and "pets"), that differ only in number. Also note that someone who searches on 'dog' in our collection won't get any matches, even though 'dogs' in the plural form is a valid hit. That's bad.

This is a common problem in search engine design, so of course there is a module on the CPAN to solve it. The bit of code we need is called a stemmer, a set of heuristics for removing suffixes from English words, leaving behind a common root. The stemmer we can get from CPAN uses the Porter stemming algorithm, which is an imperfect but excellent way of finding word stems in English.


        use Lingua::Stem;

We'll wrap the stemmer in our own subroutine, to hide the clunky Lingua::Stem syntax:


        sub stem {
                my ( $word) = @_;
                my $stemref = Lingua::Stem::stem( $word );
                return $stemref->[0];
        }

And here's how to fold it in to the get_words subroutine:


        return map { stem($_) => 1 }
               grep { !( exists $stop_hash{$_} ) }
               map lc,
               map { /([a-z\-']+)/i } 
               split /\s+/s, $text;

Notice that we apply our stop list before we stem the words. Otherwise, a valid word like beings (which stems to be) would be caught by the overzealous stop list. It's easy to make little slips like this in search algorithm design, so be extra careful.

With the stemmer added in, our word list now looks like this:


        cat
        dog
        fine
        good
        got
        hat
        make
        pet

Much better! We have halved the size of our original list, while preserving all of the important content.

Now that we have a complete list of content words, we're ready for the second step - mapping our documents into the term space. Because our collection has a vocabulary of eight content words, each of our documents will map onto an eight-dimensional vector.

Here is one example:


        # A cat is a fine pet



        $vec = [ 1, 0, 1, 0, 0, 0, 1 ];

The sentence "A cat is a fine pet" contains three content words. Looking at our sorted list of words, we find cat, fine, and pet at positions one, three, and eight respectively, so we create an anonymous array and put ones at those positions, with zeroes everywhere else.

If we wanted to go in the opposite direction, then we could take the vector and look up the non-zero values at the appropriate position in a sorted word list, getting back the content words in the document (but no information about word order).

The problem with using Perl arrays here is that they won't scale. Perl arrays eat lots of memory, and there are no native functions for comparing arrays to one another. We would have to loop through our arrays, which is slow.

A better data way to do it is to use the PDL module, a set of compiled C extensions to Perl made especially for use with matrix algebra. You can find it on the CPAN. PDL stands for "Perl Data Language", and it is a powerful language indeed, optimized for doing math operations with enormous multidimensional matrices. All we'll be using is a tiny slice of its functionality, the equivalent of driving our Ferrari to the mailbox.

It turns out that a PDL vector (or "piddle") looks similar to our anonymous array:



        use PDL;
        my $vec = piddle [ 1, 0, 1, 0, 0, 0, 1 ];


        > print $vec

        [1 0 0 1 0 0 0 1]

We give the piddle constructor the same anonymous array as an argument, and it converts it to a smaller data structure, requiring less storage.

Since we already know that most of our values in each document vector will be zero (remember how sparse natural language is), passing full-length arrays to the piddle constructor might get a little cumbersome. So instead we'll use a shortcut to create a zero-filled piddle, and then set the non-zero values explicitly. For this we have the zeroes function, which takes the size of the vector as its argument:


        my $num_words = 8;
        my $vec = zeroes $num_words;



        > print $vec

        [0 0 0 0 0 0 0 0 0]

To set one of the zero values to something else, we'll have to use the obscure PDL syntax:


        my $value = 3;
        my $offset = 4;



        index( $vec, $offset ) .= $value;

        > print $vec;

        [0 0 0 3 0 0 0 0 0]

Here we've said "take this vector, and set the value at position 4 to 3".

This turns out to be all we need to create a document vector. Now we just have to loop through each document's word list, and set the appropriate values in the corresponding vector. Here's a subroutine that does the whole thing:


        # $word_count is the total number of words in collection
        # %index is a lookup hash of word to its position in the master list



        sub make_vector {
                my ( $doc ) = @_;
                my %words = get_words( $doc );  
                my $vector = zeroes $word_count;


                foreach my $w ( keys %words ) {
                        my $value = $words{$w};
                        my $offset = $index{$w};
                        index( $vector, $offset ) .= $value;
                }
                return $vector;
        }

Now that we can generate a vector for each document in our collection, as well as turn an incoming query into a query vector (by feeding the query into the make_vector subroutine), all we're missing is a way to calculate the distance between vectors.

There are many ways to do this. One of the simplest (and most intuitive) is the cosine measure. Our intuition is that document vectors with many words in common will point in roughly the same direction, so the angle between two document vectors is a good measure of their similarity. Taking the cosine of that angle gives us a value from zero to one, which is handy. Documents with no words in common will have a cosine of zero; documents that are identical will have a cosine of one. Partial matches will have an intermediate value - the closer that value is to one, the more relevant the document.

The formula for calculating the cosine is this:


        cos  = ( V1 * V2 ) / ||V1|| x ||V2||

Where V2 and V2 are our vectors, the vertical bars indicate the 2-norm, and the * indicates the inner product. You can take the math on faith, or look it up in any book on linear algebra.

With PDL, we can express that relation easily:


        sub cosine {
                my ($vec1, $vec2 ) = @_;
                my $n1 = norm $vec1;
                my $n2 = norm $vec2;
                my $cos = inner( $n1, $n2 );    # inner product
                return $cos->sclr();  # converts PDL object to Perl scalar
        }

We can normalize the vectors to unit length using the norm function, because we're not interested in their absolute magnitude, only the angle between them.

Now that we have a way of computing distance between vectors, we're almost ready to run our search engine. The last bit of the puzzle is a subroutine to take a query vector and compare it against all of the document vectors in turn, returning a ranked list of matches:


        sub get_cosines {
                my ( $query_vec ) = @_;
                my %cosines;
                while ( my ( $id, $vec ) = each  %document_vectors ) {
                        my $cosine = cosine( $vec, $query_vec );
                        next unless $cosine > $threshold;
                        $cosines{$id} = $cosine;
                }
                return %cosines;
        }

This gives us back a hash with document IDs as its keys, and cosines as its values. We'll call this subroutine from a search subroutine that will be our module's interface with the outside world:


        sub search {
                my ( $query ) = @_;
                my $query_vec = make_vector( $query );
                my %results = get_cosines( $query_vec );
                return %results;
        }

All that remains is to sort the results by the cosine (in descending order), format them, and display them to the user.

You can find an object-oriented implementation of this code in Listing 1, complete with built-in stop list and some small changes to make the search go faster (for the curious, we normalize the document vectors before storing them, to save having to do it every time we run the cosine subroutine).

Once we've actually written the code, using it is straightforward:


    use Search::VectorSpace;



    my @docs = get_documents_from_somewhere();


    my $engine = Search::VectorSpace->new( docs => \@docs );


    $engine->build_index();
    $engine->set_threshold( 0.8 );


    while ( my $query = <> ) {
        my %results = $engine->search( $query );
        foreach my $result ( sort { $results{$b} <=> $results{$a} }
                             keys %results ) {
                print "Relevance: ", $results{$result}, "\n";
                print $result, "\n\n";
        }


        print "Next query?\n";
    }

And there we have it, an instant search engine, all in Perl.

Making It Better

There are all kinds of ways to improve on this basic model. Here are a few ideas to consider:

Better parsing
Our get_words subroutine is rudimentary - the code equivalent of a sharpened stick. For one thing, it will completely fail on text containing hyperlinks, acronyms or XML. It also won't recognize proper names or terms that contain more than one word ( like "Commonwealth of Independent States"). You can make the parser smarter by stripping out HTML and other markup with a module like HTML::TokeParser, and building in a part-of-speech tagger to find proper names and noun phrases (look for our own Lingua::Tagger::En, coming soon on the CPAN).
Non-English Collections
Perl has great Unicode support, and the vector model doesn't care about language, so why limit ourselves to English? As long as you can write a parser, you can adapt the search to work with any language.

Most likely you will need a special stemming algorithm. This can be easy as pie for some languages (Chinese, Italian), and really hard for others (Arabic, Russian, Hungarian). It depends entirely on the morphology of the language. You can find published stemming algorithms online for several Western European languages, including German, Spanish and French.

Similarity Search
It's easy to add a "find similar" feature to your search engine. Just use an existing document vector as your query, and everything else falls into place. If you want to do a similarity search on multiple documents, then add the vectors together.
Term Weighting
Term weighting is a fancy way of saying "some words are more important than others". Done properly, it can greatly improve search results.

You calculate weights when building document vectors. Local weighting assigns values to words based on how many times they appear in a single document, while global weighting assigns values based on word frequency across the entire collection. The intuition is that rare words are more interesting than common words (global weighting), and that words that appear once in a document are not as relevant as words that occur multiple times (local weighting).

Incorporating Metadata
If your documents have metadata descriptors (dates, categories, author names), then you can build those in to the vector model. Just add a slot for each category, like you did for your keywords, and apply whatever kind of weighting you desire.
Exact Phrase Matching
You can add arbitrary constraints to your result set by adding a chain of filters to your result set. An easy way to do exact phrase matching is to loop through your result set with a regular expression. This kind of post-processing is also a convenient way to sort your results by something other than relevance.

Further Reading

There's a vast body of material available on search engine design, but little of it is targeted at the hobbyist or beginner. The following are good places to start:

http://hotwired.lycos.com/webmonkey/code/97/16/index2a.html?collection=perl
This Webmonkey article dates back to 1997, but it's still the best tutorial on writing a reverse index search engine in Perl.
http://moskalyuk.com/software/perl/search/kiss.htm
An example of a simple keyword search engine - no database, just a deep faith in Perl's ability to parse files quickly.
http://www.movabletype.org/download.shtml
Movable Type is a popular weblog application, written in Perl. The search engine lives in MT::App::Search, and supports several advanced features. It's not pretty, but it's real production code.
http://jakarta.apache.org/lucene/docs/index.html
Lucene is a Java-based keyword search engine, part of the Apache project. It's a well-designed, open-source search engine, intended for larger projects. The documentation discusses some of the challenges of implementing a large search engine; it's worth reading even if you don't know Java.
http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=3391
Foundations of Statistical Natural Language Processing, MIT Press (hardcover textbook). Don't be put off by the title - this book is a fantastic introduction to all kinds of issues in text search, and includes a thorough discussion of vector space models.
http://www.nitle.org/lsi/intro/
An introduction to latent semantic indexing, the vector model on steroids. The document is aimed at nontechnical readers, but gives some more background information on using vector techniques to search and visualize data collections.

The adventurous can also download some Perl code for latent semantic indexing at http://www.nitle.org/lsi.php. Both the code and the article come from my own work for the National Institute for Technology and Liberal Education.

This week on Perl 6, week ending 2003-02-16

Welcome to the all new, entirely unaltered, all singing, all dancing Perl 6 summary. Your beacon of reliability in the crazy world that is Perl 6 design and development.

Another quiet week. Even quieter than last week in fact, unless my mail spent some of the time up the spout, but I don't think so.

So, as is traditional, we kick off with perl6-internals

CGP - The Computed Goto Prederefed Runloop

Last week I mentioned that nobody seemed to have commented on Dan's bet with Guido van Rossum that Parrot would outperform Python by OSCON 2004. (I also missed the fact that the penalty for losing the bet now involves cream pies as well as $10 and a round of drinks...). After I posted the summary to the mailing lists, Leopold Tötsch informed me that he had commented indirectly by announcing his new, improved, ludicrously quick runloop that combines computed GOTOs and predereferencing. Whatever that means.

This week, Leo and Nicholas Clark worked out how to combine the blistering pace of the JIT core (for operations that had been translated into hand hacked machine code) with the blistering pace of the CGP runloop (for the other ops). As far as I can tell, this involved turning the idea 'inside out', the VM actually starts up running JIT compiled code and calls out to the CGP core to execute non-JITable sequences of ops. The numbers for this approach look fantastic (quite stunningly quick...) So Leo checked it in.

http://groups.google.com/groups

http://groups.google.com/groups -- Some numbers

http://groups.google.com/groups -- Check in notice

Optimized runloops and threading issues

Last week we were reminded that JIT and Predereferenced runloops don't work in a threaded environment. This week Jerome Vouillon pointed out an approach that looks like it could fix that (it seemed to convince Leo). Dan (possibly playing mail catchup) said he was okay with having to fall back to the old fast core (as opposed to the current, stupidly fast core) in the presence of threads, but Leo seem to think that, using Jerome's scheme we'll be able to have our cake, eat it and still throw the cream pie at the Python team. Huzzah!

http://groups.google.com/groups

keyed_int issues

Leo Tötsch had wondered about why {get,set}_<type>_keyed_int vtable methods needed to take an INTVAL* value instead of a plain INTVAL as it introduced some possibly unneeded conditional code, a stack variable for the key and made life hard for JIT code. It looks like the pointer is a holdover from an early approach to doing multidimensional keyed structures.

http://groups.google.com/groups

Changes to the calling convention and other fallout

Dan returned from the Sebastapol Perl 6 meeting with a few announcements and one change in the parrot calling convention (how many calling conventions have we had now?).

http://groups.google.com/groups

Macro support in IMCC

Jürgen Bömmels announced that he'd implemented macro expansion in IMCC (Yay Jürgen!). Leo liked it, but requested a few changes before he'd check it in, so hopefully, some time soon the mainline IMCC will have macros, which is very nice.

http://groups.google.com/groups

Meanwhile, in perl6-language

Almost all the discussion was about the difference between arrays and lists. Deborah Ariel Pickett came up with a good list of questions about arrays and array references in scalar and list contexts, which Michael Lazzaro answered (very neatly I thought) with a list of their different contextual behaviours. Deborah extended Michael's list to hashes and hashrefs in a reasonably obvious way. Smylers came up with a possible new sigquote (after paraphrasing): ``We should limit new features to those that arise from a desire for the functionality rather than a desire to use up 'spare' syntax''. (Okay, it's not exactly snappy, but I think it's important).

http://groups.google.com/groups -- Deborah's questions

http://groups.google.com/groups -- Michael's answers

And that wraps it up for the language list. I'm sure it'll pick up again soon though. There are rumours of a draft apocalypse in the next couple of weeks, and presumably that implies a real apocalypse soon after. Assuming we haven't had another kind of Apocalypse in the mean time.

Announcements, Acknowledgements and AnotherWordBeginningWithA

This week's summary was again prepared in the comfy chair with surprisingly few distractions apart from the late arrival of mail from Leon Brocard announcing that he'd implemented a brainf*ck compiler in brainf*ck, but that didn't really happen this week so I've got no excuse for mentioning Leon's name in this summary.

Thanks to everyone who dropped us a line about our American Odyssey; I'm hoping I'll have one of those web page things up at some point with a rough itinerary in case anyone wants to come and throw rotten tomatoes at me (or, preferably, buy me sushi).

Proofreading services were again down to Aspell and me.

If you appreciated this summary, please consider one or more of the following options:

This week's summary was, once more, sponsored by Darren Duncan. Thanks Darren. If you'd like to become a summary sponsor, drop me a line at p6summarizer@bofh.org.uk.

Module::Build


Dave Rolsky is a co-author of the recently released Embedding Perl in HTML with Mason.

This article was originally published in February of 2003, and was updated by Dave Rolsky and Michael Schwern in January, 2008.

If you've ever created a Perl module for distribution on CPAN, you've used the ExtUtils::MakeMaker module. This venerable module goes back to the dawn of modern Perl, which began with the release of Perl 5.000.

Recently, Ken Williams has created a potential replacement for ExtUtils::MakeMaker called Module::Build, which was first released in August of 2002. Hugo van der Sanden, the pumpking for the current development version of Perl, has expressed interest in replacing ExtUtils::MakeMaker with Module::Build for the 5.10 release of Perl, and Michael Schwern, the current ExtUtils::MakeMaker maintainer, agrees with him. ExtUtils::MakeMaker won't go away any time soon, but we can hope for a gradual transition to a new and improved build system.

Why ExtUtils::MakeMaker is Important

The ExtUtils::MakeMaker module, along with the h2xs script, has been a huge boon to the Perl community, as it makes it possible to have a standard way to distribute and install Perl modules. It automates many tasks that module authors would otherwise have to to implement by hand, such as turning XS into C, compiling that C code, generating man pages from POD, running a module's test suite, and of course, installing the module.

ExtUtils::MakeMaker is a huge part of what makes PAUSE and CPAN possible, and it is quite a programming feat. Python did not have a similar utility until the September, 1999 release of distutils, PHP's PEAR didn't begin until mid-2000, and Ruby is just beginning to work on a standard library distribution mechanism.

The Scary Guts of ExtUtils::MakeMaker

ExtUtils::MakeMaker works by generating a makefile that contains targets for the various tasks needed when maintaining or installing a module. Here's an example target:

  install :: all pure_install doc_install

If you're familiar with makefile syntax, you'll realize that all this target does is call three other targets, which are named "all", "pure_install", and "doc_install". These targets in turn may call other targets, or use system commands to do whatever action is needed, in this case installing the module and its associated documentation.

The makefiles generated by ExtUtils::MakeMaker are fairly complex. For example, using version 6.05 of ExtUtils::MakeMaker to install my Exception::Class module, a pure Perl distribution containing just one module, creates a makefile with about 390 lines of makefile code. Figuring out what this makefile actually does is no simple feat, because it consists of a maze of twisty macros, all alike, many of which simply call Perl one-liners from the command line to perform some task.

The ExtUtils::MakeMaker code itself is extremely complicated, as it works on many operating systems (almost as many as Perl itself), and needs to accommodate their file systems, command line shells, different versions of make, and so on. And this is all done with an extra layer of indirection in place, because it is generating a makefile which does all the work.

If you want to customize the module build or installation process, good luck. To do this, you must subclass ExtUtils::MakeMaker, override methods that generate specific makefile targets, and then tweak the generated text to include your custom behavior, all the while preserving the basic behavior of the target. Considering that there is no documentation describing what to expect from these targets, and that the actual target text may change between releases of ExtUtils::MakeMaker or between different OS's, this can be quite painful to implement and maintain.

And by the way, you can't really subclass ExtUtils::MakeMaker, instead you are subclassing the MY package. This is a deeply strange hack, but the end result is that you can only override certain pre-defined methods in ExtUtils::MakeMaker.

For example, the HTML::Mason module includes this snippet:

  package MY;
  sub test {
      my $self = shift;
      my $test = $self->SUPER::test(@_);
      # %MY::APACHE is set in makeconfig.pl.
      # If we are not going to test with Apache there is no harm in
      # setting this anyway.
      # The PORT env var is used by Apache::test.  Don't delete it!
      my $port = $MY::APACHE{port} || 8228;
      $MY::APACHE{apache_dir} ||= ";
      my $is_maintainer = main::is_maintainer();
      # This works for older MakeMakers
      $test =~ s/(runtests \@ARGV;)/\$\$ENV{MASON_VERBOSE} == 
	  \$(TEST_VERBOSE) ? \$(TEST_VERBOSE) : \$\$ENV{MASON_VERBOSE}; 
	  \$\$ENV{PORT}=$port; 
	  \$\$ENV{APACHE_DIR}=q^$MY::APACHE{apache_dir}^; 
	  \$\$ENV{MASON_MAINTAINER}=$is_maintainer; $1/;
      my $bs = $^O =~ /Win32/i ? " : '\\';
      # This works for newer MakeMakers (5.48_01 +)
	  $test =~ s/("-MExtUtils::Command::MM" "-e" ")
	  (test_harness\(\$\(TEST_VERBOSE\).*?\)"
	  \$\(TEST_FILES\))/$1 $bs\$\$ENV{MASON_VERBOSE} == \$(TEST_VERBOSE) ?
	  \$(TEST_VERBOSE) : $bs\$\$ENV{MASON_VERBOSE}; $bs\$\$ENV{PORT}=$port;
	  $bs\$\$ENV{APACHE_DIR}=q^$MY::APACHE{apache_dir}^;
	  $bs\$\$ENV{MASON_MAINTAINER}=$is_maintainer; $2/;
      return $test;
  }

The goal of all this code is to pass some additional environment information to the test scripts when they are run, so we can do tests with a live Apache server. It accommodates several versions of ExtUtils::MakeMaker, and attempts to work properly on multiple operating systems (at least Win32 and *nix), and it has to be careful about escaping things properly so that it executes properly from the shell.

Why not Perl?

All of this prompts the question of "why not just use Perl itself for all of this?" That's exactly the question that Ken Williams answered with Module::Build. The goal of Module::Build is to do everything useful that ExtUtils::MakeMaker does, but to do this all with pure Perl wherever possible.

This greatly simplifies the build system code, and Module::Build works on systems which don't normally include make, like Win32 and Mac OS. Of course, if a module installation requires the compilation of C code, you'll still need an external C compiler.

Additionally, customizing Module::Build's behavior is often quite trivial, and only requires that you know Perl, as opposed to knowing make syntax and possibly having to learn about multiple command line environments.

Module::Build also aims to improve on some of the features provided by ExtUtils::MakeMaker. One example is its prerequisite-checking system, which provides much more flexibility than what ExtUtils::MakeMaker allows. While these features could be added to ExtUtils::MakeMaker, it's risky to make major changes to such an important module, especially one with such complex internals.

Using Module::Build

From an end-user perspective, a module that uses Module::Build looks quite a bit like one using ExtUtils::MakeMaker, and intentionally so. So to install a module using Module::Build you'd type the following lines from the command line:

  perl Build.PL
  ./Build
  ./Build test
  ./Build install

The Build.PL script tells Module::Build to create a Build script. During this process, Module::Build also writes some files to a _build/ directory. These files are used to store the build system's state between invocations of the Build script. This script, when invoked, simply loads up Module::Build again, and tells it to perform the specified action. An action is the Module::Build version of a makefile target, and actions are implemented in pure Perl whenever possible.

A bare bones Build.PL script might look like this:

  use Module::Build;
  Module::Build->new
      ( module_name => 'My::Module',
        license => 'perl',
      )->create_build_script;

The "module_name" parameter is like the ExtUtils::MakeMaker "NAME" parameter.

The "license" parameter is new with Module::Build, and its intended use it allow automated tools to determine under which license your module is distributed.

To determine your module's version, Module::Build looks in the module specified by the "module_name" parameter, though this can be overridden either by specifying a different module to look in, or by providing the version number directly.

Of course, there are more options than those. For example, Module::Build implements a prerequisite feature similar to that implemented by ExtUtils::MakeMaker, so we can write:

  Module::Build->new
      ( module_name => 'My::Module',
        license => 'perl',
        requires => { 'CGI' => 0,
                      'DBD::mysql' => 2.1013,
                    },
      )->create_build_script;

If you have any experience with ExtUtils::MakeMaker, you can probably figure out that this says that our module requires any version of CGI, and version 2.1013 or greater of DBD::mysql. So far, this looks just like what ExtUtils::MakeMaker provides, but Module::Build goes further.

Consider what happens if we know that we need some piece of functionality first present in DBD::mysql 2.1013. But perhaps a release after this, 2.1014, introduced a new bug that breaks our application. If this bug is fixed in version 2.1015, we could simply require version 2.1015, but this is not ideal. There's no reason to force someone who already has 2.1013 to upgrade because of a bug in a version they don't have.

Fortunately, Module::Build provides a more flexible version specification option that handles exactly this situation, so we can write:

  Module::Build->new
      ( module_name => 'My::Module',
        license => 'perl',
        requires => { 'CGI' => 0,
                      'DBD::mysql' => '>= 2.1013, != 2.1014',
                    },
      )->create_build_script;

This says that we need a version greater than 2.1013 that is not version 2.1014. Users who have version 2.1013 or version 2.1015 or greater are not forced to upgrade, but anyone with 2.1014 will be.

If we knew that version 3.0 didn't work with your module, we could change our specification:

  Module::Build->new
      ( module_name => 'My::Module',
        license => 'perl',
        requires => { 'CGI' => 0,
                      'DBD::mysql' => '>= 2.1013, != 2.1014, < 3.0',
                    },
      )->create_build_script;

If the user does have version 3.0 or greater installed, it will at least let them know that it won't work with our module. Unfortunately, the only possible way to use our module at this point is for the end user to manually downgrade their installation of DBD::mysql, since Perl does not allow multiple versions of a module to co-exist peacefully. Still, this is better than letting the module be installed, only to fail at runtime when it tries to use an outdated API for DBD::mysql.

There are also other options related to prerequisites, such as "recommends" and "build_requires", which can be helpful for prerequisites that are required to build the module but don't need to be present after installation. There is also a "conflicts" option which can be used to warn a user about potential conflicts between the module they are installing and one they already have.

Action!

As of release 1.15, Module::Build implements the following actions, most of which are based on existing ExtUtils::MakeMaker functionality:

  • build

    This is the default action, and is what happens if you run ./Build without any additional arguments. It is responsible for creating the blib/ directory and copying files into it, as well as compiling XS and C files. If you have any scripts like lib/My/Module.pm.PL, these are also run during this action.

  • test, testdb

    Runs the module's tests using the Test::Harness module. The "testdb" action can be used to run the tests under Perl's debugger. Equivalently, a "debugger" parameter can be passed to the "test" action to get the same effect.

  • clean, realclean

    Both actions delete any files created by the "build" action. The "realclean" action also deletes the existing Build script.

  • diff

    This action is used to compare the files about to be installed with any corresponding files that already exist. This feature is unique to Module::Build.

  • install

    Installs the module files. As of version 0.15, this doesn't yet create or install any man pages.

  • fakeinstall

    Tells you what the "install" would do.

  • dist

    Creates a gzip'd tarball of your distribution.

  • manifest
    Creates a MANIFEST file for your distribution.
  • distcheck

    Tells you what files are in the build directory but not in the MANIFEST file, and vice versa.

  • skipcheck

    Tells you what files will not be added to the MANIFEST by the "manifest" action, based on the contents of your MANIFEST.SKIP file.

  • distclean

    This is a shortcut for "realclean" followed by "distcheck".

  • distdir

    Creates a directory based on your distribution name and version, and then copies all the files listed in MANIFEST to that directory. This directory is what people will see when they download and unpack your distribution.

    Module::Build also creates a file called META.yaml which contains meta-data about your distribution. In the future, it may be possible to use a command line tool (written in Perl, of course) to read this file and use its contents to install your distribution, without running the Build.PL script. It also makes the meta-data more readily available to tools like search.cpan.org or the CPAN shell.

  • disttest

    This performs the "distdir" action, switches to the newly created directory, and then runs perl Build.PL, ./Build, and ./Build test. This lets you make sure that your distribution is actually installable.

  • help

    Tells you what actions are available. If additional actions are implemented in a distribution, then these are listed here.

Any of these options can be overridden through straightforward subclassing, so our HTML::Mason example from earlier in this article might be written something like this:

  package MyFancyBuilder;
  use base 'Module::Build';
  sub ACTION_test {
      my $self = shift;
      # %MY::APACHE is set in makeconfig.pl.
      $ENV{PORT}             = $MY::APACHE{port}       || 8228;
      $ENV{APACHE_DIR}       = $MY::APACHE{apache_dir} || ";
      $ENV{MASON_VERBOSE}  ||= $self->{properties}{verbose};
      # _is_maintainer_mode would be another method of our subclass
      $ENV{MASON_MAINTAINER} = $self->_is_maintainer_mode();
      return $self->SUPER::ACTION_test(@_);
  }

This version is actually readable, and is unlikely to break regardless of changes in the Module::Build internals. This highlights just how difficult it was to accomplish a simple task using ExtUtils::MakeMaker, and how natural the pure-Perl solution can be.

The Larger Picture and Backwards Compatibility

One difficulty in getting Module::Build into widespread use is the fact that support for ExtUtils::MakeMaker is so tightly integrated into CPAN installers.

While the version of CPAN.pm which ships with 5.8 does not know how to deal with Build.PL, the latest versoin available from CPAN does. It will even install Module::Build for you. As of January 2008, CPANPLUS, another CPAN shell, understands Build.PL but will not install Module::Build for you, but this will be remedied in a future release.

However, old versions of CPAN.pm are still in extremely widespread use, and users won't necessarily upgrade CPAN.pm before attempting to install a distribution that relies on Module::Build.

There are a couple workarounds for this problem. The simplest is to just include a Build.PL script and document this in the README or INSTALL files included with your distribution. This has the appeal of requiring of very little work to implement, but the downside is that people who expect things to just work with a CPAN shell will give up when your distribution doesn't install properly.

Another possibility is to create functionally equivalent Build.PL and Makefile.PL scripts. If you're using Module::Build because you need to customize installation behavior in a way that is difficult to do with ExtUtils::MakeMaker, this pretty much defeats the purpose of using Module::Build at all, and in any case having two separate pieces of code that do the same thing is always unappealing.

Then there's the approach which involves using a Makefile.PL script that simply installs Module::Build if needed, and then generates a Makefile that passes everything through to the ./Build script. This is known as the "passthrough" method.

I think this approach gives the best result for the effort involved, and is the method I prefer. The Module::Build distribution includes a Module::Build::Compat module, which does the dirty work needed for this approach.

Simply add create_makefile_pl => 'passthrough' to the Build.PL parameters and a Makefile.PL will be created as part of the Build dist process.

Here's an example of such a Makefile.PL script:

    # Note: this file was auto-generated by Module::Build::Compat version 0.03

    unless (eval "use Module::Build::Compat 0.02; 1" ) {
      print "This module requires Module::Build to install itself.\n";
      
      require ExtUtils::MakeMaker;
      my $yn = ExtUtils::MakeMaker::prompt
        ('  Install Module::Build now from CPAN?', 'y');
      
      unless ($yn =~ /^y/i) {
        die " *** Cannot install without Module::Build.  Exiting ...\n";
      }
      
      require Cwd;
      require File::Spec;
      require CPAN;
      
      # Save this 'cause CPAN will chdir all over the place.
      my $cwd = Cwd::cwd();
      
      CPAN::Shell->install('Module::Build::Compat');
      CPAN::Shell->expand("Module", "Module::Build::Compat")->uptodate
        or die "Couldn't install Module::Build, giving up.\n";
      
      chdir $cwd or die "Cannot chdir() back to $cwd: $!";
    }
    eval "use Module::Build::Compat 0.02; 1" or die $@;
    
    Module::Build::Compat->run_build_pl(args => \@ARGV);
    require Module::Build;
    Module::Build::Compat->write_makefile(build_class => 'Module::Build');

So what exactly is going on here? A good question indeed. Let's walk through some of the code.

    unless (eval "use Module::Build::Compat 0.02; 1" ) {
      print "This module requires Module::Build to install itself.\n";
      
      require ExtUtils::MakeMaker;
      my $yn = ExtUtils::MakeMaker::prompt
        ('  Install Module::Build now from CPAN?', 'y');
      
      unless ($yn =~ /^y/i) {
        die " *** Cannot install without Module::Build.  Exiting ...\n";
      }

This first attempts to load version 0.02 or greater of the Module::Build::Compat module. If it isn't installed we know we need to install Module::Build. Because we're polite, we ask the user if they would like to install Module::Build before going further. Some people dislike interactive installations, but fortunately the promp() command is pretty smart about detecting if there's a user at the end of the line.

Assuming that the user agrees to install Module::Build (if they don't the installer has to give up) this is what comes next:

      # Save this 'cause CPAN will chdir all over the place.
      my $cwd = Cwd::cwd();
      
      CPAN::Shell->install('Module::Build::Compat');
      CPAN::Shell->expand("Module", "Module::Build::Compat")->uptodate
        or die "Couldn't install Module::Build, giving up.\n";
      
      chdir $cwd or die "Cannot chdir() back to $cwd: $!";

We want to use CPAN.pm to actually install Module::Build, but we need to first save our current directory, because CPAN.pm calls chdir() quite a bit, and we'll need to be in the same directory as we started in after installing Module::Build.

Then we load CPAN.pm and tell it to install Module::Build. After that, we chdir() back to our original directory.

    eval "use Module::Build::Compat 0.02; 1" or die $@;
    
    Module::Build::Compat->run_build_pl(args => \@ARGV);
    require Module::Build;
    Module::Build::Compat->write_makefile(build_class => 'Module::Build');

First it checks that the Module::Build install worked. Then it simply tells Module::Build::Compat to run the Build.PL script, and to write out a "passthrough" Makefile. Module::Build::Compat will attempt to convert ExtUtils::MakeMaker style arguments, like "PREFIX", to arguments that Module::Build can understand, like "--prefix".

The "passthrough" Makefile that Module::Build::Compat generates looks something like this:

  all :
          ./Build
  realclean :
          ./Build realclean
          rm -f \$(THISFILE)
  .DEFAULT :
          ./Build \$@
  .PHONY   : install manifest

The ".DEFAULT" target is called when there is no matching make target for the one given on the command line. It uses the "$@" make variable, which will contain the name of the target that was passed to make. So if "make install" is called, then "$@" contains "install", and it ends up running "./Build install".

The generated Makefile also contains a comment which specifies the module's prerequisites, because this is how CPAN.pm figures out what a module's prerequisites are (scary but true).

This approach is the most elegant of all, but the code that translates ExtUtils::MakeMaker arguments to something Module::Build understands is quite minimal and won't handle all possibilities.

I have used this approach for one CPAN module, Thesaurus.pm, and in my limited testing it did work. If you are inclined to try installing this module, please send bug reports to me or the Module::Build users list, module-build-general@lists.sf.net.

Recently, Autrijus Tang submitted a more complex Makefile.PL script which implements several pieces of additional functionality. First of all, it makes sure that the script is running as a user that can actually install Module::Build. Second, it prefers CPANPLUS.pm to CPAN.pm.

Autrijus' script looks promising, but since it hasn't yet been tested, I've chosen not to include it here. It's quite likely that some version of his script will be documented in future versions of Module::Build

Custom Behavior

As was hinted at earlier, you can directly subclass Module::Build in order to implement custom behavior. This is a big topic unto itself, and will be the topic of a future article here on perl.com.

The Future

There is plenty of work left to be done on Module::Build. Off the top of my head, here are some things that still need to be done:

The installation phase does not yet create man pages based on POD included in the distribution.

Module::Build needs to implement a "local install" feature like the one provided by the ExtUtils::MakeMaker "PREFIX" argument. The logic that implements this in ExtUtils::MakeMaker is Byzantine, but only because doing this correctly is quite complicated. This logic needs to be implemented for Module::Build as well.

Module::Build needs better backwards compatibility with ExtUtils::MakeMaker. The argument translation in Module::Build::Compat is currently just a placeholder. Things like "PREFIX", "LIB", and "UNINST=1" all need to be translated by Module::Build::Compat, and the equivalent functionality needs to be implemented for Module::Build

CPANPLUS.pm could take advantage of more Module::Build features. For example, it currently ignores the "conflict" information that Module::Build makes available, and it doesn't attempt to distinguish between "build_requires", "requires", or "recommends".

Some of what Module::Build provides is intended for use by external tools, such as the meta-data provided by the META.yaml file. CPANPLUS.pm could use this to avoid having to run the Build.PL and Build scripts, thus avoiding the need to install any of the "build_requires" modules. Package managers like rpm or the Debian tools could also use it to construct installable packages for Perl modules more easily.

Adding at least basic support for Module::Build to CPAN.pm would be nice. If anyone is interested in pursuing this, I have an old patch that may provide a decent start on this. Contact me if you're interested.

More Information

If you want to learn more about Module::Build, the first thing you should do is install it (it will install itself just fine under CPAN.pm) and read the docs for the Module::Build and Module::Build::Compat modules. There is a project on SourceForge for Module::Build at http://www.sourceforge.net/projects/module-build. The source is in CVS on SourceForge, and is accessible via anonymous CVS and and online.

Finally, if you're interested in using or helping to develop Module::Build, please sign up on the module-build-general@lists.sf.net email list. See http://lists.sourceforge.net/lists/listinfo/module-build-general for more details.

Thanks

Thanks to Ken Williams for reviewing this article before publication, and for writing Module::Build.

This week on Perl 6, week ending 2003-02-09

Welcome to the latest Perl 6 summary, your handy cut out and keep guide to the goings on in the crazy world of Perl 6 design and development.

It's been a rather quiet week this week; only 75 messages in perl6-internals and a mere 57 in perl6-language. So, at least it's palindromic.

We start off, as is traditional, with perl6-internals

The 2004 Performance challenge

Dan announced that he'd made a bet with Guido van Rossum that Parrot would be faster at executing a pure python benchmark of some sort to be determined. The details of the challenge will be announced at OSCON 2003 and the results at OSCON 2004. Dan stands to lose $10 and a round of beer for the Zope/Pythonlabs folks. (Dunno how many of them there are though...). We don't know what he stands to win yet, but I'd hope 'beers from each of the Zope/Pythonlabs folks' are included.

For some reason nobody commented on this.

http://groups.google.com/groups

More Parrot Objects

Jerome Quelin wondered how Parrot objects would handle property inheritance. Dan pointed out that properties don't get inherited and Jerome realised he meant to ask about attribute inheritance. Attributes are inherited but are mostly invisible to any methods but the methods of the class that defines the attributes (though they will be accessible (presumably through reflection)).

In another subthread, we got confused by multimethods.

http://groups.google.com/groups

http://groups.google.com/groups -- Multimethods

Bytecode Metadata

James Michael DuPont wanted to know what had been decided about Bytecode metadata and outlined the things that he'd like to know about a given bytecode. Leo Tötsch reckoned that what James wanted was either in the bytecode right now, or was handleable by the framework that was in place. He pointed James to docs/parrotbyte.pod in the Parrot distribution.

Further discussions centred on the amount of metadata and whether this would affect loading speed and efficiency, or get in the way of the desired 'mmap and go' principle. Gregor N. Purdy pointed out that we also had to be capable of passing meta data from a compiler 'through' IMCC and on to the final bytecode. There was also a touching reunion between James Michael DuPont and Gopal V. Ah...

http://groups.google.com/groups

Multi programming language questions

Phil Hassey has been lurking, fascinated on the internals list for a couple of months. This week he broke the silence by asking a couple of questions about cross language function dispatch, string compatibility and global scopes. For instance, PHP, Common Lisp and others are case insensitive about functions. Which is okay when you're calling such a function from a case sensitive language, but can be tricky if you call out from a case insensitive to a case sensitive language. Dan thought that there wasn't going to be much that could be done about this problem (at least, not transparently) but seems to think that the other issues raised shouldn't be too problematic.

http://groups.google.com/groups

Random questions

David popped up and, after remarking on the amount of progress Parrot had made since he last looked at it, had a few questions about various bits and pieces. Leo and Dan provided a bunch of good answers.

http://groups.google.com/groups

A Scheme for extending core.ops

Leo Tötsch seems to have got bored of 'just' being the Patch Monster and has been producing some discussion documents proposing all sorts of useful tricks for improving the design/speed/extensibility of Parrot's internals. This week he turned his attention to core.ops. His plan involves a way of reducing the size of core_ops, improving cache locality and generally solving a raft of problems I didn't even know Parrot had. His new scheme allows for a small core.ops, easy extension and no execution speed penalty for non core ops. As is usual with Leo's stuff, the scheme came with code. Gregor had a bunch of observations and suggestions and he and Leo thrashed out a slightly modified scheme for going forward.

http://groups.google.com/groups

Week of the alternative runloops

Leo Tö offered a couple of different core runloops this week. First up was the Computed Goto Prederefed (CGP) runloop which, essentially combined a two runloop optimization techniques to provide what can only be described as massive speedups. The -O3 compiled version ran parrot_mops 6 times faster than the basic 'fast_core' and more than 3 times faster than the previous fastest core. People were impressed.

A few days later, Leo reached into his bag of tricks and pulled out the CSwitch runloop that did predereferenced dispatch via a big C switch statement. This loop is about 50% faster than the basic fast_core loop, but slower than all the computed goto dispatched loops. At this point Jason Gloudon pointed out that predereferencing can cause problems in a multi threaded environment.

Quote of the Parrot development process so far:

``I'm a really bad C programmer'' -- Leo Tötsch

http://groups.google.com/groups

And that about wraps it up for perl6-internals this week. Dan's been busy in Sebastapol with the Perl 6 Cabal thrashing out the design of Perl 6 functions and other bits with Larry, Damian, Allison and chromatic, which probably explains why he's not been driving the mailing list for the week.

Meanwhile in perl6-language

It's been even quieter on the language list. Mostly interesting though.

Shortcut ?=

Miko O'Sullivan proposed a new assignment operator, $var ?= $x : $y, equivalent to $var = $var ? $x : $y. It was pointed out that, for correctness that should be ??=. Deborah Ariel Pickett wondered what was the need for such an operator and couldn't imagine ever needing such an operator.

http://groups.google.com/groups

Language Discussion Summaries

Miko O'Sullivan proposed that members of the language list provide summaries of the discussions in the group [What's this? Chopped liver? -- Ed]. The idea being that each summary would describe a proposed idea and then summarizes the list's feelings on the idea, and would be posted on a website. The idea was basically well received, then the list fell to quibbling about format, whether summaries should be posted to the list where they could be picked up by the list summarizer, and whether we were just reinventing RFCs all over again.

As of now, the only summary that actually exists is Miko's summary of the proposed ??= operator. Which I hadn't even bothered to read until I came to write the summary.

http://groups.google.com/groups

http://www.idocs.com/perl6/ -- Summaries website

Newline as a statement terminator

Stéphane Payrard confessed that he would like newline to be a statement terminator everywhere it can be. Everybody hated it. Deborah Ariel Pickett came up with a neat example of why it was a really bad idea with a simple ambiguous piece of code:

  sub foo 
  {
  print "abcde"
  if $condition
  {
  print "fghij"
  }
  }

which could mean either:

  sub foo {
    print "abcdef" if $condition;
    
    return sub {
      print "fghij";
    }
  }

Or

  sub foo {
    print "abcde";
    if $condition {
      print "fghij";
    }
  }

And disambiguating with indentation is way too Pythonesque for many.

http://groups.google.com/groups

Arrays vs. Lists

In his quest to exhaustively document what is known about Perl so far, Michael Lazzaro asked for a definitive answer to the question ``What's the difference between an array and a list in Perl 6''?

Various answers were given but I think Uri Guttman came up with the most succinct pair of definitions, which reduce to 'lists are read only, arrays are read/write'. (That's the data structure itself, not the data structure's contents, ($a, $b) = (1, 2) doesn't modify either list, but it does modify the lvalues in the left hand list).

http://groups.google.com/groups

http://groups.google.com/groups -- Uri's definitions

Announcements, Acknowledgements and Trip Planning

This week's summary was prepared in the comfy chair with distractions from cats, irc, former employers, bugs and poring over maps of the Eastern Seaboard.

We're coming to America. This year, making a virtue out of a necessity we're going to give ourselves a good long holiday in the States. We'll be flying out to Boca Raton for YAPC, then taking a road trip up through the Appalachians to New England, possibly swinging into Canada before coming down through Michigan to Chicago and taking a slow train to Portland, OR for The Perl Conference. If it looks like you might be on our route, drop me a line...

Proofreading services were provided by Aspell and me. (I could lie and say that they were provided by Leon Brocard, enabling me to work his name into the summary somehow, but I think I'll do it in a parenthetical comment instead.)

If you appreciated this summary, please consider one or more of the following options:

This week's summary was sponsored by Darren Duncan. Thanks Darren. If you'd like to become a summary sponsor, drop me a line at p6summarizer@bofh.org.uk.

Improving mod_perl Sites' Performance: Part 7

Correct configuration of the MinSpareServers, MaxSpareServers, StartServers, MaxClients, and MaxRequestsPerChild parameters is very important. There are no defaults. If they are too low, then you will underutilize the system's capabilities. If they are too high, then chances are that the server will bring the machine to its knees.

All the above parameters should be specified on the basis of the resources you have. With a plain Apache server, it's no big deal if you run many servers since the processes are about 1Mb and don't eat a lot of your RAM. Generally, the numbers are even smaller with memory sharing. The situation is different with mod_perl. I have seen mod_perl processes of 20Mb and more. Now, if you have MaxClients set to 50, then 50x20Mb = 1Gb. Maybe you don't have 1Gb of RAM - so how do you tune the parameters? Generally, by trying different combinations and benchmarking the server. Again, mod_perl processes can be made much smaller when memory is shared.

Before you start this task, you should be armed with the proper weapon. You need the crashme utility, which will load your server with the mod_perl scripts you possess. You need it to have the ability to emulate a multiuser environment and to emulate the behavior of multiple clients calling the mod_perl scripts on your server simultaneously. While there are commercial solutions, you can get away with free ones that do the same job. You can use the ApacheBench utility that comes with the Apache distribution, the crashme script which uses LWP::Parallel::UserAgent, httperf or http_load all discussed in one of the previous articles.

It is important to make sure that you run the load generator (the client which generates the test requests) on a system that is more powerful than the system being tested. After all, we are trying to simulate Internet users, where many users are trying to reach your service at once. Since the number of concurrent users can be quite large, your testing machine must be very powerful and capable of generating a heavy load. Of course, you should not run the clients and the server on the same machine. If you do, then your test results would be invalid. Clients will eat CPU and memory that should be dedicated to the server, and vice versa.

Configuration Tuning with ApacheBench

I'm going to use the ApacheBench (ab) utility to tune our server's configuration. We will simulate 10 users concurrently requesting a very light script at http://www.example.com/perl/access/access.cgi. Each simulated user makes 10 requests.


  % ./ab -n 100 -c 10 http://www.example.com/perl/access/access.cgi

The results are:


  Document Path:          /perl/access/access.cgi
  Document Length:        16 bytes
  
  Concurrency Level:      10
  Time taken for tests:   1.683 seconds
  Complete requests:      100
  Failed requests:        0
  Total transferred:      16100 bytes
  HTML transferred:       1600 bytes
  Requests per second:    59.42
  Transfer rate:          9.57 kb/s received
  
  Connnection Times (ms)
                min   avg   max
  Connect:        0    29   101
  Processing:    77   124  1259
  Total:         77   153  1360

The only numbers we really care about are:


  Complete requests:      100
  Failed requests:        0
  Requests per second:    59.42

Let's raise the request load to 100 x 10 (10 users, each making 100 requests):


  % ./ab -n 1000 -c 10  http://www.example.com/perl/access/access.cgi
  Concurrency Level:      10
  Complete requests:      1000
  Failed requests:        0
  Requests per second:    139.76

As expected, nothing changes -- we have the same 10 concurrent users. Now let's raise the number of concurrent users to 50:


  % ./ab -n 1000 -c 50  http://www.example.com/perl/access/access.cgi
  Complete requests:      1000
  Failed requests:        0
  Requests per second:    133.01

We see that the server is capable of serving 50 concurrent users at 133 requests per second! Let's find the upper limit. Using -n 10000 -c 1000 failed to get results (Broken Pipe?). Using -n 10000 -c 500 resulted in 94.82 requests per second. The server's performance went down with the high load.

The above tests were performed with the following configuration:


  MinSpareServers 8
  MaxSpareServers 6
  StartServers 10
  MaxClients 50
  MaxRequestsPerChild 1500

Now let's kill each child after it serves a single request. We will use the following configuration:


  MinSpareServers 8
  MaxSpareServers 6
  StartServers 10
  MaxClients 100
  MaxRequestsPerChild 1

Simulate 50 users each generating a total of 20 requests:


  % ./ab -n 1000 -c 50  http://www.example.com/perl/access/access.cgi

The benchmark timed out with the above configuration. I watched the output of ps as I ran it, the parent process just wasn't capable of respawning the killed children at that rate. When I raised the MaxRequestsPerChild to 10, I got 8.34 requests per second. Very bad - 18 times slower! You can't benchmark the importance of the MinSpareServers, MaxSpareServers and StartServers with this type of test.

Now let's reset MaxRequestsPerChild to 1500, but reduce MaxClients to 10 and run the same test:


  MinSpareServers 8
  MaxSpareServers 6
  StartServers 10
  MaxClients 10
  MaxRequestsPerChild 1500

I got 27.12 requests per second, which is better but still four to five times slower. (I got 133 with MaxClients set to 50.)

Summary: I have tested a few combinations of the server configuration variables (MinSpareServers, MaxSpareServers, StartServers, MaxClients and MaxRequestsPerChild). The results I got are as follows:

MinSpareServers, MaxSpareServers and StartServers are only important for user response times. Sometimes users will have to wait a bit.

The important parameters are MaxClients and MaxRequestsPerChild. MaxClients should be not too big, so it will not abuse your machine's memory resources, and not too small, for if it is, your users will be forced to wait for the children to become free to serve them. MaxRequestsPerChild should be as large as possible, to get the full benefit of mod_perl, but watch your server at the beginning to make sure your scripts are not leaking memory, thereby causing your server (and your service) to die very fast.

Also, it is important to understand that we didn't test the response times in the tests above, but the ability of the server to respond under a heavy load of requests. If the test script was heavier, then the numbers would be different but the conclusions similar.

The benchmarks were run with:

  • HW: RS6000, 1Gb RAM
  • SW: AIX 4.1.5 . mod_perl 1.16, apache 1.3.3
  • Machine running only mysql, httpd docs and mod_perl servers.
  • Machine was _completely_ unloaded during the benchmarking.

After each server restart when I changed the server's configuration, I made sure that the scripts were preloaded by fetching a script at least once for every child.

It is important to notice that none of the requests timed out, even if it was kept in the server's queue for more than a minute! That is the way ab works, which is OK for testing purposes but will be unacceptable in the real world - users will not wait for more than five to 10 seconds for a request to complete, and the client (i.e. the browser) will time out in a few minutes.

Now let's take a look at some real code whose execution time is more than a few milliseconds. We will do some real testing and collect the data into tables for easier viewing.

I will use the following abbreviations:


  NR    = Total Number of Request
  NC    = Concurrency
  MC    = MaxClients
  MRPC  = MaxRequestsPerChild
  RPS   = Requests per second

Running a mod_perl script with lots of mysql queries (the script under test is mysqld limited) (http://www.example.com/perl/access/access.cgi?do_sub=query_form), with the configuration:


  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50
  MaxRequestsPerChild 5000

gives us:


     NR   NC    RPS     comment
  ------------------------------------------------
     10   10    3.33    # not a reliable figure
    100   10    3.94    
   1000   10    4.62    
   1000   50    4.09

Conclusions: Here I wanted to show that when the application is slow (not due to perl loading, code compilation and execution, but limited by some external operation) it almost does not matter what load we place on the server. The RPS (Requests per second) is almost the same. Given that all the requests have been served, you have the ability to queue the clients, but be aware that anything that goes into the queue means a waiting client and a client (browser) that might time out!

Now we will benchmark the same script without using the mysql (code limited by perl only): (http://www.example.com/perl/access/access.cgi), it's the same script but it just returns the HTML form, without making SQL queries.


  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50
  MaxRequestsPerChild 5000

     NR   NC      RPS   comment
  ------------------------------------------------
     10   10    26.95   # not a reliable figure
    100   10    30.88   
   1000   10    29.31
   1000   50    28.01
   1000  100    29.74
  10000  200    24.92
 100000  400    24.95

Conclusions: This time the script we executed was pure perl (not limited by I/O or mysql), so we see that the server serves the requests much faster. You can see the number of requests per second is almost the same for any load, but goes lower when the number of concurrent clients goes beyond MaxClients. With 25 RPS, the machine simulating a load of 400 concurrent clients will be served in 16 seconds. To be more realistic, assuming a maximum of 100 concurrent clients and 30 requests per second, the client will be served in 3.5 seconds. Pretty good for a highly loaded server.

Now we will use the server to its full capacity, by keeping all MaxClients clients alive all the time and having a big MaxRequestsPerChild, so that no child will be killed during the benchmarking.


  MinSpareServers       50
  MaxSpareServers       50
  StartServers          50
  MaxClients            50
  MaxRequestsPerChild 5000
  
     NR   NC      RPS   comment
  ------------------------------------------------
    100   10    32.05
   1000   10    33.14
   1000   50    33.17
   1000  100    31.72
  10000  200    31.60

Conclusion: In this scenario, there is no overhead involving the parent server loading new children, all the servers are available, and the only bottleneck is contention for the CPU.

Now we will change MaxClients and watch the results: Let's reduce MaxClients to 10.


  MinSpareServers        8
  MaxSpareServers       10
  StartServers          10
  MaxClients            10
  MaxRequestsPerChild 5000
  
     NR   NC      RPS   comment
  ------------------------------------------------
     10   10    23.87   # not a reliable figure
    100   10    32.64 
   1000   10    32.82
   1000   50    30.43
   1000  100    25.68
   1000  500    26.95
   2000  500    32.53

Conclusions: Very little difference! Ten servers were able to serve almost with the same throughput as 50. Why? My guess is because of CPU throttling. It seems that 10 servers were serving requests five times faster than when we worked with 50 servers. In that case, each child received its CPU time slice five times less frequently. So having a big value for MaxClients, doesn't mean that the performance will be better. You have just seen the numbers!

Now we will start drastically to reduce MaxRequestsPerChild:


  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50

     NR   NC    MRPC     RPS    comment
  ------------------------------------------------
    100   10      10    5.77 
    100   10       5    3.32
   1000   50      20    8.92
   1000   50      10    5.47
   1000   50       5    2.83
   1000  100      10    6.51

Conclusions: When we drastically reduce MaxRequestsPerChild, the performance starts to become closer to plain mod_cgi.

Here are the numbers of this run with mod_cgi, for comparison:


  MinSpareServers        8
  MaxSpareServers       16
  StartServers          10
  MaxClients            50
  
     NR   NC    RPS     comment
  ------------------------------------------------
    100   10    1.12
   1000   50    1.14
   1000  100    1.13

Conclusion: mod_cgi is much slower. :) In the first test, when NR/NC was 100/10, mod_cgi was capable of 1.12 requests per second. In the same circumstances, mod_perl was capable of 32 requests per second, nearly 30 times faster! In the first test, each client waited about 100 seconds to be served. In the second and third tests, they waited 1,000 seconds!

Choosing MaxClients

The MaxClients directive sets the limit on the number of simultaneous requests that can be supported. No more than this number of child server processes will be created. To configure more than 256 clients, you must edit the HARD_SERVER_LIMIT entry in httpd.h and recompile. In our case, we want this variable to be as small as possible, so we can limit the resources used by the server children. Since we can restrict each child's process size with Apache::SizeLimit or Apache::GTopLimit, the calculation of MaxClients is pretty straightforward:


               Total RAM Dedicated to the Webserver
  MaxClients = ------------------------------------
                     MAX child's process size

So if I have 400Mb left for the Web server to run with, then I can set MaxClients to be of 40 if I know that each child is limited to 10Mb of memory (e.g. with Apache::SizeLimit).

You will be wondering what will happen to your server if there are more concurrent users than MaxClients at any time. This situation is signified by the following warning message in the error_log:


  [Sun Jan 24 12:05:32 1999] [error] server reached MaxClients setting,
  consider raising the MaxClients setting

There is no problem -- any connection attempts over the MaxClients limit will normally be queued, up to a number based on the ListenBacklog directive. When a child process is freed at the end of a different request, the connection will be served.

It is an error because clients are being put in the queue rather than getting served immediately, despite the fact that they do not get an error response. The error can be allowed to persist to balance available system resources and response time, but sooner or later you will need to get more RAM so you can start more child processes. The best approach is to try not to have this condition reached at all, and if you reach it often you should start to worry about it.

It's important to understand how much real memory a child occupies. Your children can share memory between them when the OS supports that. You must take action to allow the sharing to happen. We have disscussed this in one of the previous article whose main topic was shared memory. If you do this, then chances are that your MaxClients can be even higher. But it seems that it's not so simple to calculate the absolute number. If you come up with a solution, then please let us know! If the shared memory was of the same size throughout the child's life, then we could derive a much better formula:


               Total_RAM + Shared_RAM_per_Child * (MaxClients - 1)
  MaxClients = ---------------------------------------------------
                              Max_Process_Size

which is:


                    Total_RAM - Shared_RAM_per_Child
  MaxClients = ---------------------------------------
               Max_Process_Size - Shared_RAM_per_Child

Let's roll some calculations:


  Total_RAM            = 500Mb
  Max_Process_Size     =  10Mb
  Shared_RAM_per_Child =   4Mb

              500 - 4
 MaxClients = --------- = 82
               10 - 4

With no sharing in place


                 500
 MaxClients = --------- = 50
                 10

With sharing in place you can have 64 percent more servers without buying more RAM.

If you improve sharing and keep the sharing level, let's say:


  Total_RAM            = 500Mb
  Max_Process_Size     =  10Mb
  Shared_RAM_per_Child =   8Mb

              500 - 8
 MaxClients = --------- = 246
               10 - 8

392 percent more servers! Now you can feel the importance of having as much shared memory as possible.

Choosing MaxRequestsPerChild

The MaxRequestsPerChild directive sets the limit on the number of requests that an individual child server process will handle. After MaxRequestsPerChild requests, the child process will die. If MaxRequestsPerChild is 0, then the process will live forever.

Setting MaxRequestsPerChild to a non-zero limit solves some memory leakage problems caused by sloppy programming practices, whereas a child process consumes more memory after each request.

If left unbounded, then after a certain number of requests the children will use up all the available memory and leave the server to die from memory starvation. Note that sometimes standard system libraries leak memory too, especially on OSes with bad memory management (e.g. Solaris 2.5 on x86 arch).

If this is your case, then you can set MaxRequestsPerChild to a small number. This will allow the system to reclaim the memory that a greedy child process consumed, when it exits after MaxRequestsPerChild requests.

But beware -- if you set this number too low, you will lose some of the speed bonus you get from mod_perl. Consider using Apache::PerlRun if this is the case.

Another approach is to use the Apache::SizeLimit or the Apache::GTopLimit modules. By using either of these modules you should be able to discontinue using the MaxRequestPerChild, although for some developers, using both in combination does the job. In addition the latter module allows you to kill any servers whose shared memory size drops below a specified limit.

References

This week on Perl 6, week ending 2003-02-02

Parrot Objects (noun, not verb)

Piers Cawley worried about class private attributes and asked it it was still going to be possible to write an object serializing tool that wouldn't require tons of Perl class overloads. Dan said that it should still be possible.

http://groups.google.com/groups

The packfile patches, an ongoing saga.

Leo Tötsch spent the week working on the packfile format and Parrot's tools for manipulating it. Various internals folks provided feedback, pointers to standards and other handy feedback.

http://groups.google.com/groups

http://groups.google.com/groups

http://rt.perl.org/rt2/Ticket/Display.html

http://rt.perl.org/rt2/Ticket/Display.html

Securing Parrot ASM

Thomas Whateley has been thinking about how to allow Parrot to run untrusted code in a sandbox, and proposed a system where code would provide a list of required rights and Parrot would guarantee that only those privileges were provided. Matthew Byng-Maddick pointed everyone at the RFC he'd written along these lines and raised the issue of what happens when you call out to a module that's written natively, which he described a 'brick wall'. Others were less sure that this was a showstopper.

Somewhat annoyingly my mailer seems to have had a bad moment and lost some messages, including one from Dan in which he outlined his vision for 'safe mode' and gave some pointers to the VMS docs in this area.

http://groups.google.com/groups

http://dev.perl.org/rfc/353.pod

http://groups.google.com/groups -- Dan's outline

Parrot Developer World Map

Leon Brocard announced the Parrot Developer World Map, which plots the locations of all the Parrot developers who sent him their location details on a map of the world. A cursory glance at the map would seem to imply that a good place for a Parrot Developers' face to face meeting would be a Caribbean island. Gregor neatly one upped Leon with his interactive SVG version.

http://www.astray.com/parrot/worldmap/

http://www.focusresearch.com/gregor/map.html

Coroutine context patch

Jonathan Sillito found some problems with the coroutine code, so he wrote a test exposing the errors, patched things to fix the errors and added a documentation patch to be on the safe side. Go Jonathan. Steve Fink applied the patches. Jonathan also had a question about naming, and wondered if anyone would be interested in a howto for compiler writers. Nobody commented on this, but the howto sounds like a great idea to me.

http://groups.google.com/groups

Multiple code segments and the interpreter

Leo Tötsch noted some problems with interpreter->code and multiple code segments, suggested a way forward and asked for comments. Nicholas Clark commented, wondering about the potential for memory leaks but Leo seemed to think that that was covered.

http://groups.google.com/groups

Parrot run loop problems

Leo has also been thinking about Parrot run loops and reckons we're doing it wrong. So he proposed a different way of doing things, working through its impact on the various kinds of Parrot core. Steve Fink liked Leo's idea, but Jason Gloudon was less sure.

http://groups.google.com/groups

Meanwhile, in perl6-language

More Array Behaviours

Michael Lazzaro is working on documenting Perl Arrays and asked for confirmations and clarifications on various areas of Array behaviour. One subthread morphed into a discussion of when to use is properties and when to use but properties. It turns out that the 'is is compile time, but is runtime' rule of thumb is only a rule of thumb. Damian says that a better distinction is 'but is a property on a value, is is a property on a referent'. Variables, arrays, opaque objects, hashes etc are referents and therefore get is properties. Values are thing things that are held in variables, arrays, opaque objects etc, and they get but properties.

http://groups.google.com/groups

Spare Brackets

Martin wondered if it would be possible to use [] to do both hash and array lookups. Short answer ``No''. Which isn't to say the thread didn't take off like a runaway train. After far too many messages we reached the same 'No' conclusion that Damian had stated way back a the start of the thread. The reasoning goes like this: Given < $container[$spec] > you can't use the type of $container to decide whether to do a hash or an array lookup because there are Perl object types which have both hash and array semantics. And you can't use $spec's type either because Perl happily autoconverts between numbers and strings. So, you need some way to specify what sort of lookup you need, and you do that by using {} or [] as appropriate. But, frankly, that's beside the point. The reason we're not using [] to do both array and hash style lookups is Because Larry Says So.

http://groups.google.com/groups

http://groups.google.com/groups

Arrays: Default values

Argh! I've really been screwed by problems with my mail here. For some reason a batch of mail appears to have got dropped on the floor this week and my record of this thread is the one that's suffered the most.

The gist of the thread is that adding a default value to an array opens a whole scary can of worms to do with the distinction between 'undef' and 'exists' (made trickier still by the fact that 'exists' is a pretty vague concept when you're dealing with an array.) Damian's view is that an array with a default value should do the equivalent of

   $content = @foo[$index] // @foo.default;

ie, all undefs in the array should be 'replaced' with the array's default value. And that includes any 'deliberate' undefs (where the programmer does @foo[$index] = undef). As I understand it, doing anything else leads to implementation pain. And it leads to even more fun when your default is computed and the function doing the computation gets passed the index as an argument. What happens when someone does a shift?

For some reason this led into a discussion of whether hashes are ordered (they're not, but if you leave a particular hash alone any iterators (keys/values/each) will step through the hash in the same order, repeatably, but that order is not predictable).

http://groups.google.com/groups

http://groups.google.com/groups -- Current 'default' rules

Damian Takes a Sabbatical

Damian announced that, owing to time pressures, the need to find paying work, the need to help Larry get the next Apocalypse out the door and to get the associated Exegesis written he's unsubscribed from the language list for the time being, though he expects that he'll probably resubscribe once A6 and E6 are released. Folks wished him well.

http://groups.google.com/groups

In Brief

After asking for suggestions of a language to implement using parrot, K Stol announced that he would be implementing a Lua->parrot compiler as his final year project.

Who's Who in Perl 6?

Who are you?
Gopal V [gopalv82 at symonds dot net]
What do you do for/with Perl 6?
I'm more interested in Parrot rather than in Perl 6 , since it is the only really free bytecode format out there. I am planning to build a C# to Parrot compiler. I did a bunch of opcodes in this direction , but mostly what I do is lurk around in the mailing list & IRC.
Where are you coming from?
I'm from the DotGNU project and have been playing around in that code for some time. I've still got my foot placed firmly there and am peering over the hedge into parrot. I'm still in college, so I still have to find time to study.
When do you think Perl 6 will be released?
On a Saturday morning after announcing Tuesday afternoon in the release plan.
Why are you doing this?
It's great to be involved with a community project , it's a very entertaining, stimulating and educating experience. In short, I'm in this for the warm fuzzy feeling !.
You have 5 words. Describe yourself.
compulsive hacker, caffeine addict, RSI_sufferer.
Do you have anything to declare?
I promise, I'll try hard to get the C# compiler working with parrot. (starts singing ``Some day, Some way...'' off key)

Announcements and Acknowledgements

This week's summary was once again prepared in a comfortable chair with plenty of Earl Grey tea and distractions from two delightfully ginger kittens and my embarrassing ability to screw up my Gnus installation (now unscrewed).

Thanks to Gopal V for his answers to the questionnaire this week. If you've been involved in the Perl 6 development process, please consider answering the same set of questions and sending them to 5Ws@bofh.org.uk and you'll earn my undying gratitude.

Proofreading was handled by the ever lovely Aspell.

If you appreciated this summary, please consider one or more of the following options:

This week's summary was sponsored by Darren Duncan. Thanks Darren. If you'd like to become a summary sponsor, drop me a line at p6summarizer@bofh.org.uk.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en