October 2002 Archives

On Topic

A few concepts in Perl 6 are strange at first sight. They seem hard to understand, but it's only because they're new and different. They aren't deep mystical concepts known only to Tibetan lamas. Anyone can understand them, but it helps to start with a common-sense explanation.

This article looks at the concepts of "topic" and "topicalizer". The words aren't quotes from a particularly nasty bit of Vogon poetry. They're actually common terms from the field of linguistics ... which some might say is even worse. Still, the best way to understand topic in Perl is to understand its source.

Topic in Linguistics

Every larger unit of human language has a topic -- whether it's a sentence, a paragraph, a conversation or some other sizable chunk. The topic is the central idea of the unit. It's the focus of what's communicated. Native speakers usually have no trouble figuring out what the current topic is, when they think about it. If two little old ladies were talking over a cup of tea:

"I saw Lister yesterday."

"Really? What's he up to these days?"

"Oh, you know, drunk again, and mooning over that awful Krissy Kochanski."

etc ...

and someone asked an observer what the conversation was about, they would instantly reply "Lister".

Topicalizers in Linguistics

A topicalizer is simply a word that marks some thing or some idea as the current topic. In English, we have topicalizers such as "for", "given" and "regarding":

"For our first trick tonight, ladies and gentlemen, my partner Kryten will attempt to eat a boiled egg."

"Given that God is infinite, and that the universe is also infinite, would you like a toasted tea-cake?"

"Regarding topicalizers, I should point out that this sentence starts with one."

Topic in Perl

Now we need to adapt the linguistic definition of topic to Perl. In Perl, the topic is the most important variable in a block of code. It can be any variable: a scalar, array, hash, object. To be more accurate, it's the underlying storage location that's the topic. This might sound a little too abstract, but it's an important distinction. Variables are really only names we use to get at stored values. A single storage location can have multiple names. In English, "Rimmer", "he" and "the hologram" could all appear in a text meaning the same person. In Perl $_, $name, %characters{'title'} and any number of other variables could appear in a section of code as different ways of accessing a single value. And if that value is the current topic then all the variables connected to it are too. This will be important later.

At this point, the average reader is thinking "That's very interesting, but why do I care what the topic is? I've gotten along just fine without it all these years. Why start now?"

And the answer is: it's not required. No one has to understand topic to use it, any more than they have to understand gravity to catch a ball.

Why? It's a really simple rule. We'll call it the first law of topics: Topic is $_. Any time a value becomes the current topic, $_ becomes another one of the names for that value. We say $_ is aliased to it. So, all it takes to use topic is to use $_ either explicitly or implicitly in all the old familiar places, like chomp, print and substitutions, and in a few new places, like when statements and with unary dot.

Even so, it's still a good idea to understand topic. Understanding gravity makes a number of things that seem unrelated suddenly fit. Things like apples falling, planes crashing, the way the moon and sun move, baseball, and rollercoasters. It's the same with topic. Any programmer can use $_ without understanding topic. But when they understand topic, it becomes a logical system instead of a random collection of "things $_ does". I like logical systems.

Topicalizers in Perl

A topicalizer in Perl is a keyword or construct that makes a value become the current topic. The current topicalizers are given, for, method, rule, ->, certain bare closures, and CATCH blocks, but the cast of characters keeps growing.

Coal and Switches

Perl 6's switch, the given construct, is the prime example of a topicalizer. Its sole purpose is to make the value passed to it the current topic within the block.


    given $name {
        when "Lister" { print "I'm a curryaholic." }
        when "Cat"    { print "Orange?! With this suit?!" }
        when "Rimmer" { print "4,691 irradiated haggis." }
    }

So, in this example, the given makes $name the current topic by aliasing it to $_. Then the when statements compare against $_. After the block, $_ regains the value in the outer scope. In Perl 6, $_ is just an ordinary lexical variable and every topicalizer creates its own $_ variable, lexically scoped to its associated block.

Fruit Loops and M&M's

The for loop is the classic topicalizer. It was topicalizing long before most of us had a word for the activity. for is similar to given, but instead of creating a single topic, it creates a series of topics, one for each iteration.


    for @orders {
        when /scone/ {
            print "Would you like some toast?";
        }
        when /croissant/ {
            print "Hot, buttered, scrummy toast?";
        }
        when /toast/ {
            print "Really? How about a muffin?";
        }
    }

Just like given, for takes a value, in this case the current element of the array, and makes it the topic.

In simple cases like these, both for and given create the $_ alias read-write. This is the same as Perl 5: any change to $_ inside the block modifies the original value.

Bow and Arrow

The new and improved arrow (->) is the most flexible topicalizer. It appears in a variety of different contexts. By itself -> creates an anonymous subroutine just like sub.


        -> $param { ... }

        # is the same as:

        sub ($param) { ... }

The only differences are that -> doesn't require parentheses around its parameter list, and that -> topicalizes its first parameter.

In the following example, the first expression creates an anonymous sub and stores it in $cleanup. When the sub stored in $cleanup executes, the $line parameter takes the string argument and becomes the current topic, so both $line and $_ are aliased to the value of $intro. The usual suspects, chomp, substitution and print then use the topic as default.


    $cleanup = -> $line is rw {
        s:w/Captain Rimmer!/the bloke/;
        $line _= " who cleans the soup machine!";
        print;
    }

    $intro = "Fear not, I'm Captain Rimmer!";
    $cleanup($intro);

Unlike the simple for and given, the arrow creates its aliases read-only by default. The is rw property marks both the named alias and the $_ alias as read-write. Without the property attached, any statements within the block that modify $line or $_ cause a compile-time error just as if they had been explicitly flagged is constant.

The arrow isn't limited to working alone. It can also combine with other topicalizers. When it does, it creates a named alias for the current topic.


    for @lines -> $line is rw {
        s:w/Captain Rimmer!/the bloke/;
        $line _= " who cleans the soup machine!";
        print;
    }

As the for iterates over the array it aliases every element in turn to $line and to $_. This takes the place of the Perl 5 way of aliasing a loop variable:


    # Perl 5
    for my $line (@lines) {
        $line =~ s/Captain Rimmer!/the bloke/;
        $line .= " who cleans the soup machine!";
        print $line;
    }

The Perl 6 way has some added benefits, though. Since the arrow aliases both $line and $_ to the current value, it works with the defaulting constructs, like print, but also provides a more meaningful name than $_ when an explicitly named variable is necessary.

The first example of for and the example of the anonymous sub reference are fascinatingly similar. The only difference is one is stored in a variable to be called later and one is tacked onto a for. Really, all the for example has done is replace the loop's block with an anonymous sub reference. This is the second advantage of the Perl 6 way. Because $line is now the parameter of a subroutine, it's automatically lexically scoped to the block. The my happens implicitly.

The arrow also combines with constructs that aren't topicalizers, like if and while, and allows them to topicalize.


    if %people{$name}{'details'}{'age'} -> $age {
        print "$age already?\n";
        if $age > 3000000 {
            print "How was your stasis?\n";
        } elsif $age < 10 {
            print "How 'bout a muffin?\n";
        }
    }

As the if tests the truth value of the element of the data structure, the arrow also aliases that value to $age and to $_. The example could have accessed the hash of hashes of hashes directly each time it needed the age value, but the short alias is much more convenient.

This feature is really only useful with simple truth tests. The truth value tested in the following example isn't 3 or $counter, it's the result of a complex conditional, $counter > 3.


    if $counter > 3 -> $value {
        # do something with $value
    }

The result will be true or false, but if it's false, the block will never execute. In fact, when the truth test is false, $value is never aliased at all. It simply doesn't exist. A lexically scoped variable with a true value would have the same effect.


    if $counter > 3 {
        my $value = 1;
        # do something with $value
    }

So the arrow isn't a Ronco plum-pitter, yoghurt-squirter, do-everything tool. When it's useful, it's very, very useful, but when it's not... well... don't use it. :)

Method in my Madness

Methods topicalize their invocant. An invocant is the object on which a method is called. After saying that 10 times fast, anyone can see why it needed a name. The design team chose "invocant". Methods topicalize the invocant when it's a named parameter like $self:



    method sub_ether ($self: $message) {
        .transmit( .encode($message) );
    }

and when it's left implicit:


    method sub_ether {
        .transmit( .encoded_message );
    }

    method sub_ether (: $message) {
        .transmit( .encode($message) );
    }

This is handy in short methods. The unary dot is just another defaulting construct. Any method call without an explicit object executes on the current topic. In the previous examples, .transmit is exactly the same as $self.transmit and $_.transmit.

The Sub of All Fears

Unlike methods and the arrow, ordinary subs don't topicalize, at least not by default. The following example will either print nothing, or else print a stray $_ that is in lexical scope wherever the sub is defined.


    sub eddy ($space, $time) {
        print;
    }

Subs can topicalize, though, with a little help from the is topic property. The property flags a parameter as the topic within the subroutine's block. It can be attached to any parameter in the list, but not to more than one parameter at a time.


    sub eddy ($space, $time is topic) {
        print;
    }

Built-in functions like print that default to the current topic when they're called are incredibly useful. Wouldn't it be great to have user-defined subroutines that behaved the same way? But, since $_ is now just an ordinary lexical variable, a subroutine generally can't access the topic from its caller. It can only access variables in the scope where it is defined.

The is given property gives subroutines access to their caller's topic. It accesses the topic within the caller's scope and binds it to the property's parameter.


    sub print_quotes is given($default) {
        print "Random Quote: ";
        print $default;
    }
    ...
    given $quote {
        print_quotes;
    }

The is given property can appear on subroutines with a full parameter list as well. The only restriction is that the property's parameter can't have the same name as a parameter in the full list.


    sub print_quotes (*@quotes) is given($default) {
        print "Random Quote: ";
        if ( @quotes.length > 0 ) {
            print @quotes;
        } else {
            print $default;
        }
    }

(A true simulation of the built-in print function would use multimethod dispatch, but that's outside the scope of this article.)

The property's parameter can have any name, but a parameter named $_ or one that has the is topic property attached will set the caller's topic as the topic within the subroutine as well.


    sub print_quotes is given($_) {  
                # alias $_ to caller's $_
        print;  # prints the value of caller's $_
    }
    # or
    sub print_quotes is given($default is topic) { 
                # alias $default to caller's $_
                # and to $_ within the subroutine
        print;  # prints $default, the value of caller's $_
    }

Perl Rules!

Grammar rules and closures within a rule topicalize their state object. This is convenient because it means methods on the state object can use the unary dot syntax.


    m:each/ aardvark { print .pos } /

The state object for a rule is similar to the $self object for a method. It's an instance of a grammar class. Named rules are really just named methods called on the state object and anonymous rules and closures within a rule are really just anonymous methods called on the state object. Unfortunately, that's just enough information to be tantalizing without actually being useful, but a full explanation could take up an entire article. Still, knowing that the state object is like a $self object is a step in the right direction.

The CATCH-er in the try

CATCH blocks always topicalize the error variable, $!. This streamlines the exception catching syntax because the CATCH block acts as its own switch statement.


    CATCH {
        when Err::WrongUniverse {
            try_new_universe();
        }
    }

This is much tidier than the equivalent:


    CATCH {
        if $!.isa(Err::WrongUniverse) {
            try_new_universe();
        }
    }

The Bare Truth

Bare closures topicalize their first argument. If the block uses placeholder variables, the topic is also aliased to the Unicode-abetically first variable. The topic is lexically scoped to the block, but it is a read-write parameter, so modifications to $_ within the block modify the original value.


    %commands = (
        add  => { $^a + $^b },
        incr => { $_  + 1 },
    );

Constructs like grep and map are no longer special because they use $_ within the block argument. They simply benefit from the normal behavior of bare blocks.


    @names = map { chomp; split /\s*/; } @input;

Nesting Instinct

Nested topicalizers are a little more complicated. The following example starts with $name as the topic and the first case matches against it. Within the case, the print also defaults to the current topic. The second case is a little more complicated; it contains a loop. Within the loop is another print statement. This one also defaults to the current topic which is... Hmmmm... is the topic $name or $quote?


    given $name {
        when /Rimmer/ {
            print;
            print rimmer_quote();
        }
        when /Kryten/ {
            for kryten_quotes() -> $quote {
                print;
            }
        }
    }

The answer falls out of a few simple rules:

  • There is only one topic at a time.

A series of nested topicalizers doesn't create a collection of topics. The interpreter doesn't have to sort through a complicated set of options to know what the current topic is. There will never be any ambiguity. A script or module may have a series of different topics, but only one at a time.

  • Topic obeys the lexical scope of topicalizers.

For the programmer, determining the current topic is never any more complex than tracing outward to the nearest topicalizer. The scope of the topic is always restricted to the scope of the topicalizer that created it. So the example above will print the inner topic, $quote.

An equivalent pair of nested topicalizers in Perl 5 would have printed the outer topic instead. That's because topicalizers never created a $_ alias at the same time as a named alias. It's a fairly common trick in Perl 5 to use a named alias with a for loop to avoid clobbering $_. That trick doesn't work anymore. Which brings us to the third rule:

  • To keep outer topics, use a named alias.

It's just the old trick inverted. Instead of using a named alias with the inner topicalizer to avoid clobbering the outer topic, it uses a named alias with the outer topicalizer to access it after the topic has changed.


        when /Kryten/ {
            for kryten_quotes() -> $quote {
                print $name;
                print;
            }

This really fits better with the way humans think. It makes more sense to give a name to the thing worth keeping than to give a name to the thing to be thrown away.

Methods have the same problem, but in their case it means that even though the unary dot is really handy in simple methods, in more complex methods with nested topicalizers it's better to use a named parameter for the invocant.


    method locate ($self: *@characters) {
        .cleanup_names(@characters);
        for @characters -> $name {
            .display_location($name);
        }
        .change_location('Holly');
    }

The .display_location method won't call a method on $self, it will try to call a method on $name, and fail (unless $name is an object with a .display_location method). The code will have to call the method as $self.display_location() instead. It would be clearer to add $self in front of the other method calls as well, but that's entirely a matter of style.

Multiple aliases

Another complication is that topicalizers aren't restricted to a single alias. A for loop might iterate through an array several parameters at a time:


    for @characters -> $role1, $role2, $role3 {
        ...
    }

iterate through several arrays, one after the other, taking a few parameters at a time:


    for @humans, @betelgeusians, @vogons -> $role1, $role2 {
        ...
    }

or iterate through several arrays at the same time:


    for @characters; @locations -> $name; $place {
        ...
    }

But no matter how complicated the code gets, the topic stays the same. The rules of the game are:

  • There is only one topic.

This rule should look familiar. It might even deserve to be called "the second law of topics". With nested topicalizers, the restriction means that two topics from different scopes will never be accessible at the same time. With multiple aliases, it means that while a topicalizer can create more than one alias, only one of the aliases can be the topic.

  • The topic is the first parameter.

This rule makes it easy to pick out the topic. In the first two examples, the topic is $role1, and in the third example it's $name.

There is one exception to the rule. The is topic property can select any parameter as topic in place of the default first parameter.


    for @characters -> $role1, $role2 is topic, $role3 {
        ...
    }

The Final Frontier

That's pretty much all there is to topic. Hopefully, this article has pushed one more thing into the "Gee, that's easy!" category. But, if not, carry away one idea: that first law, "Topic is $_". The next time conversation turns to Perl 6 and topic, that simple translation will make it understandable.

The Phrasebook Design Pattern

Have you ever written a Perl application that connects to a database? If so, then you probably faced a problem of having some code like:


 $statement = q(select isbn from book where title = 'Design Patterns');
 
 $sth = $dbh->prepare($statement) ; 
 $rc=$sth->execute();

Why is this "a problem"? It looks like nice code, doesn't it? Actually, if you look at the code above, you will see two languages: Perl and SQL. This makes the code not that readable. Besides, SQL statements that are only slightly different may appear in several places, and this will make it more difficult to maintain. In addition, suppose you have an SQL expert who should optimize your SQL calls. How do you let him work on the SQL; should he look through your Perl code for it? And what if that guy is so ignorant that he doesn't even know Perl?



We can solve these problems by writing:

 $statement = $sql->get("FIND_ISBN", { title => "Design Patterns" });
 $sth = $dbh->prepare($statement);
 $rc=$sth->execute();

We keep the actual SQL statements and their corresponding lookup keys in a separate place that we call a phrasebook. We can use XML to wrap the SQL and the lookup keys:


 ... 
 <phrase name="FIND_ISBN">
   select isbn from book where title = '$title' 
 </phrase> 
 ...

As we can see above, the SQL statement can hold placeholders, so we can use one generic SQL statement that can be used in different places.

Now, our code is more readable and maintainable. As we will see later, there are additional advantages for writing using the phrasebook design pattern - it helps when we port the code, or when we debug.

Nevertheless, is it a design pattern at all? We saw only one problem and one solution, right? So here are two more examples: When you need to generate an error code from your application, you might have in your code two languages: Perl and English. Moreover, suppose you want to have later the error code in other languages? The phrasebook design pattern suggests that you should separate the languages, and put the error messages in a phrasebook. It also guides you as to how to have the error messages in different languages.

Suppose you write a code generator that generates PostScript. In order to have readable code, it is better to separate the PostScript code from the rest of your code, and put it in a phrasebook.

If you are particularly interested, then you can find the original Phrasebook Design Pattern paper in http://jerry.cs.uiuc.edu/~plop/plop2k/proceedings/Pinchuk/Pinchuk.pdf.

Because our examples above were about writing applications that uses database, I would like to add a paragraph or two about persistency. Some readers might argue that instead of having the SQL code within our Perl code, or even within our phrasebook, we should create objects that take the responsibility for connecting to the database, and load or save themselves. This way, if we use those objects, we need no SQL in the rest of our code. There are two points to make in this context:

I would suggest the use of the phrasebook within the classes that implement those persistence objects. This way, the code of those classes will gain from the advantages that the phrasebook pattern offers.

Usually a persistence object represents one or more tables in the database. The idea is that we should not access those tables without using the object. Yet, in my experience, because of performance issues, you may need complex SQL statements that access tables belonging to more then one object. So the programmer might find himself writing an SQL statement in his main application anyway, instead of using the persistence objects.

Class::Phrasebook

You probably guessed already that the phrasebook class implements the phrasebook design pattern. Like the rest of the classes that are described in this paper, it is available for downloading from CPAN. Let us see how we use that class to generate some error codes in different languages (English and Dutch for example).

We begin with writing the phrasebook. The phrasebook is a simple XML file. It will look like:


 <?xml version="1.0" encoding="ISO-8859-1"?>
 <!DOCTYPE phrasebook [
 <!ELEMENT phrasebook (dictionary)*>
 <!ELEMENT dictionary (phrase)*>
 <!ATTLIST dictionary name CDATA #REQUIRED>
 <!ELEMENT phrase (#PCDATA)>
 <!ATTLIST phrase name CDATA #REQUIRED>
 ]>

 <phrasebook>
  <dictionary name="EN">
   <phrase name="MISUSE_OF_MANUAL_TEMPLATE_NAME">
     The name $name can be used only as manual template
   </phrase>
   ...
  </dictionary>
  <dictionary name="NL">
   <phrase name="MISUSE_OF_MANUAL_TEMPLATE_NAME">
    De naam $name mag enkle gebriuk worden als webboek template
   </phrase>
   ...
  </dictionary> 
 </phrasebook>

As we can see, the phrasebook file starts with the Document Type Definition (DTD). Don't panic - just copy it as is. It is used by the XML parser to validate the XML code. Then we open the definition of the phrasebook, and inside it one or more definitions of dictionaries. Each dictionary will hold the phrases. The first dictionary is taken as the default dictionary: if a phrase is missing in other dictionary, it will be taken from the first one. The phrases can hold placeholders. The placeholders look exactly like Perl scalar variables.

Now let's see how we get a phrase from the phrasebook:


 ...
 $msg = new Class::Phrasebook($log, "errors.xml");
 $msg->load($language);
 ...
 # check that the name of the document is not a manual_template name
 if (is_manual_template_name($template_name)) {
    my $message = $msg->get("MISUSE_OF_MANUAL_TEMPLATE_NAME",
                            { name => "$template_name"} ); 
    $log->write($message, 5);
    return 0;
 }

First, we create the Class::Phrasebook object. We supply to the constructor a Log::LogLite object and a name of the XML file that contains the phrasebook we want to use. We will discuss the Log::LogLite class later. For now, you should only know that this class provides with the ability to have log messages - so if, for example, our new Class::Phrasebook object fails to find the XML file, a log message will be written to a log file.

How the XML file will be found if we do not supply any path? The class automatically searches the following directories:

  • The current directory.
  • The directory ./lib in the current directory.
  • The directory ../lib in the current directory.
  • @INC.

This allows us to create an XML file that comes with a module in an easy way: We should place the XML file in a lib directory below the directory of our module. Therefore, if the module we write is Blah, then its XML file will be placed in the directory Blah/lib. Of course, it is a good idea to give the XML file a name that is similar to the module name - blah.xml, so it will be unique on the system (because "make install" will install the XML file next to the module file Blah.pm).

The next thing we do is to load a dictionary from the phrasebook file. The $language variable above will hold the dictionary name - in our example, it will be "EN" or "NL".

Finally, when we need to get the text for an error message, we call the get method of the Class::Phrasebook object with the key of the message we want to get, and a reference to a hash that holds the name-value pairs of the placeholders within the phrase.

Some careful readers might point out that if we use the Class::Phrasebook inside another class, then we might end up with a memory problem. Suppose many of the other class objects are constructed, and all of them load the same dictionary using the Class::Phrasebook objects. We will end up with many identical dictionaries loaded into the memory.

For example, assume we have object User that uses an object of Class::Phrasebook to generate error messages. When we construct the User object, we also construct a Class::Phrasebook object and load the dictionary for the error messages in the right language. Let's assume we have 100 User objects that are supposed to provide the error messages in English. It means that those 100 User objects will hold 100 Class::Phrasebook objects and each of them will hold its own copy of the English error dictionary. Terrible loss of memory!

Well, not exactly. The Class::Phrasebook keeps the loaded dictionaries in a cache that is actually class data. That means that all the objects of Class::Phrasebook have one cache like that. The cache is managed by the class, and knows to load only one dictionary of each sort. It will know also to delete that dictionary from the memory when there are no more objects that refer to it. Therefore, the careful readers should not worry about this issue any more.

However, continuing the example above, sometimes we might load the 100 User objects one after the other. Each time, that object will be destroyed. Other careful reader might point out that in this case the dictionary will go out of scope every time, and actually we will load the same dictionary 100 times!

Because of that, the class provides for that careful reader a class method that configures the class to keep the loaded dictionaries in the cache forever. This is not the default behavior, but when coding daemons, for example, it will be desirable.

Class::Phrasebook::SQL

The class Class::Phrasebook::SQL is a daughter class of Class::Phrasebook. It provides us with some extra features that helps us to deal with SQL phrases more easily. One example is the way it deals with the update SQL statement. Imagine that you have in your application a form that the user is supposed to fill in. If the user fills in only two fields, then you are supposed to update only those two fields in his record. Usually this problem is solved by having a simple code that builds our update SQL statement from its possible parts. However, this will not be that readable. The Class::Phrasebook::SQL will let you do it in the following way:

We will place as a phrase the update SQL statement with all the possible fields:


 <phrase name="UPDATE_T_ACCOUNT">
   update t_account set
         login = '$login',
         description = '$description',
         dates_id = $dates_id,
         groups = $groups,
         owners = $owners
      where id = $id
 </phrase>

The Class::Phrasebook::SQL will drop the "set" lines, where there is a placeholder that its value is undefined. As I wanted to avoid from real parsing of the update statements, the update statements must look as the example above - each "set" expression in different line. Now, if we call the get method as the following:


 $statement = $sql->get("UPDATE_T_ACCOUNT",
                        { description => "administrator of manuals",
                          id => 77 });

We will get the statement:


 update t_account set
       description = 'administrator of manuals'
    where id = 77

"Set" lines that contained undefined placeholders in the original update statement were removed.

Debugging With the Phrasebook Classes

Both the classes Class::Phrasebook and Class::Phrasebook::SQL provide us with debugging services. One example is the environment variable PHRASEBOOK_SQL_DEBUG_PRINTS. If this variable holds the value "TEXT" then it will print debug message each time the method C<get> is called:

 path = Oefeningen/logoklad.source
 [DBOrderedTreeUI.pm:322-->DBOrderedTreeUI::show_list > DBOrderedTreeUI.
 pm:4134-->Manual::fill_node_info_container_from_list > Manual.pm:2885--
 >Document::load > /htdocs/html/projects/webiso/code/classes/Document.pm
 :403-->Revisions::load > /htdocs/html/projects/webiso/code/classes/Revi
 sions.pm:114-->Revision::load: ][GET_LAST_REVISION]

        select path, major, minor, date, user_id,
               state_id, md5, data_md5, is_patch, is_changed 
            from revision 
            where path = 'Oefeningen/logoklad.source'
              and is_patch = 0

If the value of the environment is "COLOR", then the output would be with colors. The colors come from the Term::ANSIColor module. If the value is "HTML" then the output would be HTML code that generates a similar colorful representation.

 path = Oefeningen/logoklad.source
 [DBOrderedTreeUI.pm:322-->DBOrderedTreeUI::show_list > DBOrderedTreeUI 
 .pm:4134-->Manual::fill_node_info_container_from_list > Manual.pm:2885
 -->Document::load > /htdocs/html/projects/webiso/code/classes/Document
 .pm:403-->Revisions::load > /htdocs/html/projects/webiso/code/classes/
 Revisions.pm:114-->Revision::load: ][GET_LAST_REVISION]
        select path, major, minor, date, user_id,
               state_id, md5, data_md5, is_patch, is_changed 
            from revision 
            where path = 'Oefeningen/logoklad.source'
              and is_patch = 0
  

Imagine that you need to see what are the SQL statements that are generated from a certain piece of code. Setting the PHRASEBOOK_SQL_DEBUG_PRINTS environment variable within that code will do the trick in no time. This feature can help not only in debugging, but also in the optimization of SQL code.

A similar environment variable is the PHRASEBOOK_SQL_SAVE_STATEMENTS_FILE_PATH. When this environment is set to a certain file path, all the SQL statements that are generated by calling to the get method will be written to that file. That way, you can see later which SQL statements your application issued, and even to re-run them from that file.

Actually, this is the way I found a bug in one of the former versions of the database PostgreSQL. I noticed that under certain conditions, some select statements fail while they are not supposed to fail. As usual, the "certain conditions" were totally unknown. What I did was to run my application while having the environment PHRASEBOOK_SQL_SAVE_STATEMENTS_FILE_PATH set. When the bug happened, I took the file with all the SQL statements that I got and ran the SQL statements directly from it. Then, I started to clean it until I had only few SQL statements left, and still the bug happened when they were run. I sent those statements with my report of the bug and got within two hours a solution for the problem (Well - you know, when you use open source software, you get support - and the PostgreSQL team is very responsive).

Log::LogLite and Log::NullLogLite

The Log::LogLite and the Log::NullLogLite classes provide us with an excellent opportunity to introduce a beautiful design pattern called the Null Object Design Pattern.

The Log::LogLite is a simple class that let us create simple log files in our application. The synopsis from the manual page of the class gives a good overview of how to use the class:


 use Log::LogLite;
 my $LOG_DIRECTORY = "/where/ever/our/log/file/should/be";
 my $ERROR_LOG_LEVEL = 6;
 # create new Log::LogLite object
 my $log = new Log::LogLite($LOG_DIRECTORY."/error.log",
                            $ERROR_LOG_LEVEL); 
 ...
 # we had an error
 $log->write("Could not open the file ".$file_name.": $!", 4);

As we saw Class::Phrasebook demands the use of a Log::LogLite object. This allows Class::Phrasebook to generate log messages when errors occur - for example, when the parsing of the XML file fails. However, it might be that we do not want to have a log file for every use of the Class::Phrasebook. How can we avoid from having a log file without changing the code of Class::Phrasebook?

The solution for that problem comes from the beautiful Null Object Design Pattern by Bobby Woolf. The pattern guides us to inherit from our class a null class - a class that does nothing, but implement the same interface of the original class. In our example, Log::NullLogLite overrides some of the methods of Log::LogLite to do nothing. So when we call the method write, nothing is written. Because the class inherits from Log::LogLite, the classes that use Log::LogLite continue to run correctly also when we send to them Log::NullLogLite objects.

Conclusion

The phrasebook design pattern helps us to get more readable and maintainable code by separating expressions in one language from the main code that is written in other language. The Class::Phrasebook module helps us to implement this pattern in Perl.

The classes that are presented above have been used by my colleagues and me for few years. During that time, we could see in practice all the advantages that the pattern promises. For example, an application of 65,000 lines had to be ported to another database. When the SQL code is concentrated in XML files, we could achieve that kind of port very rapidly.

In my opinion, Perl gives us a nice platform to program in object-oriented techniques. However, I am not sure that this opinion is well-spread. Many programmers stop learning when things they write start running, and with Perl things start to run very soon. Nevertheless, with good design, big and sophisticated applications can be written with Perl, like with other OO languages.

Thanks

Many thanks to Ockham Technology N.V. for letting me release the above modules and others as open source on CPAN.

This week on Perl 6 (10/7-14, 2002)

This is yet another Perl 6 summary documenting what has happened over on the perl6-internals (where Parrot, the virtual machine that will run Perl 6 is discussed) and perl6-language (where Perl 6 language design is discussed) mailing lists. Piers is still on vacation (bungee jumping and motocrossing, no doubt), so I'm still your host this week. A fairly quiet week, so let's start with the perl6-internals list as usual.

The Pumpking Is Dead, Long Live the Pumpking!

I am happy to report that we have a new Parrot pumpking. Jeff Goff has done great work in the past, but taking over is Steve Fink, who has been active in Parrot since near the beginning. He's been extremely active this week, participating in almost all of the discussions and accepting patches left, right and center.

http://groups.google.com/groups

Variables Have Three Parts

Dan Sugalski decreed that vtables were about to get more complicated. Variables and values used to be simple, but now we need to have three parts for each ``thing:'' an optional name, a container and the contents of the container. There wasn't any discussion, but I expect Dan will rejig vtables some more.

http://groups.google.com/groups

Line Number Metadata

Juergen Boemmels explained that the line number information given by the setline opcode was quite verbose in the source and suggested adding line number metadata into Parrot bytecode. He proposed using the DWARF-2 debugging format (as used by the Mono project) so as not to reinvent the wheel. Dan promised some specs for moving this information out of band.

This thread quickly got out of hand, with Nicholas Clark noticing that having column number information magically built in full debugging support for Befunge (a two-dimensional language) and Sean O'Rouke wishing to make ``source position'' a vector, thus generalizing to scripting languages of any dimension.

http://groups.google.com/groups

New Array Base

Leopold Toetsch continued in his attempt to confuse the summarizer with thousands of patches. He had rewritten the base routines from the array PMC as a working engine for list operations. It should be fast and simple, being based on chunks with fast index get and set. He committed this as list.c, and commented that most of the other array-style PMCs will start to use it as a base, and that it may replace the intlist PMC (and other typed array PMCs).

http://groups.google.com/groups

Parrot_sprintf

Continuing last week's ``sprintf in Parrot'' mention, Brent Dax committed a huge patch completing the feature set of Parrot_sprintf, including width and precision for ints and strings, and modified many little bits of code to use it.

Inspiration then struck him, or rather, vtables did. He's rewritten it to use vtables, and split some of the code out of misc.c and into the new spf_render.c and spf_vtable.c, which managed to turn into another huge patch. Looks like there is a portability issue on PPC systems with va_copy however.

http://bugs6.perl.org/rt2/Ticket/Display.html

http://bugs6.perl.org/rt2/Ticket/Display.html

Nuke dem opcodes

Simon Glover proposed a patch to get rid of the 2-element ``ne'' opcode, which at a first glance should be optimizable at compiler time and hence should not be in Parrot -- barring complicated number precision issues. After a little discussion, Nicholas Clark pointed out that maybe we should do as C99 and state that constant folding will be done at compile time and at the precision of the compiling Parrot. Some of the opcodes where nuked, but it's important to keep some opcodes just in case of overloading.

http://groups.google.com/groups

Getting Started Guide

Cast your minds back, dear readers, if you will, to last week, when Erik Lechak proposed writing a getting-started guide. Well, he did just that, starting from the beginning with the configure system and then all the way out. There were many comments and suggestions, and it would be great to see this as POD and in the repository soon. Unfortunately, it is not in the archive.

Larry Explains All

Perl6-language had very few new threads this week. Instead, there were mostly little updates to previous threads, which makes it somewhat tricky to summarize. However, Larry Wall was everywhere this week, giving us detailed insights into the Perl 6 language.

Larry clarified that to remove ambiguity, variable properties will surrounded in brackets and have repetitions of ``is'':

  # instead of this:
  my $foo is rw cmp "";
  # we would have:
  my ($foo is foo is bar, $bar is baz) = (1,2);

Additionally, it looked like class attributes had changed name from attr to has.

There was also some talk on the module versioning system, which could be done with a slice-like notation:

  use Acme[1.0];                  # like so
  use Acme[ (1;17..) | (2;0..) ]; # or perhaps
  use Acme[1;17..] | Acme[2;0..]; # or even

I'm pretty sure he was joking, but Larry considered alternatives to the ..., which issues a warning (or maybe an exception) if you try to execute it. ??? would never complain and !!! would always throw an exception. Or the other way around. Or one is fatal. Or more likely, stick with ... and make its behaviour pragmatically controllable. ... is useful for abstract method declaration.

There definitely wasn't any talk about &&& and |||.

Perl6 OO Cookbook

Michael Lazzaro made good on his promise last week and produced a comprehensive Perl 6 OO cookbook describing ``stuff that hasn't yet been designed, for a language that doesn't yet exist.'' It is a great piece of work and tries to examine real-life Perl 6 examples.

There was some discussion of the recipes and the Michael announced that he wanted to work on an online system for adding new data and many other changes. Worried about Perl 6 OO? Then check this out:

http://cog.cognitivity.com/perl6/

In Brief

Some of the Parrot tests still don't work under Win32.

Some bugs in various bits of the Parrot JITs were found, and some fixed.

Dakkar found a bug in the Perl 6 compiler that basically boiled down to checking for truth instead of definedness. Hopefully, Perl 6 will remove this particular problem for us ;-)

We probably need more tinderboxes.

Brent Dax promises to fit a pony into his next patch.

Simon Glover added quite a few tests and pieces of documentation.

C structs need to be padded for the more exotic architectures and compilers.

There are still some DOD / GC bugs.

Who's Who in Perl6

Once more we get to meet people involved in the development of Perl 6.

Who are you?
Jérôme Quelin.
What do you do for/with Perl 6?
I wrote a Befunge interpreter in Parrot assembly. Now I'm waiting for Parrot to handle multi-arrays and objects in order to implement the Befunge-98 specs.
Where are you coming from?
Some Perl Golf rumors have said that I'm an alien coming from Mars.
When do you think Perl 6 will be released?
Some day.
Why are you doing this?
In order to have a chance to understand Perl 6 sources. And because Befunge is a fun language, that deserves to be supported by Parrot.
You have five words. Describe yourself.
Perl Golf and Befunge addicted.
Do you have anything to declare?
Befunge rocks.

Acknowledgements

This summary was brought to you with slightly less distraction from Super Mario Sunshine and more recognition of the sterling work that Piers does every week.

As Piers says: One more, if you think this summary has value send money to the Perl Foundation http://donate.perl-foundation.org and feed back and/or T?iBooks to me, mailto:pdcawley@bofh.org.uk. As usual, the fee paid for publication of this summary on perl.com has been donated directly to the Perl Foundation.

Radiator

Are you fed up with those who think that commercial applications need to be written in an "enterprise" language such as Java or C++? While we're big fans of open source at Perl.com, we're bigger fans of Perl, and we're frustrated when people claim there's something that Perl can't do; so we spoke to Mike McCauley at Open System Consultants.

Open Systems produces Radiator, a commercial RADIUS server implementation, first released four years ago and with new versions and enhancements continually developed. It has about 5,000 paying installations worldwide. Mike explains: "It's used to authenticate dialup and wireless access to an ISP or corporate network. In order to authenticate users, it can look up user details in a wide range of data sources, such as SQL, LDAP, flat files, Unix password files, OPIE, PAM, Windows NT SAM, Active Directory, third-party billing packages, and so on."

Mike was particularly impressed with Perl's write-once-run-anywhere nature, which has completely obviated the need for porting or platform-specific alteration. As he says, "The finished product runs without change on almost every platform known to humanity. That means we can appeal to more potential customers, with less effort spent on porting and maintenance."

But there were other reasons for choosing Perl as an implementation language: "The richness of Perl allows us to express complicated algorithms quickly and concisely, and know they will work wherever the customer wants to run the code; and the connect easily to lots of different data sources." Perl's reusability and the great library of code already available also played its part. "The modules from CPAN mean we can talk to servers like SQL, LDAP, NISPLUS, Active Directory, OpenSSL, and lots of other things. We can concentrate on writing the product, rather than coding and maintaining interfaces."

We asked about the performance of going with Perl instead of, for example, C, but this seemed not to be the problem that many people might expect. "Most authentications rely on some external server or system, such as an LDAP or SQL server, and so the speed determining step is usually that extenal system. Also, the interfaces to those systems are generally compiled Perl modules written in C. And where there is no external server, we can use clever hashing mechanisms to make lookups faster than some C based Radius servers. That means that Radiator can be up there with a server written in C, and with much more features and flexibility."

Mike also explained that good Perl programming practice can keep performance high: "In order to get the best performance, you have to code so that you use as much of Perl's internal lovingly hand-wrought C code as possible. That means using complex operators like map, grep, hashes etc, to do the maximum of work with as few Perl operators as possible. You have to balance that against readability though, otherwise you can end up with unmaintainable code that looks like line noise. Radiator makes heavy use of Perl's Object Oriented support, which costs in performance, but we think the benefits in maintainability and easy extensibility (for us and the customer) are worth it."

One common concern with businesses releasing Perl products is that they're worried about piracy; if the source code is visible by anyone, isn't it easy for people to run away with it? The evaluation version of Radiator is shipped with a small portion of the code encrypted, and only made available to bona fide equirers. However, the full product is shipped completely unencrypted. Mike flips over the concern and sees the advantages. "Most network operators really like the idea of a product with full source code: They can be sure that the product does what it claims, and they can change or enhance it if necessary. ... We like to offer full source code, but we also need to be paid for our hard work, and partial encryption of demos seems to be a good compromise that results in most of our demos turning into sales."

What, then, about people making customizations or passing on copies to their friends? "Actually, we don't mind if customers change their code to suit their own needs. But we don't like it if they give the code to someone else, so that we get to support them without being paid. There is a little bit of that goes on, but ours is not really a mass-market consumer product, and I think that most people realise they get more benefit from an honest relationship with us. In the end, a license is not very expensive, especially since the support is so good."

Finally, we asked Mike to sum up his thoughts on commercial development with Perl. "Technically, I think it is unsurpassed for almost any application. In Perl, I can be five to 10 times more productive, line-for-line than in C or C++ (and I'm no slouch at them, either). For software vendors, that means a more maintainable product delivered to market faster. For customers it means a better product for less money. And interoperability and portability is fantastic. As you can tell, we really like it!"

"Commercially, a qualified yes, provided you have control over licensing and distribution issues, which might be hard in anything other than a niche market. You can't keep writing software unless you get to pay the mortgage and feed the kids, too."

Mike McCauley is chief software developer at Open System Consultants in Melbourne, Australia. He has a bachelor of engineering from the University of Queensland, and has worked in the computer software industry for 20 years. When he is not writing software, he flies light planes and has fun with his family.

A Review of Komodo

Every time I get a new copy of ActiveState's Komodo IDE, I do a review that invariably ends "this would be the perfect IDE for me if I were the sort of person who used IDEs". And every time I get the next release, I get closer to being persuaded I should be using IDEs. With Komodo 2.0, ActiveState is getting very, very close to persuading me - but it's not there yet. Let's see what it got right and got wrong this time.

From a Perl point of view, the syntax highlighting and background syntax checking has been greatly improved. Now, every time someone claims to do decent Perl syntax highlighting, there'll be at least one annoying pedant who crawls out of the woodwork with some pathological case and the oft-quoted "only perl can parse Perl!". Well, so what? For something that's impossible, Komodo does a surprisingly fine job. Combining the background syntax checking with use strict catches a reassuring amount of coding errors almost instantly.

My old favourite, Rx, is still there, and is still a great way to visualize what's going on in complicated regular expressions. One problem I found was that the Rx window changed size awkwardly while stepping through the regex match - but maybe this is a good time to mention that I'm working from a beta version of Komodo 2.0, and many of my gripes may be fixed by the time 2.0 is released for real.

There are a bunch of little things too that one should come to expect from this caliber of IDE. Integration with CVS was touted as being a feature, but I couldn't find any mention of it, although the FTP "remote editing" feature works well; combined with the CGI features of Komodo, this could be a fantastic workstation for the Windows user deploying CGI files on a remote server.

The Web services integration is a nice touch, especially with its automated installer, but it would benefit from some kind of progress indicator. Integration with the ASPN Cookbooks is a nice idea, but I don't know how much use it would see in day-to-day coding. The "external debugger" feature sounds interesting, but I couldn't find any documentation on it.

If I had to point to something that really impressed me about Komodo 2.0, then it would be the toolbox; this is actually something that I've been looking for in my own development environment and could be the one feature that persuades me to use Komodo. The toolbox is, simply put, somewhere to collect stuff that might be useful during your development. That's to say, snippets of code that you can drop into place, commands that you can execute, URLs to start a browser at, and so on. For example, you can highlight a function in Perl, and have a button in the toolbox run perldoc -f on it. You can drag a piece of boilerplate code directly into your file. It's just a nice, fast way of working, making the repeated things easy - precisely what a GUI should be doing. This is combined with a macro recording facility to make it easy for you to create your own toolbox items.

It's worth remembering that Komodo is not just about Perl. Komodo's XML mode is absolutely wonderful; it offers per-tag folding, tag completion on insertion, and XSLT debugging. Since I write all of my books in XML these days, I'm seriously considering using Komodo instead of jEdit and even emacs's psgml mode for my writing work; the only slight snag is that Komodo's tag insertion works by divining the acceptable XML tags from the structure of the documentation, instead of loading a DTD if one is explicitly supplied. However, perhaps this is unfair - Komodo doesn't claim to be an XML editor, just an XSLT editor, and the fact that it makes quite a good XML editor is a surprising bonus.

So why won't I switch to using Komodo? Two reasons: First, it's not out on my favorite OS, Apple's OS X - although I'm told that a port is in progress. Unfortunately, when that happens, Komodo will be competing with Apple's own Project Builder.

Second, speed. If I'm going to be giving up vi and emacs, then I'm going to want something that can keep up with them in terms of speed and keyboard response times. Komodo, even running a fairly beefy box, was slowing down my coding. No amount of cool syntax highlighting is worth that.

As with the last time I reviewed Komodo, I'd heartily recommend it for those who enjoy programming with IDEs, but reluctantly say that it's not for me. But it's very, very close.

This week on Perl 6 (9/30 - 10/6, 2002)

This is yet another Perl 6 summary, documenting what has happened over on the perl6-internals (where Parrot, the virtual machine that will run Perl 6, is discussed) and perl6-language (where Perl 6 language design is discussed) mailing lists. Piers is off on holiday (snowboarding and parachuting, no doubt), so I will be your host for the next two weeks. A particularly average week, so let's start off with the perl6-internals list as usual.

Getting started guide

Erik Lechak offered to help write a getting started guide to Parrot - he thought (correctly) that there may be "a pressing need for a document helping newbie developers figure out how to get started".

Unfortunately, Erik isn't a big fan of POD and attempted to code up a better replacement. While POD may not be perfect, an awful lot of Parrot developers are also Perl developers and POD is simple and good enough for Parrot for now.

http://groups.google.com/groups

Debugger debugged

Juergen Boemmels noticed that hitting the Enter key in the Parrot debugger (pdb) caused a segfault and that this wasn't quite ideal behaviour. He provided a patch which ignored empty lines, and after a nudge from Aldo Calpini provided another patch which followed the suggested behaviour in docs/debugger.pod, instead repeating the last command entered.

http://groups.google.com/groups

Patch Master

The main patcher these recent weeks has been Leopold Toetsch, and he kept it up this week as well. He provided patches to stop Parrot permanently allocating increasing amounts of memory, fix a parser error in imcc, add stats to life.p6 and more. In fact, so many patches that I have a feeling he'll get CVS write access soon ;-)

New allocator

Leopold had a play with plugging in the Doug Lea memory allocator, providing a patch to configure in support for the current Parrot malloc, the Doug Lea malloc, or ptmalloc in libc. He also attempted to plug in Perl 5's malloc and found similar speed results. Later on, he created a document to explain memory internals, continuing the recent trend for more docs.

http://gee.cs.oswego.edu/dl/html/malloc.html

http://rt.perl.org/rt2/Ticket/Display.html

http://rt.perl.org/rt2/Ticket/Display.html

Patches, patches everywhere

There have been a lot of pending patches recently, and Robert Spier prodded us about them and hopes to have an automated weekly email nudge too.

http://www.parrotcode.org/openpatches

Library name collisions

Steve Fink reported a problem with library name collisions. For example, having intlist.c as well as classes/intlist.c and key.c as well as classes/key.c. Dan proposed a naming convention where classes start with a CL_prefix, encodings with an EN_ prefix and character set stuff with a CS_prefix, although it looks like a simpler solution may be taken.

http://groups.google.com/groups

core.ops ate my (miniscule) RAM

David Chan is using an itty bitty box (Cyrix with 32 MB of RAM) and found that compiling Parrot failed due to lack of memory. Nicholas Clark reminded us that you can disable the (faster, but needs more memory to compile) computed goto core at configure time using ./Configure.pl --cgoto=0.

http://groups.google.com/groups

Parrot file list

There are a great number of source files in the Parrot distribution and Mark Sparshatt gathered ideas for how to have a file list which explained what each actually does.

http://groups.google.com/groups

Interfaces

Over in perl6-language, Michael Schwern was asked at a recent Java conference whether Perl 6 will support interfaces. He tried to describe a likely interface and raised some problems with interfaces. This spawned a huge thread about OO, interfaces, and some of the weirder OO languages. Do we enforce strict interfaces? Are some methods allowed to be optional?

One of the more interesting ideas would be to move from just using prototypes to using Design-By-Contract features: pre- and post-conditions and invariants. Dave Whipp seemed a big fan of the Eiffel model, where it is possible to rename methods in a derived class. There was lots of scary Perl 6 pseudocode including cars, MP3 players, birds, and wings.

A little later, Schwern came back and worried about enforcing interfaces on subclasses. One interesting approach is to follow Eiffel and only allow subclasses to weaken the preconditions or strengthen the postconditions of the parent. Weakening involves adding ORs, strengthening ANDs. (More Foo and Bar pseudocode. Oh, and apparently Schwern only has 2 toes).

There was a short sidetrack on having multiple versions of modules running at the same time, and Allison Randal informed us that it looks like the full name of classes will include their version number, ie Acme::N-1.1.

In fact, a lot of the discussion is a bit in the air: as Dan Sugalski pointed out, things like object attributes aren't until Apocalypse 12 and it may be a little early to worry about such things.

http://groups.google.com/groups

http://groups.google.com/groups

Subject-Oriented Programming

Schwern also asked about subject-oriented programming, which looked interesting but which he couldn't quite understand. Andy Wardley explained that all these "advanced" programming techniques are all attempting a clear separation of concerns, and went on to describe and give pointers to more info.

http://groups.google.com/groups

Matching

Someone mysteriously known only as "Ed" [ Ed Pescho - Ed.] asked what the favoured syntax would be to match negative multi-byte strings in Perl 6. It wasn't entirely clear what the question was, but one thing is sure: the Perl 6 pattern matching engine will have a lot of scope for optimisation.

http://groups.google.com/groups

In brief

Tim Bunce asked if there were any good tools for static code analysis around. None, apparently.

Brent Dax reminded Andy Dougherty and others about /Parrot_v?sn?printf(_[sc])?/ in misc.c - a reimplementation of the printf functions for portability reasons.

Simon Glover added tests for the assign opcode.

It looks like Michael Lazzaro will be writing a list of issues with OO as well as a tutorial so that everyone is clear what exactly we are talking about.

Larry cringes every time someone says "Parens construct lists in Perl 6".

Who's Who in Perl6

Once more we get to meet people involved in the development of Perl 6.

Who are you?
Simon Cozens
What do you do for/with Perl 6?
I was the Parrot pumpking up until 0.0.4, and wrote much of the PMC and string infrastructure. I then escaped to get a life, play more Go and be a nicer person.

As perl.com editor and occasional author, I pop in from time to time just to check on the correctness of things I'm writing about and make sure Parrot compiles on at least one platform I possess so I can test code for articles. Rumours of my return to development have been greatly exaggerated. :)

Where are you coming from?
I'm not, I'm already here!
When do you think Perl 6 will be released?
I don't think Perl 6 - as we understand it - ever will be released. How's that for a Delphic pronouncement?
Why are you doing this?
Because I like food.
You have 5 words. Describe yourself.
Less vehemently obnoxious than before.
Do you have anything to declare?
Watch out for some surprises in the near future.

Acknowledgements

This summary was brought to you with much distraction indeed from Super Mario Sunshine. Thanks to Kate Pugh and #london.pm for proofreading.

As Piers says: One more, if you think this summary has value send money to the Perl Foundation http://donate.perl-foundation.org and feed back and/or T?iBooks to me, mailto:pdcawley@bofh.org.uk. As usual, the fee paid for publication of this summary on perl.com has been donated directly to the Perl Foundation.

How Hashes Really Work

It's easy to take hashes for granted in Perl. They are simple, fast, and they usually "just work," so people never need to know or care about how they are implemented. Sometimes, though, it's interesting and rewarding to look at familiar tools in a different light. This article follows the development of a simple hash class in Perl in an attempt to find out how hashes really work.

A hash is an unordered collection of values, each of which is identified by a unique key. A value can be retrieved by its key, and one can add to or delete from the collection. A data structure with these properties is called a dictionary, and some of the many ways to implement them are outlined below.

Many objects are naturally identified by unique keys (like login names), and it is convenient to use a dictionary to address them in this manner. Programs use their dictionaries in different ways. A compiler's symbol table (which records the names of functions and variables encountered during compilation) might hold a few hundred names that are looked up repeatedly (since names usually occur many times in a section of code). Another program might need to store 64-bit integers as keys, or search through several thousands of filenames.

How can we build a generally useful dictionary?

Implementing Dictionaries

One simple way to implement a dictionary is to use a linked list of keys and values (that is, a list where each element contains a key and the corresponding value). To find a particular value, one would need to scan the list sequentially, comparing the desired key with each key in turn until a match is found, or we reach the end of the list.

This approach becomes progressively slower as more values are added to the dictionary, because the average number of elements we need to scan to find a match keeps increasing. We would discover that a key was not in the dictionary only after scanning every element in it. We could make things faster by performing binary searches on a sorted array of keys instead of using a linked list, but performance would still degrade as the dictionary grew larger.

If we could transform every possible key into a unique array index (for example, by turning the string "red" into the index 14328.), then we could store each value in a corresponding array entry. All searches, insertions and deletions could then be performed with a single array lookup, irrespective of the number of keys. But although this strategy is simple and fast, it has many disadvantages and is not always useful.

For one thing, calculating an index must be fast, and independent of the size of the dictionary (or we would lose all that we gained by not using a linked list). Unless the keys are already unique integers, however, it isn't always easy to quickly convert them into array indexes (especially when the set of possible keys is not known in advance, which is common). Furthermore, the number of keys actually stored in the dictionary is usually minute in comparison to the total number of possible keys, so allocating an array that could hold everything is wasteful.

For example, although a typical symbol table could contain a few hundred entries, there are about 50 billion alphanumeric names with six or fewer characters. Memory may be cheap enough for an occasional million-element array, but 50 billion elements (of which most remain unused) is still definitely overkill.

(Of course, there are many different ways to implement dictionaries. For example, red-black trees provide different guarantees about expected and worst-case running times, that are most appropriate for certain kinds of applications. This article does not discuss these possibilities further, but future articles may explore them in more detail.)

What we need is a practical compromise between speed and memory usage; a dictionary whose memory usage is proportional to the number of values it contains, but whose performance doesn't become progressively worse as it grows larger.

Hashes represent just such a compromise.

Hashes

Hashes are arrays (entries in it are called slots or buckets), but they do not require that every possible key correspond directly to a unique entry. Instead, a function (called a hashing function) is used to calculate the index corresponding to a particular key. This index doesn't have to be unique, i.e., the function may return the same hash value for two or more keys. (We disregard this possibility for a while, but return to it later, since it is of great importance.)

We can now look up a value by computing the hash of its key, and looking at the corresponding bucket in the array. As long as the running time of our hashing function is independent of the number of keys, we can always perform dictionary operations in constant time. Since hashing functions make no uniqueness guarantees, however, we need some way to to resolve collisions (i.e., the hashed value of a key pointing to an occupied bucket).

The simple way to resolve collisions is to avoid storing keys and values directly in buckets, and to use per-bucket linked lists instead. To find a particular value, its key is hashed to find the index of a bucket, and the linked list is scanned to find the exact key. The lists are known as chains, and this technique is called chaining.

(There are other ways to handle collisions, e.g. via open addressing, in which colliding keys are stored in the first unoccupied slot whose index can be recursively derived from that of an occupied one. One consequence is that the hash can contain only as many values as it has buckets. This technique is not discussed here, but references to relevant material are included below.)

Hashing Functions

Since chaining repeatedly performs linear searches through linked lists, it is important that the chains always remain short (that is, the number of collisions remains low). A good hashing function would ensure that it distributed keys uniformly into the available buckets, thus reducing the probability of collisions.

In principle, a hashing function returns an array index directly; in practice, it is common to use its (arbitrary) return value modulo the number of buckets as the actual index. (Using a prime number of buckets that is not too close to a power of two tends to produce a sufficiently uniform key distribution.)

Another way to keep chains remain short is to use a technique known as dynamic hashing: adding more buckets when the existing buckets are all used (i.e., when collisions become inevitable), and using a new hashing function that distributes keys uniformly into all of the buckets (it is usually possible to use the same hashing function, but compute indexes modulo the new number of buckets). We also need to re-distribute keys, since the corresponding indices will be different with the new hashing function.

Here's the hashing function used in Perl 5.005:


  # Return the hashed value of a string: $hash = perlhash("key")
  # (Defined by the PERL_HASH macro in hv.h)
  sub perlhash
  {
      $hash = 0;
      foreach (split //, shift) {
          $hash = $hash*33 + ord($_);
      }
      return $hash;
  }

More recent versions use a function designed by Bob Jenkins, and his Web page (listed below) does an excellent job of explaining how it and other hashing functions work.

Representing Hashes in Perl

We can represent a hash as an array of buckets, where each bucket is an array of [$key, $value] pairs (there's no particular need for chains to be linked lists; arrays are more convenient). As an exercise, let us add each of the keys in %example below into three empty buckets.


  %example = (
      ab => "foo", cd => "bar",
      ef => "baz", gh => "quux"
  );
  
  @buckets = ( [],[],[] );
  
  while (($k, $v) = each(%example)) {
      $hash  = perlhash($k);
      $chain = $buckets[ $hash % @buckets ];
	  
      $entry = [ $k, $v ];
      push @$chain, $entry;
  }

We end up with the following structure (you may want to verify that the keys are correctly hashed and distributed), in which we can identify any key-value pair in the hash with one index into the array of buckets and a second index into the entries therein. Another index serves to access either the key or the value.


  @buckets = (
      [ [ "ef", "baz" ]                   ],    # Bucket 0: 1 entry
      [ [ "cd", "bar" ]                   ],    # Bucket 1: 1 entry
      [ [ "ab", "foo" ], [ "gh", "quux" ] ],    # Bucket 2: 2 entries
  );
  $key = $buckets[2][1][0];   # $key = "gh"
  $val = $buckets[2][1][1];   # $val = $hash{$key}
  $buckets[0][0][1] = "zab";  # $hash{ef} = "zab"

Building Toy Hashes

In this section, we'll use the representation discussed above to write a tied hash class that emulates the behavior of real Perl hashes. For the sake of brevity, the code doesn't check for erroneous input. My comments also gloss over details that aren't directly relevant to hashing, so you may want to have a copy of perltie handy to fill in blanks.

(All of the code in the class is available at the URL mentioned below.)

We begin by writing a tied hash constructor that creates an empty hash, and another function to empty an existing hash.


  package Hash;
  
  # We'll reuse the perlhash() function presented previously.
  
  # Create a tied hash. (Analogous to newHV in hv.c)
  sub TIEHASH
  {
      $h = {
          keys    => 0,                         # Number of keys
          buckets => [ [],[],[],[],[],[],[] ],  # Seven empty buckets
          current => [ undef, undef ]           # Current iterator entry
      };                                        # (Explained below)
      return bless $h, shift;
  }
  
  # Empty an existing hash. (See hv.c:hv_clear)
  sub CLEAR
  {
      ($h) = @_;
	  
         $h->{keys}      = 0;
      @{ $h->{buckets} } = ([],[],[],[],[],[],[]);
      @{ $h->{current} } = (undef, undef);
  }

For convenience, we also write a function that looks up a given key in a hash and returns the indices of its bucket and the correct entry within. Both indexes are undefined if the key is not found in the hash.


  # Look up a specified key in a hash.
  sub lookup
  {
      ($h, $key) = @_;
	  
      $buckets = $h->{buckets};
      $bucket  = perlhash($key) % @$buckets;
	  
      $entries = @{ $buckets->[$bucket] };
      if ($entries > 0) {
          # Look for the correct entry inside the bucket.
          $entry = 0;
          while ($buckets->[$bucket][$entry][0] ne $key) {
              if (++$entry == $entries) {
                  # None of the entries in the bucket matched.
                  $bucket = $entry = undef;
                  last;
              }
          }
      }
      else {
          # The relevant bucket was empty, so the key doesn't exist.
          $bucket = $entry = undef;
      }
	  
      return ($bucket, $entry);
  }

The lookup function makes it easy to write EXISTS, FETCH, and DELETE methods for our class:


  # Check whether a key exists in a hash. (See hv.c:hv_exists)
  sub EXISTS
  {
      ($h, $key) = @_;
      ($bucket, $entry) = lookup($h, $key);
	  
      # If $bucket is undefined, the key doesn't exist.
      return defined $bucket;
  }
  
  # Retrieve the value associated with a key. (See hv.c:hv_fetch)
  sub FETCH
  {
      ($h, $key) = @_;
	  
      $buckets = $h->{buckets};
      ($bucket, $entry) = lookup($h, $key);
	  
      if (defined $bucket) {
          return $buckets->[$bucket][$entry][1];
      }
      else {
          return undef;
      }
  }
  
  # Delete a key-value pair from a hash. (See hv.c:hv_delete)
  sub DELETE
  {
      ($h, $key) = @_;
	  
      $buckets = $h->{buckets};
      ($bucket, $entry) = lookup($h, $key);
	  
      if (defined $bucket) {
          # Remove the entry from the bucket, and return its value.
          $entry = splice(@{ $buckets->[$bucket] }, $entry, 1);
          return $entry->[1];
      }
      else {
          return undef;
      }
  }

STORE is a little more complex. It must either update the value of an existing key (which is just an assignment), or add an entirely new entry (by pushing an arrayref into a suitable bucket). In the latter case, if the number of keys exceeds the number of buckets, then we create more buckets and redistribute existing keys (under the assumption that the hash will grow further; this is how we implement dynamic hashing).


  # Store a key-value pair in a hash. (See hv.c:hv_store)
  sub STORE
  {
      ($h, $key, $val) = @_;
  
      $buckets = $h->{buckets};
      ($bucket, $entry) = lookup($h, $key);
  
      if (defined $bucket) {
          $buckets->[$bucket][$entry][1] = $val;
      }
      else {
          $h->{keys}++;
          $bucket = perlhash($key) % @$buckets;
          push @{ $buckets->[$bucket] }, [ $key, $val ];
  
          # Expand the hash if all the buckets are full. (See hv.c:S_hsplit)
          if ($h->{keys} > @$buckets) {
              # We just double the number of buckets, as Perl itself does
              # (and disregard the number becoming non-prime).
              $newbuckets = [];
              push(@$newbuckets, []) for 1..2*@$buckets;
  
              # Redistribute keys
              foreach $entry (map {@$_} @$buckets) {
                  $bucket = perlhash($entry->[0]) % @$newbuckets;
                  push @{$newbuckets->[$bucket]}, $entry;
              }
              $h->{buckets} = $newbuckets;
          }
      }
  }

For completeness, we implement an iteration mechanism for our class. The current element in each hash identifies a single entry (by its bucket and entry indices). FIRSTKEY sets it to an initial (undefined) state, and leaves all the hard work to NEXTKEY, which steps through each key in turn.


  # Return the first key in a hash. (See hv.c:hv_iterinit)
  sub FIRSTKEY
  {
      $h = shift;
      @{ $h->{current} } = (undef, undef);
      return $h->NEXTKEY(@_);
  }

If NEXTKEY is called with the hash iterator in its initial state (by FIRSTKEY), it returns the first key in the first occupied bucket. On subsequent calls, it returns either the next key in the current chain, or the first key in the next occupied bucket.


  # Return the next key in a hash. (See hv.c:hv_iterkeysv et al.)
  sub NEXTKEY
  {
      $h = shift;
      $buckets = $h->{buckets};
      $current = $h->{current};
	  
      ($bucket, $entry) = @{ $current };
	  
      if (!defined $bucket || $entry+1 == @{ $buckets->[$bucket] }) {
      FIND_NEXT_BUCKET:
          do {
              if (++$current->[0] == @$buckets) {
                  @{ $current } = (undef, undef);
                  return undef;
              }
          } while (@{ $buckets->[$current->[0]] } == 0);
          $current->[1] = 0;
      }
      else {
          $current->[1]++;
      }
	  
      return $buckets->[$current->[0]][$current->[1]][0];
  }

The do loop at FIND_NEXT_BUCKET finds the next occupied bucket if the iterator is in its initial undefined state, or if the current entry is at the end of a chain. When there are no more keys in the hash, it resets the iterator and returns undef.

We now have all the pieces required to use our Hash class exactly as we would a real Perl hash.


  tie %h, "Hash";
  
  %h = ( foo => "bar", bar => "foo" );
  while (($key, $val) = each(%h)) {
      print "$key => $val\n";
  }
  delete $h{foo};
  
  # ...

Perl Internals

If you want to learn more about the hashes inside Perl, then the FakeHash module by Mark-Jason Dominus and a copy of hash.c from Perl 1.0 are good places to start. The PerlGuts Illustrated Web site by Gisle Aas is also an invaluable resource in exploring the Perl internals. (References to all three are included below.)

Although our Hash class is based on Perl's hash implementation, it is not a faithful reproduction; and while a detailed discussion of the Perl source is beyond the scope of this article, parenthetical notes in the code above may serve as a starting point for further exploration.

History

Donald Knuth credits H. P. Luhn at IBM for the idea of hash tables and chaining in 1953. About the same time, the idea also occurred to another group at IBM, including Gene Amdahl, who suggested open addressing and linear probing to handle collisions. Although the term "hashing" was standard terminology in the 1960s, the term did not actually appear in print until 1967 or so.

Perl 1 and 2 had "two and a half data types", of which one half was an "associative array." With some squinting, associative arrays look very much like hashes. The major differences were the lack of the % symbol on hash names, and that one could only assign to them one key at a time. Thus, one would say $foo{'key'} = 1;, but only @keys = keys(foo);. Familiar functions like each, keys, and values worked as they do now (and delete was added in Perl 2).

Perl 3 had three whole data types: it had the % symbol on hash names, allowed an entire hash to be assigned to at once, and added dbmopen (now deprecated in favour of tie). Perl 4 used comma-separated hash keys to emulate multidimensional arrays (which are now better handled with array references).

Perl 5 took the giant leap of referring to associative arrays as hashes. (As far as I know, it is the first language to have referred to the data structure thus, rather than "hash table" or something similar.) Somewhat ironically, it also moved the relevant code from hash.c into hv.c.

Nomenclature

Dictionaries, as explained earlier, are unordered collections of values indexed by unique keys. They are sometimes called associative arrays or maps. They can be implemented in several ways, one of which is by using a data structure known as a hash table (and this is what Perl refers to as a hash).

Perl's use of the term "hash" is the source of some potential confusion, because the output of a hashing function is also sometimes called a hash (especially in cryptographic contexts), and because hash tables aren't usually called hashes anywhere else.

To be on the safe side, refer to the data structure as a hash table, and use the term "hash" only in obvious, Perl-specific contexts.

Further Resources

Introduction to Algorithms (Cormen, Leiserson and Rivest)
Chapter 12 of this excellent book discusses hash tables in detail.
The Art of Computer Programming (Donald E. Knuth)
Volume 3 ("Sorting and Searching") devotes a section (�6.4) to an exhaustive description, analysis, and a historical perspective on various hashing techniques.
http://perl.plover.com/badhash.pl
"When Hashes Go Wrong" by Mark-Jason Dominus demonstrates a pathological case of collisions, by creating a large number of keys that hash to the same value, and effectively turn the hash into a very long linked list.
http://burtleburtle.net/bob/hash/doobs.html
Current versions of Perl use a hashing function designed by Bob Jenkins. His web page explains how the function was constructed, and provides an excellent overview of how various hashing functions perform in practice.
http://perl.plover.com/FakeHash/
This module, by Mark-Jason Dominus, is a more faithful re-implementation of Perl's hashes in Perl, and is particularly useful because it can draw pictures of the data structures involved.
http://www.etla.org/retroperl/perl1/perl-1.0.tar.gz
It might be instructive to read hash.c from the much less cluttered (and much less capable) Perl 1.0 source code, before going through the newer hv.c.
http://gisle.aas.no/perl/illguts/hv.png
This image, from Gisle Aas's "PerlGuts Illustrated", depicts the layout of the various structures that comprise hashes in the core. The entire web site is a treasure trove for people exploring the internals.
http://ams.wiw.org/src/Hash.pm
The source code for the tied Hash class developed in this article.
Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en