Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Hash Crash Course
by Simon Cozens | Pages: 1, 2

Caching

A special case of "have I seen this before?" comes when you want to create a cache: if I have seen this before, what was the answer I saw last time? Here's how to do that.

I mentioned that the underlying database abstraction layer in the tags example might cache look-ups and return the same object if you requested the same tag multiple times:

    my %cache;

    sub retrieve {
        my ($self, $id) = @_;
        return $cache{$id} if exists $cache{$id};
        return $cache{$id} = $self->_hard_retrieve($id);
    }

This checks to see if it has seen the $id before; if so have, return the value from the hash. If not, work out what the value should be, then store it in the hash for next time, then return the whole lot.

Of course, for heavy-duty applications like a database abstraction layer, you need to do a little more work, such as pruning the cache to make sure it doesn't get out of date or get so full of data that it eats all your memory. The Cache::Cache suite of modules from CPAN takes care of all of this for you, and the core Memoize module adds this kind of caching to a function without you specifically having to write the cache-getting and -setting code.

Searching

Finally, you can use hashes for searching.

If you've done any computer science, you'll probably know about a couple of common searching algorithms. (Even if you haven't, they're so common that you've probably reinvented them without knowing it.) One is linear search, where you start at the beginning of a list and work your way toward the end, looking for the target item:

    my $index;
    for $index (0..@chambers) {
        last if $chambers[$index] == $bullet;
    }
    print "Found at index $index" if $index < @chambers;

Another is binary search, where you start in the middle of a sorted list, and work out whether you should go higher or lower than the current element, like players in some demented version of Card Sharks. This is basically the way to look up names in a phone book--it's faster, but a bit more complicated to implement:

    my @names = qw(Able Baker Charlie Dog ...);
    my $index;

    my $target = "Roger";

    sub search {
        my ($lower, $upper) = (0, $#names);
        while ($lower <= $upper) {
            my $index = ($lower + $upper) / 2;
            if ($names[$index] lt $target) {
                $lower = $index + 1;
            } elsif ($names[$index] gt $target) {
                $upper = $index--1;
            } else { return $index }
        }
        # Not found!
    }

This starts in the middle, at "Mike". "Go higher!" shouts the crowd, so it goes half-way between there and the end, to "Tare". Then it needs to go lower, so looks half-way between "Mike" and "Tare" and gets to "Peter". Now it needs to go higher again, between "Peter" and "Tare", and finds "Roger".

That takes four comparisons, which is not bad. Can you do it any better? How about... oh, none?

    my %search = map { $names[$_] => $_ } 0..$#names;
    print $names{"Roger"};

Now I'm cheating somewhat here because setting up the hash requires iterating over the whole array, but once you've done that, the searches are basically free.

Of course, if you need to look up an array index by its contents, then maybe you're doing something wrong in the first place, and should have used a hash to start with. Consider, for instance, in a configuration file:

    force-v3-sigs
    escape-from-lines
    lock-once
    load-extension rndlinux
    keyserver wwwkeys.eu.pgp.net
    keyserver the.earth.li

(Yes, that's from GnuPG.) Each line has a key and optionally a value, but you can have multiple values for each key. At the end of the day, you might want to look through and say "tell me all the keyservers"--but not just the keyservers, you want to read the whole config in and be able to say that about any key. That's a search problem, and a tricky one. Here's one solution:

    while (<>) {
        chomp;
        my ($key, $value) = split /\s+/, $_, 2;
        push @config, [$key, $value];
    }

Now you have to say:

    @keyservers = map { $_->[1] } grep { $_->[0] eq "keyserver" } @config;

This is actually a linear search in disguise. (grep does a linear search for you.)

With hashes, the issue is much simpler:

    while (<>) {
        chomp;
        my ($key, $value) = split /\s+/, $_, 2;
        push @{$config{$key}}, $value;
    }

This treats every key as a separate array reference, and pushes values into that array. Now to retrieve the list of keyservers, just look up the array inside the hash:

    @keyservers = @{$config{"keyservers"}};

This process is very remniscient of the final pattern--using a hash as a portable symbol table.

Dispatch Tables

Instead of having the configuration reader create a bunch of arrays--@keyservers, @load_extension and so on--I created a hash which held the arrays so as to look them up indirectly but more efficiently. In effect, instead of using the Perl symbol table, you can use a hash as a portable symbol table.

Suppose you have a script that does several related things: it manages your to-do list by adding, editing, listing, and deleting to-do items:

    % todo add "Email Samuel about photos"
    Todo item 129 created
    % todo done 129
    Item 129 marked as done

You might expect the script to look like:

    my $command = shift @ARGV;
    if    ($command eq "add")  { add(@ARGV)  }
    elsif ($command eq "list") { list(@ARGV) }
    elsif ($command eq "done") { done(@ARGV) }
    elsif ($command eq "edit") { edit(@ARGV) }
    ...
    else { die "Unknown command: $command" }

That is quite tedious; you need to edit the program in several places every time you add a new command. You could use symbolic references--that is, tell Perl to call a function named $command:

    my $command = shift @ARGV;
    sub AUTOLOAD { die "Unknown command: $AUTOLOAD" }
    no strict 'refs';
    &{$command}(@ARGV);

But that's somewhat crazy. It allows the user to get at any subroutine in the main package, which you may not want, and to keep any error checking you have to assume that any undefined subroutine call comes from the command line.

The middle way is to copy the commands into a hash, mapped to a function reference:

    %commands = (
        add  => \&add,
        list => \&list,
        edit => \&edit,
        done => \&done,
    );
    my $command = shift @ARGV;
    if (!exists $commands{$command}) { die "Unknown command: $command" }
    $commands{$command}->(@ARGV);

This keeps strict happy, it's safe in the way it restricts what subroutines users can call, and it allows for error checking that doesn't mess everything else up. Mark Jason Dominus' Higher Order Perl shows how you can define commands at runtime if you use dispatch tables, something you can't do if you hard-code your dispatch.

Conclusion

I've explored some of the most common hash-based patterns: using hashes for counting, uniqueness, searching, and dispatch--rather a lot more than just mapping from one thing to another. Of course, that is what a hash does at one level, but the uses of such a data structure are a lot more diverse than just that.

That's how you improve your Perl programming--you take elements of the language which ostensibly do one thing, and you find that they're great for more complicated uses as well. Maybe after these ideas you'll be able to find a few more hash idioms of your own!