How Hashes Really Work
by Abhijit Menon-Sen
|
Pages: 1, 2
Building Toy Hashes
In this section, we'll use the representation discussed above to write a tied hash class that emulates the behavior of real Perl hashes. For the sake of brevity, the code doesn't check for erroneous input. My comments also gloss over details that aren't directly relevant to hashing, so you may want to have a copy of perltie handy to fill in blanks.
(All of the code in the class is available at the URL mentioned below.)
We begin by writing a tied hash constructor that creates an empty hash, and another function to empty an existing hash.
package Hash;
# We'll reuse the perlhash() function presented previously.
# Create a tied hash. (Analogous to newHV in hv.c)
sub TIEHASH
{
$h = {
keys => 0, # Number of keys
buckets => [ [],[],[],[],[],[],[] ], # Seven empty buckets
current => [ undef, undef ] # Current iterator entry
}; # (Explained below)
return bless $h, shift;
}
# Empty an existing hash. (See hv.c:hv_clear)
sub CLEAR
{
($h) = @_;
$h->{keys} = 0;
@{ $h->{buckets} } = ([],[],[],[],[],[],[]);
@{ $h->{current} } = (undef, undef);
}
For convenience, we also write a function that looks up a given key in a hash and returns the indices of its bucket and the correct entry within. Both indexes are undefined if the key is not found in the hash.
# Look up a specified key in a hash.
sub lookup
{
($h, $key) = @_;
$buckets = $h->{buckets};
$bucket = perlhash($key) % @$buckets;
$entries = @{ $buckets->[$bucket] };
if ($entries > 0) {
# Look for the correct entry inside the bucket.
$entry = 0;
while ($buckets->[$bucket][$entry][0] ne $key) {
if (++$entry == $entries) {
# None of the entries in the bucket matched.
$bucket = $entry = undef;
last;
}
}
}
else {
# The relevant bucket was empty, so the key doesn't exist.
$bucket = $entry = undef;
}
return ($bucket, $entry);
}
The lookup function makes it easy to write EXISTS, FETCH, and
DELETE methods for our class:
# Check whether a key exists in a hash. (See hv.c:hv_exists)
sub EXISTS
{
($h, $key) = @_;
($bucket, $entry) = lookup($h, $key);
# If $bucket is undefined, the key doesn't exist.
return defined $bucket;
}
# Retrieve the value associated with a key. (See hv.c:hv_fetch)
sub FETCH
{
($h, $key) = @_;
$buckets = $h->{buckets};
($bucket, $entry) = lookup($h, $key);
if (defined $bucket) {
return $buckets->[$bucket][$entry][1];
}
else {
return undef;
}
}
# Delete a key-value pair from a hash. (See hv.c:hv_delete)
sub DELETE
{
($h, $key) = @_;
$buckets = $h->{buckets};
($bucket, $entry) = lookup($h, $key);
if (defined $bucket) {
# Remove the entry from the bucket, and return its value.
$entry = splice(@{ $buckets->[$bucket] }, $entry, 1);
return $entry->[1];
}
else {
return undef;
}
}
STORE is a little more complex. It must either update the value of an
existing key (which is just an assignment), or add an entirely new entry
(by pushing an arrayref into a suitable bucket). In the latter case, if
the number of keys exceeds the number of buckets, then we create more buckets
and redistribute existing keys (under the assumption that the hash will
grow further; this is how we implement dynamic hashing).
# Store a key-value pair in a hash. (See hv.c:hv_store)
sub STORE
{
($h, $key, $val) = @_;
$buckets = $h->{buckets};
($bucket, $entry) = lookup($h, $key);
if (defined $bucket) {
$buckets->[$bucket][$entry][1] = $val;
}
else {
$h->{keys}++;
$bucket = perlhash($key) % @$buckets;
push @{ $buckets->[$bucket] }, [ $key, $val ];
# Expand the hash if all the buckets are full. (See hv.c:S_hsplit)
if ($h->{keys} > @$buckets) {
# We just double the number of buckets, as Perl itself does
# (and disregard the number becoming non-prime).
$newbuckets = [];
push(@$newbuckets, []) for 1..2*@$buckets;
# Redistribute keys
foreach $entry (map {@$_} @$buckets) {
$bucket = perlhash($entry->[0]) % @$newbuckets;
push @{$newbuckets->[$bucket]}, $entry;
}
$h->{buckets} = $newbuckets;
}
}
}
For completeness, we implement an iteration mechanism for our class. The
current element in each hash identifies a single entry (by its bucket
and entry indices). FIRSTKEY sets it to an initial (undefined) state,
and leaves all the hard work to NEXTKEY, which steps through each key
in turn.
# Return the first key in a hash. (See hv.c:hv_iterinit)
sub FIRSTKEY
{
$h = shift;
@{ $h->{current} } = (undef, undef);
return $h->NEXTKEY(@_);
}
If NEXTKEY is called with the hash iterator in its initial state (by
FIRSTKEY), it returns the first key in the first occupied bucket. On
subsequent calls, it returns either the next key in the current chain,
or the first key in the next occupied bucket.
# Return the next key in a hash. (See hv.c:hv_iterkeysv et al.)
sub NEXTKEY
{
$h = shift;
$buckets = $h->{buckets};
$current = $h->{current};
($bucket, $entry) = @{ $current };
if (!defined $bucket || $entry+1 == @{ $buckets->[$bucket] }) {
FIND_NEXT_BUCKET:
do {
if (++$current->[0] == @$buckets) {
@{ $current } = (undef, undef);
return undef;
}
} while (@{ $buckets->[$current->[0]] } == 0);
$current->[1] = 0;
}
else {
$current->[1]++;
}
return $buckets->[$current->[0]][$current->[1]][0];
}
The do loop at FIND_NEXT_BUCKET finds the next occupied bucket if
the iterator is in its initial undefined state, or if the current entry
is at the end of a chain. When there are no more keys in the hash, it
resets the iterator and returns undef.
We now have all the pieces required to use our Hash class exactly as we would a real Perl hash.
tie %h, "Hash";
%h = ( foo => "bar", bar => "foo" );
while (($key, $val) = each(%h)) {
print "$key => $val\n";
}
delete $h{foo};
# ...
Perl Internals
If you want to learn more about the hashes inside Perl, then the FakeHash
module by Mark-Jason Dominus and a copy of hash.c from Perl 1.0 are
good places to start. The PerlGuts Illustrated Web site by Gisle Aas is
also an invaluable resource in exploring the Perl internals. (References
to all three are included below.)
Although our Hash class is based on Perl's hash implementation, it is
not a faithful reproduction; and while a detailed discussion of the Perl
source is beyond the scope of this article, parenthetical notes in the
code above may serve as a starting point for further exploration.
History
Donald Knuth credits H. P. Luhn at IBM for the idea of hash tables and chaining in 1953. About the same time, the idea also occurred to another group at IBM, including Gene Amdahl, who suggested open addressing and linear probing to handle collisions. Although the term "hashing" was standard terminology in the 1960s, the term did not actually appear in print until 1967 or so.
Perl 1 and 2 had "two and a half data types", of which one half was an
"associative array." With some squinting, associative arrays look very
much like hashes. The major differences were the lack of the % symbol
on hash names, and that one could only assign to them one key at a time.
Thus, one would say $foo{'key'} = 1;, but only @keys = keys(foo);.
Familiar functions like each, keys, and values worked as they
do now (and delete was added in Perl 2).
Perl 3 had three whole data types: it had the % symbol on hash names,
allowed an entire hash to be assigned to at once, and added dbmopen
(now deprecated in favour of tie). Perl 4 used comma-separated hash
keys to emulate multidimensional arrays (which are now better handled
with array references).
Perl 5 took the giant leap of referring to associative arrays as hashes. (As far as I know, it is the first language to have referred to the data structure thus, rather than "hash table" or something similar.) Somewhat ironically, it also moved the relevant code from hash.c into hv.c.
Nomenclature
Dictionaries, as explained earlier, are unordered collections of values indexed by unique keys. They are sometimes called associative arrays or maps. They can be implemented in several ways, one of which is by using a data structure known as a hash table (and this is what Perl refers to as a hash).
Perl's use of the term "hash" is the source of some potential confusion, because the output of a hashing function is also sometimes called a hash (especially in cryptographic contexts), and because hash tables aren't usually called hashes anywhere else.
To be on the safe side, refer to the data structure as a hash table, and use the term "hash" only in obvious, Perl-specific contexts.
Further Resources
- Introduction to Algorithms (Cormen, Leiserson and Rivest)
- Chapter 12 of this excellent book discusses hash tables in detail.
- The Art of Computer Programming (Donald E. Knuth)
- Volume 3 ("Sorting and Searching") devotes a section (§6.4) to an exhaustive description, analysis, and a historical perspective on various hashing techniques.
- http://perl.plover.com/badhash.pl
- "When Hashes Go Wrong" by Mark-Jason Dominus demonstrates a pathological case of collisions, by creating a large number of keys that hash to the same value, and effectively turn the hash into a very long linked list.
- http://burtleburtle.net/bob/hash/doobs.html
- Current versions of Perl use a hashing function designed by Bob Jenkins. His web page explains how the function was constructed, and provides an excellent overview of how various hashing functions perform in practice.
- http://perl.plover.com/FakeHash/
- This module, by Mark-Jason Dominus, is a more faithful re-implementation of Perl's hashes in Perl, and is particularly useful because it can draw pictures of the data structures involved.
- http://www.etla.org/retroperl/perl1/perl-1.0.tar.gz
- It might be instructive to read hash.c from the much less cluttered (and much less capable) Perl 1.0 source code, before going through the newer hv.c.
- http://gisle.aas.no/perl/illguts/hv.png
- This image, from Gisle Aas's "PerlGuts Illustrated", depicts the layout of the various structures that comprise hashes in the core. The entire web site is a treasure trove for people exploring the internals.
- http://ams.wiw.org/src/Hash.pm
- The source code for the tied Hash class developed in this article.

