How Hashes Really Work
by Abhijit Menon-SenOctober 01, 2002
It's easy to take hashes for granted in Perl. They are simple, fast, and they usually "just work," so people never need to know or care about how they are implemented. Sometimes, though, it's interesting and rewarding to look at familiar tools in a different light. This article follows the development of a simple hash class in Perl in an attempt to find out how hashes really work.
A hash is an unordered collection of values, each of which is identified by a unique key. A value can be retrieved by its key, and one can add to or delete from the collection. A data structure with these properties is called a dictionary, and some of the many ways to implement them are outlined below.
|
Related Reading
Programming Perl |
Many objects are naturally identified by unique keys (like login names), and it is convenient to use a dictionary to address them in this manner. Programs use their dictionaries in different ways. A compiler's symbol table (which records the names of functions and variables encountered during compilation) might hold a few hundred names that are looked up repeatedly (since names usually occur many times in a section of code). Another program might need to store 64-bit integers as keys, or search through several thousands of filenames.
How can we build a generally useful dictionary?
Implementing Dictionaries
One simple way to implement a dictionary is to use a linked list of keys and values (that is, a list where each element contains a key and the corresponding value). To find a particular value, one would need to scan the list sequentially, comparing the desired key with each key in turn until a match is found, or we reach the end of the list.
This approach becomes progressively slower as more values are added to the dictionary, because the average number of elements we need to scan to find a match keeps increasing. We would discover that a key was not in the dictionary only after scanning every element in it. We could make things faster by performing binary searches on a sorted array of keys instead of using a linked list, but performance would still degrade as the dictionary grew larger.
If we could transform every possible key into a unique array index (for example, by turning the string "red" into the index 14328.), then we could store each value in a corresponding array entry. All searches, insertions and deletions could then be performed with a single array lookup, irrespective of the number of keys. But although this strategy is simple and fast, it has many disadvantages and is not always useful.
For one thing, calculating an index must be fast, and independent of the size of the dictionary (or we would lose all that we gained by not using a linked list). Unless the keys are already unique integers, however, it isn't always easy to quickly convert them into array indexes (especially when the set of possible keys is not known in advance, which is common). Furthermore, the number of keys actually stored in the dictionary is usually minute in comparison to the total number of possible keys, so allocating an array that could hold everything is wasteful.
For example, although a typical symbol table could contain a few hundred entries, there are about 50 billion alphanumeric names with six or fewer characters. Memory may be cheap enough for an occasional million-element array, but 50 billion elements (of which most remain unused) is still definitely overkill.
(Of course, there are many different ways to implement dictionaries. For example, red-black trees provide different guarantees about expected and worst-case running times, that are most appropriate for certain kinds of applications. This article does not discuss these possibilities further, but future articles may explore them in more detail.)
What we need is a practical compromise between speed and memory usage; a dictionary whose memory usage is proportional to the number of values it contains, but whose performance doesn't become progressively worse as it grows larger.
Hashes represent just such a compromise.
Hashes
Hashes are arrays (entries in it are called slots or buckets), but they do not require that every possible key correspond directly to a unique entry. Instead, a function (called a hashing function) is used to calculate the index corresponding to a particular key. This index doesn't have to be unique, i.e., the function may return the same hash value for two or more keys. (We disregard this possibility for a while, but return to it later, since it is of great importance.)
We can now look up a value by computing the hash of its key, and looking at the corresponding bucket in the array. As long as the running time of our hashing function is independent of the number of keys, we can always perform dictionary operations in constant time. Since hashing functions make no uniqueness guarantees, however, we need some way to to resolve collisions (i.e., the hashed value of a key pointing to an occupied bucket).
The simple way to resolve collisions is to avoid storing keys and values directly in buckets, and to use per-bucket linked lists instead. To find a particular value, its key is hashed to find the index of a bucket, and the linked list is scanned to find the exact key. The lists are known as chains, and this technique is called chaining.
(There are other ways to handle collisions, e.g. via open addressing, in which colliding keys are stored in the first unoccupied slot whose index can be recursively derived from that of an occupied one. One consequence is that the hash can contain only as many values as it has buckets. This technique is not discussed here, but references to relevant material are included below.)
Hashing Functions
Since chaining repeatedly performs linear searches through linked lists, it is important that the chains always remain short (that is, the number of collisions remains low). A good hashing function would ensure that it distributed keys uniformly into the available buckets, thus reducing the probability of collisions.
In principle, a hashing function returns an array index directly; in practice, it is common to use its (arbitrary) return value modulo the number of buckets as the actual index. (Using a prime number of buckets that is not too close to a power of two tends to produce a sufficiently uniform key distribution.)
Another way to keep chains remain short is to use a technique known as dynamic hashing: adding more buckets when the existing buckets are all used (i.e., when collisions become inevitable), and using a new hashing function that distributes keys uniformly into all of the buckets (it is usually possible to use the same hashing function, but compute indexes modulo the new number of buckets). We also need to re-distribute keys, since the corresponding indices will be different with the new hashing function.
Here's the hashing function used in Perl 5.005:
# Return the hashed value of a string: $hash = perlhash("key")
# (Defined by the PERL_HASH macro in hv.h)
sub perlhash
{
$hash = 0;
foreach (split //, shift) {
$hash = $hash*33 + ord($_);
}
return $hash;
}
More recent versions use a function designed by Bob Jenkins, and his Web page (listed below) does an excellent job of explaining how it and other hashing functions work.
Representing Hashes in Perl
We can represent a hash as an array of buckets, where each bucket is an
array of [$key, $value] pairs (there's no particular need for chains
to be linked lists; arrays are more convenient). As an exercise, let us
add each of the keys in %example below into three empty buckets.
%example = (
ab => "foo", cd => "bar",
ef => "baz", gh => "quux"
);
@buckets = ( [],[],[] );
while (($k, $v) = each(%example)) {
$hash = perlhash($k);
$chain = $buckets[ $hash % @buckets ];
$entry = [ $k, $v ];
push @$chain, $entry;
}
We end up with the following structure (you may want to verify that the keys are correctly hashed and distributed), in which we can identify any key-value pair in the hash with one index into the array of buckets and a second index into the entries therein. Another index serves to access either the key or the value.
@buckets = (
[ [ "ef", "baz" ] ], # Bucket 0: 1 entry
[ [ "cd", "bar" ] ], # Bucket 1: 1 entry
[ [ "ab", "foo" ], [ "gh", "quux" ] ], # Bucket 2: 2 entries
);
$key = $buckets[2][1][0]; # $key = "gh"
$val = $buckets[2][1][1]; # $val = $hash{$key}
$buckets[0][0][1] = "zab"; # $hash{ef} = "zab"
Pages: 1, 2 |





