Using Bloom Filters
by Maciej Ceglowski
|
Pages: 1, 2
Building a Bloom Filter in Perl
To make a working Bloom filter, we need a good set of hash functions These
are easy to come by -- there are several excellent hashing algorithms
available on CPAN. For our purposes, a good choice is
Digest::SHA1, a cryptographically strong hash with a fast C
implementation. We can use the module to create as many hash functions as we
like by salting the input with a list of distinct values. Here's a subroutine
that builds a list of unique hash functions:
use Digest::SHA1 qw/sha1/;
sub make_hashing_functions {
my ( $count ) = @_;
my @functions;
for my $salt (1..$count ) {
push @functions, sub { sha1( $salt, $_[0] ) };
}
return @functions;
}
To be able to use these hash functions, we have to find a way to control their
range. Digest::SHA1 returns an embarrassingly lavish 160 bits of
hashed output, useful only in the unlikely case that our vector is
2160 bits long. We'll use a combination of bit chopping and division
to scale the output down to a more usable size.
Here's a subroutine that takes a key, runs it through a list of hash
functions, and returns a bitmask of length $FILTER_LENGTH:
sub make_bitmask {
my ( $key ) = @_;
my $mask = pack( "b*", '0' x $FILTER_LENGTH);
foreach my $hash_function ( @functions ){
my $hash = $hash_function->($key);
my $chopped = unpack("N", $hash );
my $bit_offset = $result % $FILTER_LENGTH;
vec( $mask, $bit_offset, 1 ) = 1;
}
return $mask;
}
That's a dense stretch of code, so let's look at it line by line:
my $mask = pack( "b*", '0' x $FILTER_LENGTH);
We start by using Perl's pack operator to create a zeroed bit
vector that is $FILTER_LENGTH bits long. pack takes two
arguments, a template and a value. The b in our template tells
pack that we want it to interpret the value as bits, and the
* indicates "repeat as often as necessary," just like in a regular
expression. Perl will actually pad our bit vector to make its length a multiple
of eight, but we'll ignore those superfluous bits.
With a blank bit vector in hand, we're ready to start running our key through the hash functions.
my $hash = $hash_function->($key);
my $chopped = unpack("N", $hash );
We're keeping the first 32 bits of the output and discarding the rest. This
prevents us from having to require BigInt support further along. The second line does the actual bit chopping. The N in the
template tells unpack to extract a 32-bit integer in network byte
order. Because we don't provide any quantifier in the template,
unpack will extract just one integer and then stop.
If you are extra, super paranoid about bit chopping, you could split the hash into five 32-bit pieces and XOR them together, preserving all the information in the original hash:
my $chopped = pack( "N", 0 );
my @pieces = map { pack( "N", $_ ) } unpack("N*", $hash );
$chopped = $_ ^ $chopped foreach @pieces;
But this is probably overkill.
Now that we have a list of 32-bit integer outputs from our hash functions,
all we have to do is scale them down with the modulo operator so they fall in
the range (1..$FILTER_LENGTH).
my $bit_offset = $chopped % $FILTER_LENGTH;
Now we've turned our key into a list of bit offsets, which is exactly what we were after.
The only thing left to do is to set the bits using vec, which
takes three arguments: the vector itself, a starting position, and the number
of bits to set. We can assign a value to vec like we would to a
variable:
vec( $mask, $bit_offset, 1 ) = 1;
After we've set all the bits, we wind up with a bitmask that is the same length as our Bloom filter. We can use this mask to add the key into the filter:
sub add {
my ( $key, $filter ) = @_;
my $mask = make_bitmask( $key );
$filter = $filter | $mask;
}
Or we can use it to check whether the key is already present:
sub check {
my ( $key, $filter ) = @_;
my $mask = make_bitmask( $key );
my $found = ( ( $filter & $mask ) eq $mask );
return $found;
}
Note that those are the bitwise OR (|) and AND
(&) operators, not the more commonly used logical OR
(||) and AND ( && ) operators. Getting the
two mixed up can lead to hours of interesting debugging. The first example ORs
the mask against the bit vector, turning on any bits that aren't already set.
The second example compares the mask to the corresponding positions in the
filter -- if all of the on bits in the mask are also on in the filter, we
know we've found a match.
Once you get over the intimidation factor of using vec,
pack, and the bitwise operators, Bloom filters are actually quite
straightforward. Listing 1 shows a complete
object-oriented implementation called Bloom::Filter.
Bloom Filters in Distributed Social Networks
One drawback of existing social network schemes is that they require participants to either divulge their list of contacts to a central server (Orkut, Friendster) or publish it to the public Internet (FOAF), in both cases sacrificing a great deal of privacy. By exchanging Bloom filters instead of explicit lists of contacts, users can participate in social networking experiments without having to admit to the world who their friends are. A Bloom filter encoding someone's contact information can be checked to see whether it contains a given name or email address, but it can't be coerced into revealing the full list of keys that were used to build it. It's even possible to turn the false-positive rate, which may not sound like a feature, into a powerful tool.
Suppose that I am very concerned about people trying to reverse-engineer my social network by running a dictionary attack against my Bloom filter. I can build my filter with a prohibitively high false-positive rate (50%, for example) and then arrange to send multiple copies of my Bloom filter to friends, varying the hash functions I use to build each filter. The more filters my friends collect, the lower the false-positive rate they will see. For example, with five filters the false-positive rate will be (0.5)5, or 3% -- and I can reduce the rate further by sending out more filters.
If any one of the filters is intercepted, it will register the full 50% false-positive rate. So I am able to hedge my privacy risk across several interactions, and have some control over how accurately other people can see my network. My friends can be sure with a high degree of certainty whether someone is on my contact list, but someone who manages to snag just one or two of my filters will learn almost nothing about me.
Here's a Perl function that checks a key against a set of noisy filters:
use Bloom::Filter;
sub check_noisy_filters {
my ( $key, @filters ) = @_;
foreach my $filter ( @filters ) {
return 0 unless $filter->check( $key );
}
return 1;
}
If you and your friends agree to use the same filter length and set of hash functions, you can also use bitwise comparisons to estimate the degree of overlap between your social networks. The number of shared on bits in two Bloom filters will give a usable measure of the distance between them.
sub shared_on_bits {
my ( $filter_1, $filter_2 ) = @_;
return unpack( "%32b*", $filter_1 & $filter_2 )
}
Additionally, you can combine two Bloom filters that have the same length and hash functions with the bitwise OR operator to create a composite filter. For example, if you participate in a small mailing list and want to create a whitelist from the address books of everyone in the group, you can have each participant create a Bloom filter individually and then OR the filters together into a Voltron-like master list. None of the members of the group will know who the other members' contacts are, and yet the filter will exhibit the correct behavior.
There are sure to be other neat Bloom filter tricks with potential applications to social networking and distributed applications. The references below list a few good places to start mining.
References
- Bloom Filters -- the math. A good place to start for an overview of the math behind Bloom filters.
- Some Motley Bloom Tricks. Handy filter tricks and theory page.
- Bloom Filter Survey. A handy survey article on Bloom filter network applications.
- LOAF. Our own system for incorporating social networks onto email using Bloom filters.
- Compressed Bloom Filters. If you are passing filters around a network, you will want to optimize them for minimum size; this paper gives a good overview of compressed Bloom filters.
Bloom16. A CPAN module implementing a counting Bloom filter.Text::Bloom. CPAN module for using Bloom filters with text collections.- Privacy-Enhanced Searches Using Encryted Bloom Filters. This paper discusses how to use encryption and Bloom filters to set up a query system that prevents the search engine from knowing the query you are running.
- Bloom Filters as Summaries. Some performance data on actually using Bloom filters as cache summaries.
- Using Bloom Filters for Authenticated Yes/No Answers in the DNS. Internet draft for using Bloom filters to implement Secure DNS

