Recently in Data Structures Category

Managing Rich Data Structures

As a Perl beginner, I wrote many scripts that created text files to store snippets of data. Eventually I had hundreds of little text files in a data directory, each one holding one piece of information in the form of a text string. I decided to move from snippets to a single database. Here's how I did it.

I didn't go all the way to creating a relational database with SQL, in part because I had not yet learned to use Perl modules to work with such databases. Plus I wanted to try the apparently simpler technique described in Chapters 11 and 14 of Perl Cookbook, 2nd Edition, namely the use of one of the DBM libraries. I was intrigued by the idea of creating and using a database without needing to learn SQL or run a database server such as PostgreSQL or MySQL.

One of my scripts created HTML for a clickable banner ad; the HTML would be inserted into an HTML email newsletter I publish daily. For any particular issue, the script pulled data from three different text files, each of which held a different piece of information in the form of a text string on the file's first and only line:

  • The URL of the advertiser's web page, to which the reader would be taken when the banner is clicked,
  • The location of the .gif file (the banner), or
  • A one-line headline that would appear above the banner.

A different script created those files and named them according to (i) the type of data in the file and (ii) the date of the newsletter in which the ad would appear. For example, data for the December 9, 2005 ad was contained in three files named url_2005_12_09, gif_2005_12_09 and headline_2005_12_09.

A Hash of Hashes

First I considered the kind of data structure needed to hold the data. I looked at the relationship between the three text files. It became clear that I basically had a hash for each ad. The data for any particular ad consisted of three different keys and their values: URL, gif and headline. The name of any particular file would provide the key (the type of information contained in the file). The contents of the file would provide the key's value.

I thought about finding some way to store those hashes as an array of anonymous hashes (one hash per ad), but then I realized that an array wouldn't let me access a particular ad's data easily. The hashes would be in the order in which I saved them into the array, but that wouldn't translate easily to the ad for a particular date. For example, how would I know where to find the data for next Monday's newsletter? Is it in $array[8] or $array[17]?

Hmm. Each anonymous hash could be identified by a particular date--the key (!) to locating the ad for any particular date. What kind of data structure associates a unique key with a value? A hash, of course! My data would fit nicely into a hash of hashes.

The name I chose for the hash was %data_for_ad_on. Choosing a hash name that ends in a preposition provides a more natural-reading and meaningful name; the key for data for the December 8, 2005 banner ad would be 2005_12_08, for example, and the way to access the value associated with that key would be $data_for_ad_on{2005_12_08}.

In code, this is how the data for two days of newsletters could be represented as a hash of hashes:

%data_for_ad_on = (

  '2005_12_08' => { 'url'      => 'http://roadrunners-r-us.com/index.html',
                    'gif'      => 'http://myserver.com/banners/roadrunners_banner.gif',
                    'headline' => 'Use Roadrunners R Us for speedy, reliable deliveries!',
                  },

  '2005_12_09' => { 'url'      => 'http://acme.com/index.html',
                    'gif'      => 'http://myserver.com/banners/acme_banner.gif',
                    'headline' => 'Look to Acme for quality, inexpensive widgets!',
                  },
);

The keys of the named hash are 2005_12_08 and 2005_12_09. Each key's value is a reference to an anonymous hash that contains its own keys and values. When a hash is created using braces instead of parentheses, its value is a reference to that unnamed, "anonymous" hash. I need to use a reference because a hash is permitted to contain only scalar keys and scalar values; another hash can't be stored as a value. A reference to that hash works, because it acts like a scalar.

Storing the Data

Now I knew what the data structure would be; but how would I store it? Being Perl, of course there's more than one way to store data to disk. Some of those solutions involve the parsing of potentially large amounts of data and then the need to programmatically find a particular desired piece of data, however.

I came across a single-file database solution in Recipes 11.14 and 14.6 of Perl Cookbook, 2nd Edition. These describe the use of a DBM file on disk. A Perl script can read data from a DBM file as if it were contained in a hash in memory. For data having unique keys, it's a great fit. In addition to data persistence from one run of your script to the next, using a DBM database file means your script won't need to read all the data into memory. Instead, you pull up only data associated with a particular key. This applies especially to large data sets; a DBM database file on disk might hold data for hundreds of thousands of keys.

A straightforward but tedious way to get data into a DBM file is shown below. The code creates a database called ad_data.db and inserts data for two newsletters.

#!/usr/local/bin/perl
use strict;
use warnings;
use MLDBM qw( DB_File Storable );
use Fcntl qw( :flock O_CREAT O_RDWR );

my $db = '/www/cgi-bin/databases/ad_data.db';
my %data_for_ad_on;

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Trouble opening $db, stopped: $!";

%data_for_ad_on = (
  '2005_12_09' => { 'url'      => 'http://acme.com/index.html',
                    'gif'      => 'http://myserver.com/banners/acme_banner.gif',
                    'headline' => 'Look to Acme for quality, inexpensive widgets!',
                  },

  '2005_12_08' => { 'url'      => 'http://roadrunners-r-us.com/index.html',
                    'gif'      => 'http://myserver.com/banners/roadrunners_banner.gif',
                    'headline' => 'Use Roadrunners R Us for speedy, reliable deliveries!',
                  },
);

After the script is run, the data contained in %data_for_ad_on are saved automatically into the DBM database file named /www/cgi-bin/databases/ad_data.db.

The secret is Perl's tie function. It associates a particular hash (here named %data_for_ad_on ) with a "class" and a file. A class that works for "complex" data (data containing references) is the MLDBM module, available from CPAN.

Note: if the values being stored are scalars only, a tied hash using the DB_File module as the class, rather than the MLDBM module, will be much easier to work with. I'm using MLDBM because my values include references to data structures. I'll discuss some of MLDBM's limitations later in this article.

The script shown above uses the Fcntl module to make it easier to create or update the DBM database file. The imported O_CREAT|O_RDWR parameters tell the script to create the database file if it doesn't yet exist or to update (read/write) the file if it exists.

The DB_File and Storable parameters passed to MLDBM specify the particular kind of DBM database to use and the serialization module to use. They handle the behind-the-scenes conversion of references into strings. (A recent version of MLDBM is necessary to be able to specify such parameters.) DB_File won't work with every system, however, because it depends upon your system's configuration, like all DBM modules. In particular, the Berkeley DB library (a C library) must be present; see the DB_File documentation for details. You can get another DBM module from CPAN if DB_File won't work for you.

Manually entering hash keys and their values into a script is time-consuming and error-prone, of course. I prefer to programmatically add data created or copied from existing files, and used a script to copy data from my hundreds of text files into the DBM database file.

#!/usr/local/bin/perl -T
use strict;
use warnings;
use MLDBM qw( DB_File Storable );
use Fcntl qw( :flock O_CREAT O_RDWR );

my $db = '/www/cgi-bin/databases/ad_data.db';
my %data_for_ad_on;

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Can't open $db: $!";

my $data_dir = '/www/cgi-bin/newsletter_data_text_files';

opendir my $dir_handle, $data_dir
  or die "Can't open $data_dir, stopped: $!";

# All of the data files are in the same directory; each file's name
# is url_YYYY_MM_DD, gif_YYYY_MM_DD or headline_YYYY_MM_DD (where
# YYYY is the year, MM is the month, with a leading zero if needed,
# and DD is the day of the month, with a leading zero if needed).
# No other files in that directory end with YYYY_MM_DD.

my @data_files = grep { /\d\d\d\d_\d\d_\d\d$/ } readdir $dir_handle;

close $dir_handle;

my $entry;
foreach my $file ( @data_files ) {

  # Test the filename to see if the file contains the advertiser's
  # URL...

  if ($file =~ /^url_(\d\d\d\d_\d\d_\d\d)/ ) {

      $entry = $data_for_ad_on{$1};

      open (FILE, "$data_dir/$file") or die;
      my $url = do { local $/; <FILE> };
      close FILE;

      $entry->{url}       = $url;
      $data_for_ad_on{$1} = $entry;
      print "Advertiser's URL for $1 has been saved.\n";

  # Test the filename to see if the file contains the .gif location...

  } elsif ($file =~ /^gif_(\d\d\d\d_\d\d_\d\d)/ ) {

      $entry = $data_for_ad_on{$1};

      open (FILE, "$data_dir/$file") or die;
      my $gif = do { local $/; <FILE> };
      close FILE;

      $entry->{gif}       = $gif;
      $data_for_ad_on{$1} = $entry;
      print ".gif file location for $1 has been saved.\n";

  # Test the filename to see if the file contains the headline...

  } elsif ($file =~ /^headline_(\d\d\d\d_\d\d_\d\d)/ ) {

      $entry = $data_for_ad_on{$1};

      open (FILE, "$data_dir/$file") or die;
      my $headline = do { local $/; <FILE> };
      close FILE;

      $entry->{headline}  = $headline;
      $data_for_ad_on{$1} = $entry;
      print "Headline for $1 has been saved.\n";
  }

}

A limitation of MLDBM is that you can't work piecemeal with the data structures (the values for keys in the tied hash). You must retrieve a copy of the data structure in the form of a reference ($entry in the example); use that reference to add, update or delete parts of the data structure; and then save the reference back to the database in order for MLDBM to put the changed data structure back onto disk. This is because the data structure is no longer tied, once it has been retrieved from disk.

For example, the script above works piecemeal with the data structure: it builds one key at a time in the anonymous hash. A temporary variable called $entry holds a reference to the retrieved anonymous hash, which now exists in memory; then $entry is used to dereference the anonymous hash (and to select a named key in the hash, on the right side of the infix operator); then the script alters a part of that data structure (assigning a value to the named key); then the now-changed anonymous hash is saved back into the database file on disk by assigning $entry to the key.

As shown in earlier examples, an entire data structure can be stored as the value of a key in the tied hash without fetching the existing data structure (if any) associated with that key. In such a case, you're overwriting any existing data structure completely, so there is no need to fetch the existing data structure first or to work with it using the temporary variable technique.

Note: a recently added CPAN module called DBM_Deep allows tied hashes of complex data as does MLDBM, but appears to avoid the need for fetching data first in order to work with it through a temporary variable.

By the way, MLDBM allows stored data structures to be arrays, not just hashes. Hence MLDBM might be a good disk-based storage solution for you if a hash of arrays works best to hold your data. Plus, whether you're storing a hash of hashes or a hash of arrays, those stored hashes or arrays are permitted to contain values that consist of even more hashes or arrays!

Another thing to know about MLDBM is that file locking is not provided automatically. The solution presented in this article works well in a single-user environment, but data corruption is possible when more than one instance of the script (or more than one script) needs to access the disk-based data while another process is writing to it. See section 10.2.1 of CGI Programming with Perl for an example of file locking with the DB_File DBM module.

Reading HoH Data

Here is a portion of a script that prints a report for my stored newsletter data.

I reach into the %data_for_ad_on hash by specifying a key (the date of a particular newsletter); the key's associated value is a reference to the anonymous hash holding ad data for that date. Use infix (arrow) notation to dereference the anonymous hash and also to access a specified key.

tie %data_for_ad_on, 'MLDBM', $db, O_RDWR, 0644
  or die "Can't open $db: $!";

my @newsletter_dates = sort keys %data_for_ad_on;

foreach my $date ( @newsletter_dates ) {

   my $url      = $data_for_ad_on{$date}->{url};
   my $gif      = $data_for_ad_on{$date}->{gif};
   my $headline = $data_for_ad_on{$date}->{headline};

   print "\nHere are details for the banner ad in the $date newsletter:\n";
   print "          Advertiser's URL: $url\n";
   print "        Location of banner: $gif\n";
   print "     Headline above banner: $headline\n";
}

A temporary variable to work with the data was unnecessary because I was merely reading data. I had no need to save a changed data structure back to disk after working with it. A temporary variable would be needed as a reference to the data structure, as noted earlier, if I were adding, updating or deleting something in the data structure, in order to save the changed data structure back to disk.

Nevertheless, a temporary variable is useful for storing a fetched data structure even when a tied hash is being used only to read data from disk. The MLDBM documentation points out that disk fetch occurs each time a script names a key in the tied hash; a new copy of the entire data structure for that key gets created with each fetch. For efficiency, scripts that merely read data work better if the data structure for a particular key is stored first into a temporary variable and the script works with that single copy of data in memory. For example, the lines in the preceding code that get values for $url, $gif, and $headline should be rewritten to cut the number of fetches from three to one:

   my $entry    = $data_for_ad_on{$date};
   my $url      = $entry->{url};
   my $gif      = $entry->{gif};
   my $headline = $entry->{headline};

Updating HoH Data

My next task was to update existing data. An advertiser informed me that its web page had changed. The advertiser's banner ad in a particular upcoming newsletter needed to point to the new page rather than to the URL I had stored earlier for the ad. So I tried this:

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Can't open $db: $!";

$data_for_ad_on{'2005_12_14'}->{url}
  = 'http://roadrunners-r-us.com/new_home_page.html';   # WRONG

Despite the apparently correct syntax, the line marked "# WRONG" did not change the data in the tied hash. MLDBM can't add, update or delete just a part of a data structure stored on disk; it must retrieve the entire data structure first. Unfortunately that doesn't happen automatically when a part of the tied hash is used as an lvalue (the left side of an assignment).

Here is the correct syntax for updating part of a data structure:

my $entry = $data_for_ad_on{'2005_12_14'};                        # Get
$entry->{url} = 'http://roadrunners-r-us.com/new_home_page.html'; # Change
$data_for_ad_on{'2005_12_14'} = $entry;                           # Save

A different kind of update would be to add a new kind of information to an existing data structure. Until now I have been working with each ad's data as an anonymous hash of three keys because I only had three pieces of information per ad. Here's how I can modify the data structure for any particular ad--that is, the value of any particular key in the tied hash--to include a fourth piece of information. I'll record whether or not a particular ad has been billed.

tie %data_for_ad_on, 'MLDBM', $db, O_CREAT|O_RDWR, 0644
  or die "Can't open $db: $!";

my $entry = $data_for_ad_on{'2005_12_02'};  # Get
$entry->{has_been_billed} = 'yes';           # Change (by creating a key and value)
$data_for_ad_on{'2005_12_02'} = $entry;      # Save

Adding a fourth key as shown is not remarkable because it's done the same way I created the original three keys. But note how I've added a fourth key to just one of the ads (for the 2005_12_02 newsletter). I don't need to define such a fourth key in each of the other anonymous hashes because MLDBM doesn't require that the anonymous hashes have the same structure.

Now that I've added a fourth key to one of the ads, I want to see that key's information when I run my report, even though the information for each of the other newsletters is incomplete. Just as with a hash in memory, the exists operator can test for the existence of a key in a tied hash, or test for the existence of a key in one of the stored anonymous hashes. The earlier reporting script can be improved:

foreach my $date ( @newsletter_dates ) {

   my $entry    = $data_for_ad_on{$date};
   my $url      = $entry->{url};
   my $gif      = $entry->{gif};
   my $headline = $entry->{headline};

   my $has_been_billed = exists ( $entry->{has_been_billed} ) ?
                         $entry->{has_been_billed}
                         : 'Billing status is unknown';

   print "\nHere are details for the banner ad in the $date newsletter:\n";
   print "            Advertiser's URL: $url\n";
   print "          Location of banner: $gif\n";
   print "       Headline above banner: $headline\n";
   print "  Has advertiser been billed? $has_been_billed\n";
}

Note how a default value is provided for $has_been_billed (the string 'Billing status is unknown'). If $entry->{has_been_billed} had been assigned to $has_been_billed without providing a default value, the script (if it is using the warnings pragma) would report "Use of uninitialized value in concatenation (.) or string" every time it processes the "Has advertiser been billed?" statement for an anonymous hash that has no 'has_been_billed' key. The value of a nonexistent key in a hash is undefined, whether or not the hash is tied.

Because the value of a nonexistent key in a hash is undefined, remember when working with MLDBM not to assume your tied hash's keys will return either a reference to a data structure or a reference to an empty data structure. For example, this code produces a runtime error if the '2005_12_32' key does not exist in the tied hash (there is no December 32) and the script is using the strict pragma:

my $entry = $data_for_ad_on{2005_12_32};
my %named_hash = %{ $entry };

if ( $named_hash{url} ) {
    print "The advertiser's URL for the ad dated 2005_12_32 is $named_hash{url}.\n";

} else {
    print "We have no URL on file for the ad dated 2005_12_32.\n";
}

The runtime error would be 'Can't use string ("") as a HASH ref while "strict refs" in use' on the line number of my %named_hash = %{ $entry }. The code doesn't create a named hash that has no elements.

Conclusion

I hope this article helps you move from a swamp of weedy text files into the simplicity of a single disk-based database file. A DBM database file enables you to work with your data as if it were a single hash of keys and their values.

The MLDBM module extends the "hash on a disk" capability by letting you store a key's value as a hash, as illustrated in this article, or store a key's value as an array.

MLDBM requires a special technique for working piecemeal with stored hashes (or stored arrays). Also, the solution described works well in my single-user environment; file locking to protect the disk-based data would become important in a multi-user environment such as a script executed by a publicly accessible web server. Still, you get all this hash-like goodness for key-based data without running a relational database server or using SQL, without parsing large amounts of disk-based data, and without storing all the data in memory first.

Happy hashing!

Understanding and Using Iterators

The purpose of this tutorial is to give a general overview of what iterators are, why they are useful, how to build them, and things to consider to avoid common pitfalls. I intend to give the reader enough information to begin using iterators, though this article assumes some understanding of idiomatic Perl programming. Please consult the "See Also" section if you need supplemental information.

What Is an Iterator?

Iterators come in many forms, and you have probably used one without even knowing it. The readline and glob functions, as well as the flip-flop operator, are all iterators when used in scalar context. A user-defined iterator usually takes the form of a code reference that, when executed, calculates the next item in a list and returns it. When the iterator reaches the end of the list, it returns an agreed-upon value. While implementations vary, a subroutine that creates a closure around any necessary state variables and returns the code reference is common. This technique is called a factory and facilitates code reuse.

Why Are Iterators Useful?

The most straightforward way to use a list is to define an algorithm to generate the list and store the results in an array. There are several reasons why you might want to consider an iterator instead:

  • The list in its entirety would use too much memory.

    Iterators have tiny memory footprints, because they can store only the state information necessary to calculate the next item.

  • The list is infinite.

    Iterators return after each iteration, allowing the traversal of an infinite list to stop at any point.

  • The list should be circular.

    Iterators contain state information, as well as logic allowing a list to wrap around.

  • The list is large but you only need a few items.

    Iterators allow you to stop at any time, avoiding the need to calculate any more items than is necessary.

  • The list needs to be duplicated, split, or variated.

    Iterators are lightweight and have their own copies of state variables.

How to Build an Iterator

The basic structure of an iterator factory looks like this:

sub gen_iterator {
    my @initial_info = @_;

    my ($current_state, $done);

    return sub {
        # code to calculate $next_state or $done;
        return undef if $done;
        return $current_state = $next_state;   
    };
}

To make the factory more flexible, the factory may take arguments to decide how to create the iterator. The factory declares all necessary state variables and possibly initializes them. It then returns a code reference--in the same scope as the state variables--to the caller, completing the transaction. Upon each execution of the code reference, the state variables are updated and the next item is returned, until the iterator has exhausted the list.

The basic usage of an iterator looks like this:

my $next = gen_iterator( 42 );
while ( my $item = $next->() ) {
    print "$item\n";
}

Example: The List in Its Entirety Would Use Too Much Memory

You work in genetics and you need every possible sequence of DNA strands of lengths 1 to 14. Even if there were no memory overhead in using arrays, it would still take nearly five gigabytes of memory to accommodate the full list. Iterators come to the rescue:

my @DNA = qw/A C T G/;
my $seq = gen_permutate(14, @DNA);
while ( my $strand = $seq->() ) {
    print "$strand\n";
}

sub gen_permutate {
    my ($max, @list) = @_;
    my @curr;
    return sub {
        if ( (join '', map { $list[ $_ ] } @curr) eq $list[ -1 ] x @curr ) {
            @curr = (0) x (@curr + 1);
        }
        else {
            my $pos = @curr;
            while ( --$pos > -1 ) {
                ++$curr[ $pos ], last if $curr[ $pos ] < $#list;
                $curr[ $pos ] = 0;
            }
        }
        return undef if @curr > $max;
        return join '', map { $list[ $_ ] } @curr;
    };
}

Example: The List Is Infinite

You need to assign IDs to all current and future employees and ensure that it is possible to determine if an ID is valid with nothing more than the number itself. You have already taken care of persistence and number validation (using the LUHN formula). Iterators take care of the rest:

my $start = $ARGV[0] || 999999;
my $next_id = gen_id( $start );
print $next_id->(), "\n" for 1 .. 10;  # Next 10 IDs

sub gen_id {
    my $curr = shift;
    return sub {
        0 while ! is_valid( ++$curr );
        return $curr;
    };
}

sub is_valid {
    my ($num, $chk) = (shift, '');
    my $tot;
    for ( 0 .. length($num) - 1 ) {
        my $dig = substr($num, $_, 1);
        $_ % 2 ? ($chk .= $dig * 2) : ($tot += $dig);
    }

    $tot += $_ for split //, $chk;

    return $tot % 10 == 0 ? 1 : 0;
}

Example: The List Should Be Circular

You need to support legacy apps with hardcoded filenames, but want to keep logs for three days before overwriting them. You have everything you need except a way to keep track of which file to write to:

my $next_file = rotate( qw/FileA FileB FileC/ );
print $next_file->(), "\n" for 1 .. 10;

sub rotate {
    my @list  = @_;
    my $index = -1;

    return sub {
        $index++;
        $index = 0 if $index > $#list;
        return $list[ $index ];
    };
}

Adding one state variable and an additional check would provide the ability to loop a user-defined number of times.

Example: The List Is Large But Only a Few Items May Be Needed

You have forgotten the password to your DSL modem and the vendor charges more than the cost of a replacement to unlock it. Fortunately, you remember that it was only four lowercase characters:

while ( my $pass = $next_pw->() ) {
    if ( unlock( $pass ) ) {
        print "$pass\n";
        last;
    }
}

sub fix_size_perm {

    my ($size, @list) = @_;
    my @curr          = (0) x ($size - 1);

    push @curr, -1;

    return sub {
        if ( (join '', map { $list[ $_ ] } @curr) eq $list[ -1 ] x @curr ) {
            @curr = (0) x (@curr + 1);
        }
        else {
            my $pos = @curr;
            while ( --$pos > -1 ) {
                ++$curr[ $pos ], last if $curr[ $pos ] < $#list;
                $curr[ $pos ] = 0;
            }
        }

        return undef if @curr > $size;
        return join '', map { $list[ $_ ] } @curr;
    };
}

sub unlock { $_[0] eq 'john' }

Example: The List Needs To Be Duplicated, Split, or Modified into Multiple Variants

Duplicating the list is useful when each item of the list requires multiple functions applied to it, if you can apply them in parallel. If there is only one function, it may be advantageous to break the list up and run duplicate copies of the function. In some cases, multiple variations are necessary, which is why factories are so useful. For instance, multiple lists of different letters might come in handy when writing a crossword solver.

The following example uses the idea of breaking up the list to enhance the employee ID example. Assigning ranges to departments adds additional meaning to the ID.

my %lookup;

@lookup{ qw/sales support security management/ }
    = map { { start => $_ * 10_000 } } 1..4;

$lookup{$_}{iter} = gen_id( $lookup{$_}{start} ) for keys %lookup;

# ....

my $dept = $employee->dept;
my $id   = $lookup{$dept}{id}();
$employee->id( $id );

Things To Consider

The iterator's @_ is Different Than the Factory's

The following code doesn't work as you might expect:

sub gen_greeting {
    return sub { print "Hello ", $_[0] };
}

my $greeting = gen_greeting( 'world' );
$greeting->();

It may seem obvious, but closures need lexicals to close over, as each subroutine has its own @_. The fix is simple:

sub gen_greeting {
    my $msg = shift;
    return sub { print "Hello ", $msg };
}

The Return Value Indicating Exhaustion Is Important

Attempt to identify a value that will never occur in the list. Using undef is usually safe, but not always. Document your choice well, so calling code can behave correctly. Using while ( my $answer = $next->() ) { ... } would result in an infinite loop if 42 indicated exhaustion.

If it is not possible to know in advance valid values in the list, allow users to define their own values as an argument to the factory.

References to External Variables for State May Cause Problems

Problems can arise when factory arguments needed to maintain state are references. This is because the variable being referred to can have its value changed at any time during the course of iteration. A solution might be to de-reference and make a copy of the result. In the case of large hashes or arrays, this may be counterproductive to the ultimate goal. Document your solution and your assumptions so that the caller knows what to expect.

You May Need to Handle Edge Cases

Sometimes, the first or the last item in a list requires more logic than the others in the list. Consider the following iterator for the Fibonacci numbers:

sub gen_fib {
    my ($low, $high) = (1, 0);

    return sub {
        ($low, $high) = ($high, $low + $high);
        return $high;
    };
}

my $fib = gen_fib();
print $fib->(), "\n" for 1 .. 20;

Besides the funny initialization of $low being greater than $high, it also misses 0, which should be the first item returned. Here is one way to handle it:

sub gen_fib {

    my ($low, $high) = (1, 0);

    my $seen_edge;

    return sub {
        return 0 if ! $seen_edge++;
        ($low, $high) = ($high, $low + $high);
        return $high;
    };
}

State Variables Persist As Long As the Iterator

Reaching the end of the list does not necessarily free the iterator and state variables. Because Perl uses reference counting for its garbage collection, the state variables will exist as long as the iterator does.

Though most iterators have a small memory footprint, this is not always the case. Even if a single iterator doesn't consume a large amount of memory, it isn't always possible to forsee how many iterators a program will create. Be sure to document how the caller can destroy the iterator when necessary.

In addition to documentation, you may also want to undef the state variables at exhaustion, and perhaps warn the caller if the iterator is being called after exhaustion.

sub gen_countdown {
   my $curr = shift;

   return sub {
       return $curr++ || 'blast off';
   }
}

my $t = gen_countdown( -10 );
print $t->(), "\n" for 1..12; # off by 1 error

Becomes:

sub gen_countdown {
   my $curr = shift;

   return sub {
       if ( defined $curr && $curr == 0 ) {
           undef $curr, return 'blast off';
       }

       warn 'List already exhausted' and return if ! $curr;

       return $curr++;
   }
}

See Also

Symbol Table Manipulation

Having almost achieved the state of perfect laziness, one of my favorite modules is Class::DBI::mysql. It makes MySQL database tables seem like classes, and their rows like objects. This completely relieves me from using SQL in most cases. This article explains how Class::DBI::mysql carries out its magic. Instead of delving into the complexities of Class::DBI::mysql, I will use a simpler case study: Class::Colon.

Introduction

One of my favorite modules from CPAN is Class::DBI::mysql. With it, I can almost forget I'm working with a database. Here's an example:

#!/usr/bin/perl
use strict; use warnings;

package Users;

use base 'Class::DBI::mysql';

Users->set_db('Main', 'dbi:mysql:rt3', 'rt_user', 'rt_pass');
Users->set_up_table('Users');

package main;

my @column_names = qw( Name RealName );
print "@column_names\n";
print "-" x 30 . "\n";

my $user_iter = Users->retrieve_all();

while (my $row = $user_iter->next) {
    print $row->Name, " ", $row->RealName, "\n";
}

Except for the MySQL connection information, no trace of SQL or databases remains.

My purpose here is not really to introduce you to this beautiful module. Instead, I'll explain how to build façades like this. To do so, I'll work through another, simpler CPAN module called Class::Colon. It turns colon-delimited files into classes and their lines into objects. Here's an example from a checkbook application. This program computes the balance of an account on a user-supplied date or the end of time if the user doesn't supply one.

#!/usr/bin/perl
use strict; use warnings;

use Getopt::Std;
use Date;
use Class::Colon Trans => [ qw(
    status type date=Date amount desc category memo
) ];

our $opt_d;
getopt('d');
my $date       = Date->new($opt_d) if $opt_d;

my $account    = shift or die "usage: $0 [-d date] account_file\n";
my $trans_list = Trans->READ_FILE($account);
my $balance    = 0;

foreach my $trans (@$trans_list) {
    if (not defined $date or $date >= $trans->date) {
        $balance += $trans->amount;
    }
}

print "balance = $balance\n";

In the use statement for Class::Colon, I told it the name of the class to build (Trans), followed by a list of fields in the order they appear in the file. The date field is really an object itself, so I used =Date after the field name. This told Class::Colon that a class named Date will handle the date field. If the Date class constructor were not named new, I would have written date=Date=constructor_name. My Date class is primitive at best, it only provides comparisons like greater than. It only does that for dates in one format. I won't embarrass myself further by showing it.

After shifting in the name of the account file, the code calls READ_FILE through Trans, which Class::Colon defined. This returns a list of Trans objects. The fields in these objects are the ones given in the Class::Colon use statement. They are easy to access through their named subroutines.

The rest of the program loops through the transactions list checking dates. If the user didn't give a date, or the current transaction happened before the user's date, the program adds that amount to the total. Finally, it reports the balance.

Though the example shows only the lookup access, you can easily change values. All of the accessors retrieve and store. Calling WRITE_FILE puts the updated records back onto the disk.

Other methods help with colon-delimited records. Some let you work with handles instead of file names. Others help you parse and produce strings so that you can drive your own input and output. See the Class::Colon perldoc for details. (No, colon is not the only delimiter.)

Let the Games Begin

Both Class::DBI::mysql and Class::Colon build classes at run time which look like any other classes. How do they do this? They manipulate symbol tables directly. To see what this means, I want to start small. Suppose I have a variable name like:

my $extremely_long_variable_indicator_string;

That's not something I want to type often. I could make an alias in two steps like this:

our $elvis;

First, I declare an identifier with a better name. I must make it global. If strict is on, I should use our to do this (though there are other older ways that also work). Lexical variables (the ones declared with my) don't live in symbol tables, so the tricks below won't work with them.

*elvis = \$extremely_long_variable_indicator_string;

Now I can point $elvis to the longer version. The key is the * sigil. It refers to the symbol table entry for elvis (the name without any sigils). This line stores a reference to $extremely_long_variable_indicator_string in the symbol table under $elvis, but it doesn't affect other entries like @elvis or %elvis. Now, both scalars point to the same data, so $elvis is a genuine alias for the longer name. It is not just a copy.

Unless you work with mean-spirited colleagues, or are into self-destructive behavior, you probably don't need an alias just to gain a shorter name. However, the technique works in other situations you might actually encounter. In particular, it is the basis for the API simplification provided by Class::Colon.

To understand what Class::Colon does, remember that the subroutine is a primitive type in Perl. You can store subs just as you do variables. For instance, I could store a subroutine reference like this (the sigil for subs is &):

my $abbr;
$abbr = \&some_long_sub_name;

and use it to call the subroutine:

my @answer = $abbr->();

Here, I have made a new scalar variable, $abbr, which holds a reference to the subroutine. This is not quite the same as directly manipulating the symbol table, but you can do that too:

*alias     = \&some_long_sub_name;
my @retval = alias();

Instead of storing a reference to the subroutine in a variable, this code stores the subroutine in the symbol table itself. This means that subsequent code can access the subroutine as if it had declared the subroutine with its new name itself. Adjusting the symbol table is not really easier to read or write than storing a reference, but, in modules like Class::Colon, symbol table changes are the essential step to simplifying the caller's API.

Classes from Sheer Magic

The previous example demonstrated how to make symbol table entries whenever you want. These can save typing and/or make things more readable. The standard module English uses this technique to give meaningful English names to the standard punctuation variables (like $_). You want more, though. You want to build classes out of thin air during run time.

The key to fabricating classes is to realize that a class is just a package and a package is really just a symbol table (more or less). That, and the fact that symbol tables autovivify, is all you need to carry off hugely helpful deceptions like Class::DBI::mysql.

What use really does

This subsection explains how to pass data during a use statement. If you already understand the import subroutine, feel free to skip to the next section.

When you use a module in Perl, you can provide information for that module to use during loading. While Class::DBI::mysql waits for you to call routines before setting up classes, Class::Colon does it during loading by implementing an import method.

Whenever someone uses your module, Perl calls its import method (if it has one). import receives the name of the class the caller used, plus all of the arguments provided by the caller.

In the checkbook example above, the caller used Class::Colon with this statement:

use Class::Colon Trans => [ qw(
    status type date=Date amount desc category memo
) ];

The import method of the Class::Colon package receives the following as a result:

  1. The string Class::Colon.
  2. A list with two elements. First, the string Trans. Second, a reference to the array which lists the fields.

The top of the import routine stores these as shown below.

Inserting into non-existent symbol tables

The main magic of Class::Colon happens in the import routine. Here's how that looks:

sub import {
    my $class = shift;
    my %fakes = @_;

    foreach my $fake (keys %fakes) {
        no strict;
        *{"$fake\::NEW"}     = sub { return bless {}, shift; };

        foreach my $proxy_method qw(
                read_file  read_handle  objectify delim
                write_file write_handle stringify
        ) {
            my $proxy_name   = $fake  . "::" . uc $proxy_method;
            my $real_name    = $class . "::" .    $proxy_method;
            *{"$proxy_name"} = \&{"$real_name"};
        }

        my @attributes;
        foreach my $col (@{$fakes{$fake}}) {
            my ($name, $type, $constructor)  = split /=/, $col;
            *{"$fake\::$name"} = _make_accessor($name, $type, $constructor);
            push @attributes, $name;
        }
        $simulated_classes{$fake} = {ATTRS => \@attributes, DELIM => ':'};
    }
}

After shifting the arguments into meaningful variable names, the main loop walks through each requested class (the list of fakes). Inside the loop it disables strict, because the necessary uses of so many symbolic references would upset it.

There are four steps in the fabrication of each class:

  1. Make the constructor
  2. Make the class methods
  3. Make the accessor methods
  4. Store the attribute names in order

The constructor is about as simple as possible and the same for every fabricated class. It returns a hash reference blessed into the requested class. The cool thing is that you can insert code into a symbol table that doesn't exist in advance. This constructor will be NEW. (By convention, Class::Colon uses uppercase names for its methods to avoid name collisions with the user's fields).

This code requires a little bit of careful quoting. Saying *{"$fake\::NEW"} tells Perl to make an entry in the new package's symbol table under NEW. The backslash suppresses variable interpolation. While $fake needs interpolation, interpolating $fake::NEW would just yield undef, because this is its first definition here.

Perl has already done the hard part by the time it stores the constructor. It has brought the package into existence. Now it's just a matter of making some aliases.

For each provided method, the code makes an entry in the symbol table of the fabricated class. Those entries point to the methods of the Class::Colon package, which serve as permanent shared delegates for all fabricated classes.

Similarly, it builds an accessor for each attribute supplied by the caller in the use statement. These routines require a bit of customization to look up the proper attribute name and to deal with object construction. Hence, there is a small routine called _make_accessor which returns the proper closure for each accessor.

Finally, it makes an entry for the new class in the master list of simulated classes. This allows easy lookup by name when calling class methods through the fabricated names. Note that there is nothing in the import routine that limits the caller to one invocation. Further use statements can bring additional classes to life. Alternatively, the caller can request several new classes with a single use statement by including multiple hash keys.

In the standard case, _make_accessor works like this:

sub _make_accessor {
    my $attribute   = shift;
    return sub {
        my $self            = shift;
        my $new_val         = shift;
        $self->{$attribute} = $new_val if defined $new_val;
        return $self->{$attribute};
    };
}

The actual routine is a bit more complex, so it can handle construction of attributes which are objects. Note that the value of $attribute, which is in scope when the closure is created, will be kept with the sub and used whenever it is called. The actual code is a fairly standard Perl dual-use accessor. It assigns a new value to the attribute if the caller has passed it in. It always returns the value of the attribute.

What Class::Colon provides

Just for sake of completeness, here is how Class::Colon turns a string into a set of objects. Note the heavy use of methods through their previously-entered symbol table names.

sub objectify {
    my $class    = shift;
    my $string   = shift;
    my $config   = $simulated_classes{$class};
    my $col_list = $config->{ATTRS};

    my $new_object = $class->NEW();
    my @cols       = split /$config->{DELIM}/, $string;
    foreach my $i (0 .. @cols - 1) {
        my $method = $col_list->[$i];
        $new_object->$method($cols[$i]);
    }
    return $new_object;
}

All fabricated classes share this method (and the other class methods of Class::Colon).

Recall that NEW returns a blessed hash reference with nothing in it. In objectify, the loop fills in the attributes by calling their accessors. This ensures the proper construction of any object attributes. Callers access objectify indirectly when they call READ_FILE and its cousins. They can also use it directly through its OBJECTIFY alias.

Summary

By making entries into symbol tables, you can create aliases for data that is hard to name. Further, you can create new symbol tables simply by referring to them. This allows you to build classes on the fly. Modules like Class::DBI::mysql and Class::Colon do this to provide classes representing tabular data.

There are other uses of these techniques. For example, Memoize wraps an original function with a cache scheme, storing the wrapped version in place of the original in the caller's own symbol table. For functions which return the same result whenever the arguments are the same, this can save time. Exporter does even more sophisticated work to pollute the caller's symbol table with symbols from a used package. At heart, these schemes are similar to the one shown above. By carefully performing symbol table manipulations in modules, you can often greatly simplify an API, making client code easier to read, write, and maintain.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02