December 2005 Archives

A Timely Start

This article is a follow-up to the lightning talk I delivered at the recent YAPC::Europe 2005 that took place in Braga, Portugal: "Perl Vs. Korn Shell: 1-1." I presented two case studies taken from my experience as Perl expert at the Agency (they have asked me to remain discreet about their real name, but rest reassured, it's not the CIA, it's a European agency).

In the first case, I explained how I had rewritten a heavily used Korn shell script in Perl, in the process bringing its execution time from a half-hour down to ten seconds. The audience laughed and applauded loudly at that point of my talk.

Then I proceeded to the second case, where the situation was not so rosy for Perl. One of my colleagues had rewritten a simple ksh script that did a few computations then transferred a file over FTP. The rewrite in Perl ran ten times slower.

Slower?

My job was to investigate. I found a couple of obvious ways of speeding up the Perl script. First I removed hidden calls to subprocesses:

use Shell qw( date which );
# ...
$now   = date();
$where = which 'some_command';

Then I replaced a few use directives with calls to require. If you're going to use the features contained in a module only conditionally, require may be preferable, because perl always executes use directives at compile time.

Take this code:

use Pod::Usage;

# $help set from command-line option
if ($help) {
    pod2usage(-exitstatus => 0);
}

This loads Pod::Usage, even if $help is false. The following change won't help, though:

# $help set from command-line option
if ($help) {
    use Pod::Usage;
    pod2usage(-exitstatus => 0);
}

This just gives the illusion that the program loads Pod::Usage only when necessary. The right answer is:

# $help set from command-line option
if ($help) {
    require Pod::Usage;
    Pod::Usage::pod2usage(-exitstatus => 0);
}

After these changes the situation had improved: the Perl version was only five times slower now. I pulled out the profiler. Sadly, most of the time still went to loading modules, especially Net::FTP. Of course, we always needed that.

If you dig in, you realize that Net::FTP is not the only culprit. It loads other modules that in turn load other modules, and so on. Here is the complete list:

/.../libnet.cfg
Carp.pm
Config.pm
Errno.pm
Exporter.pm
Exporter/Heavy.pm
IO.pm
IO/Handle.pm
IO/Socket.pm
IO/Socket/INET.pm
IO/Socket/UNIX.pm
Net/Cmd.pm
Net/Config.pm
Net/FTP.pm
SelectSaver.pm
Socket.pm
Symbol.pm
Time/Local.pm
XSLoader.pm
strict.pm
vars.pm
warnings.pm
warnings/register.pm

This is not by itself a bad thing: modules are there to promote code reuse, after all, so in theory, the more modules a module uses, the better. It usually shows that it's not reinventing wheels. Modules are good and account for much of Perl's success. You can solve so many problems in ten lines: use this, use that, use this too, grab this from CPAN, write three lines yourself--bang! problem solved.

Why Was Perl Slower?

In my case, though, the sad truth was that the Perl version of the FTP script spent 95 percent of its time getting started; the meat of the script, the code written by my colleague, accounted for only five percent of the execution time. The ksh version, on the other hand, started to work on the job nearly immediately.

My lightning talk ended with the conclusion that Perl needs a compiler--badly. In the final slides I addressed some popular objections against this statement, the more important one being that "it won't make your code run faster." I know: I want it to start faster.

Several hackers came for a discussion after my talk. Some of their objections were interesting but missed the point. ("The total execution time is good enough." Me: "It's not up to me to decide, there are specs; and the FTP version is five times faster. That's what I'm talking about.")

However, other remarks suggested that some stones remained unturned. One of the attendants timed perl -MNet::FTP -e 0 under my eyes and indeed it loaded quite fast on his laptop. It loaded faster than on the Agency's multi-CPU, 25000-euro computers. How come?

The answer turned out to be the length of PERL5LIB. The Agency has collections of systems that build upon one another. Each system may contain one or several directories hosting Perl modules. The build system produces a value for PERL5LIB that includes the Perl module directories for the system, followed by the module directories of all of the systems it builds upon, recursively.

I wrote a small module, Devel::Dependencies, which uses BEGIN and CHECK blocks to find all of the modules that a Perl program loads. Optionally, it lists the path to each module, the position of that path within @INC, and the sum of all of the positions at the end. This gives a good idea of the number of directory searches Perl has to perform when loading modules.

I used it on a one-line script that just says use Net::FTP. Here's the result:

$ echo $PERL5LIB
XxB:
/build/MAP/CONFIG!11.213/idl/stubs:
/build/MAP/CONFIG!11.213/perl/lib:
/build/MAP/CONFIG!11.213/pm:
/build/LIB/UTILS!11.151/ftpstuff:
/build/LIB/UTILS!11.151/alien/PA-RISC2.0:
/build/LIB/UTILS!11.151/alien:
/build/LIB/UTILS!11.151/perl/lib:
/build/LIB/UTILS!11.151/pm:
/build/MAP/MAP_SERVER!11.197/pm:
/build/CONTROL/KLIBS!11.132/idl/stubs:
/build/CONTROL/KLIBS!11.132/perl/lib:
/build/CONTROL/KLIBS!11.132/pm:
/build/CONTROL/ACME!11.177/idl/stubs:
/build/CONTROL/ACME!11.177/pm:
/build/CONTROL/ADAPTER!11.189/pm:
/build/MAP/HMI!1.12.165/idl/stubs:
/build/CONTROL/KERNEL_INIT!11.130/pm:
/build/CONTROL/BUSINESS!11.176/pm:
/build/CONTROL/KERNEL!11.130/pm:
/build/BASIC/SSC!11.78/pm:
/build/BASIC/LOG!11.78/perl/lib:
/build/BASIC/LOG!11.78/pm:
/build/BASIC/BSC!11.77/pm:
/build/BASIC/POLYORB!11.39/idl/stubs:
/build/BASIC/TEST!11.58/perl/lib/ldap/interface:
/build/BASIC/TEST!11.58/perl/lib/ldap:
/build/BASIC/TEST!11.58/perl/lib:
/build/BASIC/TEST!11.58/pm:
/build/BASIC/TANGRAM!11.53/perl/lib:
/build/BASIC/TANGRAM!11.53/pm:
/build/LIB/PERL!5.6.1.1.39/lib/site_perl:
/build/LIB/PERL!5.6.1.1.39/lib:
/build/LIB/PERL!5.6.1.1.39/alien:
/build/LIB/PERL!5.6.1.1.39/pm:
/build/LIB/GNU!11.30/pm:
/build/CM/LIB!2.1.221/perl/lib/wle/interface:
/build/CM/LIB!2.1.221/perl/lib:
/build/CM/LIB!2.1.221/alien:
/build/CM/LIB!2.1.221/pm:
/build/CM/CM_LIB!2.1.96/perl/lib/MakeCfg:
/build/CM/CM_LIB!2.1.96/perl/lib:
/build/CM/CM_LIB!2.1.96/pm:
/build/CM/EXT_LIBS!2.1.6/perl/lib:
/build/CM/EXT_LIBS!2.1.6/alien:
/build/CM/EXT_LIBS!2.1.6/pm:
XxE

$ perl -c -MDevel::Dependencies=origin ftp.pl
Devel::Dependencies 23 dependencies:
  /build/LIB/UTILS!11.151/ftpstuff/Net/libnet.cfg /build/LIB/UTILS!11.151/ftpstuff/Net/libnet.cfg (6)
  Carp.pm /build/LIB/PERL!5.6.1.1.39/lib/Carp.pm (38)
  Config.pm /build/LIB/PERL!5.6.1.1.39/lib/PA-RISC2.0/Config.pm (37)
  Errno.pm /build/LIB/PERL!5.6.1.1.39/lib/PA-RISC2.0/Errno.pm (37)
  Exporter.pm /build/LIB/PERL!5.6.1.1.39/lib/Exporter.pm (38)
  Exporter/Heavy.pm /build/LIB/PERL!5.6.1.1.39/lib/Exporter/Heavy.pm (38)
  IO.pm /build/LIB/PERL!5.6.1.1.39/lib/PA-RISC2.0/IO.pm (37)
  IO/Handle.pm /build/LIB/PERL!5.6.1.1.39/lib/PA-RISC2.0/IO/Handle.pm (37)
  IO/Socket.pm /build/LIB/PERL!5.6.1.1.39/lib/PA-RISC2.0/IO/Socket.pm (37)
  IO/Socket/INET.pm /build/LIB/PERL!5.6.1.1.39/lib/IO/Socket/INET.pm (38)
  IO/Socket/UNIX.pm /build/LIB/PERL!5.6.1.1.39/lib/IO/Socket/UNIX.pm (38)
  Net/Cmd.pm /build/LIB/UTILS!11.151/ftpstuff/Net/Cmd.pm (6)
  Net/Config.pm /build/LIB/UTILS!11.151/ftpstuff/Net/Config.pm (6)
  Net/FTP.pm /build/LIB/UTILS!11.151/ftpstuff/Net/FTP.pm (6)
  SelectSaver.pm /build/LIB/PERL!5.6.1.1.39/lib/SelectSaver.pm (38)
  Socket.pm /build/LIB/PERL!5.6.1.1.39/lib/PA-RISC2.0/Socket.pm (37)
  Symbol.pm /build/LIB/PERL!5.6.1.1.39/lib/Symbol.pm (38)
  Time/Local.pm /build/LIB/PERL!5.6.1.1.39/lib/Time/Local.pm (38)
  XSLoader.pm /build/LIB/PERL!5.6.1.1.39/lib/PA-RISC2.0/XSLoader.pm (37)
  strict.pm /build/LIB/PERL!5.6.1.1.39/lib/strict.pm (38)
  vars.pm /build/LIB/PERL!5.6.1.1.39/lib/vars.pm (38)
  warnings.pm /build/LIB/PERL!5.6.1.1.39/lib/warnings.pm (38)
  warnings/register.pm /build/LIB/PERL!5.6.1.1.39/lib/warnings/register.pm (38)
Total directory searches: 739
ftp.pl syntax OK

Now it looked like I had found the wasted time.

Making Perl Faster

Still pushing, I wrote a script that consolidates @INC. It creates a directory tree containing the union of all of the directory trees found in @INC and then populates them with symlinks to the .pm files. I could replace the lengthy PERL5LIB with one that contained just one directory. Here's the resulting dependency listing:

ler@cougar: cm_perl -I lib/MAP/CONFIG -c -MDevel::Dependencies=origin ftp.pl
  Devel::Dependencies 23 dependencies:
  Carp.pm lib/MAP/CONFIG/Carp.pm (2)
  Config.pm lib/MAP/CONFIG/PA-RISC2.0/Config.pm (1)
  Errno.pm lib/MAP/CONFIG/PA-RISC2.0/Errno.pm (1)
  Exporter.pm lib/MAP/CONFIG/Exporter.pm (2)
  Exporter/Heavy.pm lib/MAP/CONFIG/Exporter/Heavy.pm (2)
  IO.pm lib/MAP/CONFIG/PA-RISC2.0/IO.pm (1)
  IO/Handle.pm lib/MAP/CONFIG/PA-RISC2.0/IO/Handle.pm (1)
  IO/Socket.pm lib/MAP/CONFIG/PA-RISC2.0/IO/Socket.pm (1)
  IO/Socket/INET.pm lib/MAP/CONFIG/IO/Socket/INET.pm (2)
  IO/Socket/UNIX.pm lib/MAP/CONFIG/IO/Socket/UNIX.pm (2)
  Net/Cmd.pm lib/MAP/CONFIG/Net/Cmd.pm (2)
  Net/Config.pm lib/MAP/CONFIG/Net/Config.pm (2)
  Net/FTP.pm lib/MAP/CONFIG/Net/FTP.pm (2)
  SelectSaver.pm lib/MAP/CONFIG/SelectSaver.pm (2)
  Socket.pm lib/MAP/CONFIG/PA-RISC2.0/Socket.pm (1)
  Symbol.pm lib/MAP/CONFIG/Symbol.pm (2)
  Time/Local.pm lib/MAP/CONFIG/Time/Local.pm (2)
  XSLoader.pm lib/MAP/CONFIG/PA-RISC2.0/XSLoader.pm (1)
  lib/MAP/CONFIG/Net/libnet.cfg lib/MAP/CONFIG/Net/libnet.cfg (2)
  strict.pm lib/MAP/CONFIG/strict.pm (2)
  vars.pm lib/MAP/CONFIG/vars.pm (2)
  warnings.pm lib/MAP/CONFIG/warnings.pm (2)
  warnings/register.pm lib/MAP/CONFIG/warnings/register.pm (2)
Total directory searches: 39
ftp.pl syntax OK

Why does Perl find Carp.pm in the second directory, considering Perl should search the directory passed via -I first? perl -V gives the answer:

    (extract)
  @INC:
    lib/MAP/CONFIG/PA-RISC2.0
    lib/MAP/CONFIG
    XxB
    /build/LIB/UTILS!11.162/ftpstuff/PA-RISC2.0
    /build/LIB/UTILS!11.162/ftpstuff
    /build/LIB/UTILS!11.162/alien/PA-RISC2.0

Under some circumstances, Perl adds architecture-specific paths to @INC; for more information on this, see the description of PER5LIB in the perlrun manpage.

Finally I timed the ftp.pl program twice: with the normal PERL5LIB and with the consolidated PERL5LIB. Here are the results (u stands for "user time," s for "system time," and u+s is the sum; times are in seconds):

Running ft_ftp.pl..
47-element PERL5LIB  : u: 0.07 s: 0.20 u+s: 0.27
Consolidated PERL5LIB: u: 0.05 s: 0.04 u+s: 0.09

Therefore, I recommended incorporating the consolidation script as part of the process that builds the various systems.

Conclusion

It may seem silly to have a PERL5LIB that contains 47 directories. On the other hand, that kind of situation naturally arises once you try to use Perl in complex developments such as the Agency's. After all, Perl "is a real programming language," we like to say, so why can't it do what C++ or Ada can do?

I still think that we need a Perl compiler. The problem is not the length of PERL5LIB, it's the fact that Perl processes it each time it runs the script. My workaround, in effect, amounts to "compiling" a fast Perl lib.

Logic Programming with Perl and Prolog

Computing languages can be addictive; developers sometimes blame themselves for perceived inadequacies, making apologies for them. That is the case, at least, when one defends his or her language of choice against the criticism of another language's devotee. Regardless, many programmers prefer one language, and typically ground that preference in a respect for that language's strengths.

Perl has many strengths, but two most often cited are its adaptability and propensity to work as a "glue" between applications and/or data. However, Perl isn't the only advantageous language: programmers have used C or even assembly to gain speed for years, and intelligent use of SQL allows the keen coder to offload difficult data manipulations onto a database, for example. Prolog is an often overlooked gem that, when combined with the flexibility of Perl, affords the coder powerful ways to address logical relationships and rules. In this article, I hope to provide a glimpse of the benefits that embedded Prolog offers to Perl programmers. Moreover, I hope that my example implementation demonstrates the ease with which one can address complex logical relationships.

A Bit About Prolog

For the sake of demonstration, I would like to frame a simple problem and solution that illustrate the individual strengths of Perl and Prolog, respectively. However, while I anticipate that the average reader will be familiar with the former, he or she may not be as familiar with the latter. Prolog is a logic programming language often used in AI work, based upon predicate calculus and first developed in 1972. There are several excellent, free versions of Prolog available today, including GNU Prolog and the popular SWI Prolog. For the Prolog initiate, I recommend checking out some of the free Prolog tutorials, either those linked from Wikipedia or from OOPWeb.

Prolog and Perl aren't exactly strangers, however. There are several excellent Perl modules available to allow the coder to access the power of Prolog quite easily, including the SWI module developed by Robert Barta, the Interpreter module by Lee Goddard, the Yaswi modules developed by Salvador Fandino Garcia, and the AI::Prolog module written by Curtis "Ovid" Poe. Poe has also recently provided a rather nice introduction to Prolog-in-Perl in an online-accessible format.

The Problem

There are many advantages to using Prolog within Perl. In the general sense, each language has its own advantages, and can thus complement the other. Suppose that I am building a testing harness or a logic-based query engine for a web application, where neither language easily provides all of the features I need. In cases such as these, I could use Prolog to provide the logic "muscle," and Perl to "glue" things together with its flexibility and varied, readily available modules on CPAN.

In my simple demonstration, I am going to posit the requirement that I take genealogical information built by another application and test relationships based upon a set of rules. In this case, the rules are defined in a Prolog file (an interesting intersection here is that both Perl and Prolog typically use the suffix .pl), while the genealogical information is contained in a Dot file readable by Graphviz. As such, I am going to make certain assumptions about the format of the data. Next, I am going to assume that I will have a query (web-based, or from yet another application) that will allow users to identify relationships (such as brothers, cousins, etc.).

Here are my Prolog rules:

is_father(Person)        :- is_parent(Person, _),
                            is_male(Person).
is_father(Person, Child) :- is_parent(Person, Child),
                            is_male(Person).

is_mother(Person)        :- is_parent(Person, _),
                            is_female(Person).
is_mother(Person, Child) :- is_parent(Person, Child),
                            is_female(Person).

ancestor(Ancestor, Person) :- is_parent(Ancestor, Person).
ancestor(Ancestor, Person) :- is_parent(Ancestor, Child),
                              ancestor(Child, Person).

is_sibling(Person, Sibling) :- is_parent(X, Person),
                               is_parent(X, Sibling).

is_cousin(Person, Cousin) :- is_parent(X, Person),
                             is_parent(Y, Cousin),
                             is_sibling(X, Y).

One advantage to separating my logic is that I can troubleshoot it before I even write the Perl code, loading the rules into a Prolog interpreter or IDE such as XGP (for Macintosh users) and testing them. However, AI::Prolog conveniently provides its own solution: by typing aiprolog at the command line, I can access a Prolog shell, load in my file, and run some tests.

At this point, however, I am mostly interested in accessing these rules from Perl. While there are several options for accessing Prolog from within Perl, the AI::Prolog module is perhaps the easiest with which to start. Moreover, it is quite simple to use, the rules used to build the Prolog database being fed in when creating the AI::Prolog object. The ability to hand the object constructor a filehandle is not currently supported, but would indeed be a nice improvement. While there are other ways to accomplish the task of reading in the data, such as calling the Prolog command consult, I will read in the Prolog file (ancestry.pl) and provide a string representation of the contents.

open( PROLOGFILE, 'ancestry.pl' ) or die "$! \n";
local $/;
my $prologRules = <PROLOGFILE>;
close( PROLOGFILE );

my $prologDB = AI::Prolog->new( $prologRules );

Now that I have loaded my Prolog database, I need to feed it some more information. I need to take my data, in Dot format, and translate it into something that my Prolog interpreter will understand. There are some modules out there that may be helpful, such as DFA::Simple, but since I can assume that my data will look a certain way--having written it from my other application--I will build my own simple parser. First, I am going to take a look at the data.

The visualization program created the diagram in Figure 1 from the code:

digraph family_tree {
   { jill [ color = pink ]
     rob  [ color = blue ] } -> { ann [ color = pink ]
                                  joe [ color = blue ] } ;

   { sue [ color = pink ] 
     dan [ color = blue ] } -> { sara [ color = pink ]
                                 mike [ color = blue ] } ;

   { nan [ color = pink ]
     tom [ color = blue ] } -> sue ;

   { nan
     jim [ color = blue ] } -> rob ;

   { kate  [ color = pink ]
     steve [ color = blue ] } -> dan ;

   { lucy  [ color = pink ]
     chris [ color = blue ] } -> jill ;
}

a family tree
Figure 1. A family tree from the sample data

There are a few peculiarities worth mentioning here. First, it may seem that the all-lower-case names are a bit strange, but I am already preparing for the convention that data in Prolog is typically lower-case. Also, I inserted an extra space before the semicolons in an effort to make matching them easier. While both of these conventions are easy to code around, they seems to create extra questions when illustrating a point. Therefore, assume that the above Dot snippet illustrates the range of possible formats in the example. While the "real-world examples" may provide a richer set of possibilities, the fact that applications with defined behavior generated this data will limit the edge cases.

Returning to the data, it will be easiest to parse the Dot data using a simple state machine. Previously, I had defined some constants to represent states:

use constant { modInit   => 0,
               modTag    => 1,
               modValue  => 2 };

Basically, I assume that anything on the left-hand side of the = is a parent and anything on the right is a child. Additionally, modifiers (in this case only color) begin with a left square-bracket and males have the blue modifier, whereas females are pink. I know that I have completed a parent-child relationship "block" when I hit the semicolon. Past these stipulations, if it isn't a character I know that I can safely ignore, then it must be a noun.

sub parse_dotFile {
   ##----------------------------------------
   ##  Examine data a word at a time
   ##----------------------------------------
   my @dotData = split( /\s+/, shift() );

   my ( $familyBlock, $personName, @prologQry ) = ();
   my $personModPosition                        = modInit;
   my $relationship                             = 'parent';

   for ( my $idx = 3; $idx < @dotData; $idx++ ) {
      chomp( $dotData[$idx] );

      SWITCH: {

         ## ignore
         if ( $dotData[ $idx ] =~ /[{}=\]]/ ) {
            last SWITCH; }

         ## begin adding attributes
         if ( $dotData[ $idx ] eq '[' ) {
            $personModPosition = modTag;
            last SWITCH; }

         ## switch from parents to children
         if ( $dotData[ $idx ] eq '->' ) {
            $relationship = 'child';
            last SWITCH; }

         ## end of this block
         if ( $dotData[ $idx ] =~ /\;/ ) {
           ##-----------------------------------------
           ##  Generate is_parent rules for Prolog
           ##-----------------------------------------
            foreach my $parentInBlock ( @{ $familyBlock->{ parent } } ) {
               foreach my $childInBlock ( @{ $familyBlock->{ child } } ) {
                  push( @prologQry,
                      "is_parent(${parentInBlock}, ${childInBlock})" );
               }
            }
            $familyBlock = ();
            $relationship = 'parent';
            last SWITCH; }

         ## I have a noun, need to set something
         else {

            ## I have a modifier tag, next is the value
            if ( $personModPosition == modTag ) {
               $personModPosition = modValue;
               last SWITCH;

            } elsif ( $personModPosition == modValue ) {
                 ##--------------------------------------
                 ##  Set modifier value and reset
                 ##  We currently assume it is color
                 ##--------------------------------------
               if ( $dotData[ $idx ] eq 'blue' ) {

                  push( @prologQry, "is_male(${personName})" );
               } else {
                  push( @prologQry, "is_female(${personName})" );
               }
               $personModPosition = modInit;
               $personName        = ();
               last SWITCH;
            } else {
                 ##--------------------------------------
                 ##  Grab the name and id as parent or child
                 ##--------------------------------------
               $personName = $dotData[ $idx ];
               push( @{ $familyBlock->{ $relationship } }, $personName );
            }
         }
      }
   }

   return( \@prologQry );
}

Rather than simply pushing my new rules into the Prolog interpreter directly, I return an array that contains the full ruleset. I am doing this so that I can easily dump it to a file for troubleshooting purposes. I can simply write the rules to a file, and consult this file in a Prolog shell.

With a subroutine to parse my Dot file into Prolog rules, I can now push those rules into the interpreter:

   ##-------------------------------------------
   ##  Read in Dot file containing relations
   ##  and feed it into the Prolog instance
   ##-------------------------------------------
   open( DOTFILE, 'family_tree.dot' ) or die "$! \n";
   my $parsedDigraph = parse_dotFile( <DOTFILE> );
   close( DOTFILE );

   foreach ( @$parsedDigraph ) {
      $prologDB->do("assert($_).");
   }

Now I can easily query my Prolog database using the query method in AI::Prolog:

   ##-------------------------------------------
   ##  Run the query
   ##-------------------------------------------
   $prologDB->query( "is_cousin(joe, sara)." );
   while (my $results = $prologDB->results) { print "@$results\n"; }

What Next?

Even though this is a trivial example, I think that it provides an idea of the powerful ways in which Perl can be supplemented with Prolog. Just within the context of evaluating genealogical data (a mainstay of Prolog tutorials and examples), it seems that a Perl/Prolog application that uses genealogical data from open source genealogical software or websites would be a killer application. The possibilities seem endless: rules based upon Google maps, mining information from online auctions or news services, or even harvesting information for that new test harness are all tremendous opportunities for the marriage of Perl and Prolog.

Testing Files and Test Modules

For the last several years, there has been more and more emphasis on automated testing. No self-respecting CPAN author can post a distribution without tests. Yet some things are hard to test. This article explains how writing Test::Files gave me a useful tool for validating one module's output and taught me a few things about the current state of Perl testing.

Introduction

My boss put me to work writing a moderately large suite in Perl. Among many other things, it needed to perform check out and commit operations on CVS repositories. In a quest to build quality tests for that module, I wrote Test::Files, which is now on CPAN. This article explains how to use that module and, perhaps more importantly, how it tests itself.

Using Test::Files

To use Test::Files, first use Test::More and tell it how many tests you want to run.

use strict;
use warnings;
use Test::More tests => 5;
use Test::Files;

After you use the module, there are four things it can help you do:

  • Compare one file to a string or to another file.
  • Make sure that directories have the files you expect them to have.
  • Compare all the files in one directory to all the files in another directory.
  • Exclude some things from consideration.

Single Files

In the simplest case, you have written a file. Now it is time to validate it. That could look like this:

file_ok($file_name, "This is the\ntext\n",
    "file one contents");

The file_ok function takes two (or optionally, and preferably, three) arguments. The first is the name of the file you want to validate. The second is a text string containing the text that should be in the file. The third is the name of the test. In the rush of writing, I'm likely to fail to mention the test names at some point, so let me say up front that all of the tests shown here take a name argument. Including a name makes finding the test easier.

If the file agrees with the string, the test passes with only an OK message. Otherwise, the test will fail and diagnostic messages will show where the two differed. The diagnostic output is really the reason to use Test::Files.

Some, including myself, prefer to check one file against another. I put one version in the distribution. The other one, my tests write. To compare two files, use:

compare_ok($file1, $file2, $name);

As with file_ok, if the files are the same, Test::Files only reports an OK message. Failure shows where the files differ.

Directory Structures

Sometimes, you need to validate that certain files are present in a directory. Other times, you need to make that check exclusive so that only known files are present. Finally, you might want to know that not only is the directory structure is the same, but that the files contain the same data.

To look for some files in a directory by name, write:

dir_contains_ok($dir, [qw(list files here)], $name);

This will succeed, even if the directory has some other files you weren't looking for.

To ensure that your list is exclusive, add only to the function name:

dir_only_contains_ok($dir, [qw(list all files here)], $name);

Both of these report a list of absent files if they fail due to them. The exclusive form also reports a list of unexpected files, if it sees any.

Directory Contents

If knowing that certain file names are present is not enough, use the compare_dirs_ok function to check the contents of all files in one directory against files in another directory. A typical module might build one directory during make test, with the other built ahead of time and shipped with the distribution.

compare_dirs_ok($test_built, $shipped, $name);

This will generate a separate diagnostic diff output for each pair of files that differs, in addition to listing files that are missing from either distribution. (If you need to know which files are missing from the built directory, either reverse the order of the directories or use dir_only_contains_ok in addition to compare_dirs_ok. This is a bug and might eventually be fixed.) Even though this could yield many diagnostic reports, all of those separate failures only count as one failed test.

There are many times when testing all files in the directories is just wrong. In these cases, it is best to use File::Find or an equivalent, putting an exclusion criterion at the top of your wanted function and a call to compare_ok at the bottom. This probably requires you to use no_plan with Test::More:

use Test::More qw(no_plan);

Test::More wants to know the exact number of tests you are about to run. If you tell it the wrong number, the test harness will think something is wrong with your test script, causing it to report failures. To avoid this confusion, use no_plan--but keep in mind that plans are there for a reason. If your test dies, the plan lets the harness know how many tests it missed. If you have no_plan, the harness doesn't always have enough information to keep score. Thus, you should put such tests in separate scripts, so that the harness can count your other tests properly.

Filtering

While the above list of functions seemed sufficient during planning, reality set in as soon as I tried it out on my CVS module. I wanted to compare two CVS repositories: one ready for shipment with the distribution, the other built during testing. As soon as I tried the test it failed, not because the operative parts of the module were not working, but because the CVS timestamps differed between the two versions.

To deal with cosmetic differences that should not count as failures, I added two functions to the above list: one for single files and the other for directories. These new functions accept a code reference that receives each line prior to comparison. It performs any needed alterations, and then returns a line suitable for comparison. My example function below redacts the offending timestamps. With the filtered versions in place, the tests pass and fail when they should.

My final tests for the CVS repository directories look like this:

compare_dirs_filter_ok(
    't/cvsroot/CVSROOT',
    't/sampleroot/CVSROOT',
    \&chop_dates,
    "make repo"
);

The code reference argument comes between the directory names and the test name. The chop_dates function is not particularly complex. It removes two kinds of dates favored by CVS, as shown in its comments.

sub chop_dates {
    my $line =  shift;

    #  2003.10.15.13.45.57 (year month day hour minute sec)
    $line    =~ s/\d{4}(.\d\d){5}//;

    #  Thu Oct 16 18:00:28 2003
    $line    =~ s/\w{3} \w{3} \d\d? \d\d:\d\d:\d\d \d{4}//;

    return $line;
}

This shows the general behavior of filters. They receive a line of input which they must not directly change. Instead, they must return a new, corrected line.

In addition to compare_dirs_filter_ok for whole directory structures, there is also compare_filter_ok, which works similarly for single file comparisons. (There is no file_filter_ok, but maybe there should be.)

Testing a Test Module

The most interesting part of writing Test::Files was learning how to test it. Thanks to Schwern, I learned about Test::Builder::Tester, which eases the problems inherent in testing a Perl test module.

The difficulty with testing Perl tests has to do with how they normally run. The venerable test harness scheme expects test scripts to produce pass and fail data on standard out and diagnostic help on standard error. This is a great design. The simplicity is exactly what you would expect from a Unix-inspired tool. Yet, it poses a problem for testing test modules.

When eventual users use the test module, their harness expects it to write quite specific things to standard out and standard error. Among the things that must go to standard out are a sequence of lines such as ok 1. When you write a test of the test module, its harness also expects to see this sort of data on standard out and standard error. Having two different sources of ok 1 is highly confusing, not least to the harness, which chokes on such duplications.

Test module writers need a scheme to trap the output from the module being tested, check it for correct content, and report that result onto the actual standard channels for the harness to see. This is tricky, requiring care in diversion of file handles at the right moments without the knowledge of the module whose output is diverted. Doing this by hand is inelegant and prone to error. Further, multiple test scripts might have to recreate home-rolled solutions (introducing the oldest of known coding sins: duplication of code). Finally, the diagnostic output, in the event of failure, from homemade diverters is unlikely to be helpful when tests of the test module fail.

Enter Test::Builder::Tester.

To help us test testers, Mark Fowler collected some code from Schwern, and used it to make Test::Builder::Tester. With it, tests of test modules are relatively painless and their failure diagnostics are highly informative. Here are two examples from the Test::Files test suite. The first shows a file comparison that should pass:

test_out("ok 1 - passing file");
compare_ok("t/ok_pass.dat", "t/ok_pass.same.dat",
    "passing file");
test_test("passing file");

This test should work, generating ok 1 - passing file on standard output. To tell Test::Builder::Tester what the standard output should be, I called test_out. After the test, I called test_test with only the name of my test. (To avoid confusion, I made the test names the same.)

Between the call to test_out and the one to test_test, Test::Builder::Tester diverted the regular output channels so the harness won't see them.

The second example shows a failed test and how to check both standard out and standard error. The later contains the diagnostic data the module should generate.

test_out("not ok 1 - failing file");
$line = line_num(+9);
test_diag("    Failed test (t/03compare_ok.t at line $line)",
'+---+--------------------+-------------------+',
'|   |Got                 |Expected           |',
'| Ln|                    |                   |',
'+---+--------------------+-------------------+',
'|  1|This file           |This file          |',
'*  2|is for 03ok_pass.t  |is for many tests  *',
'+---+--------------------+-------------------+'  );
compare_ok("t/ok_pass.dat", "t/ok_pass.diff.dat",
    "failing file");
test_test("failing file");

Two new functions appear here. First, line_num returns the current line number plus or minus an offset. Because failing tests report the line number of the failure, checking standard error for an exact match requires matching that number. Yet, no one wants his tests to break because he inserted a new line at the top of the script. With line_num, you can obtain the line number of the test relative to where you are. Here, there are nine lines between the call to line_num and the actual test.

The other new function is test_diag. It allows you to check the standard error output, where diagnostic messages appear. The easiest way to use it is to provide each line of output as a separate parameter.

Summary

Now you know how to use Test::Files and how to test modules that implement tests. There is one final way I use Test::Files. I use it outside of module testing any time I want to know how the contents of text files in two directory hierarchies compare. With this, I can quickly locate differences in archives, for example, enabling me to debug builders of those archives. In one example, I used it compare more than 400 text files in two WebSphere .ear archives. My program had only about 30 operative lines (there were also comments and blank lines) and performed the comparison in under five seconds. This is testament to the leverage of Perl and CPAN.

(Since doing that comparison, I have moved to a new company. In the process I exchanged WebSphere for mod_perl and am generally happier with the latter.)

Perl Success Story: Client-Side Collection and Reporting

Accurate software inventory management is critical to any organization. Without an accurate software inventory, organizations may either be out of compliance with their vendor licensing agreements or they may be paying extra for licenses that they do not need.

Hitachi Global Storage Technologies (Hitachi GST) formed in 2003 as a result of the strategic combination of Hitachi and IBM's storage technology businesses. The company offers customers worldwide a comprehensive range of hard disk drives for desktop computers, high-performance servers, and mobile devices. With offices and manufacturing facilities spanning the globe, deployment of the corporate business intelligence (BI) tool suite is extensive. The company uses BI tools throughout the business, from manufacturing to sales and warranty operations, for analysis of data that reside in operational data stores and data warehouses. With each new manufacturing site comes requests for additional licenses.

A Business Intelligence Problem

To avoid unnecessary spending on licenses, the company needs a detailed assessment of both the number and type of licenses actually deployed at the various sites. We operate in a highly competitive environment in the technology industry and can ill afford to continue purchasing unnecessary licenses. An effort was started to determine how we deploy the current pool of licenses. The initial effort to perform a software inventory for the BI tool involved a manual inventory by each site owner or application owner. As expected, the results returned appear to indicate a significant under-licensing with one of the products from the BI suite. Looking at the raw information sent in by each owner, we had speculated that we double-counted some users and that some users may be misreporting the client that they have installed on their system, thus driving up the license count for the respective product.

Based on the preliminary analysis of the information collected, it became evident that we needed to deploy a software inventory tool to assess each workstation and determine what users have actually installed on their machines. The corporate software asset management tool was not a viable option, because it had limited deployment, with some organizations maintaining software inventory using manual methods. The licensing issues of deploying a commercial tool, along with the customizations needed in order to accurately capture the different types of licenses, would be prohibitive from both a time and cost perspective.

In order to generate an accurate license count, the software inventory tool must be able to detect the different types of BI clients that may be installed on a given PC. The release information of each version is also important. We can use this information to determine which users need upgrades to the latest supported release. Our license entitlement limits the number of each type of client from the BI suite that we can deploy across the enterprise. To further complicate matters, there are two different types of client-server tools that use the same binary, with no easy way to distinguish between them by looking at the installed applications listing in Windows. The difference in cost between these two tools is approximately $3,400 per license, so it is imperative that the inventory count is correct for each specific type of client. In addition, a web-based version of the BI tool is available as a plugin. However, the plugin version information is inaccurate in the binary file and the client-server tool version is incomplete in the registry. In preliminary tests of a commercial tool, it was unable to distinguish between the two types of clients and missed the web plugin altogether.

The Design of a Solution

Given this challenge, we had to develop a custom solution to address the problem. I had previously used Perl to develop an automated software delivery and installation solution on the Windows platform, and some of what I had already developed was very similar to what we needed. Leveraging this prior experience reduced the amount of time and effort needed for development of the software inventory solution to address our immediate needs.

The solution criteria that we set forth required that the client inventory tool must be relatively compact and not require any installation. The tool also must be a self-contained executable that a user can download and run on their client. If a tool is difficult to use or install, this severely curtails its adoption and use. The user population is spread across multiple locales, including Japan, South Korea, Taiwan, Singapore, Thailand, the United Kingdom, and the United States, so the tool must be simple to understand and must avoid complex procedures.

Each user also needs to authenticate against the corporate LDAP directory to ensure that their results are accurately recorded and ensure the authenticity of the result. In addition, the inventory results must be collected and stored on a central server to ease management of the inventory process. The results must also be readily accessible to all users and owners so they can see where they stand in comparison to what they had originally reported.

The solution involves a client tool and a server-side application. The combination of the two must authenticate the user against the corporate enterprise directory, inventory the client system, log the results, and record the results in a DB2 back-end database. The data stored in the DB2 back-end database is useful for generating both summary and detail reports via the Web. The specific modules used in the development of the solution included Win32::OLE, Win32::GUI, Win32::File::VersionInfo, Win32::TieRegistry, IO::Socket, CGI, DBI, DBD::DB2, and Net::LDAP.

Building the Client

Development started with the client component first. Because most end users are averse to command-line tools, we had to develop a GUI of sorts to guide the user through the authentication and inventory steps. In addition, the GUI could display error messages to the user. We used Win32::GUI to create the prompts as well as data entry dialog boxes. The GUI captures the information entered and passes it to the central server using HTTP.

All communications between the client and server use HTTP because coding the client to work directly with the LDAP server and the database server would have made troubleshooting any problems experienced by clients in distant locations inordinately difficult. In a situation such as this, we wanted a single controlled point of failure, as depicted in Figure 1. By using HTTP and a central-server-based application to broker requests, we can isolate problems on the client side to only HTTP-related transport issues.

The authentication process
Figure 1. The authentication process

The user that is running the inventory tool must authenticate first. This is important, as we need to track who has or has not reported their results based on the initial manual inventory information. We use Net::LDAP to perform the authentication against our LDAP server. The application prompts users for their intranet IDs and passwords, and the client passes this information to the server application so that it can issue a bind against the LDAP server. If the information provided by the user does not yield a successful bind, then the client requests that the user reenter his or her login information. Figure 1 depicts this authentication process.

To authenticate a user via LDAP using Net::LDAP, the server uses the user's intranet ID to prepare the distinguished name (DN). The DN is the unique identifier for the user's record in the LDAP directory. In order to perform the initial lookup, the server issues an anonymous bind against the LDAP directory and performs a search using the intranet ID provided. The search returns only the uid, which is the unique record identifier in our implementation of the corporate directory. After the server obtains this, it forms a DN string and issues another bind again using the password provided. If the bind is successful, the return code is undefined and the client proceeds to inventory the system.

Most software applications leave some type of signature in the system registry on a client system, and the BI tool suite deployed at Hitachi GST is no exception. I knew that the software leaves specific signatures when installed on a client system. This information is in the system registry under both the HKEY_LOCAL_MACHINE and HKEY_CURRENT_USER registry entries. We use Win32::TieRegistry to retrieve registry values for determining the type of client that is installed, as well as the binary install location and install key used. Win32::TieRegistry makes it extremely easy to access and modify registry values.

The exact version of the BI client deployed is not accurate in the system registry as we also want to determine the specific point release deployed on the client systems. To achieve this, we use Win32::File::VersionInfo to extract the ProductVersion information from the binary files. Looking at the properties for the file in Windows Explorer for the BI web client binary, we can clearly see that the FileVersion information in the executable is incorrect. We settled on using ProductVersion instead.

In addition to checking the various client versions that may be installed, the tool must also log the client machine serial number. We achieved this by using Win32::OLE and querying information from the Win32_BIOS WMI class. Because some users may have multiple workstations or laptops, we use the machine serial to differentiate multiple entries logged by a single user. The WMI classes yield a wealth of information. Extending the features of this tool simply requires querying additional WMI classes to obtain additional hardware information, such as hard disk, CPU, and memory.

Deploying the Client

Once we developed and tested the client code, we had to package the script itself. Not every client machine has Perl installed. Even if they did have Perl installed, there is no guarantee that the appropriate modules needed by the application are available on the target systems. To package the final script into a single executable file, we used PerlApp from ActiveState's PDK to compile the script and the associated modules. This allowed us to neatly package and deploy the client as a single executable that users can run from their client system without performing any type of installation or setup.

When all of the requisite information is collected on the client machine, it passes back to the server which records it on the server's local file system first. This initial logging allows us to capture the client inventory results even if the database server is down. The logged results are easy to import into the database, should there be an unexpected database outage. The CGI module handles the information passed from the client to the server, and DBI and DBD::DB2 handle the database connection. DB2 is the RDBMS of choice in our environment, and we were able to leverage an existing DB2 database environment, which further reduced the time to deployment.

Learning from the Process

All of this information that we collect is of little use if it is not easily accessible for analysis. We also wrote a script to retrieve and format information from the database and present it via the Web. This allowed users and application owners to see the current inventory status of their users, and also help them to determine which users still need to run the client inventory tool to complete their assessment. The script provides summary, detail, and exception views. With reports available through the Web, users can access it from any web browser, regardless of the user's geographic location. This empowers everyone involved to see the results in near real time.

Upon deployment of the tool, we then recorded and summarized user inventory information automatically. Some interesting results came to light. It became apparent that up to $163,000 worth of additional license purchases were unnecessary, as the existing pool of licenses had not been depleted (as we believed earlier), based on the manual inventory results submitted. Also, with the client inventory tool being an internally developed tool, there were no licensing costs for distributing it at various sites.

The client- and server-side components took two weeks to develop. We formed test cases before writing the code in order to minimize functional issues during final testing in the field, as well as to ensure that we did not overlook critical features and integration issues during the development process. Ultimately, as illustrated in Figure 2, the client and the various server components must work together as an integrated system.

The client and server working together
Figure 2. The client and server working together

The use of virtual machine software was also instrumental for testing and reduced overall development time. It allowed me to simulate client environments and test client code without having to use a physical machine for each version of the BI application. Using undo disks, I was able to return the VMs to a prior state very quickly without manually uninstalling the various BI tools from the VM. We initially tested the inventory tool in Singapore, Japan, and the U.S., and this testing confirmed that the tool was able to perform well even though users were spread across remote geographic locations.

Although we developed this tool for a one-time inventory of the number of BI clients deployed in the field, it also has a role going forward in helping to maintain an accurate picture of the licenses deployed. Possible enhancements include converting the tool so that we can install it as a service to run without user intervention for periodic updates of client inventory data from identified clients, auto-conversion of licensing from one client class to another, and automatic uninstall of clients, should the user decide that he or she no longer needs it. In staying with the near-zero client management paradigm, we can extend the client to contain auto-update features by checking the local version and verifying it against the latest published version on a central server.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en