Recently in Databases Category

An Introduction to Test::MockDBI

Prelude

How do you test DBI programs:

  • Without having to modify your current program code or environment settings?
  • Without having to set up multiple test databases?
  • Without separating your test data from your test code?
  • With tests for every bizarre value your program will ever have to face?
  • With complete control over all database return values, along with all DBI method return values?
  • With an easy, regex-based rules interface?

You test with Test::MockDBI, that's how. Test::MockDBI provides all of this by using Test::MockObject::Extends to mock up the entire DBI API. Without a solution like Test::MockDBI--a solution that enables direct manipulation of the DBI--you'll have to trace DBI methods through a series of test databases.

You can make test databases work, but:

  • You'll need multiple (perhaps many) databases when you need multiple sets of mutually inconsistent values for complete test coverage.
  • Some DBI failure modes are impossible to generate through any test database.
  • Depending on the database toolset available, it may be difficult to insert all necessary test values--for example, Unicode values in ASCII applications, or bizarre file types in a document-manager application.
  • Test databases, by definition, are separate from their corresponding test code. This increases the chance that the test code and the test data will fall out of sync with each other.

Using Test::MockDBI avoids these problems. Read on to learn how Test::MockDBI eases the job of testing DBI applications.

A Mock Up of the Entire DBI

Test::MockDBI mocks up the entire DBI API by using Test::MockObject::Extends to substitute a Test::MockObject::Extends object in place of the DBI. A feature of this approach is that if the DBI API changes (and you use that change), you will notice during testing if you haven't upgraded Test::MockDBI, as your program will complain about missing DBI API method(s).

Mocking up the entire DBI means that you can add the DBI testing code into an existing application without changing the initial application code--using Test::MockDBI is entirely transparent to the rest of your application, as it neither knows nor cares that it's using Test::MockDBI in place of the DBI. This property of transparency is what drove me to develop Test::MockDBI, as it meant I could add the Test::MockDBI DBI testing code to existing client applications without modifying the existing code (handy, for us consultants).

Further enhancing Test::MockDBI's transparency is the DBI testing type class value. Testing is only enabled when the DBI testing type is non-zero, so you can just leave the DBI testing code additions in your production code--users will not even know about your DBI testing code unless you tell them.

Mocking up the entire DBI also means that you have complete control of the DBI's behavior during testing. Often, you can simulate a SELECT DBI transaction with a simple state machine that returns just a few rows from the (mocked up) database. Test::MockDBI lets you use a CODEREF to supply database return values, so you can easily put a simple state machine into the CODEREF to supply the necessary database values for testing. You could even put a delay loop into the CODEREF when you need to perform speed tests on your code.

Rules-Based DBI Testing

You control the mocked-up DBI of Test::MockDBI with one or more rules that you insert as Test::MockDBI method calls into your program. The default DBI method values provided by Test::MockDBI make the database appear to have a hole in the bottom of it--all method calls return OK, but you can't get any data out of the database. Rules for DBI methods that return database values (the fetch*() and select*() methods) can use either a value that they return directly for matching method calls, or a CODEREF called to provide a value each time that rule fires. A rule matches when its DBI testing type is the current testing type and the current SQL matches the rule's regular expression. Rules fire in the order in which you declare them, so usually you want to order your rules from most-specific to least-specific.

The DBI testing type is an unsigned integer matching /^d+$/. When the DBI testing type is zero, there will be no DBI testing (or at least, no mocked-up DBI testing) performed, and the program will use the DBI normally. A zero DBI testing type value in a rule means the rule could fire for any non-zero DBI testing type value--that is, zero is the wildcard DBI testing type value for rules. Set the DBI testing type either by a first command-line argument of the form:

--dbitest[=DTT]

where the optional DTT is the DBI testing type (defaulting to one), or through Test::MockDBI's set_dbi_test_type() method. Setting the DBI testing type through a first command-line argument has the advantage of requiring no modifications to the code under test, as this command-line processing is done so early (during BEGIN time for Test::MockDBI) that the code under test should be ignorant of whether this processing ever happened.

DBI Return Values

Test::MockDBI defaults to returning a success (true) value for all DBI method calls. This fits well with the usual techniques of DBI programming, where the first DBI error causes the program to stop what it is doing. Test::MockDBI's bad_method() method creates a rule that forces a failure return value on the specified DBI method when the current DBI testing type and SQL match those of the rule. Arbitrary DBI method return value failures like these are difficult (at best) to generate with a test database.

Test::MockDBI's set_retval_scalar() and set_retval_array() methods create rules for what database values to return. Set rules for scalar return values (arrayrefs and hashrefs) with set_retval_scalar() and for array return value rules with set_retval_array(). You can supply a value to be returned every time the rule matches, which is good when extracting single rows out of the database, such as configuration parameters. Alternatively, pass a CODEREF that will be called each time the rule fires to return a new value. Commonly, with SELECT statements, the DBI returns one or more rows, then returns an empty row to signify the end of the data. A CODEREF can incorporate a state machine that implements this "return 1+ rows, then a terminator" behavior quite easily. Having individual state machines for each rule is much easier to develop with than having one master state machine embedded into Test::MockDBI's core. (An early alpha of Test::MockDBI used the master state machine approach, so I have empirical evidence of this result--I am not emptily theorizing here.)

Depending on what tools you have for creating your test databases, it may be difficult to populate the test database with all of the values you need to test against. Although it is probably not so much the case today, only a few years ago populating a database with Unicode was difficult, given the national-charset-based tools of the day. Even today, a document management system might be difficult to populate with weird file types. Test::MockDBI makes these kinds of tests much easier to carry out, as you directly specify the data for the mock database to return rather than using a separate test database.

This ease of database value testing also applies when you need to test against combinations of database values that are unlikely to occur in practice (the old "comparing apples to battleships" problem). If you need to handle database value corruption--as in network problems causing the return of partial values from a Chinese database when the program is in the U.S.--this ability to completely specify the database return values could be invaluable in testing. Test::MockDBI lets you take complete control of your database return values without separating test code and test data.

Simplicity: Test::MockDBI's Standard-Output-Based Interface

This modern incarnation of the age-old stubbed-functions technique also uses the old technique of "printf() and scratch head" as its output interface. This being Perl we are working with, and not FORTRAN IV (thank goodness), we have multiple options beyond the use of unvarnished standard output.

One option that I think integrates well with DBI-using module testing is to redirect standard output into a string using IO::String. You can then match the string against the regex you are looking for. As you have already guessed, use of pure standard output integrates well with command-line program testing.

What you will look for, irrespective of where your code actually looks, is the output of each DBI method as it executes--the method name and arguments--along with anything else your code writes to standard output.

Bind Test Data to Test Code

Because DBI and database return values are bound to your test programs when using Test::MockDBI, there is less risk of test data getting out of sync with the test code. A separate test database introduces another point of failure in your testing process. Multiple test databases add yet another point of failure for each database. Whatever you use to generate the test databases also introduces another point of failure for each database. I can imagine cases where special-purpose programs for generating test databases might create multiple points of failure, especially if the programs have to integrate data from multiple sources to generate the test data (such as a VMS Bill of Materials database and a Solaris PCB CAD file for a test database generation program running on Linux).

One of the major advances in software engineering is the increasing ability to gather and control related information together--the 1990s advance of object-oriented programming in common languages is a testimony to this, from which we Perl programmers reap the benefits in our use of CPAN. For many testing purposes, there is no need for separate test databases. Without that need for a separate test database, separating test data from test code only complicates the testing process. Test::MockDBI lets you bind together your test code and test data into one nice, neat package. Binding is even closer than code and comments, as comments can get out of sync with their code, while the test code and test data for Test::MockDBI cannot get out of sync too far without causing their tests to fail unexpectedly.

When to Use Test::MockDBI

DBI's trace(), DBD::Mock, and Test::MockDBI are complementary solutions to the problem of testing DBI software. DBI's trace() is a pure tracing mechanism, as it does not change the data returned from the database or the DBI method return values. DBD::Mock works at level of a database driver, so you have to look at your DBI testing from the driver's point of view, rather than the DBI caller's point of view. DBD::Mock also requires that your code supports configurable DBI DSNs, which may not be the case in all circumstances, especially when you must maintain or enhance legacy DBI software.

Test::MockDBI works at the DBI caller's level, which is (IMHO) more natural for testing DBI-using software (possibly a matter of taste: TMTOWTDI). Test::MockDBI's interface with your DBI software is a set of easy-to-program, regex-based rules, which incorporate a lot of power into one or a few lines of code, thereby using Perl's built-in regex support to best advantage. This binds test data and test code tightly together, reducing the chance of synchronization problems between the test data and the test code. Using Test::MockDBI does not require modifying the current code of the DBI software being tested, as you only need additional code to enable Test::MockDBI-driven DBI testing.

Test::MockDBI takes additional coding effort when you need to test DBI program performance. It may be that for performance testing, you want to use test databases rather than Test::MockDBI. If you were in any danger of your copy of DBI.pm becoming corrupted, I don't know whether you could adequately test that condition with Test::MockDBI, depending on the corruption. You would probably have to create a special mock DBI to test corrupted DBI code handling, though you could start building the special mock DBI by inheriting from Test::MockDBI without any problems from Test::MockDBI's design, as it should be inheritance-friendly.

Some Examples

To make:

$dbh = DBI->connect("dbi:AZ:universe", "mortal", "(none)");

fail, add the rule:

$tmd->bad_method("connect", 1,
    "CONNECT TO dbi:AZ:universe AS mortal WITH \\(none\\)");

(where $tmd is the only Test::MockDBI object, which you obtain through Test::MockDBI's get_instance() method).

To make a SQL SELECT failure when using DBI::execute(), use the rule:

$tmd->bad_method("execute", 1,
    "SELECT zip_plus_4 from zipcodes where state='IN'");

This rule implies that:

  • The DBI::connect() succeeded().
  • The DBI::prepare() succeeded().
  • But the DBI::execute() failed as it should.

A common use of direct scalar return values is returning configuration data, such as a U.S. zip code for an address:

$tmd->set_retval_scalar(1,
 "zip5.*'IN'.*'NOBLESVILLE'.*'170 WESTFIELD RD'",
 [ 46062 ]);

This demonstrates using a regular expression, as matching SQL could then look like this:

SELECT
  zip5
FROM
  zipcodes
WHERE
  state='IN' AND
  city='NOBLESVILLE' AND
  street_address='170 WESTFIELD RD'

and the rule would match.

SELECTs that return one or more rows from the database are the common case:

my $counter = 0;                    # name counter
sub possibly_evil_names {
    $counter++;
    if ($counter == 1) {
        return ('Adolf', 'Germany');
    } elsif ($counter == 2) {
        return ('Josef', 'U.S.S.R.');
    } else {
        return ();
    }
}
$tmd->set_retval_array(1,
   "SELECT\\s+name,\\s+country.*possibly_evil_names",
   \&possibly_evil_names);

Using a CODEREF (\&possibly_evil_names) lets you easily add the state machine for implementing a return of two names followed by an empty array (because the code uses fetchrow_array() to retrieve each row). SQL for this query could look like:

SELECT
  name,
  country
FROM
  possibly_evil_names
WHERE
  year < 2000

Summary

Albert Einstein once said, "Everything should be made as simple as possible, but no simpler." This is what I have striven for while developing Test::MockDBI--the simplest possible useful module for testing DBI programs by mocking up the entire DBI.

Test::MockDBI gives you:

  • Complete control of DBI return values and database-returned data.
  • Returned database values from either direct value specifications or CODEREF-generated values.
  • Easy, regex-based rules that govern the DBI's behavior, along with intelligent defaults for the common cases.
  • Complete transparency to other code, so the code under test neither knows nor cares that you are testing it with Test::MockDBI.
  • Test data tightly bound to test code, which promotes cohesiveness in your testing environment, thereby reducing the chance that your tests might silently fail due to loss of synchronization between your test data and your test code.

Test::MockDBI is a valuable addition to the arsenal of DBI testing techniques.

Massive Data Aggregation with Perl

This article is a case study of the use of Perl and XML/RDF technologies to channel disparate sources of data into a semi-structured repository. This repository helped to build structured OLAP warehouses by mining an RDF repository with SAX machines. Channels of data included user-contributed datasets, data from FTP and HTTP remote-based repositories, and data from other intra-enterprise based assets. We called the system the 'Kitchen Sync', but one of the project's visionaries best described it as akin to a device that accepts piles of random coins and returns them sorted for analysis. This system collected voter data and was the primary data collection point in a national organization for the presidential campaign during the 2004 election.

Introduction

My initial question was why anyone would want to store data in XML/RDF formats. It's verbose, it lacks widely accepted query interfaces (such as SQL), and it generally requires more work than a database. XML, in particular, is a great messaging interface, but a poor persistence medium.

Eventually, I concluded that this particular implementation did benefit from the use of XML and RDF as messaging protocols. The messaging interface involved the use of SAX machines to parse a queue of XML and RDF files. The XML files contained the metadata for what we called polls, and the RDF files contained data from those polls. We had a very large buffer, from which cron-based processes frequently constructed data warehouses for analysis.

Hindsight and Realizations

The difficulty of this project was in the gathering of requirements and vendor interfacing. When implementing application workflow, it is critical to use a programming language that doesn't get in the way and allows you to do what you want--and that is where Perl really shined. A language that allows for quick development is an asset, especially in a rushed environment where projects are due "yesterday". The code samples here are not examples of how to write great object-oriented Perl code. They are real world examples of the code used to get things done in this project.

For example, when a voter-data vendor changed its poll format, our data collection spiders stopped returned data and alerted our staff immediately. In just minutes, we adapted our SAX machine to the vendor's new format and we had our data streams back up and running. It would have taken hours or days to call the vendor about the change and engage in a technical discussion to get them to do things our way. Instead, Perl allowed us to adapt to their ways quickly and efficiently.

Project Goals

The architects of this project specified several goals and metrics for the application. The main goals--with the penultimate objective being to accumulate as much data as possible before election day--were to:

  • Develop a web-based application for defining metadata of polls, and uploading sets of poll data to the system.

    The application had to give the user the ability to define sets of questions and answers known as polls. Poll metadata could contain related data contained in documents of standard business formats (.doc, .pdf). The users also needed an easy method, one that minimized possible errors, to upload data to the system.

  • Meet requirements of adding 50 million new records per day.

    That metric corresponds to approximately 578 records per second. Assuming a non-linear load distribution over time, peak transaction requirements were likely to be orders of magnitude higher than the average of 578 per second.

  • Develop a persistent store for RDF and XML data representing polls and poll data.

    The web application had to generate XML documents from poll definitions and RDF documents from uploaded poll data. We stored the poll data in RDF. We needed an API to manage these documents.

  • Develop a mechanized data collection system for the retrieval of data from FTP- and HTTP-based data repositories.

    The plan was to assimilate data sources into our organization from several commercial and other types of vendors. Most vendors had varying schemas and formats for their data. We wanted to acquire as much data as possible before the election to gauge voter support levels and other key metrics crucial to winning a political election.

Web Application

When I started this project, I had been using mod_perl2 extensively in prototyping applications and also as a means of finding all of the cool new features. Mod_perl2 had proven itself stable enough to use in production, so I implemented a Model-View-Controller application design pattern using a native mod_perl2 and an libapreq2-enabled Apache server. I adopted the controller design patterns from recipes in the Modperl Cookbook. The model classes subclassed Berkeley DBXML and XML::LibXML for object methods and persistence. We used Template Toolkit to implement views. (I will present more about the specifics of the persistence layer later in this article.)

Of primary importance with the web application component of the system was ease of use. If the system was not easy to use, then we would likely receive less data as a result of user frustration. The component of the web application that took extended transaction processing time was the poll data upload component.

If the user uploads a 10MB file on a 10Kbps upstream connection (common for residential DSL lines), the transaction would take approximately twenty minutes. On a 100Kbps upstream connection (business grade DSL), the transaction would take two minutes--certainly much longer than most unsuspecting users would wait before clicking on the browser refresh button.

To prevent the user from accidentally corrupting the lengthy upload process, I created a monitoring browser window which opened via the following JavaScript call when the user clicked the upload button.

<input type=submit name='submit' value='Upload'
    onClick="window.open('/ksync/dataset/monitor', 'Upload',
       'width=740,height=400')">

The server forked off a child process which read the upload status from a BerkeleyDB database. The parent process used a libapreq UPLOAD_HOOK-based approach to measure the amount of data uploaded, and to write that plus a few other metrics to the BerkeleyDB database. The following is a snippet of code from the upload handler:

<Location /ksync/poll/data/progress>
    PerlResponseHandler KSYNC::Apache::Data::Upload->progress
</Location>

sub progress : method {
    my ( $self, $r ) = @_;

    # We deal with commas and tabs as delimiters currently
    my $delimiter;

    # Create a BerkeleyDB to keep track of upload progress
    my $db = _init_status_db( DB_CREATE );

    # Get the specifics of the poll we're getting data for
    my $poll = $r->pnotes('SESSION')->{'poll'};

    # Generate a unique identifier for files based on the poll
    my $id = _file_id($poll);

    # Store any data which does not validate according to the poll schema
    my $invalid = IO::File->new();
    my $ivfn = join '', $config->get('data_root'), '/invalid/', $id, '.txt';
    $invalid->open("> $ivfn");

    # Set the rdf filename
    my $gfn = join '', $config->get('data_root'), '/valid/', $id, '.rdf';

    # Create an RDF document object to store the data
    my $rdf = KSYNC::Model::Poll::Data::RDF->new(
                $gfn, 
                $poll,
                $r->pnotes('SESSION')->{'creator'}, 
                DateTime->now->ymd, 
    );

    # Get the poll questions for to make sure the answers are valid
    my $questions = $poll->questions;

    # Create a data structure to hold the answers to validate against.
    my @valid_answers = _valid_answers($questions);

    # And a data structure to hold the validation results
    my $question_data = KSYNC::Model::Poll::validation_results($questions);

    # Set progress store parameters
    my $length              = 0;
    my $good_lines_total    = 0;
    my $invalid_lines_total = 0;
    my $began;              # Boolean to determine if we've started parsing data
    my $li                  = 1;    # Starting line number

    # The subroutine to process uploaded data
    my $fragment;
    my $upload_hook = sub {
        my ( $upload, $data, $data_len, $hook_data ) = @_;

        if ( !$began ) {   # If this is the first set check the array length

            # Chop up the stream
            my @lines = split "\n", $data;

            # Determine the delimiter for this line
            $delimiter = _delimiter(@lines);

            unless ( ( split( /$delimiter/, $lines[0] ) ) ==
                scalar( @{$question_data} ) + 1 )
            {
                $db->db_put( 'done', '1' );
                
                # The dataset isn't valid, so throw an exception
                KSYNC::Apache::Exception->throw('Invalid Dataset!');
            }
        }

        # Mark the start up the upload
        $began = 1;

        # Validate the data against the poll answers we've defined
        my ( $good_lines, $invalid_lines );

        ( $good_lines, $invalid_lines, $question_data, $li, $fragment ) =
          KSYNC::Model::Poll::Data::validate( \@valid_answers, 
                                              $data, 
                                              $question_data,
                                              $li, 
                                              $delimiter, 
                                              $fragment );

        # Keep up the running count of good and invalid lines
        $good_lines_total     += scalar( @{$good_lines} );
        $invalid_lines_total  += scalar( @{$invalid_lines} );

        # Increment the number of bytes processed
        $length += length($data);

        # Update the status for the monitor process
        $db->db_put(
                     valid     => $good_lines_total,
                     invalid   => $invalid_lines_total,
                     bytes     => $length,
                     filename  => $upload->filename,
                     filetype  => $upload->type,
                     questions => $question_data,
                   );

        # And store the data we've collected
        $rdf->write( $good_lines ) if scalar( @{$good_lines} );

        # Write out any invalid data points to a separate file
        _write_txt( $invalid, $invalid_lines ) if scalar( @{$invalid_lines} );
    };

    my $req = Apache::Request->new(
        $r,
        POST_MAX    => 1024 * 1024 * 1024,    # One Gigabyte
        HOOK_DATA   => 'Note',
        UPLOAD_HOOK => $upload_hook,
        TEMP_DIR    => $config->get('temp_dir'),
    );

    my $upload = eval { $req->upload( scalar +( $req->upload )[0] ) };
    if ( ref $@ and $@->isa("Apache::Request::Error") ) {

        # ... handle Apache::Request::Error object in $@
        $r->headers_out->set( Location => 'https://'
              . $r->construct_server
              . '/ksync/poll/data/upload/aborted' );
        return Apache::REDIRECT;
    }

    # Finish up
    $invalid->close;
    $rdf->save;

    # Set status so the progress window will close
    $db->db_put('done', 1');
    undef $db;
    
    # Send the user to the summary page
    $r->headers_out->set(
      Location => join('', 
                       'https://', 
                       $r->construct_server, 
                       '/poll/data/upload/summary',
                      )                   
    );
    return Apache::REDIRECT; 
}

During the upload process, the users saw a status window which refreshed every two seconds and had a pleasant animated GIF to enhance their experience, as well as several metrics on the status of the upload. One user uploaded a file that took 45 minutes because of a degraded network connection, but the uploaded file had no errors.

The system converted CSV files that users uploaded into RDF and saved them to the RDF store during the upload process. Because of the use of the UPLOAD_HOOK approach for processing uploaded data, the mod_perl-enabled Apache processes never grew in size or leaked memory as a result of handling the upload content.

Poll and Poll Data Stores

Several parties involved raised questions about the use of XML and RDF as persistence mediums. Why not use a relational database? Our primary reasons for deciding against a relational database were that we had several different schemas and formats of incoming data, and we needed to be able to absorb huge influxes of data in very short time periods.

Consider how a relational database could have handled the variation in schemas and formats. Creating vendor-specific drivers to handle each format would have been straightforward. To handle the variations in schema, we could have normalized each data stream and its attributes so that we could store all the data in source, object, attribute, and value tables. The problem with that approach is that you get one really big table with all the values, which becomes more difficult to manage as time goes on. Another possible approach, which I have used in the past, is to create separate tables for each data stream to fit the schema, and then use the power of left, right, and outer joins to extract the needed information. It scales much better than the first approach but it is not as well suited for data mining as warehouses are.

With regard to absorbing a lot of data very quickly, transactional relational databases have limitations when you insert or update data in a table with many rows. Additionally, the insert and update transactions are not asynchronous. When inserting or updating a record, the transaction will not complete until the indexes associated with the indexed fields of that record have updated. This slows down as the database grows in size.

We wanted the transactions between users, machines, and the Kitchen Sync to be as asynchronous as possible. Our ability to take in data in RDF format would not degrade with increasing amounts of data already taken in before warehousing for analysis. Data exchange challenges between vendors and us included a few large transactions in RDF format per data set, and how the length of the transaction time depended solely on the speed of the network connection between the vendor and our data center.

With the decision to use XML for storing poll metadata and RDF for storing poll data in place, we turned our attention to the specifics of the persistence layer. We stored the poll objects in XML, as shown in this example:


<?xml version="1.0"?>
<poll>
    <creator>Fred Moyer</creator>
    <date>2005-03-01</date>
    <vendor>Voter Data Inc.</vendor>
    <location>https://www.voterdatainc.com/poll/1234</location>
    <questions>
        <question>
            <name>Who is buried in Grant's Tomb?</name>
            <answers>
                <answer>
                    <name>Ulysses Grant</name>
                    <value>0</value>
                </answer>
                <answer>
                    <name>John Kerry</name>
                    <value>1</value>
                </answer>
                <answer>
                    <name>George Bush</name>
                    <value>2</value>
                </answer>
                <answer>
                    <name>Alfred E.  Neumann</name>
                    <value>3</name>
                </answer>
            </answers>
        </question>
    </questions>
    <media>
        <pdf>
            <name>Name of a PDF file describing this poll</name>
            <raw>The raw contents of the PDF file</raw>
            <text>The text of the PDF file, generated with XPDF libs</text>
        </pdf>
    </media>
</poll>

We also needed an API to manage those documents. We chose Berkeley DBXML because of its simple but effective API and its ability to scale to terabyte size if needed. We created a poll class which subclassed the Sleepycat and XML::LibXML modules and provided some Perlish methods for manipulating polls.

package KSYNC::Model::Poll;

use strict;
use warnings;

use base qw(KSYNC::Model);
use SleepyCat::DbXml qw(simple);
use XML::LibXML;
use KSYNC::Exception;

my $ACTIVITY_LOC = 'data/poll.dbxml';

BEGIN {
    # Initialize the DbXml database
    my $container = XmlContainer->new($ACTIVITY_LOC);
}

# Call base class constructor KSYNC::Model->new
sub new {
    my ($class, %args) = @_;

    my $self = $class->SUPER::new(%args);
    return $self;
}

# Transform the poll object into an xml document
sub as_xml {
    my ($self, $id) = @_;
    
    my $dom = XML::LibXML::Document->new();
    my $pi = $dom->createPI( 'xml-styleshet', 
                             'href="/css/poll.xsl" type="text/xsl"' );
    $dom->appendChild($pi);
    my $element = XML::LibXML::Element->new('Poll');

    $element->appendTextChild('Type',        $self->type);
    $element->appendTextChild('Creator',     $self->creator);
    $element->appendTextChild('Description', $self->description);
    $element->appendTextChild('Vendor',      $self->vendor);
    $element->appendTextChild('Began',       $self->began);
    $element->appendTextChild('Completed',   $self->completed);

    my $questions = XML::LibXML::Element->new('Questions');

    for my $question ( @{ $self->{question} } ) {
        $questions->appendChild($question->as_element);
    }

    $element->appendChild($questions);

    $dom->setDocumentElement($element);
    return $dom;
}

sub save {
    my $self = shift;

    # Connect to the DbXml databae
    $container->open(Db::DB_CREATE);

    # Create a new document for storage from xml serialization of $self
    my $doc = XmlDocument->new();
    $doc->setContent($self->as_xml);
    
    # Save, throw an exception if problems happen
    eval { $container->putDocument($doc); };
    KSYNC::Exception->throw("Could not add document: $@") if $@;

    # Return the ID of the newly added document
    return $doc->getID();
}

We chose RDF as the format for poll data because the format contains links to resources that describe the namespaces of the document, making the document self-describing. The availability of standardized namespaces such as Dublin Core gave us predefined tags such as dc:date and dc:creator. We added our own namespaces for representation of poll data. Depending on what verbosity of data the vendors kept, we could add dc:date tags to different portions of the document to provide historical references. We constructed our URLs in a REST format for all web-based resources.

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
	xmlns:RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
        xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:ourparty="http://www.ourparty.org/xml/schema#">
	
	<rdf:Description rdf:about="http://www.ourparty.org/poll/1234">
	    <dc:date>2004-10-14</dc:date>
            <dc:creator>fmoyer@plusthree.com</dc:creator>
        </rdf:Description>
        
        <rdf:Bag>
        <rdf:li ourparty:id="6372095736" ourparty:question="1"
		    ourparty:answer="1" dc:date="2005-03-01" />
        <rdf:li ourparty:id="2420080069" ourparty:question="2"
            ourparty:answer="3" dc:date="2005-03-02" />
	</rdf:Bag>
</rdf:RDF>

We used SAX machines as drivers to generate summary models of RDF files and LibXML streaming parsers to traverse the RDF files. We stacked drivers by using pipelined SAX machines and constructed SAX drivers for the different vendor data schemas. Cron-based machines scanned the RDF store, identified new poll data, and processed them into summary XML documents which we served to administrative users via XSLT transformations. Additionally, we used the SAX machines to create denormalized SQL warehouses for data mining.

An example SAX driver for Voter Data, Inc. RDF poll data:

package KSYNC::SAX::Voterdatainc;

use strict;
use warnings;

use base qw(KSYNC::SAX);

my %NS = (
    rdf      => 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    dc       => 'http://purl.org/dc/elements/1.1/',
    ourparty => 'http://www.ourparty.org/xml/schema#',
);

my $VENDOR = 'Voter Data, Inc.';

sub new {
    my $class = shift;

    # Call the super constructor to create the driver
    my $self = $class->SUPER::new(@_, { vendor => $VENDOR });

    return $self;
}

sub start_element {
    my ($self, $data) = @_;
    
    # Process rdf:li elements
    if ( $data->{Name} eq 'rdf:li' ) {
    
        # Grab the data
        my $id      = $data->{Attributes}{ "{$NS{ourparty}}id" }{Value};
        my $answer  = $data->{Attributes}{ "{$NS{ourparty}}answer" }{Value};
        my $creator = $data->{Attributes}{ "{$NS{dc}}creator" }{Value};
        my $date    = $data->{Attributes}{ "{$NS{dc}}date" }{Value};

        # Map the data to a common response
        $self->add_response({ vendor        => $VENDOR,
                              voter_id      => $id, 
                              support_level => $answer, 
                              creator       => $creator,
                              date          => $date,
                           });

        # Call the base class start_element method to do something with the data
        $self->SUPER::start_element($data);
}

1;

We stored RDF documents compressed in bzip2 format, because bzip2 compression algorithm is especially efficient at compressing repeating element data. As shown below in the SAX machine example, using bzcat as the intake to a pipeline parser allowed decompression of the bzip2 documents for parsing and creating a summary of a poll data set.

#!/usr/bin/env perl

use strict;
use warnings;

use KSYNC::SAX::Voterdatainc;
use XML::SAX::Machines qw(Pipeline);

# The poll data
my $rdf = 'data/voterdatainc/1759265.rdf.bz2';

# Create a vendor specific driver
my $driver = KSYNC::SAX::Voterdatainc->new();

# Create a driver to add the data to a data warehouse handle
my $dbh = KSYNC::DBI->connect();
my $warehouser = KSYNC::SAX::DBI->new(
                    source => 'http://www.voterdatainc/ourparty/poll.xml',
                    dbh    => $dbh,
                );

# Create a parser which uncompresses the poll data set, summarizes it, and 
# outputs data to a filter which warehouses the denormalized data
my $parser = Pipeline(
                "bzcat $rdf |" =>
                $driver        => 
                $warehouser    => 
;

# Parse the poll data
$parser->parse();

# Summarize the poll data
print "Average support level:  ",   $driver->average_support_level, "\n";
print "Starting date:  ", 	    $driver->minimum_date, "\n";
print "Ending date:  ", 	    $driver->maximum_date, "\n";

Between the polls, the XML Schema dictionaries, and the RDF files, we know who the polls contacted, what they saw, and how they responded. A major benefit of keeping the collected information in RDF format is the preservation of historical information. We constructed SQL warehouses to analyze changes in voter support levels over time. This was critical for measuring the effect of events such as presidential debates on voter interest and support.

Using RDF also provided us with the flexibility to map new data sources as needed. If a vendor collected some information which we had not processed before, they would add an about tag such as <rdf:Description rdf:about="http://www.datavendor.com/ourparty/poll5.xml" /> , which we would map to features of our SAX machines as needed.

We added some hooks to the SAX machines to match certain URIs and then process selected element data. Late in the campaign, when early voting started, we were able to quickly modify our existing SAX machines to collect early voting data from the data streams and produce SQL warehouses for analysis.

Mechanization of Data Collection

A major focus of the application was retrieving data from remote sources. Certain vendors used our secure FTP site to send us data, but most had web and FTP sites to which they posted the information. We needed a way to collect data from those servers. Some vendors were able to provide data to us in XML and RDF formats, but for the most part, we would receive data in CSV, TSV, or some form of XML. Each vendor generally had supplementary data beyond the normal voter data fields which we also wanted to capture. Using that additional data was not an immediate need, but by storing it in RDF format we could extract it and generate SQL warehouses whenever necessary.

We developed a part of the application known as the spider and created a database table containing information on the data source authentication, protocol, and data structure details. A factory class KSYNC::Model::Spider read the data source entries and constructed spider objects for each data source. These spiders used Net::FTP and LWP to retrieve poll data, and processed the data using the appropriate KSYNC::SAX machine. To add a new data source to our automated collection system, an entry in the database configured the spider, and if the new data source had data in a format that we did not support, we added a SAX machine for that data source.

An example of spider usage:

package KSYNC::Model::Spider;

use strict;
use warnings;

use Carp 'croak';
use base 'KSYNC::Model';

sub new {
    my ($class, %args) = @_;
    
    # Create an FTP or HTTP spider based on the type specified in %args
    my $spider_pkg = $class->_factory($args{type});
    my $self = $spider_pkg->new(%args);

    return $self;
}

sub _factory {
    my ($class, $type) = @_;

    # Create the package name for the spider type
    my $pkg = join '::', $class, $type;
    
    # Load the package
    eval "use $pkg";
    croak("Error loading factory module: $@") if $@;

    return $pkg;
}

1;

package KSYNC::Model::Spider::FTP;

use Net::FTP;
use KSYNC::Exception;

sub new {
    my ($class, %args) = @_;
    
    my $self = { %args };

    # Load the appropriate authentication package via Spider::Model::Auth 
    # factory class
    $self->{auth} = Spider::Model::Auth->new(%{$args{auth}});

    return bless $self, $class;
}

sub authenticate {
    my $self = shift;
    
    # Login
    eval { $self->ftp->login($self->auth->username, $self->auth->password); };
     
    # Throw an exception if problems occurred
    KSYNC::Exception->throw("Cannot login ", $self->ftp->message) if $@;
}

sub crawl {
    my $self = shift;
    
    # Set binary retrieval mode
    $self->ftp->binary;

    # Find new poll data
    my @datasets = $self->_find_new();

    # Process that poll data
    foreach my $dataset (@datasets) {
        eval { $self->_process($dataset); };
        $self->error("Could not process poll data $dataset->id", $@) if $@;
    }
}

sub ftp { 
    croak("Method Not Implemented!") if @_ > 1; 
    $_[0]->{ftp} ||= Net::FTP->new($self->auth->host); 
}

1;

#!/usr/bin/env perl

use strict;
use warnings;

use KSYNC::Model::Spider;
use KSYNC::Model::Vendor;

# Retrieve a vendor so we can grab their latest data
my $vendor = KSYNC::Model::Vendor->retrieve({ 
  name => 'Voter Data, Inc.',
});

# Construct a spider to crawl their site
my $spider = KSYNC::Model::Spider->new({ type => $vendor->type });

# Login
$spider->login();

# Grab the data
$spider->crawl();

# Logout
$spider->logout();

1;

Conclusions

In this project, getting things done was of paramount importance. Perl allowed us to deal with the complexities of the business requirements and the technical details of data schemas and formats without presenting additional technical obstacles, as programming languages occasionally do. The CPAN, mod_perl, and libapreq provided the components that allowed us to quickly build an application to deal with complex, semi-structured data on an enterprise scale. From creating a user friendly web application to automating data collection and SQL warehouse generation, Perl was central to the success of this project.

Credits

Thanks to the following people who made this possible and contributed to this project: Thomas Burke, Charles Frank, Lyle Brooks, Lina Brunton, Aaron Ross, Alan Julson, Marc Schloss, and Robert Vadnais.

Thanks to Plus Three LP for sponsoring work on this project.

More Lightning Articles

Customizing Emacs with Perl

by Bob DuCharme

Over time, I've accumulated a list of Emacs customizations I wanted to implement when I got the chance. For example, I'd like macros to perform certain global replaces just within a marked block, and I'd like a macro to reformat an Outlook formatted date to an ISO 8609 formatted date. I'm not overly intimidated by the elisp language used to customize Emacs behavior; I've copied elisp code and modified it to make some tweaks before, I had a healthy dose of Scheme and LISP programming in school, and I've done extensive work with XSLT, a descendant of these grand old languages. Still, as with a lot of postponed editor customization work, I knew I'd have to use these macros many, many times before they earned back the time invested in creating them, because I wasn't that familiar with string manipulation and other basic operations in a LISP-based language. I kept thinking to myself, "This would be so easy if I could just do the string manipulation in Perl!"

Then, I figured out how I could write Emacs functions that called Perl to operate on a marked block (or, in Emacs parlance, a "region"). Many Emacs users are familiar with the Escape+| keystroke, which invokes the shell-command-on-region function. It brings up a prompt in the minibuffer where you enter the command to run on the marked region, and after you press the Enter key Emacs puts the command's output in the minibuffer if it will fit, or into a new "*Shell Command Output*" buffer if not. For example, after you mark part of an HTML file you're editing as the region, pressing Escape+| and entering wc (for "word count") at the minibuffer's "Shell command on region:" prompt will feed the text to this command line utility if you have it in your path, and then display the number of lines, words, and characters in the region at the minibuffer. If you enter sort at the same prompt, Emacs will run that command instead of wc and display the result in a buffer.

Entering perl /some/path/foo.pl at the same prompt will run the named Perl script on the marked region and display the output appropriately. This may seem like a lot of keystrokes if you just want to do a global replace in a few paragraphs, but remember: Ctrl+| calls Emacs's built-in shell-command-on-region function, and you can call this same function from a new function that you define yourself. My recent great discovery was that along with parameters identifying the region boundaries and the command to run on the region, shell-command-on-region takes an optional parameter that lets you tell it to replace the input region with the output region. When you're editing a document with Emacs, this allows you to pass a marked region outside of Emacs to a Perl script, let the Perl script do whatever you like to the text, and then Emacs will replace the original text with the processed version. (If your Perl script mangled the text, Emacs' excellent undo command can come to the rescue.)

Consider an example. When I take notes about a project at work, I might write that Joe R. sent an e-mail telling me that a certain system won't need any revisions to handle the new data. I want to make a note of when he told me this, so I copy and paste the date from the e-mail he sent. We use Microsoft Outlook at work, and the dates have a format following the model "Tue 2/22/2005 6:05 PM". I already have an Emacs macro bound to alt+d to insert the current date and time (also handy when taking notes) and I wanted the date format that refers to e-mails to be the same format as the ones inserted with my alt+d macro: an ISO 8609 format of the form "2005-02-22T18:05".

The .emacs startup file holds customized functions that you want available during your Emacs session. The following shows a bit of code that I put in mine so that I could convert these dates:

(defun OLDate2ISO ()
  (interactive)
  (shell-command-on-region (point)
         (mark) "perl c:/util/OLDate2ISO.pl" nil t))

The (interactive) declaration tells Emacs that the function being defined can be invoked interactively as a command. For example, I can enter "OLDate2ISO" at the Emacs minibuffer command prompt, or I can press a keystroke or select a menu choice bound to this function. The point and mark functions are built into Emacs to identify the boundaries of the currently marked region, so they're handy for the first and second arguments to shell-command-on-region, which tell it which text is the region to act on. The third argument is the actual command to execute on the region; enter any command available on your operating system that can accept standard input. To define your own Emacs functions that call Perl functions, just change the script name in this argument from OLDate2ISO to anything you like and then change this third argument to shell-command-on-region to call your own Perl script.

Leave the last two arguments as nil and t. Don't worry about the fourth parameter, which controls the buffer where the shell output appears. (Setting it to nil means "don't bother.") The fifth parameter is the key to the whole trick: when non-nil, it tells Emacs to replace the marked text in the editing buffer with the output of the command described in the third argument instead of sending the output to a buffer.

If you're familiar with Perl, there's nothing particularly interesting about the OLDate2ISO.pl script. It does some regular expression matching to split up the string, converts the time to a 24 hour clock, and rearranges the pieces:

# Convert Outlook format date to ISO 8309 date 
#(e.g. Wed 2/16/2005 5:27 PM to 2005-02-16T17:27)
while (<>) {
  if (/\w+ (\d+)\/(\d+)\/(\d{4}) (\d+):(\d+) ([AP])M/) {
     $AorP = $6;
     $minutes = $5;
     $hour = $4;
     $year = $3;
     $month = $1;
     $day = $2;
     $day = '0' . $day if ($day < 10);
     $month = '0' . $month if ($month < 10);
     $hour = $hour + 12 if ($6 eq 'P');
     $hour = '0' . $hour if ($hour < 10);
     $_ = "$year-$month-$day" . "T$hour:$minutes";
  }
  print;
}

When you start up Emacs with a function definition like the defun OLDate2ISO one shown above in your .emacs file, the function is available to you like any other in Emacs. Press Escape+x to bring up the Emacs minibuffer command line and enter "OLDate2ISO" there to execute it on the currently marked buffer. Like any other interactive command, you can also assign it to a keystroke or a menu choice.

There might be a more efficient way to do the Perl coding shown above, but I didn't spend too much time on it. That's the beauty of it: with five minutes of Perl coding and one minute of elisp coding, I had a new menu choice to quickly do the transformation I had always wished for.

Another example of something I always wanted is the following txt2htmlp.pl script, which is useful after plugging a few paragraphs of plain text into an HTML document:

# Turn lines of plain text into HTML p elements.
while (<>) {
  chop($_);
  # Turn ampersands and < into entity references.
  s/\&/\&amp\;/g;
  s/</\&lt\;/g;
  # Wrap each non-blank line in a "p" element.
  print "<p>$_</p>\n\n" if (!(/^\s*$/));
}

Again, it's not a particularly innovative Perl script, but with the following bit of elisp in my .emacs file, I have something that greatly speeds up the addition of hastily written notes into a web page, especially when I create an Emacs menu choice to call this function:

(defun txt2htmlp ()
  (interactive)
  (shell-command-on-region (point) 
         (mark) "perl c:/util/txt2htmlp.pl" nil t))

Sometimes when I hear about hot new editors, I wonder whether they'll ever take the place of Emacs in my daily routine. Now that I can so easily add the power of Perl to my use of Emacs, it's going to be a lot more difficult for any other editor to compete with Emacs on my computer.

Debug Your Programs with Devel::LineTrace

by Shlomi Fish

Often, programmers find a need to use print statements to output information to the screen, in order to help them analyze what went wrong in running the script. However, including these statements verbatim in the script is not such a good idea. If not promptly removed, these statements can have all kinds of side-effects: slowing down the script, destroying the correct format of its output (possibly ruining test-cases), littering the code, and confusing the user. It would be a better idea not to place them within the code in the first place. How, though, can you debug without debugging?

Enter Devel::LineTrace, a Perl module that can assign portions of code to execute at arbitrary lines within the code. That way, the programmer can add print statements in relevant places in the code without harming the program's integrity.

Verifying That use lib Has Taken Effect

One example I recently encountered was that I wanted to use a module I wrote from the specialized directory where I placed it, while it was already installed in the Perl's global include path. I used a use lib "./MyPath" directive to make sure this was the case, but now had a problem. What if there was a typo in the path of the use lib directive, and as a result, Perl loaded the module from the global path instead? I needed a way to verify it.

To demonstrate how Devel::LineTrace can do just that, consider a similar script that tries to use a module named CGI from the path ./MyModules instead of the global Perl path. (It is a bad idea to name your modules after names of modules from CPAN or from the Perl distribution, but this is just for the sake of the demonstration.)

#!/usr/bin/perl -w

use strict;
use lib "./MyModules";

use CGI;

my $q = CGI->new();

print $q->header();

Name this script good.pl. To test that Perl loaded the CGI module from the ./MyModules directory, direct Devel::LineTrace to print the relevant entry from the %INC internal variable, at the first line after the use CGI one.

To do so, prepare this file and call it test-good.txt:

good.pl:8
    print STDERR "\$INC{CGI.pm} == ", $INC{"CGI.pm"}, "\n";

Place the file and the line number at which the trace should be inserted on the first line. Then comes the code to evaluate, indented from the start of the line. After the first trace, you can put other traces, by starting the line with the filename and line number, and putting the code in the following (indented) lines. This example is simple enough not to need that though.

After you have prepared test-good.txt, run the script through Devel::LineTrace by executing the following command:

$ PERL5DB_LT="test-good.txt" perl -d:LineTrace good.pl

(This assumes a Bourne-shell derivative.). The PERL5DB_LT environment variable contains the path of the file to use for debugging, and the -d:LineTrace directive to Perl instructs it to debug the script through the Devel::LineTrace package.

As a result, you should see either the following output to standard error:

$INC{CGI.pm} == MyModules/CGI.pm

meaning that Perl indeed loaded the module from the MyModules sub-directory of the current directory. Otherwise, you'll see something like:

$INC{CGI.pm} == /usr/lib/perl5/vendor_perl/5.8.4/CGI.pm

...which means that it came from the global path and something went wrong.

Limitations of Devel::LineTrace

Devel::LineTrace has two limitations:

  1. Because it uses the Perl debugger interface and stops at every line (to check whether it contains a trace), program execution is considerably slower when the program is being run under it.
  2. It assigns traces to line numbers, and therefore you must update it if the line numbering of the file changes.

Nevertheless, it is a good solution for keeping those pesky print statements out of your programs. Happy LineTracing!

Using Test::MockDBI

by Mark Leighton Fisher

What if you could test your program's use of the DBI just by creating a set of rules to guide the DBI's behavior—without touching a database (unless you want to)? That is the promise of Test::MockDBI, which by mocking-up the entire DBI API gives you unprecedented control over every aspect of the DBI's interface with your program.

Test::MockDBI uses Test::MockObject::Extends to mock all of the DBI transparently. The rest of the program knows nothing about using Test::MockDBI, making Test::MockDBI ideal for testing programs that you are taking over, because you only need to add the Test::MockDBI invocation code— you do not have to modify any of the other program code. (I have found this very handy as a consultant, as I often work on other people's code.)

Rules are invoked when the current SQL matches the rule's SQL pattern. For finer control, there is an optional numeric DBI testing type for each rule, so that a rule only fires when the SQL matches and the current DBI testing type is the specified DBI testing type. You can specify this numeric DBI testing type (a simple integer matching /^\d+$/) from the command line or through Test::MockDBI::set_dbi_test_type(). You can also set up rules to fail a transaction if a specific DBI::bind_param() parameter is a specific value. This means there are three types of conditions for Test::MockDBI rules:

  • The current SQL
  • The current DBI testing type
  • The current bind_param() parameter values

Under Test::MockDBI, fetch*() and select*() methods default to returning nothing (the empty array, the empty hash, or undef for scalars). Test::MockDBIM lets you take control of their returned data with the methods set_retval_scalar() and set_retval_array(). You can specify the returned data directly in the set_retval_*() call, or pass a CODEREF that generates a return value to use for each call to the matching fetch*() or select*() method. CODEREFs let you both simulate DBI's interaction with the database more accurately (as you can return a few rows, then stop), and add in any kind of state machine or other processing needed to precisely test your code.

When you need to test that your code handles database or DBI failures, bad_method() is your friend. It can fail any DBI method, with the failures dependent on the current SQL and (optionally) the current DBI testing type. This capability is necessary to test code that handles bad database UPDATEs, INSERTs, or DELETEs, along with being handy for testing failing SELECTs.

Test::MockDBI extends your testing capabilities to testing code that is difficult or impossible to test on a live, working database. Test::MockDBI's mock-up of the entire DBI API lets you add Test::MockDBI to your programs without having to modify their current DBI code. Although it is not finished (not all of the DBI is mocked-up yet), Test::MockDBI is already a powerful tool for testing DBI programs.

Unnecessary Unbuffering

by chromatic

A great joy in a programmer's life is removing useless code, especially when its absence improves the program. Often this happens in old codebases or codebases thrown together hastily. Sometimes it happens in code written by novice programmers who try several different ideas all together and fail to undo their changes.

One such persistent idiom is wholesale, program-wide unbuffering, which can take the form of any of:

local $| = 1;
$|++;
$| = 1;

Sometimes this is valuable. Sometimes it's vital. It's not the default for very good reason, though, and at best, including one of these lines in your program is useless code.

What's Unbuffering?

By default, modern operating systems don't send information to output devices directly, one byte at a time, nor do they read information from input devices directly, one byte at a time. IO is so slow, especially for networks, compared to processors and memory that adding buffers and trying to fill them before sending and receiving information can improve performance.

Think of trying to fill a bathtub from a hand pump. You could pump a little water into a bucket and walk back and forth to the bathtub, or you could fill a trough at the pump and fill the bucket from the trough. If the trough is empty, pumping a little bit of water into the bucket will give you a faster start, but it'll take longer in between bucket loads than if you filled the trough at the start and carried water back and forth between the trough and the bathtub.

Information isn't exactly like water, though. Sometimes it's more important to deliver a message immediately even if it doesn't fill up a bucket. "Help, fire!" is a very short message, but waiting to send it when you have a full load of messages might be the wrong thing.

That's why modern operating systems also let you unbuffer specific filehandles. When you print to an unbuffered filehandle, the operating system will handle the message immediately. That doesn't guarantee that whoever's on the other side of the handle will respond immediately; there might be a pump and a trough there.

What's the Damage?

According to Mark-Jason Dominus' Suffering from Buffering?, one sample showed that buffered reading was 40% faster than unbuffered reading, and buffered writing was 60% faster. The latter number may only improve when considering network communications, where the overhead of sending and receiving a single packet of information can overwhelm short messages.

In simple interactive applications though, there may be no benefit. When attached to a terminal, such as a command line, Perl operates in line-buffered mode. Run the following program and watch the output carefully:

#!/usr/bin/perl

use strict;
use warnings;

# buffer flushed at newline
loop_print( 5, "Line-buffered\n" );

# buffer not flushed until newline
loop_print( 5, "Buffered  " );
print "\n";

# buffer flushed with every print
{
    local $| = 1;
    loop_print( 5, "Unbuffered  " );
}

sub loop_print
{
    my ($times, $message) = @_;

    for (1 .. $times)
    {
        print $message;
        sleep 1;
    }
}

The first five greetings appear individually and immediately. Perl flushes the buffer for STDOUT when it sees the newlines. The second set appears after five seconds, all at once, when it sees the newline after the loop. The third set appears individually and immediately because Perl flushes the buffer after every print statement.

Terminals are different from everything else, though. Consider the case of writing to a file. In one terminal window, create a file named buffer.log and run tail -f buffer.log or its equivalent to watch the growth of the file in real time. Then add the following lines to the previous program and run it again:

open( my $output, '>', 'buffer.log' ) or die "Can't open buffer.log: $!";
select( $output );
loop_print( 5, "Buffered\n" );
{
      local $| = 1;
      loop_print( 5, "Unbuffered\n" );
}

The first five messages appear in the log in a batch, all at once, even though they all have newlines. Five messages aren't enough to fill the buffer. Perl only flushes it when it unbuffers the filehandle on assignment to $|. The second set of messages appear individually, one second after another.

Finally, the STDERR filehandle is hot by default. Add the following lines to the previous program and run it yet again:

select( STDERR );
loop_print( 5, "Unbuffered STDERR " );

Though no code disables the buffer on STDERR, the five messages should print immediately, just as in the other unbuffered cases. (If they don't, your OS is weird.)

What's the Solution?

Buffering exists for a reason; it's almost always the right thing to do. When it's the wrong thing to do, you can disable it. Here are some rules of thumb:

  • Never disable buffering by default.
  • Disable buffering when and while you have multiple sources writing to the same output and their order matters.
  • Never disable buffering for network outputs by default.
  • Disable buffering for network outputs only when the expected time between full buffers exceeds the expected client timeout length.
  • Don't disable buffering on terminal outputs. For STDERR, it's useless, dead code. For STDOUT, you probably don't need it.
  • Disable buffering if it's more important to print messages regularly than efficiently.
  • Don't disable buffering until you know that the buffer is a problem.
  • Disable buffering in the smallest scope possible.
Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02