Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Massive Data Aggregation with Perl
by Fred Moyer | Pages: 1, 2, 3

Mechanization of Data Collection

A major focus of the application was retrieving data from remote sources. Certain vendors used our secure FTP site to send us data, but most had web and FTP sites to which they posted the information. We needed a way to collect data from those servers. Some vendors were able to provide data to us in XML and RDF formats, but for the most part, we would receive data in CSV, TSV, or some form of XML. Each vendor generally had supplementary data beyond the normal voter data fields which we also wanted to capture. Using that additional data was not an immediate need, but by storing it in RDF format we could extract it and generate SQL warehouses whenever necessary.

We developed a part of the application known as the spider and created a database table containing information on the data source authentication, protocol, and data structure details. A factory class KSYNC::Model::Spider read the data source entries and constructed spider objects for each data source. These spiders used Net::FTP and LWP to retrieve poll data, and processed the data using the appropriate KSYNC::SAX machine. To add a new data source to our automated collection system, an entry in the database configured the spider, and if the new data source had data in a format that we did not support, we added a SAX machine for that data source.

An example of spider usage:

package KSYNC::Model::Spider;

use strict;
use warnings;

use Carp 'croak';
use base 'KSYNC::Model';

sub new {
    my ($class, %args) = @_;
    
    # Create an FTP or HTTP spider based on the type specified in %args
    my $spider_pkg = $class->_factory($args{type});
    my $self = $spider_pkg->new(%args);

    return $self;
}

sub _factory {
    my ($class, $type) = @_;

    # Create the package name for the spider type
    my $pkg = join '::', $class, $type;
    
    # Load the package
    eval "use $pkg";
    croak("Error loading factory module: $@") if $@;

    return $pkg;
}

1;

package KSYNC::Model::Spider::FTP;

use Net::FTP;
use KSYNC::Exception;

sub new {
    my ($class, %args) = @_;
    
    my $self = { %args };

    # Load the appropriate authentication package via Spider::Model::Auth 
    # factory class
    $self->{auth} = Spider::Model::Auth->new(%{$args{auth}});

    return bless $self, $class;
}

sub authenticate {
    my $self = shift;
    
    # Login
    eval { $self->ftp->login($self->auth->username, $self->auth->password); };
     
    # Throw an exception if problems occurred
    KSYNC::Exception->throw("Cannot login ", $self->ftp->message) if $@;
}

sub crawl {
    my $self = shift;
    
    # Set binary retrieval mode
    $self->ftp->binary;

    # Find new poll data
    my @datasets = $self->_find_new();

    # Process that poll data
    foreach my $dataset (@datasets) {
        eval { $self->_process($dataset); };
        $self->error("Could not process poll data $dataset->id", $@) if $@;
    }
}

sub ftp { 
    croak("Method Not Implemented!") if @_ > 1; 
    $_[0]->{ftp} ||= Net::FTP->new($self->auth->host); 
}

1;

#!/usr/bin/env perl

use strict;
use warnings;

use KSYNC::Model::Spider;
use KSYNC::Model::Vendor;

# Retrieve a vendor so we can grab their latest data
my $vendor = KSYNC::Model::Vendor->retrieve({ 
  name => 'Voter Data, Inc.',
});

# Construct a spider to crawl their site
my $spider = KSYNC::Model::Spider->new({ type => $vendor->type });

# Login
$spider->login();

# Grab the data
$spider->crawl();

# Logout
$spider->logout();

1;

Conclusions

In this project, getting things done was of paramount importance. Perl allowed us to deal with the complexities of the business requirements and the technical details of data schemas and formats without presenting additional technical obstacles, as programming languages occasionally do. The CPAN, mod_perl, and libapreq provided the components that allowed us to quickly build an application to deal with complex, semi-structured data on an enterprise scale. From creating a user friendly web application to automating data collection and SQL warehouse generation, Perl was central to the success of this project.

Credits

Thanks to the following people who made this possible and contributed to this project: Thomas Burke, Charles Frank, Lyle Brooks, Lina Brunton, Aaron Ross, Alan Julson, Marc Schloss, and Robert Vadnais.

Thanks to Plus Three LP for sponsoring work on this project.