Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

FEAR-less Site Scraping
by Yung-chung Lin | Pages: 1, 2, 3, 4

A Full Example

Finally, here is an example that submits a query to CPAN, extracts records of results, and then puts the extracted data into a SQLite database.

First, create the database schema:

   % cat > schema.sql

   CREATE TABLE cpan (
      module      varchar(64),
      dist        varchar(64),
      link        varchar(256),
      description varchar(128),
      date        varchar(32),
      author      varchar(64),
      url         varchar(256),
      primary key (module)
   );

Then create the database:

   % sqlite3 cpan.db < schema.sql

Now create a class that maps to the database:

   % mkdir lib
   % mkdir lib/CPAN
   % cat > lib/CPAN/DBI.pm
   package CPAN::DBI;
   use base 'Class::DBI::SQLite';
   __PACKAGE__->set_db('Main', 'dbi:SQLite:dbname=cpan.db', '', '');
   __PACKAGE__->set_up_table('cpan');
   1;

The next part is the CPAN scraper:

  % cat > cpan-scraper.pl
   use lib 'lib';
   use FEAR::API -base;
   use CPAN::DBI;

   url("http://search.cpan.org/")->();
   submit_form(form_name => 'f',
               fields => {
                  query => 'perl',
                  mode => 'module',
               });

   preproc('s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s');

   template('<!--item-->
  <p><a href="[% link %]"><b>[% module %]</b></a>
<br /><small>[% description %]</small>
<br /><small>   <a href="[% ... %]">[% dist %]</a> -
   <span class=date>[% date %]</span> -
   <a href="/~[% ... %]">[% author %]</a>
</small>
<!--end item-->');

   extract;

   invoke_handler(sub {
    print "-- Inserting $_->{module}\n";
    CPAN::DBI->find_or_create($_);
   });

Run it, and then check the database:

   % sqlite3 cpan.db
   sqlite> .mode csv
   sqlite> select module, dist, author from cpan;

If everything goes well, your results will resemble:

   "Perl","PerlInterp-0.03","Ben Morrow"
   "Perl::AfterFork","Perl-AfterFork-0.01","Torsten F&#246;rtsch"
   "Perl::AtEndOfScope","Perl-AtEndOfScope-0.01","Torsten F&#246;rtsch"
   "Perl::BestPractice","Perl-BestPractice-0.01","Adam Kennedy"
   "Perl::Compare","Perl-Compare-0.10","Adam Kennedy"
   "Perl::Critic","Perl-Critic-0.14","Jeffrey Ryan Thalhammer"
   "Perl::Editor","Perl-Editor-0.02","Adam Kennedy"
   "Perl::Metrics","Perl-Metrics-0.05","Adam Kennedy"
   "Perl::MinimumVersion","Perl-MinimumVersion-0.11","Adam Kennedy"
   "Perl::SAX","Perl-SAX-0.06","Adam Kennedy"

Isn't that easy?

Conclusion

FEAR::API is an innovation for site scraping. It combines strong features and powerful methods from various modules, and it also employs operator overloading to build something a domain-specific language without forbidding the use of Perl's full power. FEAR::API is very suitable for the fast creation of scraping scripts. A central dogma of FEAR::API is "Code the least and perform the most."

However, FEAR::API still needs lots of improvement. Currently, it does not handle errors very well, lacks automatic template generation, performs no logging, and has no direct connection to a database mapper such as DBIx::Class or Class::DBI. Even the documentation needs work.

Patches or suggestions are welcome!