FEAR-less Site Scraping
by Yung-chung Lin
|
Pages: 1, 2, 3, 4
A Full Example
Finally, here is an example that submits a query to CPAN, extracts records of results, and then puts the extracted data into a SQLite database.
First, create the database schema:
% cat > schema.sql
CREATE TABLE cpan (
module varchar(64),
dist varchar(64),
link varchar(256),
description varchar(128),
date varchar(32),
author varchar(64),
url varchar(256),
primary key (module)
);
Then create the database:
% sqlite3 cpan.db < schema.sql
Now create a class that maps to the database:
% mkdir lib
% mkdir lib/CPAN
% cat > lib/CPAN/DBI.pm
package CPAN::DBI;
use base 'Class::DBI::SQLite';
__PACKAGE__->set_db('Main', 'dbi:SQLite:dbname=cpan.db', '', '');
__PACKAGE__->set_up_table('cpan');
1;
The next part is the CPAN scraper:
% cat > cpan-scraper.pl
use lib 'lib';
use FEAR::API -base;
use CPAN::DBI;
url("http://search.cpan.org/")->();
submit_form(form_name => 'f',
fields => {
query => 'perl',
mode => 'module',
});
preproc('s/\A.+<!--results-->(.+)<!--end results-->.+\Z/$1/s');
template('<!--item-->
<p><a href="[% link %]"><b>[% module %]</b></a>
<br /><small>[% description %]</small>
<br /><small> <a href="[% ... %]">[% dist %]</a> -
<span class=date>[% date %]</span> -
<a href="/~[% ... %]">[% author %]</a>
</small>
<!--end item-->');
extract;
invoke_handler(sub {
print "-- Inserting $_->{module}\n";
CPAN::DBI->find_or_create($_);
});
Run it, and then check the database:
% sqlite3 cpan.db
sqlite> .mode csv
sqlite> select module, dist, author from cpan;
If everything goes well, your results will resemble:
"Perl","PerlInterp-0.03","Ben Morrow"
"Perl::AfterFork","Perl-AfterFork-0.01","Torsten Förtsch"
"Perl::AtEndOfScope","Perl-AtEndOfScope-0.01","Torsten Förtsch"
"Perl::BestPractice","Perl-BestPractice-0.01","Adam Kennedy"
"Perl::Compare","Perl-Compare-0.10","Adam Kennedy"
"Perl::Critic","Perl-Critic-0.14","Jeffrey Ryan Thalhammer"
"Perl::Editor","Perl-Editor-0.02","Adam Kennedy"
"Perl::Metrics","Perl-Metrics-0.05","Adam Kennedy"
"Perl::MinimumVersion","Perl-MinimumVersion-0.11","Adam Kennedy"
"Perl::SAX","Perl-SAX-0.06","Adam Kennedy"
Isn't that easy?
Conclusion
FEAR::API is an innovation for site scraping. It combines strong features and powerful methods from various modules, and it also employs operator overloading to build something a domain-specific language without forbidding the use of Perl's full power. FEAR::API is very suitable for the fast creation of scraping scripts. A central dogma of FEAR::API is "Code the least and perform the most."
However, FEAR::API still needs lots of improvement. Currently, it does not handle errors very well, lacks automatic template generation, performs no logging, and has no direct connection to a database mapper such as DBIx::Class or Class::DBI. Even the documentation needs work.
Patches or suggestions are welcome!
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 2 of 2.
- What's up with the name
2006-07-19 05:54:38 Mark_Thomas [Reply]
I don't think I've seen a worse example of a module name. A module named FEAR::API, by existing convention, should be a programming interface to an application or hosted service named FEAR.
- Good idea, but...
2006-06-23 09:56:00 avetiktopchyan [Reply]
Painful to setup. On Windows, CPAN command-line installer is attempting to install a lot of modules required by modules, required by modules, required by modules etc... you get the idea. Not smooth at all. Attempting to install required modules using PPM is little easier, yet the final dependency IPC::SysV is not found anywhere and thus the whole thing is not functioning... Oh, well...



