FEAR-less Site Scraping
by Yung-chung Lin
|
Pages: 1, 2, 3, 4
More Features
FEAR::API incorporates many features from successful modules, and you can use FEAR::API as an alternative.
If you use LWP::Simple:
use LWP::Simple;
get("http://search.cpan.org");
getprint("http://search.cpan.org");
getstore("http://search.cpan.org", 'cpan.html');
With FEAR::API:
use FEAR::API;
get("http://search.cpan.org");
getprint("http://search.cpan.org");
getstore("http://search.cpan.org", 'cpan.html');
If you are familiar with curl, you may use:
$ curl http://site.{one,two,three}.com
# and
$ curl ftp://ftp.numericals.com/file[1-100].txt
In FEAR::API, use Template-Toolkit:
url("[% FOREACH number = ['one','two','three'] %]
http://site.[% number %].com
[% END %]");
fetch while has_more_links;
# and
url("[% FOREACH number = [1..100] %]
ftp://ftp.numericals.com/file[% number %].txt
[% END %]");
fetch while has_more_links;
FEAR::API also supports WWW::Mechanize methods. Use submit_form(), links(), and follow_links() in FEAR::API like those in WWW::Mechanize.
Submitting a query is easy:
fetch("http://search.cpan.org");
submit_form(
form_name => 'f',
fields => {
query => 'perl'
});
template($template); # specify template
extract;
Dumping links is also easy:
print Dumper fetch("http://search.cpan.org/")->links;
So is following links:
fetch("http://search.cpan.org/")->follow_link(n => 3);
Cleaning Up Content
You may use HTML::Strip or basic regular expressions to strip HTML code in fetched content or in extracted results, but FEAR::API provides two simple methods: preproc() and postproc(). (There are also aliases: doc_filter() and result_filter().)
You may process documents now with code resembling:
use LWP::Simple;
use HTML::Strip;
my $content = get("http://search.cpan.org");
my $hs = HTML::Strip->new();
print $hs->parse( $content );
Things are easier in FEAR::API:
fetch("search.cpan.org");
preproc(use => 'html_to_null');
print document->as_string;
If you don't use FEAR::API for postprocessing, your code might be:
use Data::Dumper;
use LWP::Simple;
use Template::Extract;
my $extor = Template::Extract->new;
my $content = get("http://search.cpan.org");
my $result = $extor->extract($template, $content);
foreach my $r (@$result){
foreach (values %$r){
s/(?:<[^>]*>)+/ /g;
}
}
print Dumper $result;
FEAR::API is simpler:
fetch("search.cpan.org");
extract($template);
postproc('s/(?:<[^>]*>)+/ /g;');
print extresult;
You can apply preproc() and postproc() on data multiple times until you find satisfactory results.
More Overloaded Operators.
The previous examples have used the dispatch_links operator (>>). There are more overloaded operators that you can use to reduce your code size further.
print document->as_string;
is equivalent to:
print $$_;
print Dumper extresult;
is equivalent to:
print Dumper \@$_;
url("search.cpan.org")->();
is equivalent to:
url("search.cpan.org");
fetch;
my $cont = fetch("search.cpan.org")->document->as_string;
is equivalent to:
fetch("search.cpan.org") > $cont;
push my @cont, fetch("search.cpan.org")->document->as_string;
is equivalent to:
fetch("search.cpan.org") > \my @cont;
Filtering Syntax
FEAR::API creates something like shell piping. You can continually pass data through a series of filters to get what you need.
url("search.cpan.org")->()
| _preproc(use => 'html_to_null')
| _template($template)
| _postproc('tr/a-z/A-Z/')
| _foreach_result({ print Dumper $_ });
This is equivalent to:
url("search.cpan.org")->();
preproc(use => 'html_to_null');
template($template);
extract;
postproc('tr/a-z/A-Z');
foreach (@{extresult()}){
print Dumper $_;
}

