Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

FEAR-less Site Scraping
by Yung-chung Lin | Pages: 1, 2, 3, 4

More Features

FEAR::API incorporates many features from successful modules, and you can use FEAR::API as an alternative.

If you use LWP::Simple:

   use LWP::Simple;
   get("http://search.cpan.org");
   getprint("http://search.cpan.org");
   getstore("http://search.cpan.org", 'cpan.html');

With FEAR::API:

   use FEAR::API;
   get("http://search.cpan.org");
   getprint("http://search.cpan.org");
   getstore("http://search.cpan.org", 'cpan.html');

If you are familiar with curl, you may use:

   $ curl  http://site.{one,two,three}.com 
   # and
   $ curl ftp://ftp.numericals.com/file[1-100].txt

In FEAR::API, use Template-Toolkit:

   url("[% FOREACH number = ['one','two','three'] %]
        http://site.[% number %].com
        [% END %]");
   fetch while has_more_links;

   # and

   url("[% FOREACH number = [1..100] %]
        ftp://ftp.numericals.com/file[% number %].txt
        [% END %]");
   fetch while has_more_links;

FEAR::API also supports WWW::Mechanize methods. Use submit_form(), links(), and follow_links() in FEAR::API like those in WWW::Mechanize.

Submitting a query is easy:

   fetch("http://search.cpan.org");
   submit_form(
               form_name => 'f',
               fields => {
                    query => 'perl'
               });
   template($template); # specify template
   extract;

Dumping links is also easy:

   print Dumper fetch("http://search.cpan.org/")->links;

So is following links:

   fetch("http://search.cpan.org/")->follow_link(n => 3);

Cleaning Up Content

You may use HTML::Strip or basic regular expressions to strip HTML code in fetched content or in extracted results, but FEAR::API provides two simple methods: preproc() and postproc(). (There are also aliases: doc_filter() and result_filter().)

You may process documents now with code resembling:

   use LWP::Simple;
   use HTML::Strip;

   my $content = get("http://search.cpan.org");
   my $hs = HTML::Strip->new();
   print $hs->parse( $content );

Things are easier in FEAR::API:

   fetch("search.cpan.org");
   preproc(use => 'html_to_null');
   print document->as_string;

If you don't use FEAR::API for postprocessing, your code might be:

   use Data::Dumper;
   use LWP::Simple;
   use Template::Extract;
   my $extor = Template::Extract->new;
   my $content = get("http://search.cpan.org");
   my $result = $extor->extract($template, $content);
   foreach my $r (@$result){
      foreach (values %$r){
        s/(?:<[^>]*>)+/ /g;
      }
   }
   print Dumper $result;

FEAR::API is simpler:

   fetch("search.cpan.org");
   extract($template);
   postproc('s/(?:<[^>]*>)+/ /g;');
   print extresult;

You can apply preproc() and postproc() on data multiple times until you find satisfactory results.

More Overloaded Operators.

The previous examples have used the dispatch_links operator (>>). There are more overloaded operators that you can use to reduce your code size further.

   print document->as_string;

is equivalent to:

   print $$_;
   print Dumper extresult;

is equivalent to:

   print Dumper \@$_;
   url("search.cpan.org")->();

is equivalent to:

   url("search.cpan.org");
   fetch;
   my $cont = fetch("search.cpan.org")->document->as_string;

is equivalent to:

   fetch("search.cpan.org") > $cont;

   push my @cont, fetch("search.cpan.org")->document->as_string;

is equivalent to:

   fetch("search.cpan.org") > \my @cont;

Filtering Syntax

FEAR::API creates something like shell piping. You can continually pass data through a series of filters to get what you need.

   url("search.cpan.org")->()
     | _preproc(use => 'html_to_null')
     | _template($template)
     | _postproc('tr/a-z/A-Z/')
     | _foreach_result({ print Dumper $_ });

This is equivalent to:

   url("search.cpan.org")->();
   preproc(use => 'html_to_null');
   template($template);
   extract;
   postproc('tr/a-z/A-Z');
   foreach (@{extresult()}){
     print Dumper $_;
   }

Pages: 1, 2, 3, 4

Next Pagearrow