Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

FEAR-less Site Scraping
by Yung-chung Lin | Pages: 1, 2, 3, 4

What Do I Need?

There are some techniques to gather identical code blocks and put them into some place and create scripts by loading different components for different purposes. Instead, I worked on the interface. I wished to simplify the problem through language. I re-examined the routine and the code structure to identified distinct features in every site scraping script:

  • Manually load lots of modules.
  • Create a WWW agent.
  • Create an extractor object.
  • Process links using a control structure.
  • Perform extraction.
  • Process extracted results.

I searched CPAN for something related to my ideas. I found plenty of modules for site scraping and data extraction, but no module that could meet my needs.

Then I created FEAR::API.

Use FEAR::API

FEAR::API's documentation says:

FEAR::API is a tool that helps reduce your time creating site scraping scripts and helps you do it in an much more elegant way. FEAR::API combines many strong and powerful features from various CPAN modules, such as LWP::UserAgent, WWW::Mechanize, Template::Extract, Encode, HTML::Parser, etc., and digests them into a deeper Zen.

It might be best to introduce FEAR::API by rewriting the previous example:

   1    use FEAR::API -base;
   2    url("search.cpan.org");
   3    fetch >> [
   4      qr(foo) => _feedback,
   5      qr(bar) => \my @link,
   6      qr()    => sub { 'do something here' }
   7    ];
   8    fetch while has_more_links;
   9    extmethod('Template::Extract');
  10    extract($template);
  11    print Dumper extresult;
  12    print document->as_string;
  13    print Dumper \@link;
  14    invoke_handler('YAML');

Line 1 loads FEAR::API. The -base argument means the package is a subclass of FEAR::API. The module automatically instantiates $_ as a FEAR::API object.

Line 2 specifies the URL. The code will later fetch this URL by calling fetch(), but you can use fetch( $the_url ), too.

Line 3 fetches the home page of some.site.com. >> is an overloaded operator for dispatching links. The following array reference contains pairs of (regular expression => action). An action can be a code ref, an array ref, or a _feedback or _self constant.

FEAR::API maintains a queue of links. Using _feedback or _self means that FEAR::API should put the link in a queue for fetching later if the link matches a certain regular expression.

Line 8 calls has_more_links, so FEAR::API checks if the internal link queue has, well, more links. The program will continue fetching if there are queued links.

Line 9 specifies the extraction method. The default method is Template::Extract.

Line 10 extracts data according to $template.

Line 11 dumps the extracted results to STDOUT. FEAR::API even exports Dumper() for you. For YAML fans, there is also Dump().

Line 12 accesses the fetched content through the object returned from document. You need to invoke as_string() to stringify the data. By the way, each fetched document is converted to UTF-8 automatically for you. It is very useful while processing multilingual texts.

Line 14 invokes the result handler to do data processing. The argument can be a subref, a module's name, YAML, or Data::Dumper.

Comparison

I hope that now you can see what FEAR::API has improved, at least in code size. FEAR::API encapsulates many modules, and you don't need to worry about messing around with them on your own. All you need to do is tell FEAR::API to fetch a page, to do extraction, and how you want to deal with links contained in the page and the extracted results from the page. You don't need to initialize a WWW agent, convert the encoding of a fetched page, create an extractor object on your own, pass content to the extractor, write control structures for link processing, or anything else. Everything happens inside of FEAR::API or via this simple syntax.

At first sight, perhaps you don't even realize that the example script uses OO. If you don't like things to happen so automatically, you may choose to drop the -base option. Then you have to create FEAR::API objects manually using fear():

   use FEAR::API;
   my $f = fear();

One of the goals of FEAR::API is to weed out redundancies and minimize code size. It is very cumbersome to use syntax such as $_->blah_blah('blah') throughout a scraping script, given mass script creation requirements. I decided to remove $_->, while it still uses OO.

Pages: 1, 2, 3, 4

Next Pagearrow