Listen Print Discuss

FEAR-less Site Scraping

by Yung-chung Lin
June 01, 2006

Imagine that you have an assignment that you need to fetch all of the web pages of a given website, scrape data from them, and transfer the data to another place, such as a database or plain files. This is a common scenario for data scraping tasks, and CPAN has plenty of modules for this job.

While I was developing site-scraping scripts, retrieving data from some sites of the same type, I realized that I had repeated many identical or very similar code structures, such as:

  fetch_the_homepage();

  while(there_are_some_more_unfetched_links){
     foreach $link (@{links_in_the_current_page}){
         follow_link()          if $link =~ /NEXT_PAGE_OR_SOMETHING/;
         extract_product_spec() if $link =~ /PRODUCT_SPEC_PAGE/;
     }
  }

The Usual Tools

At the very beginning, I created scripts using LWP::Simple, LWP::UserAgent, and vanilla regular expressions to extract links and produce details. As the number of scripts grew, I needed more powerful resources, so I started to use WWW::Mechanize for web page fetching and Regexp::Bind, Template::Extract, HTML::LinkExtractor, Regexp::Common, etc. for data scraping. However, then I still found many redundancies.

A scraping script first needs to use essential modules for the site scraping task. Second, it may need to instantiate objects. Third, site scraping involves many interactions among different modules, mostly by passing data between them. After you fetch a page, you may need to pass the page to HTML::LinkExtractor to extract links, to Template::Extract to get detailed information, or save it to a file. You may then store extracted data in a relational database. Considering these properties, creating a site scraping script is very time-consuming, and sometimes it makes a lot of duplication.

Thus, I tried to fuse some modules together, hoping to save some of my keystrokes and simplify the coding process.

An Example using WWW::Mechanize and Template::Extract

Here's a typical site scraping script structure:

     use YAML;
     use Data::Dumper;
     use WWW::Mechanize;
     use Template::Extract;

     my $mech = WWW::Mechanize->new();
     $mech->get( "http://search.cpan.org" );

     my $ext = Template::Extract->new;

     my @result = $ext->extract($template, $mech->content);
     print Dumper \@result;

     my @link;
     foreach ($mech->links){
         if( $_->[0] =~ /foo/ ) {
            $mech->get($_->[0]);
         }
         elsif( $_->[0] =~ /bar/ ) {
            push @link;
         }
         else {
            sub { 'do something here' }->($_->[0]);
         }
     }
     print $mech->content;
     print Dumper \@link;
     foreach (@result){
        print YAML::Dump $_;
     }

This program does several things:

  • Fetch CPAN's homepage.
  • Extract data with a template.
  • Process links using a control structure.
  • Print fetched content to STDOUT.
  • Dump links in the page.
  • Use YAML to print extract results.

If you need to create just one or two temporary scripts, it is acceptable to use copy and paste to generate scripts. Things will become messy if the job is to create a hundred scripts and you still use copy and paste.

Spidering Hacks

Related Reading

Spidering Hacks
100 Industrial-Strength Tips & Tools
By Kevin Hemenway, Tara Calishain

Table of Contents
Index

Read Online--Safari
Search this book on Safari:
 

Code Fragments only

Pages: 1, 2, 3, 4

Next Pagearrow





Contact Us | Advertise with Us | Privacy Policy | Press Center | Jobs | Submissions Guidelines

Copyright © 2000-2008 O’Reilly Media, Inc. All Rights Reserved. | (707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on the O'Reilly Network are the property of their respective owners.

For problems or assistance with this site, email