Sign In/My Account | View Cart  
advertisement


Listen Print

Cooking with Perl, Part 3

September 17, 2003

Editor's note: In this third and final batch of recipes excerpted from Perl Cookbook, you'll find solutions and code examples for extracting HTML table data, templating with HTML::Mason, and making simple changes to elements or text.

Sample Recipe: Extracting Table Data

Problem

You have data in an HTML table, and you would like to turn that into a Perl data structure. For example, you want to monitor changes to an author's CPAN module list.

Solution

Use the HTML::TableContentParser module from CPAN:

use HTML::TableContentParser;
 
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
 
foreach $table (@$tables) {
  @headers = map { $_->{data} } @{ $table->{headers} };
  # attributes of table tag available as keys in hash
  $table_width = $table->{width};
 
  foreach $row (@{ $tables->{rows} }) {
    # attributes of tr tag available as keys in hash
    foreach $col (@{ $row->{cols} }) {
      # attributes of td tag available as keys in hash
      $data = $col->{data};
    }
  }
}

Discussion

Related Reading

Perl Cookbook
By Tom Christiansen, Nat Torkington

Table of Contents
Index
Sample Chapter

Read Online--Safari Search this book on Safari:
 

Code Fragments only

The HTML::TableContentParser module converts all tables in the HTML document into a Perl data structure. As with HTML tables, there are three layers of nesting in the data structure: the table, the row, and the data in that row.

Each table, row, and data tag is represented as a hash reference. The hash keys correspond to attributes of the tag that defined that table, row, or cell. In addition, the value for a special key gives the contents of the table, row, or cell. In a table, the value for the rows key is a reference to an array of rows. In a row, the cols key points to an array of cells. In a cell, the data key holds the HTML contents of the data tag.

For example, take the following table:

<table width="100%" bgcolor="#ffffff">
  <tr>
    <td>Larry &amp; Gloria</td>
    <td>Mountain View</td>
    <td>California</td>
  </tr>
  <tr>
    <td><b>Tom</b></td>
    <td>Boulder</td>
    <td>Colorado</td>
  </tr>
  <tr>
    <td>Nathan &amp; Jenine</td>
    <td>Fort Collins</td>
    <td>Colorado</td>
  </tr>
</table>

The parse method returns this data structure:

[
  {
    'width' => '100%',
    'bgcolor' => '#ffffff',
    'rows' => [
               {
                'cells' => [
                            { 'data' => 'Larry &amp; Gloria' },
                            { 'data' => 'Mountain View' },
                            { 'data' => 'California' },
                           ],
                'data' => "\n      "
               },
               {
                'cells' => [
                            { 'data' => '<b>Tom</b>' },
                            { 'data' => 'Boulder' },
                            { 'data' => 'Colorado' },
                           ],
                'data' => "\n      "
               },
               {
                'cells' => [
                            { 'data' => 'Nathan &amp; Jenine' },
                            { 'data' => 'Fort Collins' },
                            { 'data' => 'Colorado' },
                           ],
                'data' => "\n      "
               }
              ]
  }
]

The data tags still contain tags and entities. If you don't want the tags and entities, remove them by hand using techniques from "Extracting or Removing HTML Tags."

Previous Articles in this Series

Cooking with Perl
Cooking with Perl, Part 2

Example 20-11 fetches a particular CPAN author's page and displays in plain text the modules they own. You could use this as part of a system that notifies you when your favorite CPAN authors do something new.

Example 20-11: Dump modules for a particular CPAN author

  #!/usr/bin/perl -w
  # dump-cpan-modules-for-author - display modules a CPAN author owns
  use LWP::Simple;
  use URI;
  use HTML::TableContentParser;
  use HTML::Entities;
  use strict;
  our $URL = shift || 'http://search.cpan.org/author/TOMC/';
  my $tables = get_tables($URL);
  my $modules = $tables->[4];    # 5th table holds module data
  foreach my $r (@{ $modules->{rows} }) {
    my ($module_name, $module_link, $status, $description) = 
        parse_module_row($r, $URL);
    print "$module_name <$module_link>\n\t$status\n\t$description\n\n";
  } 
  sub get_tables {
    my $URL = shift;
    my $page = get($URL);
    my $tcp = new HTML::TableContentParser;
    return $tcp->parse($page);
  }
  sub parse_module_row {
    my ($row, $URL) = @_;
    my ($module_html, $module_link, $module_name, $status, $description);
    # extract cells
    $module_html = $row->{cells}[0]{data};  # link and name in HTML
    $status      = $row->{cells}[1]{data};  # status string and link
    $description = $row->{cells}[2]{data};  # description only
    $status =~ s{<.*?>}{  }g; # naive link removal, works on this simple HTML
    # separate module link and name from html
    ($module_link, $module_name) = $module_html =~ m{href="(.*?)".*?>(.*)<}i;
    $module_link = URI->new_abs($module_link, $URL); # resolve relative links
    # clean up entities and tags
    decode_entities($module_name);
    decode_entities($description);
    return ($module_name, $module_link, $status, $description);
  }

See Also

The documentation for the CPAN module HTML::TableContentParser; http://search.cpan.org/

Pages: 1, 2, 3

Next Pagearrow