Cooking with Perl, Part 3
September 17, 2003Editor's note: In this third and final batch of recipes excerpted from Perl Cookbook, you'll find solutions and code examples for extracting HTML table data, templating with HTML::Mason, and making simple changes to elements or text.
Sample Recipe: Extracting Table Data
Problem
You have data in an HTML table, and you would like to turn that into a Perl data structure. For example, you want to monitor changes to an author's CPAN module list.
Solution
Use the HTML::TableContentParser module from CPAN:
use HTML::TableContentParser;
$tcp = HTML::TableContentParser->new;
$tables = $tcp->parse($HTML);
foreach $table (@$tables) {
@headers = map { $_->{data} } @{ $table->{headers} };
# attributes of table tag available as keys in hash
$table_width = $table->{width};
foreach $row (@{ $tables->{rows} }) {
# attributes of tr tag available as keys in hash
foreach $col (@{ $row->{cols} }) {
# attributes of td tag available as keys in hash
$data = $col->{data};
}
}
}
Discussion
|
Related Reading
Perl Cookbook |
The HTML::TableContentParser module converts all tables in the HTML document into a Perl data structure. As with HTML tables, there are three layers of nesting in the data structure: the table, the row, and the data in that row.
Each table, row, and data tag is represented as a hash reference.
The hash keys correspond to attributes of the tag that defined that table, row,
or cell. In addition, the value for a special key gives the contents of the
table, row, or cell. In a table, the value for the rows key is a reference to an array of rows. In a row, the
cols key points to an array of cells. In a cell, the
data key holds the HTML contents of the data tag.
For example, take the following table:
<table width="100%" bgcolor="#ffffff">
<tr>
<td>Larry & Gloria</td>
<td>Mountain View</td>
<td>California</td>
</tr>
<tr>
<td><b>Tom</b></td>
<td>Boulder</td>
<td>Colorado</td>
</tr>
<tr>
<td>Nathan & Jenine</td>
<td>Fort Collins</td>
<td>Colorado</td>
</tr>
</table>
The parse method returns this data
structure:
[
{
'width' => '100%',
'bgcolor' => '#ffffff',
'rows' => [
{
'cells' => [
{ 'data' => 'Larry & Gloria' },
{ 'data' => 'Mountain View' },
{ 'data' => 'California' },
],
'data' => "\n "
},
{
'cells' => [
{ 'data' => '<b>Tom</b>' },
{ 'data' => 'Boulder' },
{ 'data' => 'Colorado' },
],
'data' => "\n "
},
{
'cells' => [
{ 'data' => 'Nathan & Jenine' },
{ 'data' => 'Fort Collins' },
{ 'data' => 'Colorado' },
],
'data' => "\n "
}
]
}
]
The data tags still contain tags and entities. If you don't want the tags and entities, remove them by hand using techniques from "Extracting or Removing HTML Tags."
Previous Articles in this Series |
Example 20-11 fetches a particular CPAN author's page and displays in plain text the modules they own. You could use this as part of a system that notifies you when your favorite CPAN authors do something new.
Example 20-11: Dump modules for a particular CPAN author
#!/usr/bin/perl -w
# dump-cpan-modules-for-author - display modules a CPAN author owns
use LWP::Simple;
use URI;
use HTML::TableContentParser;
use HTML::Entities;
use strict;
our $URL = shift || 'http://search.cpan.org/author/TOMC/';
my $tables = get_tables($URL);
my $modules = $tables->[4]; # 5th table holds module data
foreach my $r (@{ $modules->{rows} }) {
my ($module_name, $module_link, $status, $description) =
parse_module_row($r, $URL);
print "$module_name <$module_link>\n\t$status\n\t$description\n\n";
}
sub get_tables {
my $URL = shift;
my $page = get($URL);
my $tcp = new HTML::TableContentParser;
return $tcp->parse($page);
}
sub parse_module_row {
my ($row, $URL) = @_;
my ($module_html, $module_link, $module_name, $status, $description);
# extract cells
$module_html = $row->{cells}[0]{data}; # link and name in HTML
$status = $row->{cells}[1]{data}; # status string and link
$description = $row->{cells}[2]{data}; # description only
$status =~ s{<.*?>}{ }g; # naive link removal, works on this simple HTML
# separate module link and name from html
($module_link, $module_name) = $module_html =~ m{href="(.*?)".*?>(.*)<}i;
$module_link = URI->new_abs($module_link, $URL); # resolve relative links
# clean up entities and tags
decode_entities($module_name);
decode_entities($description);
return ($module_name, $module_link, $status, $description);
}
See Also
The documentation for the CPAN module HTML::TableContentParser; http://search.cpan.org/

