Analyzing HTML with Perl
by Kendrew Lau
|
Pages: 1, 2, 3
Processing Multiple Files
These methods provide great HTML parsing capability to grade the web page assignments. The grading program first builds the tree structures from the HTML files and stores them in an array @trees:
my @trees;
foreach (@files) {
print " building tree for $_ ...\n" if $options{v};
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($_);
push( @trees, $tree );
}
The subroutine doitem() iterates through the array of trees, applying a pass-in code block to look for particular HTML elements in each tree and accumulating the results of calling the code block. To provide detailed information and facilitate debugging during development, it calls the convenience subroutine printd() to display the HTML elements found with their corresponding file name when the verbose command line switch (-v) is set. Essentially, the code invokes this subroutine once for each kind of element in the requirement.
sub doitem {
my $func = shift;
my $num = 0;
foreach my $i ( 0 .. $#files ) {
my @elements = $func->( $files[$i], $trees[$i] );
printd $files[$i], @elements;
$num += @elements;
}
return $num;
}
The code block passed into doitem is a subroutine that takes two parameters of a file name and its corresponding HTML tree and returns an array of selected elements in the tree. The following code block retrieves all HTML elements in italic, including the <i> elements (for example, <i>text</i>) and elements with a font-style of italic (for example, <span STYLE="font-style: italic">text</span>).
$n = doitem sub {
my ( $file, $tree ) = @_;
return ( $tree->find("i"),
$tree->look_down( "style" => qr/font-style *: *italic/ ) );
};
marking "Italicized text (2 points): "
. ( ( $n > 0 ) ? "good. 2" : "no italic text. 0"
);
Two points are available for any italic text in the pages. The marking subroutine records grading in a string. At the end of the program, examining the string helps to calculate the total points.
Other requirements are marked in the same manner, though some selection code is more involved. A regular expression helps to select elements with non-default colors.
my $pattern = qr/(^|[^-])color *: *rgb\( *[0-9]*, *[0-9]*, *[0-9]*\)/;
return $tree->look_down(
"style" => $pattern,
sub { $_[0]->as_trimmed_text ne "" }
);
Nvu applies colors to text by the color style in the form of rgb(R,G,B) (for example, <span STYLE="color: rgb(0, 0, 255);">text</span>). The above code is slightly stricter than the italic code, as it also requires an element to contain some text. The method as_trimmed_text() of HTML::Element returns the textual content of an element with any leading and trailing spaces removed.
Nested invocations of look_down() locate linked graphics with a border. This selects any link (an <a> element) that encloses an image (an <img> element) that has a border.
return $tree->look_down(
"_tag" => "a",
sub {
$_[0]->look_down( "_tag" => "img", sub { hasBorder( $_[0] ) } );
}
);
Finding non-linked graphics is more interesting, as it involves both the methods look_down() and look_up(). It should only find images (<img> elements) that do not have a parent link (a <a> element) up the tree.
return $tree->look_down(
"_tag" => "img",
sub { !$_[0]->look_up( "_tag" => "a" ) and hasBorder( $_[0] ); }
);
Checking valid internal links requires passing look_down() a code block that excludes common external links by checking the href value against protocol names, and verifies the existence of the file linked in the web page.
use File::Basename;
$n = doitem sub {
my ( $file, $tree ) = @_;
return $tree->look_down(
"_tag" => "a",
"href" => qr//,
sub {
!( $_[0]->attr("href") =~ /^ *(http:|https:|ftp:|mailto:)/)
and -e dirname($file) . "/" . decodeURL( $_[0]->attr("href") );
}
);
};
Nvu changes a page's text color by specifying the color components in the style of the body tag, like <body style="color: rgb(0, 0, 255);">. A regular expression matches the style pattern and retrieves the three color components. Any non-zero color component denotes a non-default text color in a page.
my $pattern = qr/(?:^|[^-])color *: *rgb\(( *[0-9]*),( *[0-9]*),( *[0-9]*)\)/;
return $tree->look_down(
"_tag" => "body",
"style" => qr//,
sub {
$_[0]->attr("style") =~ $pattern and
( $1 != 0 or $2 != 0 or $3 != 0 );
}
);
With proper use of the methods look_down(), look_up(), and as_trimmed_text(), the code can locate and mark the existence of various required elements and any broken elements (images, internal links, or background images).

