Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

More Advancements in Perl Programming
by Simon Cozens | Pages: 1, 2, 3, 4

Parsing

Perhaps the biggest change of heart I had between writing a chapter and its publication was in the parsing chapter. That chapter had very little about parsing HTML, and what it did have was not very friendly. Since then, Gisle Aas and Sean Burke's HTML::TreeBuilder and the corresponding XML::TreeBuilder have established themselves as much simpler and more flexible ways to navigate HTML and XML documents.

The basic concept in HTML::TreeBuilder is the HTML element, represented as an object of the HTML::Element class:

$a = HTML::Element->new('a', href => 'http://www.perl.com/');
$html = $a->as_HTML;

This creates a new element that is an anchor tag, with an href attribute. The HTML equivalent in $html would be <a href="http://www.perl.com"/>.

Now you can add some content to that tag:

$a->push_content("The Perl Homepage");

This time, the object represents <a href="http://www.perl.com"> The Perl Homepage </a>.

You can ask this element for its tag, its attributes, its content, and so on:

$tag = $a->tag;
$link = $a->attr("href");
@content = $a->content_list; # More HTML::Element nodes

Of course, when you are parsing HTML, you won't be creating those elements manually. Instead, you'll be navigating a tree of them, built out of your HTML document. The top-level module HTML::TreeBuilder does this for you:

use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new();
$tree->parse_file("index.html");

Now $tree is a HTML::Element object representing the <html> tag and all its contents. You can extract all of the links with the extract_links() method:

for (@{ $tree->extract_links() || [] }) {
     my($link, $element, $attr, $tag) = @$_;
     print "Found link to $link in $tag\n";
}

Although the real workhorse of this module is the look_down() method, which helps you pull elements out of the tree by their tags or attributes. For instance, in a search engine indexer, indexing HTML files, I have the following code:

for my $tag ($tree->look_down("_tag","meta")) {
    next unless $tag->attr("name");
    $hash{$tag->attr("name")} .= $tag->attr("content"). " ";
}

$hashMore Advancements in Perl Programming .= $_->as_text." " for $tree->look_down("_tag","title");

This finds all <meta> tags and puts their attributes as name-value pairs in a hash; then it puts all the text inside of <title> tags together into another hash element. Similarly, you can look for tags by attribute value, spit out sub-trees as HTML or as text, and much more, besides. For reaching into HTML text and pulling out just the bits you need, I haven't found anything better.

On the XML side of things, XML::Twig has emerged as the usual "middle layer," when XML::Simple is too simple and XML::Parser is, well, too much like hard work.

Templating

There's not much to say about templating, although in retrospect, I would have spent more of the paper expended on HTML::Mason talking about the Template Toolkit instead. Not that there's anything wrong with HTML::Mason, but the world seems to be moving away from templates that include code in a specific language (say, Perl's) towards separate templating little languages, like TAL and Template Toolkit.

The only thing to report is that Template Toolkit finally received a bit of attention from its maintainer a couple of months ago, but the long-awaited Template Toolkit 3 is looking as far away as, well, Perl 6.

Natural Language Processing

Who would have thought that the big news of 2005 would be that Yahoo is relevant again? Not only are they coming up with interesting new search technologies such as Y!Q, but they're releasing a lot of the guts behind what they're doing as public APIs. One of those that is particularly relevant for NLP is the Term Extraction web service.

This takes a chunk of text and pulls out the distinctive terms and phrases. Think of this as a step beyond something like Lingua::EN::Keywords, with the firepower of Yahoo behind it. To access the API, simply send a HTTP POST request to a given URL:

use LWP::UserAgent;
use XML::Twig;
my $uri  = "http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction";
my $ua   = LWP::UserAgent->new();
my $resp = $ua->post($uri, {
    appid   => "PerlYahooExtractor",
    context => <<EOF
Two Scottish towns have seen the highest increase in house prices in the
UK this year, according to new figures. 
Alexandria in West Dunbartonshire and Coatbridge in North Lanarkshire
both saw an average 35% rise in 2005. 
EOF
});
if ($resp->is_success) { 
    my $xmlt = XML::Twig->new( index => [ "Result" ]);
    $xmlt->parse($resp->content);
    for my $result (@{ $xmlt->index("Result") || []}) {
        print $result->text;
    }
}

This produces:

north lanarkshire
scottish towns
west dunbartonshire
house prices
coatbridge
dunbartonshire
alexandria

Once I had informed the London Perl Mongers of this amazing discovery, Simon Wistow immediately bundled it up into a Perl module called Lingua::EN::Keywords::Yahoo, coming soon to a CPAN mirror near you.

Pages: 1, 2, 3, 4

Next Pagearrow