Web Basics with LWP
Sample Recipes for Common Tasks
by Sean M. BurkeAugust 20, 2002
Sean M. Burke is the author of Perl & LWP
Introduction
LWP (short for "Library for WWW in Perl") is a popular group of Perl modules for accessing data on the Web. Like most Perl module-distributions, each of LWP's component modules comes with documentation that is a complete reference to its interface. However, there are so many modules in LWP that it's hard to know where to look for information on doing even the simplest things.
Introducing you to using LWP would require a whole book--a book that just happens to exist, called Perl & LWP. This article offers a sampling of recipes that let you perform common tasks with LWP.
Getting Documents with LWP::Simple
If you just want to access what's at a particular URL, the simplest way
to do it is to use LWP::Simple's functions.
|
Related Reading |
In a Perl program, you can call its get($url) function. It will try
getting that URL's content. If it works, then it'll return the content; but if there's some error,
it'll return undef.
my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
# Just an example: the URL for the most recent /Fresh Air/ show
use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content;
# Then go do things with $content, like this:
if($content =~ m/jazz/i) {
print "They're talking about jazz today on Fresh Air!\n";
} else {
print "Fresh Air is apparently jazzless today.\n";
}
The handiest variant on get is getprint, which is useful in Perl
one-liners. If it can get the page whose URL you provide, it sends it
to STDOUT; otherwise it complains to STDERR.
% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"
This is the URL of a plain-text file. It lists new files in CPAN in
the past two weeks. You can easily make it part of a tidy little
shell command, like this one that mails you the list of new
Acme:: modules:
% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'" \
| grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER
There are other useful functions in LWP::Simple, including one function for running a HEAD request on a URL (useful for checking links, or getting the last-revised time of a URL), and two functions for
saving and mirroring a URL to a local file. See the LWP::Simple
documentation for the full details, or Chapter 2, "Web Basics" of Perl & LWP for more examples.
The Basics of the LWP Class Model
LWP::Simple's functions are handy for simple cases, but its functions
don't support cookies or authorization; they don't support setting header
lines in the HTTP request; and generally, they don't support reading header lines
in the HTTP response (most notably the full HTTP error message, in case of an
error). To get at all those features, you'll have to use the full LWP
class model.
While LWP consists of dozens of classes, the two that you have to understand are
LWP::UserAgent and HTTP::Response. LWP::UserAgent is a class
for "virtual browsers," which you use for performing requests, and HTTP::Response is a class
for the responses (or error messages) that you get back from those requests.
The basic idiom is $response = $browser->get($url), or fully
illustrated:
# Early in your program:
use LWP 5.64; # Loads all important LWP classes, and makes
# sure your version is reasonably recent.
my $browser = LWP::UserAgent->new;
...
# Then later, whenever you need to make a get request:
my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
my $response = $browser->get( $url );
die "Can't get $url -- ", $response->status_line
unless $response->is_success;
die "Hey, I was expecting HTML, not ", $response->content_type
unless $response->content_type eq 'text/html';
# or whatever content-type you're equipped to deal with
# Otherwise, process the content somehow:
if($response->content =~ m/jazz/i) {
print "They're talking about jazz today on Fresh Air!\n";
} else {
print "Fresh Air is apparently jazzless today.\n";
}
There are two objects involved: $browser, which holds an object of the
class LWP::UserAgent, and then the $response object, which is of
the class HTTP::Response. You really need only one browser object per program;
but every time you make a request, you get back a new HTTP::Response object, which
will have some interesting attributes:
A status code indicating success or failure (which you can test with
$response->is_success).An HTTP status line, which I hope is informative if there is a failure (which you can see with
$response->status_line, and which returns something like "404 Not Found").A MIME content-type like "text/html", "image/gif", "application/xml", and so on, which you can see with
$response->content_typeThe actual content of the response, in
$response->content. If the response is HTML, that's where the HTML source will be; if it's a GIF, then$response->contentwill be the binary GIF data.And dozens of other convenient and more specific methods that are documented in the docs for
HTTP::Response, and its superclasses,HTTP::MessageandHTTP::Headers.


