Sign In/My Account | View Cart  
advertisement


Listen Print

Web Basics with LWP

Sample Recipes for Common Tasks

by Sean M. Burke
August 20, 2002

Sean M. Burke is the author of Perl & LWP

Introduction

LWP (short for "Library for WWW in Perl") is a popular group of Perl modules for accessing data on the Web. Like most Perl module-distributions, each of LWP's component modules comes with documentation that is a complete reference to its interface. However, there are so many modules in LWP that it's hard to know where to look for information on doing even the simplest things.

Introducing you to using LWP would require a whole book--a book that just happens to exist, called Perl & LWP. This article offers a sampling of recipes that let you perform common tasks with LWP.

Getting Documents with LWP::Simple

If you just want to access what's at a particular URL, the simplest way to do it is to use LWP::Simple's functions.


Perl & LWP

Related Reading

Perl & LWP
By Sean M. Burke


In a Perl program, you can call its get($url) function. It will try getting that URL's content. If it works, then it'll return the content; but if there's some error, it'll return undef.


  my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
    # Just an example: the URL for the most recent /Fresh Air/ show

  use LWP::Simple;
  my $content = get $url;
  die "Couldn't get $url" unless defined $content;

  # Then go do things with $content, like this:

  if($content =~ m/jazz/i) {
    print "They're talking about jazz today on Fresh Air!\n";
  } else {
    print "Fresh Air is apparently jazzless today.\n";
  }

The handiest variant on get is getprint, which is useful in Perl one-liners. If it can get the page whose URL you provide, it sends it to STDOUT; otherwise it complains to STDERR.


  % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"

This is the URL of a plain-text file. It lists new files in CPAN in the past two weeks. You can easily make it part of a tidy little shell command, like this one that mails you the list of new Acme:: modules:


  % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"  \
     | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER

There are other useful functions in LWP::Simple, including one function for running a HEAD request on a URL (useful for checking links, or getting the last-revised time of a URL), and two functions for saving and mirroring a URL to a local file. See the LWP::Simple documentation for the full details, or Chapter 2, "Web Basics" of Perl & LWP for more examples.

The Basics of the LWP Class Model

LWP::Simple's functions are handy for simple cases, but its functions don't support cookies or authorization; they don't support setting header lines in the HTTP request; and generally, they don't support reading header lines in the HTTP response (most notably the full HTTP error message, in case of an error). To get at all those features, you'll have to use the full LWP class model.

While LWP consists of dozens of classes, the two that you have to understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a class for "virtual browsers," which you use for performing requests, and HTTP::Response is a class for the responses (or error messages) that you get back from those requests.

The basic idiom is $response = $browser->get($url), or fully illustrated:


  # Early in your program:
  
  use LWP 5.64; # Loads all important LWP classes, and makes
                #  sure your version is reasonably recent.

  my $browser = LWP::UserAgent->new;
  
  ...
  
  # Then later, whenever you need to make a get request:
  my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
  
  my $response = $browser->get( $url );
  die "Can't get $url -- ", $response->status_line
   unless $response->is_success;

  die "Hey, I was expecting HTML, not ", $response->content_type
   unless $response->content_type eq 'text/html';
     # or whatever content-type you're equipped to deal with

  # Otherwise, process the content somehow:
  
  if($response->content =~ m/jazz/i) {
    print "They're talking about jazz today on Fresh Air!\n";
  } else {
    print "Fresh Air is apparently jazzless today.\n";
  }
There are two objects involved: $browser, which holds an object of the class LWP::UserAgent, and then the $response object, which is of the class HTTP::Response. You really need only one browser object per program; but every time you make a request, you get back a new HTTP::Response object, which will have some interesting attributes:
  • A status code indicating success or failure (which you can test with $response->is_success).

  • An HTTP status line, which I hope is informative if there is a failure (which you can see with $response->status_line, and which returns something like "404 Not Found").

  • A MIME content-type like "text/html", "image/gif", "application/xml", and so on, which you can see with $response->content_type

  • The actual content of the response, in $response->content. If the response is HTML, that's where the HTML source will be; if it's a GIF, then $response->content will be the binary GIF data.

  • And dozens of other convenient and more specific methods that are documented in the docs for HTTP::Response, and its superclasses, HTTP::Message and HTTP::Headers.

Pages: 1, 2, 3, 4, 5

Next Pagearrow