Sign In/My Account | View Cart  
advertisement


Listen Print

Web Basics with LWP
by Sean M. Burke | Pages: 1, 2, 3, 4, 5

Sending GET Form Data

Some HTML forms convey their form data not by sending the data in an HTTP POST request, but by making a normal GET request with the data stuck on the end of the URL. For example, if you went to imdb.com and ran a search on Blade Runner, the URL you'd see in your browser window would be:


  http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TV

To run the same search with LWP, you'd use this idiom, which involves the URI class:


  use URI;
  my $url = URI->new( 'http://us.imdb.com/Tsearch' );
    # makes an object representing the URL
  
  $url->query_form(  # And here the form data pairs:
    'title'    => 'Blade Runner',
    'restrict' => 'Movies and TV',
  );
  
  my $response = $browser->get($url);

See Chapter 5, "Forms" of Perl & LWP for a longer discussion of HTML forms and of form data, as well as Chapter 6 through Chapter 9 for a longer discussion of extracting data from HTML.

Absolutizing URLs

The URI class that we just mentioned above provides all sorts of methods for accessing and modifying parts of URLs (such as asking sort of URL it is with $url->scheme, and asking what host it refers to with $url->host, and so on, as described in the docs for the URI class. However, the methods of most immediate interest are the query_form method seen above, and now the new_abs method for taking a probably relative URL string (like "../foo.html") and getting back an absolute URL (like "http://www.perl.com/stuff/foo.html"), as shown here:


  use URI;
  $abs = URI->new_abs($maybe_relative, $base);

For example, consider this program that matches URLs in the HTML list of new modules in CPAN:


  use strict;
  use warnings;
  use LWP 5.64;
  my $browser = LWP::UserAgent->new;
  
  my $url = 'http://www.cpan.org/RECENT.html';
  my $response = $browser->get($url);
  die "Can't get $url -- ", $response->status_line
   unless $response->is_success;
  
  my $html = $response->content;
  while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print "$1\n";  
  }

When run, it emits output that starts out something like this:


  MIRRORING.FROM
  RECENT
  RECENT.html
  authors/00whois.html
  authors/01mailrc.txt.gz
  authors/id/A/AA/AASSAD/CHECKSUMS
  ...

However, if you actually want to have those be absolute URLs, you can use the URI module's new_abs method, by changing the while loop to this:


  while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print URI->new_abs( $1, $response->base ) ,"\n";
  }

(The $response->base method from HTTP::Message is for returning the URL that should be used for resolving relative URLs--it's usually just the same as the URL that you requested.)

That program then emits nicely absolute URLs:


  http://www.cpan.org/MIRRORING.FROM
  http://www.cpan.org/RECENT
  http://www.cpan.org/RECENT.html
  http://www.cpan.org/authors/00whois.html
  http://www.cpan.org/authors/01mailrc.txt.gz
  http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
  ...

See Chapter 4, "URLs", of Perl & LWP for a longer discussion of URI objects.

Of course, using a regexp to match hrefs is a bit simplistic, and for more robust programs, you'll probably want to use an HTML-parsing module like HTML::LinkExtor, or HTML::TokeParser, or even maybe HTML::TreeBuilder.

Other Browser Attributes

LWP::UserAgent objects have many attributes for controlling how they work. Here are a few notable ones:

  • $browser->timeout(15): This sets this browser object to give up on requests that don't answer within 15 seconds.

  • $browser->protocols_allowed( [ 'http', 'gopher'] ): This sets this browser object to not speak any protocols other than HTTP and gopher. If it tries accessing any other kind of URL (like an "ftp:" or "mailto:" or "news:" URL), then it won't actually try connecting, but instead will immediately return an error code 500, with a message like "Access to ftp URIs has been disabled".

  • use LWP::ConnCache;
    $browser->conn_cache(LWP::ConnCache->new())
    : This tells the browser object to try using the HTTP/1.1 "Keep-Alive" feature, which speeds up requests by reusing the same socket connection for multiple requests to the same server.

  • $browser->agent( 'SomeName/1.23 (more info here maybe)' ): This changes how the browser object will identify itself in the default "User-Agent" line is its HTTP requests. By default, it'll send "libwww-perl/versionnumber", like "libwww-perl/5.65". You can change that to something more descriptive like this:

    
      $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
    

    Or if need be, you can go in disguise, like this:

    
      $browser->agent( 
         'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
    
  • push @{ $ua->requests_redirectable }, 'POST': This tells this browser to obey redirection responses to POST requests (like most modern interactive browsers), even though the HTTP RFC says that should not normally be done.

For more options and information, see the full documentation for LWP::UserAgent.

Pages: 1, 2, 3, 4, 5

Next Pagearrow