January 2001 Archives

This Week on p5p 2001/01/28



Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month.

There were 317 messages this week, not including test results.

5.6.x delayed

People seemed to think that this was monumental and catastrophic; it's not, but I mention it anyway: Alan asked what was happening to 5.6.1 - we've seen one trial release, and another one is eagerly expected. Sarathy replied that he probably won't have time to do anything much with it for the next few weeks, and offered to hand over maintainance of 5.6.x to anyone who wants it. People get busy; it's not the end of the world.

Still, it does look like 5.6.1 is going to take a while to get out there. Hold on tight, though; my crystal ball tells me that Jarkko's been doing some work merging bleadperl patches into the 5.6 maintainance branch.

Test::Harness again

Michael Schwern's valiant efforts with Test::Harness are beginning to bear fruit, as he produced another megapatch this week. This seemed to make a lot of people happier, but there were still some small problems with it; notably that it would choke if some output arrived before the first "ok" or "1..N", and that it failed its own test suite. The irony! But it's looking a lot closer to going in, and it's looking a lot cleaner code. Nicholas Clark also noticed that you can overwrite the runtests function to build several different perls and runs the tests on each o of them.

Lots of test results

But this process - building Perl with lots of different options and sending in the results - is one that a few people (Notably Alan, Merijn and Abigail) have already been doing, and P5P has been near innundated with OK and Not OK reports. Jarkko suggested that some kind of summarizing program should be run over the results, and Merijn provided one. (The link is actually to an updated version he posted later in the week.) It tests Perl in lots of different configurations in all of the different possible IO subsystems. If you've got any kind of non-obvious operating system, lots of spare cycles, and you're following the rsync Perl, it would be great if you could play with it!

The hashing function

When Perl stores an element in a hash, it creates a ``hash value'' for the key, and then does the moral equivalent of

    $hashvalue = hash_it($key);
    push @{$hash[$hashvalue % 8]}, {key => $key, value => $element};

distributing the keys over 8 ``buckets''. The key, if you'll pardon the pun, to good hashing is to get the elements evenly distributed in the buckets. If you get eight elements in eight different buckets, your element is guaranteed to be the first thing in the bucket, and you can fetch it back quickly - if you get them all in the same bucket, you may have to skip over seven other elements to find yours. Hence a good hashing function is essential for efficient hash access.

The one Perl uses at the moment looks like this:

        while (i_PeRlHaSh--)
             hash_PeRlHaSh = hash_PeRlHaSh * 33 + *s_PeRlHaSh++;
        (hash) = hash_PeRlHaSh + (hash_PeRlHaSh>>5);

(In Perl, that would be something like

    $hash = $hash * 33 + ord($_) for split //, $key;
    $hash += $hash >> 5;

)

Nick Clark found that by applying Duff's device to the hashing function he could get a 2% speedup; this led to Yet Another Benchmarking Argument, and Nick Ing-Simmons rightly pointing out that you can usually get a 2% speedup in the test suite merely by running the test suite again with no changes. Still, it was fun while it lasted - Nick's implementation is here.

Next up was Mark Leighton Fisher, who provided a patch to make Perl use Bob Jenkins' ``One-at-a-time'' hash function. There was no noticable speedup in the Perl benchmarking suite, but it did cause a lot of otherwise sensible programmers to write a lot of assembly code for no apparent reason.

chop examples

Michael Schwern has been cleaning up the perlfunc documentation, and putting in useful examples; unfortunately, he's stuck on finding useful examples for chop. Some people have surmised that this is because chop is fundamentally useless, and there's a proposal that it should be outlawed in Perl 6. Well, who knows, but we all certainly had a hard time thinking of handy things to do with it. Can you come up with anything better?

PerlIO programming documentation

Thanks to Nick Ing-Simmons, PerlIO (remember that?) has now got some documentation about how to program it; the original perlapio.pod (which has actually been there for years but nobody reads it) has been updated, and if you look in snapshots from now on, you'll find perliol.pod which is all about how IO layers work.

By the way, the module authors amongst you out there are using the PerlIO abstraction layer instead of assuming FILE*, right? Good.

<<< and >>> ops

John Allen suggested that Perl should have two new bit-twiddling operators: >>> would be a right shift without sign extension under use integer, and <<< would be a left roll.

Several people pointed out that this was horribly assymetrical, that it made evil assumptions about integer size, and that it probably wouldn't win that much anyway. Which is a shame, because we all do so much low-level bit-twiddling in Perl...

Feature testing and our()

MJD inquired as to why h2xs was producing code that wanted Perl 5.005_62. It transpired that this was when our variables were introduced, and now h2xs declares variables like $VERSION to be our variables.

Jarkko said it would be a lot cleaner to be able to say use feature 'our' to defend against people doing things like backporting our to earlier Perls. Tom Hughes provided a sample implementation of the pragma that wasn't quite right, and further discussion was needed as to what Jarkko intended. One way of doing it would be to have a module which has a hash of feature names and the version numbers in which they were introduced, and checks that the program requesting a given set of features is running on a Perl that can supply them. I don't know if that'd be a good idea, but if it's your idea of fun, by all means implement it and send it in...

Win32 and ActiveState Perl Configuration

Indy Singh noted that the defaults you get when you're building your own perl on Win32 don't allow you to use pre-compiled binary modules from the various repositories such as ActiveState's, and sent in a patch to correct this by making the defaults the same as AS use.

Sarathy objected, on the grounds that the current options are a lot faster and ask less of compilers, and that:

I think 99% of the people relying on module repositories don't build their own perl, and the ones that do are smart enough to enable the options they need. OTOH, I suspect most people building perl on their own on Windows need maximal efficiency and compatibility with Unix.

So I think the older defaults make more sense than the newer ones.

His compromise would be an `ActiveState compatibility flag' that people building their own perls could set, which would turn on the same options that ActiveState use at a stroke. Indy objected to Sarathy's objection, but Sarathy pointed out that he did not actually want people to pessimize the default Win32 configuration:

Wearing my ActiveState hat, I'd be more than happy if the defaults set all the options to what ActiveState uses now or will be using in future, but I thought I should point out the downsides regardless.

Tim Jenness then raised another ActiveState config point: ActiveState's build of Perl on Solaris uses Perl's own implementation of malloc. People had reported problems using this in conjunction with PDL, and there have been other known `issues' with Perl's malloc, especially when dealing with 64-bit systems. Sarathy said that it should be considered for 5.7.x, but not in the 5.6.x builds to avoid breaking binary compatibility.

Various

Ilya's mad patch of the week was to allow overloading of int(); shame he didn't look at lvalue overloading while he was at it. Peter Prymmer provided MVS users with dynamic loading on OS/390. (as well as lots of other useful OS/390 fixes.) There was another debate about the meaninglessness of benchmarking. Yes, floating point arithmetic is still imprecise. We know.

Unicode. There, I've mentioned it.

Until next week I remain, your humble and obedient servant,


Simon Cozens

Quick Start with SOAP


Table of Contents

Quick Start with SOAP
Writing a CGI-based Server
Client
Passing Values
Autodispatching
Objects access
Error handling
Service dispatch (different services on one server)
Types and Names
Conclusion

Part 2 of this series

SOAP (Simple Object Access Protocol) is a way for you to remotely make method calls upon classes and objects that exist on a remote server. It's the latest in a long series of similar projects like CORBA, DCOM, and XML-RPC.

SOAP specifies a standard way to encode parameters and return values in XML, and standard ways to pass them over some common network protocols like HTTP (web) and SMTP (email). This article, however, is merely intended as a quick guide to writing SOAP servers and clients. We will hardly scratch the surface of what's possible.

We'll be using the SOAP::Lite module from CPAN. Don't be mislead by the "Lite" suffix--this refers to the effort it takes to use the module, not its capabilities.

Writing a CGI-based Server

Download source files mentioned in this article here.

Here's a simple CGI-based SOAP server (hibye.cgi):

 #!perl -w

  use SOAP::Transport::HTTP;

  SOAP::Transport::HTTP::CGI   
    -> dispatch_to('Demo')     
    -> handle;

  package Demo;

  sub hi {                     
    return "hello, world";     
  }

  sub bye {                    
    return "goodbye, cruel world";
  }


Paul Kulchenko is a featured speaker at the upcoming O'Reilly Open Source Convention in San Diego, CA, July 23 - 27, 2001. Take this opportunity to rub elbows with open source leaders while relaxing in the beautiful setting of the beach-front Sheraton San Diego Hotel and Marina. For more information, visit our conference home page. You can register online.



There are basically two parts to this: the first four lines set up a SOAP wrapper around a class. Everything from 'package Demo' onward is the class being wrapped.

In the previous version of specification (1.0), SOAP over HTTP was supposed to use a new HTTP method, M-POST. Now it's common to try a normal POST first, and then use M-POST if the server needs it. If you don't understand the difference between POST and M-POST, don't worry, you don't need to know all the specific details to be able to use the module.

Client

This client prints the results of the hi() method call (hibye.pl):

 #!perl -w
  
  use SOAP::Lite;

  print SOAP::Lite                                             
    -> uri('http://www.soaplite.com/Demo')                                             
    -> proxy('http://services.soaplite.com/hibye.cgi')
    -> hi()                                                    
    -> result;

The uri() identifies the class to the server, and the proxy() identifies the location of the server itself. Since both look like URLs, I'll take a minute to explain the difference, as it's quite important.

proxy()
proxy() is simply the address of the server to contact that provides the methods. You can use http:, mailto:, even ftp: URLs here.

uri()
Each server can offer many different services through the one proxy() URL. Each service has a unique URI-like identifier, which you specify to SOAP::Lite through the uri() method. If you get caught up in the gripping saga of the SOAP documentation, the "namespace" corresponds to the uri() method.

If you're connected to the Internet, you can run your client, and you should see:

 hello, world

That's it!

If your method returns multiple values (hibye.cgi):

 #!perl -w

  use SOAP::Transport::HTTP;

  SOAP::Transport::HTTP::CGI      
    -> dispatch_to('Demo')        
    -> handle;

  package Demo;

  sub hi {                        
    return "hello, world";        
  }

  sub bye {                       
    return "goodbye, cruel world";
  }

  sub languages {                 
    return ("Perl", "C", "sh");   
  }

Then the result() method will only return the first. To access the rest, use the paramsout() method (hibyeout.pl):>

 #!perl -w

  use SOAP::Lite;

  $soap_response = SOAP::Lite                                  
    -> uri('http://www.soaplite.com/Demo')                                             
    -> proxy('http://services.soaplite.com/hibye.cgi')
    -> languages();

  @res = $soap_response->paramsout;

  $res = $soap_response->result;                               
  print "Result is $res, outparams are @res\n";

This code will produce:

 Result is Perl, outparams are Perl C sh

Passing Values

Methods can take arguments. Here's a SOAP server that translates between Fahrenheit and Celsius (temper.cgi):

 #!perl -w

  use SOAP::Transport::HTTP;

  SOAP::Transport::HTTP::CGI
    -> dispatch_to('Temperatures')
    -> handle;

  package Temperatures;

  sub f2c {
      my ($class, $f) = @_;
      return 5 / 9 * ($f - 32);
  }

  sub c2f {
      my ($class, $c) = @_;
      return 32 + $c * 9 / 5;
  }

And here's a sample query (temp.pl):

 #!perl -w

  use SOAP::Lite;

  print SOAP::Lite                                            
    -> uri('http://www.soaplite.com/Temperatures')                                    
    -> proxy('http://services.soaplite.com/temper.cgi')
    -> c2f(37.5)                                              
    -> result;

You can also create an object representing the remote class, and then make method calls on it (tempmod.pl):

 #!perl -w

  use SOAP::Lite;

  my $soap = SOAP::Lite                                        
    -> uri('http://www.soaplite.com/Temperatures')                                     
    -> proxy('http://services.soaplite.com/temper.cgi');

  print $soap                                                  
    -> c2f(37.5)                                               
    -> result;

Autodispatching

This being Perl, there's more than one way to do it: SOAP::Lite provides an alternative client syntax (tempauto.pl).

 #!perl -w

  use SOAP::Lite +autodispatch =>

    uri => 'http://www.soaplite.com/Temperatures',
    proxy => 'http://services.soaplite.com/temper.cgi';

  print c2f(37.5);

After you specify the uri and proxy parameters, you are able to call remote functions with the same syntax as local ones (e.g., c2f). This is done with UNIVERSAL::AUTOLOAD, which catches all unknown method calls. Be warned that all calls to undefined methods will result in an attempt to use SOAP.

Objects access (it's 'Simple Object access protocol', isn't it?)

Methods can also return real objects. Let's extend our Temperatures class with an object-oriented interface (temper.cgi):

 #!perl -w

  use SOAP::Transport::HTTP;

  SOAP::Transport::HTTP::CGI
    -> dispatch_to('Temperatures')
    -> handle;

  package Temperatures;

  sub f2c {
      my ($class, $f) = @_;
      return 5/9*($f-32);
  }

  sub c2f {
      my ($class, $c) = @_;
      return 32+$c*9/5;
  }

  sub new {
      my $self = shift;
      my $class = ref($self) || $self;
      bless {_temperature => shift} => $class;
  }

  sub as_fahrenheit {
      return shift->{_temperature};
  }

  sub as_celsius {
      my $self = shift;
      return $self->f2c( $self->{_temperature} );
  }

Here is a client that accesses this class (tempobj.pl):

 #!perl -w
  
  use SOAP::Lite;

  my $soap = SOAP::Lite
    -> uri('http://www.soaplite.com/Temperatures')
    -> proxy('http://services.soaplite.com/temper.cgi');

  my $temperatures = $soap
    -> call(new => 100) # accept Fahrenheit  
    -> result;

  print $soap
    -> as_celsius($temperatures)
    -> result;

Similar code with autodispatch is shorter and easier to read (tempobja.pl):

 #!perl -w

  use SOAP::Lite +autodispatch =>
    uri => 'http://www.soaplite.com/Temperatures',
    proxy => 'http://services.soaplite.com/temper.cgi';

  my $temperatures = Temperatures->new(100);
  print $temperatures->as_fahrenheit();

Error handling

A SOAP call may fail for numerous reasons, such as transport error, incorrect parameters, or an error on the server. Transport errors (which may occur if, for example, there is a network break between the client and the server) are dealt with below. All other errors are indicated by the fault() method (temperr.pl):

 #!perl -w

  use SOAP::Lite;

  my $soap = SOAP::Lite
    -> uri('http://www.soaplite.com/Temperatures')
    -> proxy('http://services.soaplite.com/temper.cgi');
 my $result = $soap->c2f(37.5);

  unless ($result->fault) {
    print $result->result();
  } else {
    print join ', ', 
      $result->faultcode, 
      $result->faultstring, 
      $result->faultdetail;
  }

faultcode() gives you information about the main reason for the error. Possible values may be:

Client: you provided incorrect information in the request.
This error may occur when parameters for the remote call are incorrect. Parameters may be out-of-bounds, such as negative numbers, when positive integers are expected; or of an incorrect type, for example, a string is provided where a number was expected.

Server: something is wrong on the server side.
This means that provided information is correct, but the server couldn't handle the request because of temporary difficulties, for example, an unavailable database.

MustUnderstand: Header elements has mustUnderstand attribute, but wasn't understood by server.
The server was able to parse the request, but the client is requesting functionality that can't be provided. For example, suppose that a request requires execution of SQL statement, and the client wants to be sure that several requests will be executed in one database transaction. This could be implemented as three different calls with one common TransactionID.

In this case, the SOAP header may be extended with a new header element called, say, 'TransactionID', which carries a common identifier across the 3 separate invocations. However, if server does not understand the provided TransactionID header, it probably won't be able to maintain transactional integrity across invocations. To guard against this, the client may indicate that the server 'mustUnderstand' the element 'TransactionID'. If the server sees this and does NOT understand the meaning of the element, it will not try and process the requests in the first place.

This functionality makes services more reliable and distributed systems more robust.

VersionMismatch: the server can't understand the version of SOAP used by the client.
This is provided for (possible) future extensions, when new versions of SOAP may have different functionality, and only clients that are knowledgeable about it will be able to properly use it.

Other errors
The server is allowed to create its own errors, like Client.Authentication.

faultstring() provides a readable explanation, whereas faultdetail() gives access to more detailed information, which may be a string, object, or more complex structure.

For example, if you change uri to something else (let's try with 'Test' instead of 'Temperatures'), this code will generate:

 Client, Bad Class Name, Failed to access class (Test)

By default client will die with diagnostic on transport errors and do nothing for faulted calls, so, you'll be able to get fault info from result. You can alter this behavior with on_fault() handler either per object, so it will die on both transport errors and SOAP faults (temperrh.pl):

 #!perl -w

  use SOAP::Lite;

  my $soap = SOAP::Lite
    -> uri('http://www.soaplite.com/Temperatures')
    -> proxy('http://services.soaplite.com/temper.cgi')

    -> on_fault(sub { my($soap, $res) = @_; 
         die ref $res ? $res->faultdetail : $soap->transport->status, "\n";
       });

Or you can set it globally (temperrg.pl):

 #!perl -w

  use SOAP::Lite

    on_fault => sub { my($soap, $res) = @_; 
      die ref $res ? $res->faultdetail : $soap->transport->status, "\n";
    };

  my $soap = SOAP::Lite
    -> uri('http://www.soaplite.com/Temperatures')
    -> proxy('http://services.soaplite.com/temper.cgi');

Now, wrap your SOAP call into an eval {} block, and catch both transport errors and SOAP faults (temperrg.pl):

 #!perl -w

  use SOAP::Lite

    on_fault => sub { my($soap, $res) = @_; 
      die ref $res ? $res->faultdetail : $soap->transport->status, "\n";
    };

  my $soap = SOAP::Lite
    -> uri('http://www.soaplite.com/Temperatures')
    -> proxy('http://services.soaplite.com/temper.cgi');

  eval { 
    print $soap->c2f(37.5)->result; 
  1 } or die;

You may also consider this variant that will return undef and setup $! on failure, just like many Perl functions do (temperrv.pl):

 #!perl -w

  use SOAP::Lite
    on_fault => sub { my($soap, $res) = @_; 
      eval { die ref $res ? $res->faultdetail : $soap->transport->status };
      return ref $res ? $res : new SOAP::SOM;
    };

  my $soap = SOAP::Lite
    -> uri('http://www.soaplite.com/Temperatures')
    -> proxy('http://services.soaplite.com/temper.cgi');

  defined (my $temp = $soap->c2f(37.5)->result) or die;

  print $temp;

And finally, if you want to ignore errors (however, you can still check for them with the fault() method call):

 use SOAP::Lite
    on_fault => sub {};

or

 my $soap = SOAP::Lite
    -> on_fault(sub{})
    ..... other parameters

Service dispatch (different services on one server)

So far our CGI programs have had a single class to handle incoming SOAP calls. But we might have one CGI program that dispatches SOAP calls to many classes.

What exactly is SOAP dispatch? When a SOAP request is recieved by a server, it gets bound to the class specified in the request. The class could be already loaded on server side (on server startup, or as a result of previous calls), or might be loaded on demand, according to server configuration. Dispatching is the process of determining of which class should handle a given request, and loading that class, if necessary. Static dispatch means that name of the class is specified in configuration, whereas dynamic means that only a pool of classes is specified, in, say, a particular directory, and that any class from this directory can be accessed.

Imagine that you want to give access to two different classes on the server side, and want to provide the same 'proxy' address for both. What should you do? Several options are available:

Static internal
... Which you are already familiar with (hibye.cgi):
  use SOAP::Transport::HTTP;

  SOAP::Transport::HTTP::CGI   
    -> dispatch_to('Demo')     
    -> handle;

  package Demo;

  sub hi {                     
    return "hello, world";     
  }

  sub bye {                    
    return "goodbye, cruel world";
  }

  1;

Static external
Similar to Static internal, but the module is somewhere outside of server code (hibyeout.cgi):
  use SOAP::Transport::HTTP;

  use Demo;

  SOAP::Transport::HTTP::CGI   
    -> dispatch_to('Demo')     
    -> handle;

The following module should, of course, be somewhere in a directory listed in @INC (Demo.pm):

 package Demo;

  sub hi {                     
    return "hello, world";     
  }

  sub bye {                    
    return "goodbye, cruel world";
  }

  1;
Dynamic
As you can see in both Static internal and Static external modes, the module name is hardcoded in the server code. But what if you want to be able to add new modules dynamically without altering the server? Dynamic dispatch allows you to do it. Specify a directory, and any module in this directory becomes available for dispatching (hibyedyn.cgi):
 #!perl -w

  use SOAP::Transport::HTTP;

  SOAP::Transport::HTTP::CGI

    -> dispatch_to('/home/soaplite/modules')

    -> handle;

Then put Demo.pm in /home/soaplite/modules directory (Demo.pm):

 package Demo;

  sub hi {                     
    return "hello, world";     
  }

  sub bye {                    
    return "goodbye, cruel world";
  }

  1;

That's it. Any module you put in /home/soaplite/modules is available now, but don't forget that the URI specified on the client side should match module/class name you want to dispatch your call to.

Mixed
What do we need this for? Unfortunately, dynamic dispatch also has a significant disadvantage: Access to @INC is disabled for the purposes of dynamic dispatch, for security reasons. To work around this, you can combine dynamic and static approaches. All you need to do is this (hibyemix.cgi):

 #!perl -w

  use SOAP::Transport::HTTP;

  SOAP::Transport::HTTP::CGI

    -> dispatch_to('/home/soaplite/modules', 'Demo', 'Demo1', 'Demo2')

    -> handle;

Now Demo, Demo1, and Demo2 are pre-loaded from anywhere in @INC, but dynamic access is enabled for any modules in /home/soaplite/modules, and they'll be loaded on demand.

Types and Names

So far as Perl is typeless language (in a sense that there is no difference between integer 123 and string '123'), it greatly simplifies the transformation process from SOAP message to Perl data. For most simple data, we can just ignore typing at this stage. However, this approach has drawbacks also: we need to provide additional information during generation of our SOAP message, because the other server or client may expect type information. SOAP::Lite doesn't force you to type every parameter explicitly, but instead tries to guess each data type based on actual values in question (according to another of Perl's mottos, DWIM, or 'Do What I Mean').

For example, a variable that has the value 123 becomes an element of type int in a SOAP message, and a variable that has the value 'abc' becomes type string. However, there are more complex cases, such as variables that contain binary data, which must be Base64-encoded, or objects (blessed references), as another example, which are given type and name (unless specified) according to their Perl package.

The autotyping may not work in all cases, though. There is no default way to make an element with type string or type long from a value of 123, for example. You may alter this behavior in several ways. First, you may disable autotyping completely (by calling the autotype() with a value of 0), or change autotyping for different types.

Alternately, you may use objects from the SOAP::Data class to explicitly specify a type for a particular variable:

 my $var = SOAP::Data->type( string => 123 );

$var becomes an element with type string and value 123. You may use this variable in ANY place where you use ordinary Perl variables in SOAP calls. This also allows you to provide not only specific data types, but also specific name and attributes.

Since many services count on names of parameters (instead of positions) you may specify names for request parameters using the same syntax. To add a name to $var variable, call $var->name('myvar'), or even chain calls with the type() method:

 my $var = SOAP::Data->type(string => 123)->name('myvar');

  # -- OR --
  my $var = SOAP::Data->type('string')->name(myvar => 123);

  # -- OR --
  my $var = SOAP::Data->type('string')->name('myvar')->value(123);

You may always get or set the value of a variable with value() method:

 $var->value(321);            # set new value

  my $realvalue = $var->value; # store it in variable

Conclusion

This should be enough to get you started building SOAP applications. You can read the manpages (or even the source, if you're brave!) to learn more, and don't forget to keep checking www.soaplite.com for more documentation, examples, and SOAP-y fun.

Part 2 of this article can be found here


Major contributors:

Nathan Torkington
Basically started this work and pushed the whole process.

Tony Hong
Invaluable comments and input help me keep this material fresh and simple.

This piece continues here

This Week on p5p 2001/01/21



Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month.

sigsetjmp wrangling continues

Last week, there was some discussion about whether Perl ought to use sigsetjmp to jump out of evals and to die. Part of the problem is that sigsetjmp is quite a lot slower than setjmp, so if we can get by without it, we ought to. Nick Ing-Simmons has removed sigsetjmp from the current sources, but now Nick Clark has found that this can sometimes cause a slowdown due to bizarre optimizing.

The discussion then veered onto the problems of using any sort of non-local jump with threads. Alan pointed out that neither sigjmp nor sigsetjmp were thread-safe at all, and since Perl uses them, Perl's threading implementation is horrifically broken. There were no good suggestions about how to get around this, or to getaway without non-local jumps for trapping exceptions. Alan declared Perl 5 beyond hope, but said:

If perl6 has something akin to the perl5 stack, eval/die will have to be implemented so that that it rolls back up the stack on die, rather than the current longjmp hack.

Alan also suggested that we would need to roll our own threading model in Perl 6 to have full control over exception handling; the discussion carried on about Perl 6 over on the perl6-internals mailing list.

The part where it gets interesting this week starts here.

Safe Signals

Nick came up with a program for people to try to confirm his suspicions about signal handling. His plan was to have C set a flag in the signal handler which is checked after each op is performed, which seems the most obvious way of doing it, but he was worried about systems with signal handlers where SIGCHLD didn't call wait, meaning there would be still outstanding children when the signal handler returned, meaning a SIGCHLD would be delivered, meaning the handler would get called, rince and repeat.

However, every platform that was tested worked sensibly, so it looks like Nick is going to go ahead and try and implement safe signals.

Large file support wrangling continues

The discussion last week about Linux's large file support continued this week. The problem is that we need to find the right preprocessor directive to get the most use out of the system; most of the ones which look useful ( _GNU_SOURCE, for instance) also expose other things that we don't necessarily want. It would also throw up problems in programs embedding Perl. Russ Allbery had been through all this with both INN and autoconf. His advice:

Eventually, the autoconf folks decided to give up on glibc 2.1 as broken for large files and just recommend people upgrade to glibc 2.2 or add -D_GNU_SOURCE manually if it works for their application.

Multiple Pre-Incrementing

I decided to throw a spaniel in the works by submitting a patch to make

    print (++$i, ++$i, ++$i)

work as John Vromans would like; currently, Perl reserves the right to do, well, pretty much anything it wants in that situation, but the "obvious" thing for it to print would be (assuming $i was undefined before hand) "123". There were some arguments as to why this would be a bad idea - firstly, defining behaviour that is currently undefined robs us of the right to make clever optimizations in the future, and also that the fix slows down the behaviour of pre-increment and pre-decrement for everyone, not just those doing multiple pre-increments in a single statement.

I also wondered whether the confusion at seeing Perl output "333" in the above code would be offset by the confusion required to try something like that in a serious program anyway.

Test::Harness Megapatch

Michael Schwern did his usual trick of popping up out of nowhere with a 40K patch - this time he rewrote Test::Harness to support a lot of sensible things, like the trick of having comments after your message, like this:

    ok 123 - Testing the frobnicator

so that when tests fail you can can search for that string. He went back and forth with Andreas about some of the new features - Andreas felt that, for instance, allowing upper case output creates additional noise and distraction. Jarkko agreed, and the patch got fried.

Not put off, Schwern then went on to unify the skip and todo interfaces. Unfortunately, that couldn't be done without breaking existing code, especially CPAN modules, so that patch died the death too. Oh, the embarrassment.

Tokeniser reporting and pretty-printing

I did something evil again. After hearing a talk by Knuth about Literate Programming, I went back to bemoaning the lack of a Perl pretty-printer, and the depressing words in the FAQ:

There is no program that will reformat Perl as much as indent(1) will do for C. The complex feedback between the scanner and the parser (this feedback is what confuses the vgrind and emacs programs) makes it challenging at best to write a stand-alone Perl parser.

So if I couldn't build a stand-alone parser, I'd use the one we've got - perl. By adding a call to a reporting function every time Perl makes a decision about what a token is, you can generate a listing of all the tokens in a program and their types. Implementation of a robust pretty-printer is left as an exercise for the reader; answers on a postcard, please.

(PS: I've since been alerted to the existence of Tim Maher's Perl beautifier, which is an equally cool hack.)

Unicode

How could I go a week without mentioning Unicode? Hiroto's qu operator is in, and someone's obviously using it, because Nick Clark found that it was turning up a bug in UTF8 hashes - $foo{something} and $foo{qu/something/} were being seen as two different keys. Hiroto said he was aware of it and meant to send a patch but hasn't managed to yet.

UTF8 support on EBCDIC is starting to work, but it's being done in a bit of a bizarre way - we're actually using UTF8 to encode EBCDIC itself, rather than Unicode. This means that whileEBCDIC and non-EBCDIC platforms now both "support" UTF8 and all the code (on the whole) works, Weird Things(TM) might happen if EBCDIC people start playing with character classes or other Unicode features.

Various

IV preservation is still buggy.

I'll leave you with the news that several people reported problems with the bug-reporting system; Perl is so great, even its bugs have bugs.

Until next week I remain, your humble and obedient servant,


Simon Couzens

Table of Contents

A more complex example

Producing XML

Multiple Formats

Introducing the Template Toolkit

There are a number of Perl modules that are universally recognised as The Right Thing To Use for certain tasks. If you accessed a database without using DBI, pulled data from the WWW without using on of the LWP modules or parsed XML without using XML::Parser or one of its subclasses then you'd run the risk of being shunned by polite Perl society.

I believe that the year 2000 saw the emergence of another 'must have' Perl module - the Template Toolkit. I don't think I'm alone in this belief as the Template Toolkit won the 'Best New Module' award at the Perl Conference last summer. Version 2.0 of the Template Toolkit (known as TT2 to its friends) was recently released to the CPAN.

TT2 was designed and written by Andy Wardley <abw@cre.canon.co.uk>. It was born out of Andy's previous templating module, Text::Metatext, in best Fred Brooks 'plan to throw one away' manner; and aims to be the most useful (or, at least, the most used) Perl templating system.

TT2 provides a way to take a file of fixed boilerplate text (the template) and embed variable data within it. One obvious use of this is in the creation of dynamic web pages and this is where a lot of the attention that TT2 has received has been focussed. In this article, I hope to demonstrate that TT2 is just as useful in non-web applications.

Using the Template Toolkit

Let's look at how we'd use TT2 to process a simple data file. TT2 is an object oriented Perl module. Having downloaded it from CPAN and installed it in the usual manner, using it in your program is as easy as putting the lines

    use Template;
    my $tt = Template->new;

in your code. The constructor function, new, takes a number of optional parameters which are documented in the copious manual pages that come with the module, but for the purposes of this article we'll keep things as simple as possible.

To process the template, you would call the process method like this

    $tt->process('my_template', \%data)
      || die $tt->error;

We pass two parameters to process, the first is the name of the file containing the template to process (in this case, my_template) and the second is a reference to a hash which contains the data items that you want to use in the template. If processing the template gives any kind of error, the program will die with a (hopefully) useful error message.

So what kinds of things can go in %data? The answer is just about anything. Here's an example showing data about English Premier League football teams.

    my @teams = ({ name => 'Man Utd',
                   played => 16,
                   won => 12,
                   drawn => 3,
                   lost => 1 },
                 { name => 'Bradford',
                   played => 16,
                   won => 2,
                   drawn => 5,
                   lost => 9 });
    my %data = ( name => 'English Premier League',
                 season => '2000/01',
                 teams => \@teams );

This creates three data items which can be accessed within the template, called name, season and teams. Notice that teams is a complex data structure.

Here is a template that we might use to process this data.

    League Standings
    League Name: [% name %]
    Season     : [% season %]
    Teams:
    [% FOREACH team = teams -%]
    [% team.name %] [% team.played -%] 
     [% team.won %] [% team.drawn %] [% team.lost %]
    [% END %]

Running this template with this data gives us the following output

    League Standings
    League Name: English Premier League
    Season     : 2000/01
    Teams:
    Man Utd 16 12 3 1
    Bradford 16 2 5 9

Hopefully the syntax of the template is simple enough to follow. There are a few points to note.

  • Template processing directives are written using a simple language which is not Perl.
  • The keys of the %data have become the names of the data variables within the template.
  • Template processing directives are surrounded by [% and %] sequences.
  • If these tags are replaced with [%- -%] then the preceding or following linefeed is suppressed.
  • In the FOREACH loop, each element of the teams list was assigned, in turn, to the temporary variable team.
  • Each item assigned to the team variable is a Perl hash. Individual values within the hash are accessed using a dot notation.

It's probably the first and last of these points which are the most important. The first point emphasises the separation of the data acquisition logic from the presentation logic. The person creating the presentation template doesn't need to know Perl, they only need to know the data items which will be passed into the template.

The last point demonstrates the way that TT2 protects the template designer from the implementation of the data structures. The data objects passed to the template processor can be scalars, arrays, hashes, objects or even subroutines. The template processor will just interpret your data correctly and Do The Right Thing to return the correct value to you. In this example each team was a hash, but in a larger system each team might be an object, in which case name, played, etc. would be accessor methods to the underlying object attributes. No changes would be required to the template as the template processor would realise that it needed to call methods rather than access hash values.

A more complex example

Stats about the English Football League are usually presented in a slightly more complex format that the one we used above. A full set of stats will show the number of games that a team has won, lost or drawn, the number of goals scored for and against the team and the number of points that the team therefore has. Teams gain three points for a win and one point for a draw. When teams have the same number of points they are separated by the goal difference, that is the number of goals the team has scored minus the number of team scored against them. To complicate things even further, the games won, drawn and lost and the goals for and against are often split between home and away games.

Therefore if you have a data source which lists the team name togther with the games won, drawn and lost and the goals for and against split into home and away (a total of eleven data items) you can calculate all of the other items (goal difference, points awarded and even position in the league). Let's take such a file, but we'll only look at the top three teams. It will look something like this:

  Man Utd,7,1,0,26,4,5,2,1,15,6
  Arsenal,7,1,0,17,4,2,3,3,7,9
  Leicester,4,3,1,10,8,4,2,2,7,4

A simple script to read this data into an array of hashes will look something like this (I've simplified the names of the data columns - w, d, and l are games won, drawn and lost and f and a are goals scored for and against; h and a at the front of a data item name indicates whether it's a home or away statistic):

  my @cols = qw(name hw hd hl hf ha aw ad al af aa);
  my @teams;
  while (<>) {
    chomp;
    my %team;
    @team{@cols} = split /,/;
    push @teams, \%team;
  }

We can then go thru the teams again and calculate all of the derived data items:

  foreach (@teams) {
    $_->{w} = $_->{hw} + $_->{aw};
    $_->{d} = $_->{hd} + $_->{ad};
    $_->{l} = $_->{hl} + $_->{al};
    $_->{pl} = $_->{w} + $_->{d} + $_->{l};

    $_->{f} = $_->{hf} + $_->{af};
    $_->{a} = $_->{ha} + $_->{aa};
    $_->{gd} = $_->{f} - $_->{a};
    $_->{pt} = (3 * $_->{w}) + $_->{d};
  }

And then produce a list sorted in descending order:

  @teams 
    = sort { $b->{pt} <=> $b->{pt}
             || $b->{gd} <=> $a->{gd} } @teams;

And finally add the league position data item:

  $teams[$_]->{pos} = $_ + 1 
    foreach 0 .. $#teams;

Having pulled all of our data into an internal data structure we can start to produce output using out templates. A template to create a CSV file containing the data split between home and away stats would look like this:

  [% FOREACH team = teams -%]
  [% team.pos %],[% team.name %],[% team.pl %],[% team.hw %],
  [%- team.hd %],[% team.hl %],[% team.hf %],[% team.ha %],
  [%- team.aw %],[% team.ad %],[% team.al %],[% team.af %],
  [%- team.aa %],[% team.gd %],[% team.pt %]
  [%- END %]

And processing it like this:

  $tt->process('split.tt', { teams => \@teams }, 'split.csv')
    || die $tt->error;

produces the following output:

  1,Man Utd,16,7,1,0,26,4,5,2,1,15,6,31,39
  2,Arsenal,16,7,1,0,17,4,2,3,3,7,9,11,31
  3,Leicester,16,4,3,1,10,8,4,2,2,7,4,5,29

Notice that we've introduced the third parameter to process. If this parameter is missing then the TT2 sends its output to STDOUT. If this parameter is a scalar then it is taken as the name of a file to write the output to. This parameter can also be (amongst other things) a filehandle or a reference to an object which is assumed to implement a print method.

If we weren't interested in the split between home and away games, then we could use a simpler template like this:

  [% FOREACH team = teams -%]
  [% team.pos %],[% team.name %],[% team.pl %],[% team.w %],
  [%- team.d %],[% team.l %],[% team.f %],[% team.a %],
  [%- team.aa %],[% team.gd %],[% team.pt %]
  [% END -%]

Which would produce output like this:

  1,Man Utd,16,12,3,1,41,10,6,31,39
  2,Arsenal,16,9,4,3,24,13,9,11,31
  3,Leicester,16,8,5,3,17,12,4,5,29

Producing XML

This is starting to show some of the power and flexibility of TT2, but you may be thinking that you could just as easily produce this output with a foreach loop and a couple of print statements in your code. This is, of course, true; but that's because I've chosen a deliberately simple example to explain the concepts. What if we wanted to produce an XML file containing the data? And what if (as I mentioned earlier) the league data was held in an object? The code would then look even easier as most of the code we've written earlier would be hidden away in FootballLeague.pm.

  use FootballLeague;
  use Template;
  my $league = FootballLeague->new(name => 'English Premier');
  my $tt = Template->new;
  $tt->process('league_xml.tt', { league => $league })
    || die $tt->error;

And the template in league_xml.tt would look something like this:

  <?xml version="1.0"?>
  <!DOCTYPE LEAGUE SYSTEM "league.dtd">
  <league name="[% league.name %]" season="[% league.season %]">
  [% FOREACH team = league.teams -%]
    <team name="[% team.name %]"
          pos="[% team.pos %]"
          played="[% team.pl %]"
          goal_diff="[% team.gd %]"
          points="[% team.pt %]">
       <stats type="home">
              win="[% team.hw %]"
              draw="[%- team.hd %]"
              lose="[% team.hl %]"
              for="[% team.hf %]"
              against="[% team.ha %]" />
       <stats type="away">
              win="[% team.aw %]"
              draw="[%- team.ad %]"
              lose="[% team.al %]"
              for="[% team.af %]"
              against="[% team.aa %]" />
    </team>
  [% END -%]
  </league>

Notice that as we've passed the whole object into process then we need to put an extra level of indirection on our template variables - everything is now a component of the league variable. Other than that, everything in the template is very similar to what we've used before. Presumably now team.name calls an accessor function rather than carrying out a hash lookup, but all of this is transparent to our template designer.

Multiple Formats

As a final example, let's suppose that we need to create out football league tables in a number of formats. Perhaps we are passing this data on to other people and they can't all use the same format. Some of our users need CSV files and others need XML. Some require data split between home and away matches and other just want the totals. In total, then, we'll need four different templates, but the good news is that they can use the same data object. All the script needs to do is to establish which template is required and process it.

  use FootballLeague;
  use Template;
  my ($name, $type, $stats) = @_;
  my $league = FootballLeague->new(name => $name);
  my $tt = Template->new;
  $tt->process("league_${type}_$stats.tt", 
               { league => $league }
               "league_$stats.$type")
    || die $tt->error;

For example, calling this script as

  league.pl 'English Premier' xml split

This will process a template called league_xml_split.tt and put the results in a file called league_split.xml.

This starts to show the true strength of the Template Toolkit. If we later wanted to add another file format - perhaps we wanted to create a league table HTML page or even a LaTeX document - then we would just need to create the appropriate template and name it according to our existing naming convention. We would need to make no changes to the code.

I hope you can now see why the Template Toolkit is fast becoming an essential part of many people's Perl installation.

A Beginner's Introduction to POE

By Dennis Taylor, with Jeff Goff

What Is POE, And Why Should I Use It?

Table of Contents
POE Design

A Simple Example

That's All For Today

Related Links

Most of the programs we write every day have the same basic blueprint: they start up, they perform a series of actions, and then they exit. This works fine for programs that don't need much interaction with their users or their data, but for more complicated tasks, you need a more expressive program structure.

That's where POE (Perl Object Environment) comes in. POE is a framework for building Perl programs that lends itself naturally to tasks which involve reacting to external data, such as network communications or user interfaces. Programs written in POE are completely non-linear; you set up a bunch of small subroutines and define how they all call each other, and POE will automatically switch between them while it's handling your program's input and output. It can be confusing at first, if you're used to procedural programming, but with a little practice it becomes second nature.

POE Design

It's not much of an exaggeration to say that POE is a small operating system written in Perl, with its own kernel, processes, interprocess communication (IPC), drivers, and so on. In practice, however, it just boils down to a simple system for assembling state machines. Here's a brief description of each of the pieces that make up the POE environment:

States
The basic building block of the POE program is the state, which is a piece of code that gets executed when some event occurs -- when incoming data arrives, for instance, or when a session runs out of things to do, or when one session sends a message to another. Everything in POE is based around receiving and handling these events.
The Kernel
POE's kernel is much like an operating system's kernel: it keeps track of all your processes and data behind the scenes, and schedules when each piece of your code gets to run. You can use the kernel to set alarms for your POE processes, queue up states that you want to run, and perform various other low-level services, but most of the time you don't interact with it directly.
Sessions
Sessions are the POE equivalent to processes in a real operating system. A session is just a POE program which switches from state to state as it runs. It can create ``child'' sessions, send POE events to other sessions, and so on. Each session can store session-specific data in a hash called the heap, which is accessible from every state in that session.

POE has a very simple cooperative multitasking model; every session executes in the same OS process without threads or forking. For this reason, you should beware of using blocking system calls in POE programs.

Those are the basic pieces of the Perl Object Environment, although there are a few slightly more advanced parts that we ought to explain before we go on to the actual code:

Drivers
Drivers are the lowest level of POE's I/O layer. Currently, there's only one driver included with the POE distribution -- POE::Driver::SysRW, which reads and writes data from a filehandle -- so there's not much to say about them. You'll never actually use a driver directly, anyhow.
Filters
Filters, on the other hand, are inordinately useful. A filter is a simple interface for converting chunks of formatted data into another format. For example, POE::Filter::HTTPD converts HTTP 1.0 requests into HTTP::Request objects and back, and POE::Filter::Line converts a raw stream of data into a series of lines (much like Perl's <> operator).
Wheels
Wheels contain reusable pieces of high-level logic for accomplishing everyday tasks. They're the POE way to encapsulate useful code. Common things you'll do with wheels in POE include handling event-driven input and output and easily creating network connections. Wheels often use Filters and Drivers to massage and send off data. I know this is a vague description, but the code below will provide some concrete examples.
Components
A Component is a session that's designed to be controlled by other sessions. Your sessions can issue commands to and receive events from them, much like processes communicating via IPC in a real operating system. Some examples of Components include POE::Component::IRC, an interface for creating POE-based IRC clients, or POE::Component::Client::HTTP, an event-driven HTTP user agent in Perl. We won't be using any Components in this article, but they're a very useful part of POE nevertheless.

A Simple Example

For this simple example, we're going to make a server daemon which accepts TCP connections and prints the answers to simple arithmetic problems posed by its clients. When someone connects to it on port 31008, it will print ``Hello, client!''. The client can then send it an arithmetic expression, terminated by a newline (such as ``6 + 3\n'' or ``50 / (7 - 2)\n'', and the server will send back the answer. Easy enough, right?

Writing such a program in POE isn't terribly different from the traditional method of writing daemons in Unix. We'll have a server session which listens for incoming TCP connections on port 31008. Each time a connection arrives, it'll create a new child session to handle the connection. Each child session will interact with the user, and then quietly die when the connection is closed. And best of all, it'll only take 74 lines of modular, simple Perl.

The program begins innocently enough:

   1  #!/usr/bin/perl -w
   2  use strict;
   3  use Socket qw(inet_ntoa);
   4  use POE qw( Wheel::SocketFactory  Wheel::ReadWrite
   5              Filter::Line          Driver::SysRW );
   6  use constant PORT => 31008;

Here, we import the modules and functions which the script will use, and define a constant value for the listening port. The odd-looking qw() statement after the ``use POE'' is just POE's shorthand way for pulling in a lot of POE:: modules at once. It's equivalent to the more verbose:

        use POE;
        use POE::Wheel::SocketFactory;
        use POE::Wheel:ReadWrite;
        use POE::Filter::Line;
        use POE::Driver::SysRW;

Now for a truly cool part:

   7  new POE::Session (
   8      _start => \&server_start,
   9      _stop  => \&server_stop,
  10  );
  11  $poe_kernel->run();
  12  exit;

That's the entire program! We set up the main server session, tell the POE kernel to start processing events, and then exit when it's done. (The kernel is considered ``done'' when it has no more sessions left to manage, but since we're going to put the server session in an infinite loop, it'll never actually exit that way in this script.) POE automatically exports the $poe_kernel variable into your namespace when you write ``use POE;''.

The new POE::Session call needs a word of explanation. When you create a session, you give the kernel a list of the events it will accept. In the code above, we're saying that the new session will handle the _start and _stop events by calling the &server_start and &server_stop functions. Any other events which this session receives will be ignored. _start and _stop are special events to a POE session: the _start state is the first thing the session executes when it's created, and the session is put into the _stop state by the kernel when it's about to be destroyed. Basically, they're a constructor and a destructor.

Now that we've written the entire program, we have to write the code for the states which our sessions will execute while it runs. Let's start with (appropriately enough) &server_start, which is called when the main server session is created at the beginning of the program:

  13  sub server_start {
  14      $_[HEAP]->{listener} = new POE::Wheel::SocketFactory
  15        ( BindPort     => PORT,
  16          Reuse        => 'yes',
  17          SuccessState => \&accept_new_client,
  18          FailureState => \&accept_failed
  19        );
  20      print "SERVER: Started listening on port ", PORT, ".\n";
  21  }

This is a good example of a POE state. First things first: Note the variable called $_[HEAP]? POE has a special way of passing arguments around. The @_ array is packed with lots of extra arguments -- a reference to the current kernel and session, the state name, a reference to the heap, and other goodies. To access them, you index the @_ array with various special constants which POE exports, such as HEAP, SESSION, KERNEL, STATE, and ARG0 through ARG9 to access the state's user-supplied arguments. Like most design decisions in POE, the point of this scheme is to maximize backwards compatibility without sacrificing speed. The example above is storing a SocketFactory wheel in the heap under the key 'listener'.

The POE::Wheel::SocketFactory wheel is one of the coolest things about POE. You can use it to create any sort of stream socket (sorry, no UDP sockets yet) without worrying about the details. The statement above will create a SocketFactory that listens on the specified TCP port (with the SO_REUSE option set) for new connections. When a connection is established, it will call the &accept_new_client state to pass on the new client socket; if something goes wrong, it'll call the &accept_failed state instead to let us handle the error. That's all there is to networking in POE!

We store the wheel in the heap to keep Perl from accidentally garbage-collecting it at the end of the state -- this way, it's persistent across all states in this session. Now, onto the &server_stop state:

  22  sub server_stop {
  23      print "SERVER: Stopped.\n";
  24  }

Not much to it. I just put this state here to illustrate the flow of the program when you run it. We could just as easily have had no _stop state for the session at all, but it's more instructive (and easier to debug) this way.

Here's where we create new sessions to handle each incoming connection:

  25  sub accept_new_client {
  26      my ($socket, $peeraddr, $peerport) = @_[ARG0 .. ARG2];
  27      $peeraddr = inet_ntoa($peeraddr);
  28      new POE::Session (
  29          _start => \&child_start,
  30          _stop  => \&child_stop,
  31          main   => [ 'child_input', 'child_done', 'child_error' ],
  32          [ $socket, $peeraddr, $peerport ]
  33      );
  34      print "SERVER: Got connection from $peeraddr:$peerport.\n";
  35  }

Our POE::Wheel::SocketFactory will call this subroutine whenever it successfully establishes a connection to a client. We convert the socket's address into a human-readable IP address (line 27) and then set up a new session which will talk to the client. It's somewhat similar to the previous POE::Session constructor we've seen, but a couple things bear explaining:

@_[ARG0 .. ARG2] is shorthand for ($_[ARG0], $_[ARG1], $_[ARG2]). You'll see array slices used like this a lot in POE programs.

What does line 31 mean? It's not like any other event_name = state> pair that we've seen yet. Actually, it's another clever abbreviation. If we were to write it out the long way, it would be:

  new POE::Session (
      ...
      child_input => &main::child_input,
      child_done  => &main::child_done,
      child_error => &main::child_error,
      ...
  );

It's a handy way to write out a lot of state names when the state name is the same as the event name -- you just pass a package name or object as the key, and an array reference full of subroutine or method names, and POE will just do the right thing. See the POE::Session docs for more useful tricks like that.

Finally, the array reference at the end of the POE::Session constructor's argument list (on line 32) is the list of arguments which we're going to manually supply to the session's _start state.

If the POE::Wheel::SocketFactory had problems creating the listening socket or accepting a connection, this happens:

  36  sub accept_failed {
  37      my ($function, $error) = @_[ARG0, ARG2];
  38      delete $_[HEAP]->{listener};
  39      print "SERVER: call to $function() failed: $error.\n";
  40  }

Printing the error message is normal enough, but why do we delete the SocketFactory wheel from the heap? The answer lies in the way POE manages session resources. Each session is considered ``alive'' so long as it has some way of generating or receiving events. If it has no wheels and no aliases (a nifty POE feature which we won't cover in this article), the POE kernel realizes that the session is dead and garbage-collects it. The only way the server session can get events is from its SocketFactory wheel -- if that's destroyed, the POE kernel will wait until all its child sessions have finished, and then garbage-collect the session. At this point, since there are no remaining sessions to execute, the POE kernel will run out of things to do and exit.

So, basically, this is just the normal way of getting rid of unwanted POE sessions: dispose of all the session's resources and let the kernel clean up. Now, onto the details of the child sessions:

  41  sub child_start {
  42      my ($heap, $socket) = @_[HEAP, ARG0];
  43      $heap->{readwrite} = new POE::Wheel::ReadWrite
  44        ( Handle => $socket,
  45          Driver => new POE::Driver::SysRW (),
  46          Filter => new POE::Filter::Line (),
  47          InputState => 'child_input',
  48          ErrorState => 'child_error',
  49        );
  50      $heap->{readwrite}->put( "Hello, client!" );
  51      $heap->{peername} = join ':', @_[ARG1, ARG2];
  52      print "CHILD: Connected to $heap->{peername}.\n";
  53  }

This gets called every time a new child session is created to handle a newly connected client. We'll introduce a new sort of POE wheel here: the ReadWrite wheel, which is an event-driven way to handle I/O tasks. We pass it a filehandle, a driver which it'll use for I/O calls, and a filter that it'll munge incoming and outgoing data with (in this case, turning a raw stream of socket data into separate lines and vice versa). In return, the wheel will send this session a child_input event whenever new data arrives on the filehandle, and a child_error event if any errors occur.

We immediately use the new wheel to output the string ``Hello, client!'' to the socket. (When you try out the code, note that the POE::Filter::Line filter takes care of adding a line terminator to the string for us.) Finally, we store the address and port of the client in the heap, and print a success message.

We will omit discussion of the child_stop state, since it's only one line long. Now for the real meat of the program: the child_input state!

  57  sub child_input {
  58      my $data = $_[ARG0];
  59      $data =~ tr{0-9+*/()-}{}cd;
  60      return unless length $data;
  61      my $result = eval $data;
  62      chomp $@;
  63      $_[HEAP]->{readwrite}->put( $@ || $result );
  64      print "CHILD: Got input from peer: \"$data\" = $result.\n";
  65  }

When the client sends us a line of data, we strip it down to a simple arithmetic expression and eval it, sending either the result or an error message back to the client. Normally, passing untrusted user data straight to eval() is a horribly dangerous thing to do, so we have to make sure we remove every non-arithmetic character from the string before it's evaled (line 59). The child session will happily keep accepting new data until the client closes the connection. Run the code yourself and give it a try!

The child_done and child_error states should be fairly self-explanatory by now -- they each delete the child session's ReadWrite wheel, thus causing the session to be garbage-collected, and print an expository message explaining what happened. Easy enough.

That's All For Today

And that's all there is to it! The longest subroutine in the entire program is only 12 lines, and all the complicated parts of the server-witing process have been offloaded to POE. Now, you could make the argument that it could be done more easily as a procedural-style program, like the examples in man perlipc. For a simple example program like this, that would probably be true. But the beauty of POE is that, as your program scales, it stays easy to modify. It's easier to organize your program into discrete elements, and POE will provide all the features you would otherwise have had to hackishly reinvent yourself when the need arose.

So give POE a try on your next project. Anything that would ordinarily use an event loop would be a good place to start using POE. Have fun!

Source Listing....

Related Links

http://poe.perl.org/
The POE home page. All good things stem from here.

This Week on p5p 2001/01/14



Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month.

Excising sigsetjmp

Alan Burlison found that by telling Perl that Solaris didn't really have sigsetjmp, he could get a noticable improvement in speed - around 15%. He asked if sigsetjmp could go, or whether there was a reason for it being there. Andy said there was a reason, but he had forgotten what it was. Nick I-S asked if there was anything that sigsetjmp was absolutely required for - the answer, from Alan, was that sigsetjmp restores the signal mask after a jump. In Perl terms, this means that if you die from a signal handler into an eval (something you'd be doing with alarm, for instance) then you'd be sure to get your signal handler reinstalled. With ordinary setjmp you might get your signal mask restored, but you might not. There was some discussion as to whether it would be possible to only use sigsetjmp to jump into and out of a signal handler, but Nicholas Clark pointed out that since any Perl subroutine could be a signal handler, it's more or less impossible to make a distinction. The eventual consensus is that Perl's signal handling is currently so, uh, sub-optimal, that it probably wouldn't make that much of a difference if sigsetjmp was removed.

In the end, Nick Ing-Simmons came up with a patch which provided roughly sigsetjmp-like semantics with ordinary setjmp, so it looks like there might be a win there.

Benchmarking

Alan did some more fiddling with optimization and Solaris configuration, and managed to get what he claimed was a 30% overall speedup - 18% due to setjmp and 12% due to optimizer settings. Numbers like that immediately sparked a debate on how you can conceivably benchmark a programming language manually; it's well known that the test suites exercises Perl in a number of non-standard ways, and really doesn't represent real world use. Alan said that his tests had been done on a real XS module for dealilng with Solaris accounting.

Nicholas Clark asked what a sensible benchmark would be; he suggested Gisle's perlbench, which was at least designed to try to be a fair test for Perl, but it seemed there was some confusion as to how it was supposed to work. Doug Bagley's programming language shootout was also mentioned.

Jarkko nailed the question, in the end : "The problem with all artificial benchmarks is that they are artificial." Read about it.

UTF8 Heroism

INABA Hiroto's been at it again. With his latest patches, the Unicode torture test works fine, which is fantastic news - Unicode should now be considered stable and usable. In fact, one of his patches also fixes a couple of regular expression bugs as well. There was then some disagreement over Unicode semantics (as usual) and whether or not \x{XX} should produce Unicode output; Hiroto came up with an excellent suggestion: the qu// operator would work like a quoted string but would always produce UTF8. And, dammit, he implemented it as well. In the words of Pete Townsend, I've gotta hand my Unicode crown to him. Or something.

All that's really left to do now is to reconcile EBCDIC support and UTF8 support - the suggested way to do this was to put in some conversion tables between the two character encodings, so that anything that created UTF8 data would have its EBCDIC input sent through a filter to turn it into Latin 1, and anything which decoded UTF8 data would be sent through a filter to turn it back into EBCDIC. There was some progress on that this week, but a fundamental problem remains: some things, such as version strings, want the UTF8 codepoints qua codepoints. That's to say, the numbers in v5.7.0 should NOT be transformed into their EBCDIC equivalents. This was manifesting itself with weird errors like

    Perl lib version (v5.7.0) doesn't match executable version (v5.7.0)

But it's being worked on.

Cygwin versus Windows

Some issues surfaced while Reini Urban was looking at Berkeley DB support in Cygwin - not all of them were Perl related, but contained useful information for porters.

Some code in Berkeley DB relied on the maximum path length; Reini wanted to use an #ifdef _WIN32 block to get at MAX_PATH, but Charles Wilson pointed out that Cygwin should NOT define _WIN32, which is a compatibility crutch for bad ports. Cygwin already defines FILENAME_MAX and PATH_MAX as ISO C and POSIX demand, so those should be used instead of MAX_PATH which is a strange beast from Windows-land.

The more general lesson here for Perl porters is that you should code for Cygwin as if it were a real, POSIX-compliant system, rather than as if it were Windows.

Oxymoron of the thread award went to Ernie Boyd, who explained MAX_PATH as a "MS Windows standard".

Lvalue lists

Continuing the lvalue saga, Stephen McCamant produced a full and glorious lvalue subroutine patch, which Jarkko applied. Tim Bunce wondered what would happen if you said

        (sub_returning_lvalue_hash()) = 1;

Stephen explained that the rules for assigning things is exactly the same as you'd expect from scalars, and that, for instance, you should put brackets around the right hand side if you're doing anything clever:

        sub_returning_lvalue_array() = (1, 2);

Radu Greab fixed a problem where lvalue subs weren't properly imposing list context on the assignment; this causes all sorts of problems when you have

    (lvalue1(), lvalue2()) = split ' ', '1 2 3 4';

as split doesn't see the right number of elements to populate. This led to a discussion of the curious and undocumented PL_modcount. This variable tells Perl how many things to fill up - it's actually only used in the case of split. However, it uses the number 10000 as a signifier for "this is going to be in list context, so just keep filling". Jarkko, after possibly one too many games of wumpus raised objection to this undocumented, unmacroified bizarre magic number. However, both the magic number and the lvalue split bug got tidied up.

Linux large file support

Richard Soderberg had a valiant crack at getting large file support to work under Linux, and concluded that he had to include the file features.h to make things work; after a little more messing around, he found that -D_GNU_SOURCE should also turn on the required 64-bit types. Russ Allbery piped up saying that -D_GNU_SOURCE ought to be more than enough - if it wasn't, there was a bug in glibc. (It looked for a fun moment that features.h was somewhat ironically named.)

Andreas said that his experience had been that upgrading his kernel, making the kernel headers available and then rebuilding glibc had magically given him large file support with no changes to Perl required - just a reconfigure and recompile. Linux users take note!

Calls for papers

Nat Torkington reminded us that the Perl Conference call for papers has been published, and gave a few ideas for papers that Perl porters could give. We're trying to press-gang someone into giving a paper on how the regular expression engine actually works, but the usual suspects have gone very quiet.

Rich Lafferty also remarked that the equally worthy Yet Another Perl Conference was also seeking papers.

Various

Thanks for all the work on the bugs reported last week!

Until next week I remain, your humble and obedient servant,


Simon Cozens

This Fortnight on p5p 2000/12/31



Notes

You can subscribe to an e-mail version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month.

Allow me to apologize for there being no service last week, as I was traveling hither and thither for New Year's festivities. This is another special bumper edition covering last week and the week before.

Fix for gv_handler segfault

Lupe Christoph found an ugly bug uncovered by Ilya's recent object destruction speedup patch: Basically, we had called a function ( mg_find) on a stash before actually making sure that the stash existed, causing a segfault. Lupe added a sanity check to the handler function and mg_find, just to be on the safe side.

rsync needs occasional delete

Lupe also got bitten by a problem that could affect some of you if you're following the rsync mainline: Rsync won't, by default, delete files that have been deleted from the repository. This caused strange test failures, because the test that was failing doesn't exist any more ... . He passes on these words of wisdom:

Note that this will not delete any files that were in '.' before the rsync. Once you are sure that the rsync is running correctly, run it with the --delete and the --dry-run options like this:

# rsync -avz --delete --dry-run rsync://ftp.linux.activestate.com/perl-current/ .

This will simulate an rsync run that also deletes files not present in the bleadperl master copy. Observe the results from this run closely. If you are sure that the actual run would delete no files precious to you, you could remove the '--dry-run' option.

Outstanding Unicode fix

This sneaked in just as I was writing the previous summary: INABA Hiroto came up with a 48k patch that fixed a few Unicode bugs. He also added a pragma unicode::distinct, which makes a Unicode-encoded string never equal to a nonencoded string. That's to say, if we have:

    $a = chr(50).chr(400); # Needs to be UTF8-encoded, because of chr(400)
    chop($a);              # Now we have chr(50), but it's still encoded.
    $b = chr(50)           # This isn't UTF8-encoded

Under normal circumstances, $a eq $b. This is what we expect, because they represent the same character. But under unicode::distinct, they won't be equal because they are represented differently.

He also fixed a few other issues I should have been looking at, like Unicode tr///.

New Solaris hints file

Lupe Christoph (again; busy man) attempted to update the Solaris hints file; it looked pretty good, but produced weird errors (No, Solaris machines are not EBCDIC. Or at least, not normally.) that were traced to a problem with Configure calling another shell file to get more information (a "call back unit") but then getting confused as to where the source directory is. Robin Barker also found a similar problem when Configure sets $src using a relative path instead of an absolute one. The problem was eventually hunted down and shot, and the new hints file is now working properly.

Jarkko also re-wrote the DEC OSF hints file.

Lots of lvalue hackery

I got into a very strange mood and started looking at lvalue subroutines. These are things that allow you to return a variable or something modifiable from a subroutine, like this:

 $a = 20;
 sub foo :lvalue { $a }
 foo() = 30; # Now $a is 30

The first problem was that this didn't extend to AUTOLOAD, meaning you couldn't say

 sub AUTOLOAD : lvalue { ${$AUTOLOAD} }
 foo() = 20;

and have it set $foo. Perhaps that was meant to be a feature, but it was fixed anyway. Of course, the natural extension to that would be to let subs be called without the brackets, like this:

 foo = 20;

(Bare words in lvalue context are now interpreted as either subs or filehandles, depending on what you have declared or open.)

Unfortunately, you can't return arrays, hashes or slices of arrays or hashes. This is where the trouble started. The problem is that you need to be able to tell, for instance, that the operator that returns an array that you actually do want the array itself, rather than a list of its elements. Look at this:

 @a = @b;

Here, @a is being modified, and so the actual AV is put onto the stack. But in the case of @b, the list of its elements is put on the stack. The difference is that the op for @a knows it's being modified, and the one for @b doesn't.

Back to lvalue subs - we've got an operator in a subroutine that is going to be modified, so it needs to return the AV (the container) instead of the values. But which one?

There's nothing about an op that signifies that it's going to be used as a return value, so we don't know which op we need to tell that it's being modified. I had a go at a cheap way of doing it, but Stephen McCamant eventually persuaded me that it didn't really work, and he's now working on his own way of doing it, which looks quite nice. There followed a debate about what you ought to be able to return from an lvalue subroutine (Does shift constitute a modifiable value?) that is still going on. Read about it.

Stephen McCamant and Doug MacEachern week

The second week of the report, the first week in January, was dominated by a lot of good stuff from Stephen and Doug. As well as picking up where I left off on lvalue subroutines, Stephen identified a problem with tests depending on modules that might not have been built, (which must be a tricky area, because three people tried to patch it and only one succeeeded ... ) and then came out with some really solid patches. He fixed a couple of problems with B::Deparse and also Perl's handling of continue blocks, and then a problem with all global variables looking like they'd been declared with our. (This is quite a complex one, so if you're interested, read the patch description.) To cap it all, he produced a useful tool that he believes is a replacement for B::Terse; it's called B::Concise and it allows you to control the output format through a pattern language, and you can get it from http://csua.berkeley.edu/~smcc/Concise.pm.

Not to be outdone, Doug noticed a conflict between two system header files, which could well be Linux being sloppy, and a conflict between PerlIO and stdio; he then patched the default XS typemap to use Perl_croak instead of the older, deprecated croak, tried to fix XSUBs to be declared static (but Nick I-S pointed out that wouldn't fly) but plumped for giving them an additional prototype to help with compiler warnings, and added a nice shortcut to the XS build process so you can say

 make MyModule.i

and get a pre-processed version of the C file. He also fixed some prototypes, and finished the week with a stroke of pure genius, allowing AUTOLOAD to be an XSUB.

Things to investigate

Here are some things that out-of-work porters can investigate: (in fact, it would be really nice if everyone who submitted bug reports got some kind of feedback from a human being, so if you see something that hasn't been dealt with, why not look into it?)

This bug appears to require a strange set of conditions, but generates a segfault; perhaps someone could find out where it's segfaulting, try and narrow down the problem, and report back.

A bizarre one from torsten@sotlx2.sot.de: "When descending into a Joliet filesystem Find stops after the first level. Rockridge or ISO CDs are ok." (That's bug ID 20010103.001, if you're thinking of fixing it.)

For the scary and godlike, here's a B::C bug for you to chew on. Enjoy.

Various

There is little else to report this week; a couple of reports of bugs fixed in the latest snapshots - if you're reporting a bug, could you please try it out in at least 5.7.0, if not more freqent snapshots, because we could already have fixed it. In the second week, there were many little patches that didn't generate any discussion but were still good to see. Only four new IV-preservation bugs this time.

Some spam, many test results (Thanks, Alan!) and a couple of nonbugs.

Until next week I remain, your humble and obedient servant,


Simon Cozens

Beginners Intro to Perl - Part 6

Editor's note: this venerable series is undergoing updates. You might be interested in the newer versions, available at:

Table of Contents

Part 1 of this series
Part 2 of this series
Part 3 of this series
Part 4 of this series
Part 5 of this series

Doing It Right the First Time
Comments
Warnings
Taint
Stuff Taint Doesn't Catch
use strict
  •Strict vars
  •Strict subs
  •Want a Sub, Get a String
  •The One Exception
Is This Overkill?
Play Around!

Doing It Right the First Time


Perl is a useful tool, which many people use to write some good software. But like all programming languages, Perl can also be used to create bad software. Bad software contains bugs, has security holes and is hard to fix or extend.

Fortunately, Perl offers you many ways to increase the quality of the programs you write. In this last installment in the Beginner's Intro series, we'll take a look at a few of them.

Comments

In the first part of this series, we looked at the lowly #, which indicates a comment. Comments are your first line of defense against bad software, because they help answer the two questions that people always have when they look at source code: What does this program do and how does it do it? Comments should always be part of any software you write. Complex code with no comments is not automatically evil, but bring some holy water just in case.

Good comments are short, but instructive. They tell you things that aren't clear from reading the code. For example, here's some obscure code that could use a comment or two:

        for $i (@q) {
            my ($j) = fix($i);
            transmit($j);
        }

Bad comments would look like this:

        for $i (@q) { # @q is list from last sub
            my ($j) = fix($i);  # Gotta fix $j...
            transmit($j);  # ...and then it goes over the wire
        }

Notice that you don't learn anything from these comments. my ($j) = fix($i); # Gotta fix $j... is meaningless, the equivalent of a dictionary that contains a definition like widget (n.): A widget. What is @q? Why do you have to fix its values? That may be clear from the larger context of the program, but you don't want to skip all around a program to find out what one little line does!

Here's something a little clearer. Notice that we actually have fewer comments, but they're more instructive:

       # Now that we've got prices from database, let's send them to the buyer
       for $i (@q) {
           my ($j) = fix($i);  # Add local taxes, perform currency exchange
           transmit($j);
       }

Now it's obvious where @q comes from, and what fix() does.

Warnings

Comments are good, but the most important tool for writing good Perl is the ``warnings'' flag, the -w command line switch. You can turn on warnings by placing -w on the first line of your programs like so:

         #!/usr/local/bin/perl -w

Or, if you're running a program from the command line, you can use -w there, as in perl -w myprogram.pl.

Turning on warnings will make Perl yelp and complain at a huge variety of things that are almost always sources of bugs in your programs. Perl normally takes a relaxed attitude toward things that may be problems; it assumes that you know what you're doing, even when you don't.

Here's an example of a program that Perl will be perfectly happy to run without blinking, even though it has an error on almost every line! (See how many you can spot.)

       #!/usr/local/bin/perl
       $filename = "./logfile.txt";
       open (LOG, $fn);
       print LOG "Test\n";
       close LOGFILE;

Now, add the -w switch to the first line, and run it again. You should see something like this:

Name ``main::filename'' used only once: possible typo at ./a6-warn.pl line 3. Name ``main::LOGFILE'' used only once: possible typo at ./a6-warn.pl line 6. Name ``main::fn'' used only once: possible typo at ./a6-warn.pl line 4. Use of uninitialized value at ./a6-warn.pl line 4. print on closed filehandle main::LOG at ./a6-warn.pl line 5.

Here's what each of these errors means:

1. Name ``main::filename'' used only once: possible typo at ./a6-warn.pl line 3. and Name ``main::fn'' used only once: possible typo at ./a6-warn.pl line 4. Perl notices that $filename and $fn both only get used once, and guesses that you've misspelled or misnamed one or the other. This is because this almost always happens because of typos or bugs in your code, like using $filenmae instead of $filename, or using $filename throughout your program except for one place where you use $fn (like in this program).

2. Name ``main::LOGFILE'' used only once: possible typo at ./a6-warn.pl line 6. In the same way that we made our $filename typo, we mixed up the names of our filehandles: We use LOG for the filehandle while we're writing the log entry, but we try to close LOGFILE instead.

3. Use of uninitialized value at ./a6-warn.pl line 4. This is one of Perl's more cryptic complaints, but it's not difficult to fix. This means that you're trying to use a variable before you've assigned a value to it, and that is almost always an error. When we first mentioned $fn in our program, it hadn't been given a value yet. You can avoid this type of warning by always setting a default value for a variable before you first use it.

4. print on closed filehandle main::LOG at ./a6-warn.pl line 5. We didn't successfully open LOG, because $fn was empty. When Perl sees that we are trying to print something to the LOG filehandle, it would normally just ignore it and assume that we know what we're doing. But when -w is enabled, Perl warns us that it suspects there's something afoot.

So, how do we fix these warnings? The first step, obviously, is to fix these problems in our script. (And while we're at it, I deliberately violated our rule of always checking if open() succeeded! Let's fix that, too.) This turns it into:

        #!/usr/local/bin/perl -w
        $filename = "./logfile.txt";
        open (LOG, $filename) or die "Couldn't open $filename: $!";
        print LOG "Test\n";
        close LOG;

Now, we run our corrected program, and get this back from it:

Filehandle main::LOG opened only for input at ./a6-warn2.pl line 5.

Where did this error come from? Look at our open(). Since we're not preceding the filename with > or >>, Perl opens the file for reading, but in the next line we're trying to write to it with a print. Perl will normally let this pass, but when warnings are in place, it alerts you to possible problems. Change line 4 to this instead and everything will be great:

       open (LOG, ">>$filename") or die "Couldn't open $filename: $!";

The <-w> flag is your friend. Keep it on at all times. You may also want to read the <perldiag> man page, which contains a listing of all the various messages (including warnings) Perl will spit out when it encounters a problem. Each message is accompanied by a detailed description of what the message means and how to fix it.

Taint

Using -w will help make your Perl programs correct, but it won't help make them secure. It's possible to write a program that doesn't emit a single warning, but is totally insecure!

For example, let's say that you are writing a CGI program that needs to write a user's comment to a user-specified file. You might use something like this:

       #!/usr/local/bin/perl -w
       use CGI ':standard';
       $file = param('file');
       $comment = param('comment');
       unless ($file) { $file = 'file.txt'; }
       unless ($comment) { $comment = 'No comment'; }
       open (OUTPUT, ">>/etc/webstuff/storage/" . $file) or die "$!";
       print OUTPUT $comment . "\n";
       close OUTPUT;
       print header, start_html;
       print "<P>Thanks!</P>\n";       
       print end_html;

If you read the CGI programming installment, alarm bells are already ringing loud enough to deafen you. This program trusts the user to specify only a ``correct'' filename, and you know better than to trust the user. But nothing in this program will cause -w to bat an eye; as far as warnings are concerned, this program is completely correct.

Fortunately, there's a way to block these types of bugs before they become a problem. Perl offers a mechanism called taint that marks any variable that the user can possibly control as being insecure. This includes user input, file input and environment variables. Anything that you set within your own program is considered safe:

     $taint = <STDIN>;   # This came from user input, so it's tainted
     $taint2 = $ARGV[1]; # The @ARGV array is considered tainted too.
     $notaint = "Hi";    # But this is in your program... it's untainted

You enable taint checking with the -T flag, which you can combine with -w like so:

      #!/usr/local/bin/perl -Tw

-T will prevent Perl from running most code that may be insecure. If you try to do various dangerous things with tainted variables, like open a file for writing or use the system() or exec() functions to run external commands, Perl will stop right away and complain.

You untaint a variable by running it through a regex with matching subexpressions, and using the results from the subexpressions. Perl will consider $1, $2 and so forth to be safe for your program.

For example, our file-writing CGI program may expect that ``sane'' filenames contain only the alphanumeric characters that are matched by the \w metacharacter (this would prevent a malicious user from passing a filename like ~/.bashrc, or even ../test). We'd use a filter like so:

       $file = param('file');
       if ($file) {
           $file =~ /^(\w+)$/;
           $file = $1;
       }
       unless ($file) { $file = "file.txt"; }

Now, $file is guaranteed to be untainted. If the user passed us a filename, we don't use it until we've made sure it matches only \w+. If there was no filename, then we specify a default in our program. As for $comment, we never actually do anything that would cause Perl's taint checking to worry, so it doesn't need to be checked to pass -T.

Stuff Taint Doesn't Catch

Be careful! Even when you've turned on taint checking, you can still write an insecure program. Remember that taint only gets looked at when you try to modify the system, by opening a file or running a program. Reading from a file will not trigger taintedness! A very common breed of security hole exploits code that doesn't look very different from this small program:

        #!/usr/local/bin/perl -Tw
        use CGI ':standard';
        $file = param('filename');
        unless ($file) { $file = 'file.txt'; }
        open (FILE, "</etc/webstuff/storage/" . $file) or die "$!";
        print header();
        while ($line = <FILE>) {
            print $line;
        }
        close FILE;

Just imagine the joy when the ``filename'' parameter contains ../../../../../../etc/passwd. (If you don't see the problem: On a Unix system, the /etc/passwd file contains a list of all the usernames on the system, and may also contain an encrypted list of their passwords. This is great information for crackers who want to get into a machine for further mischief.) Since you are only reading the file, Perl's taint checking doesn't kick in. Similarly, print doesn't trigger taint checking, so you'll have to write your own value-checking code when you write any user input to a file!

Taint is a good first step in security, but it's not the last.

use strict

Warnings and taint are two excellent tools for preventing your programs from doing bad things. If you want to go further, Perl offers use strict. These two simple words can be put at the beginning of any program:

        #!/usr/local/bin/perl -wT
        use strict;

A command like use strict is called a pragma. Pragmas are instructions to the Perl interpreter to do something special when it runs your program. use strict does two things that make it harder to write bad software: It makes you declare all your variables (``strict vars''), and it makes it harder for Perl to mistake your intentions when you are using subs (``strict subs'').

If you only want to use one or two types of strictness in your program, you can list them in the use strict pragma, or you can use a special no strict pragma to turn off any or all of the strictness you enabled earlier.

        use strict 'vars';   # We want to require variables to be declared
        no strict 'vars';    # We'll go back to normal variable rules now
        use strict 'subs';   # We want Perl to distrust barewords (see below).
        no strict;           # Turn it off. Turn it all off. Go away, strict.

(There's actually a third type of strictness - strict refs - which prevents you from using symbolic references. Since we haven't really dealt with references, we'll concentrate on the other two types of strictness.)

Strict vars

Perl is generally trusting about variables. It will alllow you to create them out of thin air, and that's what we've been doing in our programs so far. One way to make your programs more correct is to use strict vars, which means that you must always declare variables before you use them. You declare variables by using the my keyword, either when you assign values to them or before you first mention them:

        my ($i, $j, @locations);
        my $filename = "./logfile.txt";
        $i = 5;

This use of my doesn't interfere with using it elsewhere, like in subs, and remember that a my variable in a sub will be used instead of the one from the rest of your program:

        my ($i, $j, @locations);
        # ... stuff skipped ...
        sub fix {
            my ($q, $i) = @_;  # This doesn't interfere with the program $i!
        }

If you end up using a variable without declaring it, you'll see an error before your program runs:

        use strict;
        $i = 5;
        print "The value is $i.\n";

When you try to run this program, you see an error message similar to Global symbol ``$i'' requires explicit package name at a6-my.pl line 3. You fix this by declaring $i in your program:

        use strict;
        my $i = 5;   # Or "my ($i); $i = 5;", if you prefer...
        print "The value is $i.\n";

Keep in mind that some of what strict vars does will overlap with the -w flag, but not all of it. Using the two together makes it much more difficult, but not impossible, to use an incorrect variable name. For example, strict vars won't catch it if you accidentally use the wrong variable:

         my ($i, $ii) = (1, 2);
         print 'The value of $ii is ', $i, "\n";

This code has a bug, but neither strict vars nor the -w flag will catch it.

Strict subs

During the course of this series, I've deliberately avoided mentioning all sorts of tricks that allow you to write more compact Perl. This is because of a simple rule: readability always wins. Not only can compactness make it difficult to read code, it can sometimes have weird side effects! The way Perl looks up subs in your program is an example. Take a look at this pair of three-line programs:

       $a = test_value;
       print "First program: ", $a, "\n";
       sub test_value { return "test passed"; }
       sub test_value { return "test passed"; }
       $a = test_value;
       print "Second program: ", $a, "\n";

The same program with one little, insignificant line moved, right? In both cases we have a test_value() sub and we want to put its result into $a. And yet, when we run the two programs, we get two different results:

       First program's result: test_value
       Second program's result: test passed

The reason why we get two different results is a little convoluted.

In the first program, at the point we get to $a = test_value;, Perl doesn't know of any test_value() sub, because it hasn't gotten that far yet. This means that test_value is interpreted as if it were the string 'test_value'.

In the second program, the definition of test_value() comes before the $a = test_value; line. Since Perl has a test_value() sub to call, that's what it thinks test_value means.

The technical term for isolated words like test_value that might be subs and might be strings depending on context, by the way, is bareword. Perl's handling of barewords can be confusing, and it can cause two different types of bug.

Want a Sub, Get a String

The first type of bug is what we encountered in our first program, which I'll repeat here:

        $a = test_value;
        print "First program: ", $a, "\n";
        sub test_value { return "test passed"; }

Remember that Perl won't look forward to find test_value(), so since it hasn't already seen test_value(), it assumes that you want a string. Strict subs will cause this program to die with an error:

        use strict;
        my $a = test_value;
        print "Third program: ", $a, "\n";
        sub test_value { "test passed"; }

(Notice the my put in to make sure that strict vars won't complain about $a.)

Now you get an error message like Bareword ``test_value'' not allowed while ``strict subs'' in use at ./a6-strictsubs.pl line 3. This is easy to fix, and there are two ways to do it:

1. Use parentheses to make it clear you're calling a sub. If Perl sees $a = test_value();, it will assume that even if it hasn't seen test_value() defined yet, it will sometime between now and the end of the program. (If there isn't any test_value() in your program, Perl will die while it's running.) This is the easiest thing to do, and often the most readable.

2. Declare your sub before you first use it, like this:

        use strict;
        sub test_value;  # Declares that there's a test_value() coming later ...
        my $a = test_value;  # ...so Perl will know this line is okay.
        print "Fourth program: ", $a, "\n";
        sub test_value { return "test_passed"; }

Declaring your subs has the advantage of allowing you to maintain the $a = test_value; syntax if that's what you find more readable, but it's also a little obscure. Other programmers may not see why you have sub test_value; in your code.

Of course, you could always move the definition of your sub before the line where you want to call it. This isn't quite as good as either of the other two methods, because now you are moving code around instead of making your existing code clearer. Also, it can cause other problems, which we'll discuss now ...

Want a String, Get a Sub

We've seen how use strict can help prevent an error where you intend to call a sub, but instead get a string value. It also helps prevent the opposite error: wanting a string value, but calling a sub instead. This is a more dangerous class of bug, because it can be very hard to trace, and it often pops up in the most unexpected places. Take a look at this excerpt from a long program:

        #!/usr/local/bin/perl -Tw
        use strict;
        use SomeModule;
        use SomeOtherModule;
        use YetAnotherModule;
        # ... (and then there's hundreds of lines of code) ...
        # Now we get to line 400 of the program, which tests if we got an "OK"
        # before we act on a request from the user.
        if ($response_code eq OK) {
            act_on($request);
        } else {
            throw_away($request);
        }

This program works without a hitch for a long time, because Perl sees the bareword OK and considers it to be a literal string. Then, two years later someone needs to add code to make this program understand HTTP status codes. They stick this in at line 2, or line 180, or line 399 (it doesn't matter exactly where, just that it comes before line 400):

        sub OK { return 200; } # HTTP "request ok, response follows" code
        sub NOT_FOUND { return 404; } # "URL not found" code
        sub SERVER_ERROR { return 500; } # "Server can't handle request"

Take a moment to guess what happens to our program now. Try to work the word ``disaster'' into it.

Thanks to this tiny change, our program now throws away every request that comes in to it. The if ($response eq OK) test now calls the OK() sub, which returns a value of 200. The if now fails every time! The programmer, if they still have a job after this fiasco, must hunt through the entire program to find out exactly when the behavior of if ($response eq OK) changed, and why.

By the way, if the programmer is really unlucky, that new OK() sub wouldn't even be in their code at all, but defined somewhere in a new version of SomeOtherModule.pm that just got installed!

Barewords are dangerous because of this unpredictable behavior. use strict (or use strict 'subs') makes them predictable, because barewords that might cause strange behavior in the future will make your program die before they can wreak havoc.

The One Exception

There's one place where it's OK to use barewords even when you've turned on strict subs: when you are assigning hash keys.

        $hash{sample} = 6;   # Same as $hash{'sample'} = 6
        %other_hash = ( pie => 'apple' );

Barewords in hash keys are always interpreted as strings, so there is no ambiguity.

Is This Overkill?

There are times when using all of the quality enforcement functionality (or ``correctness police,'' if you like to anthropmorphize) Perl offers seems like overkill. If you're just putting together a quick, three-line tool that you'll use once and then never touch again, you probably don't care about whether it'll run properly under use strict. When you're the only person who will run a program, you generally don't care if the -T flag will show that you're trying to do something unsafe with a piece of user input.

Still, it's a good idea to use every tool at your disposal to write good software. Here are three reasons to be concerned about correctness when you write just about anything:

1. One-off programs aren't. There are few programs worth writing that only get run once. Software tools tend to accumulate, and get used. You'll find that the more you use a program, the more you want it to do.

2. Other people will read your code. Whenever programmers write something really good, they tend to keep it around, and give it to friends who have the same problem. More importantly, most projects aren't one-person jobs; there are teams of programmers who need to work together, reading, fixing and extennding one another's code. Unless your plans for the future include always working alone and having no friends, you should expect that other people will someday read and modify your code.

3. You will read your code. Don't think you have a special advantage in understanding your code just because you wrote it! Often you'll need to go back to software you wrote months or even years earlier to fix it or extend it. During that time you'll have forgotten all those clever little tricks you came up with during that caffeine-fueled all-nighter and all the little gotchas that you noticed but thought you would fix later.

These three points all have one thing in common: Your programs will be rewritten and enhanced by people who will appreciate every effort you make to make their job easier. When you make sure your code is readable and correct, it tends to start out much more secure and bug-free, and it tends to stay that way, too!

Play Around!

During the course of this series, we've only scratched the surface of what Perl can do. Don't take these articles as being definitive - they're just an introduction! Read the perlfunc page to learn about all of Perl's built-in functions and see what ideas they inspire. My biography page tells you how to get in touch with me if you have any questions.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en