April 2001 Archives

This Week on p5p 2001/04/29



Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

This week was particularly busy, seeing nearly 550 messages.

MacPerl 5.6.1 Alpha

Chris Nandor announced that MacPerl 5.6.1 is now in its alpha state. Mac users, wander over to the MacPerl Sourceforge page and grab it.

B::Deparse Hackery Continues

As usual, Robin has been providing huge numbers of patches for B::Deparse - over the past few years, we've been adding all sorts of neat optimizations to the interpreter, and now Robin's been putting support for them back into the deparser. I asked him about it the other day and it seems it's getting close to the time when it's sensible to use B::Deparse for code serialization.

This week saw additions of human-readable pragmas, honouring lexical scoping inside things like do { }, __END__ sections, better filetest handling, better variable interpolation support, correct context forcing, as well as many smaller nits.

The Deparser is particularly important, because it shows us just how much we can get out of Perl bytecode. What would happen, for instance, if someone rewrote the Deparser to output not Perl, but another language?

Underscores in constants

On a similar note, Mike Guy produced a patch which explained why 0 and 1 in void context don't cause warnings, but every other constant does. This caused heated but essentially pointless discussion.

Licensing Perl modules

The vexed issue of module licensing turned up, after the GNU Automake project wanted to use a CPAN module in their work. However, the module has no license declaration and the author has disappeared, so they can't use the code; the FSF took this as a cue to remind us that everything ought to have an explicit license. Ask brought the discussion to P5P, asking us how best to encourage module authors to specify license information. One suggestion was to extend the DSLI classification for CPAN modules to have a "license" category and nag authors to state their license intentions. Jarkko and Elaine were concerned that we should not get so heavy on CPAN authors, and Jarkko noted that Elaine had recently added a default LICENSE section in h2xs which should act as encouragement in the future. Russ Allbery reminded us of the recommended license text:

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.

He also sparked a lot of armchair lawyering about "public domain" works. Amazingly, we managed to have a reasonably sensible discussion about licenses without anyone Cc'ing RMS.

M17N and POD

Sean Dague asked if there was a good way to have multiple languages in one's POD files; Jarkko suggested using

    =for lang en

and coaxing the podlators into spitting out the ones you want. Graham Barr said that he'd rather put different languages in different files, but Michael Stevens reminded us that keeping POD and code in the same place is a Good Thing.

I'd be interested to hear from anyone who's tried to use the POD utilities for things like this, and I'm sure Sean would be too. Raphael announced that he had produced a POD pre-processor, POD::PP, which may help solving this sort of thing.

Regex dumping again

Hugo rewrote MJD's patch to the regular expression dumper (remember this?) by ensuring that the value it depends on is always set, but Jarkko then noticed it was coredumping whenever he ran pod2man. Hugo explained what was going on:

Hmm, I start to understand - probably more than I wanted to.

We need to know two things: which node is logically next in the regexp, and which is physically next in memory. The patch above causes problems because NEXT_OFF is expected to point to the logical next node, not the physical next node.

The attached patch doesn't quite work either, and I'm not yet sure why not: it dumps the right thing for /(\G|x)([^x]*)x/, but not for /(\G|x)([[:alpha:]]*)x/. (I'm also a bit concerned about using up the last bit of ANYOF_FLAGS.) (And later...)

This area could probably do with some cleanup: it doesn't help that there is already a 'ANYOF_CLASS' flag, but that it does not distinguish between a regnode_charclass and a regnode_charclass_class - luckily I didn't need to understand it for this patch, since it is something to do with locale.

Jarkko promised to document what was going on with ANYOF_CLASS and explained the difference between regnode_charclass and regnode_charclass_class: " ANYOF_CLASS has [[:blah:]] flags. The first one is ANYOF with only static character class characters marked in its 256-bit bitmap, the second one in an ANYOF that has (hopefully) the ANYOF_CLASS flag on and has locale-dependent (and therefore runtime-dynamic) [[:blah:]] classes.

Jarkko also mentioned that someone ought to write t/pragma/re.t to test use re "debug" behaviour. (That's a hint.)

There was also an extremely confusing thread about word boundaries with Hugo and Ilya disagreeing with each other, and in unrelated regex wibblings, Leon Brocard found that use re 'debug' wasn't actually producing any output any more. Jarkko fixed the bug, and MJD poined out Mr Maeda's wonderful regex grapher.

Various

Benjamin Sugars continued the XSification of the Cwd module, by implementing Cwd::abs_path in C. Phillip Newton smashed a few bugs in find2perl.

Matt Sergeant updated the FAQ to reflect the fact that we now have Time::Piece in the core, making Time::localtime and other modules a less than optimal solution.

Last week, we reported on the efforts to create a pure-Perl compression library; the discussions this week seem to have centered around trying to ship zlib with Perl, integrate it in the Perl build process and ensuring its portability to everywhere Perl can go. Paul Marquess' message about this is a good summary of what's going on.

Paul also put in the next version of DB_File. Michael Schwern asked why references decompose to integers in a number context. Some people pointed at the documentation, and explained that it was to help comparing references. He also asked if you can get the variable name of an SV; you can't.

Dave Mitchell asked if maintainence branches could be made more frequent. Sarathy said that he would like that to happen, but doesn't have the tuits to make it happen right now. His thoughts on handling the maintainance are definitely worth reading. Casey West announced his perl5-porters impressions night at TPC. Contact him at casey@geeknest.com for more information about that. He also cleaned up FindBinAbigail asked why we can have underscores in fractional parts of numeric constants, like 5.___5. Well, we just can.

Until next week I remain, your humble and obedient servant,


Simon Cozens

Quick Start Guide with SOAP Part Two

Table of Contents

Quick Start with SOAP Part I

More Complex Server (daemon, mod_perl and mod_soap)

Access to Remote Services

Access With Service Description (WSDL)

Security (SSL, basic/digest authentication, cookie-based authentication, ticket-based authentication, access control)

Handling LoLs (List of Lists, Structs, Objects, or something else)

More Complex Server (daemon, mod_perl and mod_soap)

You shouldn't have many problems with the CGI-based SOAP server you created in the first part of this article; however, performance could be significantly better. The next logical step might be to implement SOAP services using accelerators (like PerlEx or VelociGen) or persistent technologies (like mod_perl). Another lightweight solution might be to implement the SOAP service as an HTTP daemon; in that case, you don't need to use a separate Web server. This might be useful in a situation where a client application accepts SOAP calls, or for internal usage.

HTTP daemon
The following code shows an example implementation for a HTTP daemon:

4.a. server (HTTP daemon)

 #!perl -w

  use SOAP::Transport::HTTP;

  use Demo;

  # don't want to die on 'Broken pipe' or Ctrl-C
  $SIG{PIPE} = $SIG{INT} = 'IGNORE';

  $daemon = SOAP::Transport::HTTP::Daemon
    -> new (LocalPort => 80)
    -> dispatch_to('/home/soaplite/modules')
  ;

  print "Contact to SOAP server at ", $daemon->url, "\n";
  $daemon->handle;

Not much difference from the CGI server (Dynamic), huh? And it makes the same interface accessible, only through a different endpoint. This code is all you need to run the SOAP server on your computer without anything else.

HTTP daemon in VBScript
Similar code in VBScript may look like:

4.b. server (HTTP daemon, VBScript)

 call CreateObject("SOAP.Lite") _
   .server("SOAP::Transport::HTTP::Daemon", _
     "LocalPort", 80) _
   .dispatch_to("/home/soaplite/modules") _
   .handle

This is all you need to run SOAP server on a Microsoft platform (and it will run on Win9x/Me/NT/2K as soon as you register Lite.dll with regsvr32 Lite.dll).

ASP/VB
An ASP server could be created with VBScript or PerlScript code:

4.c. server (ASP server, VBScript)

 <%
    Response.ContentType = "text/xml"
    Response.Write(Server.CreateObject("SOAP.Lite") _
      .server("SOAP::Server") _ 
      .dispatch_to("/home/soaplite/modules") _
      .handle(Request.BinaryRead(Request.TotalBytes)) _
    )
  %>
Apache::Registry
One of the easiest ways to significantly speed up your CGI-based SOAP server is to wrap it with the mod_perl Apache::Registry module. You need to configure it in the httpd.conf file:

4.d. server (Apache::Registry, httpd.conf)

 Alias /mod_perl/ "/Apache/mod_perl/"
  <Location /mod_perl>
    SetHandler perl-script
    PerlHandler Apache::Registry
    PerlSendHeader On
    Options +ExecCGI
  </Location>

Put the CGI script soap.mod_cgi in the /Apache/mod_perl/ directory mentioned above:

4.d. server (Apache::Registry, soap.mod_cgi)

 #!perl -w

  use SOAP::Transport::HTTP;
  
  SOAP::Transport::HTTP::CGI
    -> dispatch_to('/home/soaplite/modules')
    -> handle
  ;
mod_perl
Let's consider a mod_perl-based server now. To run it you'll need to put the SOAP::Apache module (Apache.pm) in any directory in @INC:

4.e. server (mod_perl, Apache.pm)

 package SOAP::Apache;

  use SOAP::Transport::HTTP;
  
  my $server = SOAP::Transport::HTTP::Apache
    -> dispatch_to('/home/soaplite/modules') 
	
  sub handler { $server->handler(@_) }

  1;

Then modify your httpd.conf file:

4.e. server (mod_perl, httpd.conf)

 <Location /soap>
    SetHandler perl-script
    PerlHandler SOAP::Apache
  </Location>
mod_soap
mod_soap allows you to create a SOAP server by simply configuring the httpd.conf or .htaccess file.

4.f. server (mod_soap, httpd.conf)

 # directory-based access
  <Location /mod_soap>
    SetHandler perl-script
    PerlHandler Apache::SOAP
    PerlSetVar dispatch_to "/home/soaplite/modules"
    PerlSetVar options "compress_threshold => 10000"
  </Location>

  # file-based access
  <FilesMatch "\.soap$">
    SetHandler perl-script
    PerlHandler Apache::SOAP
    PerlSetVar dispatch_to "/home/soaplite/modules"
    PerlSetVar options "compress_threshold => 10000"
  </FilesMatch>

Directory-based access turns a directory into a SOAP endpoint. For example, you may point your request to http://localhost/mod_soap (there is no need to create this directory).

File-based access turns a file with a specified name (or mask) into a SOAP endpoint. For example, http://localhost/somewhere/endpoint.soap.

Alternatively, you may turn an existing directory into a SOAP server if you put an .htaccess file inside it:

4.g. server (mod_soap, .htaccess)

 SetHandler perl-script
  PerlHandler Apache::SOAP
  PerlSetVar dispatch_to "/home/soaplite/modules"
  PerlSetVar options "compress_threshold => 10000"

Access to Remote Services

It's time now to re-use what has already been done and to try to call some services available on the Internet. After all, the most interesting part of SOAP is interoperability between systems where the communicating parts are created in different languages, running on different platforms or in different environments, and are providing interfaces with service descriptions or documentation. XMethods.net can be a perfect starting point.

Name of state based on state's number (in alphabetical order)
Frontier implementation has a test server that returns the name of a state based on a number you provide. By default, SOAP::Lite generates a SOAPAction header with the structure of [URI]#[method]. Frontier, however, expects SOAPAction to be just the URI, so we have to use on_action to modify it. In our example, we specify on_action(sub { sprintf '"%s"', shift }), so the resulting SOAPAction will contain only the URI (and don't forget the double quotes there).

5.a. client

 #!perl -w
  
  use SOAP::Lite;

  # Frontier http://www.userland.com/  $s = SOAP::Lite 
    -> uri('/examples')
    -> on_action(sub { sprintf '"%s"', shift })
    -> proxy('http://superhonker.userland.com/')
  ;

  print $s->getStateName(SOAP::Data->name(statenum => 25))->result;

You should get the output:

5.a. result

Missouri


Paul Kulchenko is a featured speaker at the upcoming O'Reilly Open Source Convention in San Diego, CA, July 23 - 27, 2001. Take this opportunity to rub elbows with open source leaders while relaxing in the beautiful setting of the beach-front Sheraton San Diego Hotel and Marina. For more information, visit our conference home page. You can register online.



Whois
We will target services with different implementations. The following service is running on a Windows platform:

5.b. client

 #!perl -w

  use SOAP::Lite;

  # 4s4c (aka Simon's SOAP Server Services For COM) http://www.4s4c.com/  print SOAP::Lite 
    -> uri('http://www.pocketsoap.com/whois')
    -> proxy('http://soap.4s4c.com/whois/soap.asp')
    -> whois(SOAP::Data->name('name' => 'yahoo'))
    -> result;

Nothing fancy here; 'name' is the name of the field and 'yahoo' is the value. That should give you the output:

5.b. result

 The Data in Network Solutions' WHOIS database is provided by Network
  Solutions for information purposes, and to assist persons in obtaining
  information about or related to a domain-name registration record.
  Network Solutions does not guarantee its accuracy. By submitting a
  WHOIS query, you agree that you will use this Data only for lawful
  purposes and that, under no circumstances will you use this data to:
  (1) allow, enable or otherwise support the transmission of mass
  unsolicited, commercial advertising or solicitations via e-mail
  (spam); or  (2) enable high volume, automated, electronic processes
  that apply to Network Solutions (or its systems). Network Solutions
  reserves the right to modify these terms at any time. By submitting
  this query, you agree to abide by this policy.
  Yahoo (YAHOO-DOM)                                            YAHOO.COM
  Yahoo Inc. (YAHOO27-DOM)                                     YAHOO.ORG
  Yahoo! Inc. (YAHOO4-DOM)                                     YAHOO.NET

  To single out one record, look it up with "!xxx", where xxx is the
  handle, shown in parenthesis following the name, which comes first.
Book price based on ISBN
In many cases the SOAP interface is just a front end that requests information, parses the response, formats it and returns according to your request. It may not be doing that much, but it saves you time on the client side and fixes this interface, so you don't need to update it each time your service provider changes format or content. In addition, the major players are moving quickly toward XML; for example, Google already has an XML-based interface for its search engine. Here is the service that returns the price of a book given its ISBN:

5.c. client

 #!perl -w

  use SOAP::Lite;

  # Apache SOAP http://xml.apache.org/soap/ (running on XMethods.net)

  $s = SOAP::Lite                             
    -> uri('urn:xmethods-BNPriceCheck')                
    -> proxy('http://services.xmethods.net/soap/servlet/rpcrouter');

  my $isbn = '0596000278'; # Programming Perl, 3rd Edition
  print $s->getPrice(SOAP::Data->type(string => $isbn))->result;

Here is the result for 'Programming Perl, 3rd Edition':

5.c. result

 39.96

Note that we explicitly specified the type to be 'string', because an ISBN looks like number and will be serialized by default as an integer. However, the SOAP server we work with requires it to be a string.

Currency exchange rates
This service returns the value of one unit of country1's currency converted into country2's currency:

5.d. client

 #!perl -w

  use SOAP::Lite;

  # GLUE http://www.themindelectric.com/ (running on XMethods.net)

  my $s = SOAP::Lite                             
    -> uri('urn:xmethods-CurrencyExchange')                
    -> proxy('http://services.xmethods.net/soap');

  my $r = $s->getRate(SOAP::Data->name(country1 => 'England'), 
                      SOAP::Data->name(country2 => 'Japan'))
            ->result;
  print "Currency rate for England/Japan is $r\n";

Which gives you (as of 2001/03/11):

5.d. result

 Currency rate for England/Japan is 175.4608
NASDAQ quotes
This service returns a delayed stock quote based on a stock symbol:

5.e. client

 #!perl -w

  use SOAP::Lite;

  # GLUE http://www.themindelectric.com/ (running on XMethods.net)

  my $s = SOAP::Lite                             
    -> uri('urn:xmethods-delayed-quotes')                
    -> proxy('http://services.xmethods.net/soap');

  my $symbol = 'AMZN';
  my $r = $s->getQuote($symbol)->result;
  print "Quote for $symbol symbol is $r\n";

It may (or may not, depending on how Amazon is doing) give you:

5.e. result

 Quote for AMZN symbol is 12.25

Access with service description (WSDL)

Although support for WSDL 1.1 is limited in SOAP::Lite for now (the service description may work in some cases, but hasn't been extensively tested), you can access services that don't have complex types in their description:

6.a. client

 #!perl -w

  use SOAP::Lite;
  
  print SOAP::Lite
    -> service('http://www.xmethods.net/sd/StockQuoteService.wsdl')
    -> getQuote('MSFT');

If we take a look under the hood we'll find that SOAP::Lite requests a service description, parses it, builds the stub (a local object that has the same methods as the remote service) and returns it to you. As a result, you can run several requests using the same service description:

6.b. client

 #!perl -w

  use SOAP::Lite;
  
my $service = SOAP::Lite -> service('http://www.xmethods.net/sd/StockQuoteService.wsdl');
print 'MSFT + ORCL = ', $service->getQuote('MSFT') + $service->getQuote('ORCL');

The service description doesn't need to be on the Internet; you can access it from your local drive also:

6.c. client

 #!perl -w

  use SOAP::Lite
    service => 'http://www.xmethods.net/sd/StockQuoteService.wsdl',
    # service => 'file:/your/local/path/StockQuoteService.wsdl',
    # service => 'file:./StockQuoteService.wsdl',
  ;

  print getQuote('MSFT'), "\n";

This code works similar to the previous example (in OO style), but loads the description and imports all the methods, so you can use the functional interface.

And finally, a couple of one-liners for those who like to do something short and simple (albeit useful and powerful):

6.d. client

 # The following command is split for readability
  perl "-MSOAP::Lite service=>'http://www.xmethods.net/sd/StockQuoteService.wsdl'" 
       -le "print getQuote('MSFT')"

  perl "-MSOAP::Lite service=>'file:./quote.wsdl'" -le "print getQuote('MSFT')"

The last example (marked line) seems to be the shortest SOAP method invocation.

Security (SSL, basic/digest authentication, cookie-based authentication, ticket-based authentication, access control)

Though SOAP doesn't impose any security mechanisms (unless you count the SOAP Security Extensions: Digital Signature specification), the extensibility of the protocol allows you to leverage many security methods that are available for different protocols, like SSL over HTTP or S/MIME. We'll consider how SOAP can be used together with SSL, basic authentication, cookie-based authorization and access control.

SSL
Let's start with SSL. Surprisingly there is nothing SOAP-specific you need to do on the server side, and there is only a minor modification on the client side: just specify https: instead of http: as the protocol for your endpoint and everything else will be done for you. Obviously, both endpoints should support this functionality and the server should be properly configured.

7.a. client

 #!perl -w
  
  use SOAP::Lite +autodispatch => 
    uri => 'http://www.soaplite.com/My/Examples',

    proxy => 'https://localhost/cgi-bin/soap.cgi',

    on_fault => sub { my($soap, $res) = @_; 
      die ref $res ? $res->faultdetail : $soap->transport->status, "\n";
    }
  ;

  print getStateName(21);
Basic authentication
The situation gets even more interesting with authentication. Consider this code that accesses an endpoint that requires authentication.

7.b. client

 #!perl -w

  use SOAP::Lite +autodispatch => 
    uri => 'http://www.soaplite.com/My/Examples', 
    proxy => 'http://services.soaplite.com/auth/examples.cgi', 
    on_fault => sub { my($soap, $res) = @_; 
      die ref $res ? $res->faultdetail : $soap->transport->status, "\n";
    }
  ;

  print getStateName(21);

Keep in mind that the password will be in clear text during the transfer (not exactly in clear text; it will be base64 encoded, but that's almost the same) unless the user uses https (i.e. authentication doesn't mean encryption).

The server configuration for an Apache Web server with authentication can be specified in a .conf or in .htaccess file, and may look like this:

7.b. server (.htaccess)

 AuthUserFile /path/to/users/file/created/with/htpasswd
  AuthType Basic
  AuthName "SOAP::Lite authentication tests"
  require valid-user

If you run example 7.b against this endpoint, you'll probably get the following error:

7.b. result

 401 Authorization Required

You may provide the required credentials on the client side (user soaplite, and password authtest) overriding the function get_basic_credentials() in the class SOAP::Transport::HTTP::Client:

7.c. client

 #!perl -w
  
  use SOAP::Lite +autodispatch => 
    uri => 'http://www.soaplite.com/My/Examples', 
    proxy => 'http://services.soaplite.com/auth/examples.cgi', 
    on_fault => sub { my($soap, $res) = @_; 
      die ref $res ? $res->faultdetail : $soap->transport->status, "\n";
    }
  ;

  sub SOAP::Transport::HTTP::Client::get_basic_credentials { 
    return 'soaplite' => 'authtest';
  }

  print getStateName(21);

That gives you the correct result:

7.c. result

Massachusetts

Alternatively you may provide this information with a credentials() functions, but you need to specify the host and realm also:

7.d. client

 #!perl -w

  use SOAP::Lite +autodispatch => 
    uri => 'http://www.soaplite.com/My/Examples',

    proxy => [
      'http://services.soaplite.com/auth/examples.cgi', 
      credentials => [
        'services.soaplite.com:80',        # host:port
        'SOAP::Lite authentication tests', # realm
        'soaplite' => 'authtest',          # user, password
      ]
    ],

    on_fault => sub { my($soap, $res) = @_; 
      die ref $res ? $res->faultdetail : $soap->transport->status, "\n";
    }
  ;

  print SOAP->getStateName(21);

Under modern Perl you may get a warning about ``deprecated usage of inherited AUTOLOAD''. To avoid it use the full syntax: SOAP->getStateName(21) instead of getStateName(21).

The simplest and most convenient way would probably be to provide the user and password embedded in a URL. Surprisingly, this works:

7.e. client

 #!perl -w

  use SOAP::Lite;

  print SOAP::Lite
    -> uri('http://www.soaplite.com/My/Examples')

    -> proxy('http://soaplite:authtest@services.soaplite.com/auth/examples.cgi')

    -> getStateName(21)
    -> result;
Cookie-based authentication
Cookie-based authentication also doesn't require much work on the client side. Usually, it means that you need to provide credentials in some way, and if everything is OK, the server will return a cookie on success, and will then check it for all subsequent requests. Using available functionality you may not only support this behavior on the client side in one session, but even store cookies in a file and use the same server session for several runs. All you need to do is:

7.f. client

 #!perl -w

  use SOAP::Lite; 
  use HTTP::Cookies;

  my $soap = SOAP::Lite
    -> uri('urn:xmethodsInterop')

    -> proxy('http://services.xmethods.net/soap/servlet/rpcrouter', 
             cookie_jar => HTTP::Cookies->new(ignore_discard => 1));

  print $soap->echoString('Hello')->result;

All the magic is in the cookie jar. You may even add or delete cookies between calls, but the underlying module does everything you need by default. Add the option file => 'filename' to the call to new() to save and restore cookies between sessions. Not much work, huh? Kudos to Gisle Aas on that!

Ticket-based authentication
Ticket-based authentication is a bit more complex. The logic is similar to cookie-based authentication, but it is executed at the application level, instead of at the transport level. The advantage is that it works for any SOAP transport (not only for HTTP) and gives you a bit more flexibility. As a result, you won't get support from the Web server and you'll have to do everything manually. No big deal, right?

The first step is the ticket generation. We'll build a ticket that contains an e-mail address, a time and a signature.

7.g. server (TicketAuth)

 package TicketAuth;

  # we will need to manage Header information to get a ticket
  @TicketAuth::ISA = qw(SOAP::Server::Parameters);

  # ----------------------------------------------------------------------
  # private functions
  # ----------------------------------------------------------------------

  use Digest::MD5 qw(md5);

  my $calculateAuthInfo = sub {
    return md5(join '', 'something unique for your implementation', @_);
  };

  my $checkAuthInfo = sub {
    my $authInfo = shift;
    my $signature = $calculateAuthInfo->(@{$authInfo}{qw(email time)});
    die "Authentication information is not valid\n" 
      if $signature ne $authInfo->{signature};
    die "Authentication information is expired\n" 
      if time() > $authInfo->{time};
    return $authInfo->{email};
  };

  my $makeAuthInfo = sub {
    my $email = shift;
    my $time = time()+20*60; # signature will be valid for 20 minutes
    my $signature = $calculateAuthInfo->($email, $time);
    return +{time => $time, email => $email, signature => $signature};
  };

  # ----------------------------------------------------------------------
  # public functions
  # ----------------------------------------------------------------------

  sub login { 
    my $self = shift;

    pop; # last parameter is envelope, don't count it
    die "Wrong parameter(s): login(email, password)\n" unless @_ == 2;
    my($email, $password) = @_;

    # check credentials, write your own is_valid() function
    die "Credentials are wrong\n" unless is_valid($email, $password);

    # create and return ticket if everything is ok
    return $makeAuthInfo->($email);

  }

  sub protected { 
    my $self = shift;

    # authInfo is passed inside the header
    my $email = $checkAuthInfo->(pop->valueof('//authInfo'));

    # do something, user is already authenticated 
    return;
  }

It would be very careless (and insecure) to create calculateAuthInfo() as a normal, exposed function, because a client could invoke it directly and generate a valid ticket without providing valid credentials (unless you forbid it in the SOAP server configuration, but we'll show another way). Therefore, we create calculateAuthInfo(), checkAuthInfo() and makeAuthInfo() as 'private' functions, so only other functions inside the same file can access it. It effectively prevents clients from accessing them directly.

The login() function returns a hash that has an e-mail and time inside, as well as an MD5 signature that prevents the user from altering this information. Since the server used a secret string during signature generation, the user is not able to tamper with the resulting signature. To access protected methods, the client has to provide the obtained ticket in the header:

7.g. fragment

 # login
  my $authInfo = login(email => 'password');

  # convert it into the Header
  $authInfo = SOAP::Header->name(authInfo => $authInfo);

  # invoke protected method
  protected($authInfo, 'parameters');

This is just a fragment, but it should give you some ideas on how to implement ticket-based authentication on application level. You could even get the ticket in one place (via HTTP for example) and access a SOAP server via SMTP providing this ticket (ideally you should use PKI [public key infrastructure] for that matter).

Access control
Why would you need access control? Imagine you have a class and want to give access to it selectively; for example, read access to one person and read/write access to another person or a list of people. At a low level, read and write access means access to specific functions/methods in class.

You could put this check in at the application level (for example with ticket-based authentication), or you could split your class into two different classes and give one person access only to one of them. Neither of these is optimal solutions. We consider a different approach, where you create two different endpoints that refer to the same class on the server side, but have different access options.

7.e. server (first endpoint)

 use SOAP::Transport::HTTP;
  
  use Protected;
  SOAP::Transport::HTTP::CGI
    -> dispatch_to('Protected::readonly')
    -> handle
  ;

This endpoint will have access only to readonly() method in Protected class.

7.e. server (second endpoint)

 use SOAP::Transport::HTTP;

  use Protected;
  SOAP::Transport::HTTP::CGI
    -> dispatch_to('Protected')
    -> handle
  ;

This endpoint will have unrestricted access to all methods/functions in Protected class. Now you can put it under basic, digest or some other kind of authentication to prevent unauthorized access.

Thus, by combining the capabilities of a Web server with the SOAP server you can create an application that best suites your needs.

Handling LoLs (List of Lists, Structs, Objects, or something else)

Processing complex data structures isn't different in any aspect from the usual processing in your programming language. The general rule is simple: 'Treat the result of a SOAP call as a variable of specified type'.

The next example shows a service that works with array of structs:

8.a. client

 #!perl -w

  use SOAP::Lite;

  my $result = SOAP::Lite
        -> uri('urn:xmethodsServicesManager')
        -> proxy('http://www.xmethods.net/soap/servlet/rpcrouter')
        -> getAllSOAPServices();

  if ($result->fault) {
    print $result->faultcode, " ", $result->faultstring, "\n";
  } else {
    # reference to array of structs is returned
    my @listings = @{$result->result};

    # @listings is the array of structs
    foreach my $listing (@listings) {
      print "-----------------------------------------\n";
      # print description for every listing
      foreach my $key (keys %{$listing}) {
        print $key, ": ", $listing->{$key} || '', "\n";
      }        
    }
  }

The same is true about structs inside other structs, lists of objects, objects that have lists inside, etc. 'What you return on server side is what you get on client side, and let me know if you get something else.'

(OK, not always. You MAY get a blessed array even when you return a simple array on the other side and you MAY get a blessed hash when you return a simple one, but it won't change anything in your code, just access it as you usually do).


Major contributors:

Nathan Torkington
Basically started this work and pushed the whole process.

Tony Hong
Invaluable comments, fixes and input help me keep this material correct, fresh and simple.

This Week on p5p 2001/04/22



Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

There were just under 500 messages this week.

Modules in the core

Jarkko faced the allegations of bloat after having added Scalar-List-Utils to the core head on, with a list of proposed module additions. This, inevitably, sparked a huge thread. The main argument was not about bloat per se, but about the less of a clear sense of who is responsible for a module, especially if it has a separate life on CPAN, and how updates would get fed back.

Paul Marquess remarked that he wanted to put the zlib source into the Compress::Zlib distribution, which would make it suitable for core-ification so that CPAN.pm can deal with compressed files. Unfortunately, the source is big, not really as portable as Perl. Another suggestion was to add a configure probe for -lz when Perl was being built, and build Compress::Zlib. Nick Clark has been working on a compression/decompression PerlIO filter, but this too would use libz. The comparison between this and the -lgdbm dependency of GDBM_File led to talk of an AnyCompress library which had pure Perl decompression fallbacks. A good summary of that part of the discussion is given in this message by Nick Clark.

Anton Berezin remarked that FreeBSD was going to break up 5.6.1 into essential core components and additional "ports". This worried a couple of people until they remembered that Debian has been doing just this as well, and Joe Karthauser pointed out that breaking off core modules into separate ports would allow them to be updated independently of the Perl version.

Larry said that requiring expat wasn't too much of an impediment to XML::Parser being included, because XML is everywhere these days. Nat noted that Paul Kulchenko wrote a pure-Perl parser, XML::Parser::Lite which could also be contributed to core. Dan Brian astutely pointed out that if someone's got expat installed and they'll be doing XML things, it's likely they'll have XML::Parser anyway, so there isn't much point in having a pure-Perl fallback. Matt Sergeant reminded us that there isn't a consensus on the "best" XML API module, so we can't really include one of them either.

Larry wanted to talk with Paul about how slow XML::Parser::Lite turned out to be, presumably to help him decide about regular expression strategies for Perl 6.

Kwalitee Control

Perl's Quality Control department - Mr. Schwern - was in full flow this week; first he found that one of the test suites had some special-case @INC handling for Mac. If this was needed, he argued, why wasn't it needed for all of the tests? And if it was needed for all of the tests, why not abstract it out into a separate module? Since the TEST harness automatically loads TestInit.pm anyway, why not use that? Chris Nandor was in agreement, but pointed out that you'd still have to remove the @INC modifications from all the test scripts, because they would run after TestInit had done its modifications.

He also suggested that Test::Simple and Test::More were put into the core. Nobody had any comments on that.

Next, he removed the "compilation" tests from t/lib/1_compile.t for those modules which already have tests. Jarkko suggested that this would have to keep happening periodically, and wanted a cleverer solution, but Schwern thought of a better one - WRITE MORE TESTS!

He then produced a list of all the modules which were untested, and offered an incentive - once all the modules have sufficient tests, Schwern will donate $500 dollars to Yet Another Society. Get to it and deprive this man of his hard earned cash! There may even be prizes in it for you...

He also proposed two pretty uncontroversial (well, by most people's standards) standards for new modules coming into the core: that they should have some documentation and a reasnoable amount of tests. Schwern said "I don't want to get any more elaborate for the moment to avoid lengthy debate." This didn't work.

Peter Prymmer had Perl avoid testing the new List::Utils module on platforms which hadn't built the XS code for it. Graham Barr briefly objected, saying that it should fallback to pure Perl replacements; however, if they weren't built, these replacements wouldn't be moved to lib/. When he realised that the fallbacks were there for those people who didn't have compilers, and that you tend to have a compiler handy during the Perl build process, he withdrew his objection.

Regex Debugger and Reference Type

Mark-Jason Dominus has been working on the regular expression debugger for ActiveState's Komodo IDE - it allows you to, for example, set breakpoints in a regular expression. However, one of the problems he came across was relating the positions of the nodes of the compiled regular expression ( ANYOF, EXACT, and so on.) to character offsets in the string representation of it. For instance, if you have /([\d.]+)f/ your debugger will want to stop at the "f". To do this, the compiled form will need to know where the "f" is. To provide this, he submitted a patch which

He's put back a patch which generates an array of offsets every time a regular expression is compiled; he also patched perldebguts to explain how it all works.

As well as this, MJD noticed a problem with the debugging output for regular expressions: when you have a character class, Perl uses a bitmask to note which characters you're matching. It used to be 256 bits, one for each character. However, with UTF8 regular expressions, that bitmask now needs to be a lot wider than it used to be. However, the debugger didn't know about this new wide bitmask, and was still only skipping over 256 bits, landing somewhere in the middle of the Unicode bitmask. If the bits at this point were set to zero, which is likely, the debugger would interpret it as a null operation. MJD fixed this by having Perl skip to the next node in the list rather than trying to grovel over the bytecode.

Meanwhile, Michael Schwern asked why the Regexp type you get when you do ref qr/foo/ wasn't documented, or why it wasn't REGEXP like all the other built-in types. Jarkko agreed it should be REGEXP and Larry (Look, he's back!) suggested making it REGEX instead for pronouncability. Sarathy said he wanted to change it, but there were a couple of points that never got resolved: the name (which we now have a diktat on) and what the class should do. So Schwern suggested a patch to change the name. Sarathy, however, was concerned at the fact that since the "thing" returned by qr/foo/ is actually blessed, one can treat it as an object and create a Regexp.pm to implement methods for this object. (This is exactly what Damian Conway does in his Object Oriented Perl book.) The point of the upper-cased types are unblessed; there's nothing to stop someone writing SCALAR.pm but that would get confusing, because you could no longer tell whether something coming back from ref was blessed or unblessed. This convinced Jarkko not to change the capitalisation of Regexp.

iThreads

Artur Bergman reported that he'd started work on a module which will hopefully one day replace Thread.pm. Instead of the old-style "5.005" threading, it uses the new interpreter threads. These are called iThreads, come in a range of exciting colours and are hideously overpriced. They're the trickery used to emulate fork on Windows - instead of forking, all you do is clone the interpreter to form a "pseudo-process".

However, until now there hasn't been a way to control iThreads from Perl space; it all has to be done from C. Artur's not finished yet, but I hear that he's got quite a lot of the fundamentals working.

Interested? Join the mailing list.

B Bumblings

Robin produced a rather amazing patch which adds support for pragmata in B::Deparse. He also added something to parenthesis arguments to currently-undefined subroutines; that is:

    foo 23
    sub foo { }

needs parenthesis. Then he fixed UTF8 literal strings, and noticed a problem with regular expressions and large codepoints. David Dyck fixed the deparsing of split " " which was previously saying split /\s+/. Robin also got the deparser recognising special constants like $^W, and recognising the difference between lexical and global variables. Oh, and BEGIN/INIT/END blocks, and all sorts of other little features.

Michael Schwern has also been messing about with B, and found some mis-documentation in B::walksymtable, which he fixed up, as well as a bug in what Robin had been playing with. Robert replied with a truly wonderful explanation of how pragmata can be detected.

Various

Benjamin Franz announced Yet Another Mailing List, a working group to come up with a coherent strategy for coming up with a "named parameters" module. If that appeals to you, send mail with a body of subscribe argument-shop to majordomo@nihongo.org.

Elaine Ashton put in a couple of patches to the FAQ, as well as adding "mailing list" and "license" sections to the stub documentation produced by h2xs. These weren't huge, but I mention them because Elaine's one of those unsung heros, and people tend to forget the work she does for us in terms of behind the scenes things, such as tidying up the FAQ, perl.org stuff, the CPAN search engine and the wonderful Perl mailing lists lists.

Tom Roche came up with a suggestion to change Perl's version searching behaviour to allow different versions of a module to be installed. There were various explanations given for why this wouldn't work, (since Perl must load a module before recognising its version) and two neat alternative solutions: Richard Soderberg suggested a coderef in @INC and MJD suggested simply putting the version number in the module's file name. In fact, why not have a directory per module, so you have use Foo::Bar::1.10? But I digress...

Calle Dybedhal asked why we have a file called patchlevel.h, since ImageMagick has one too, and that was screwing up Perl. Larry replied, saying that we had it first.

Until next week I remain, your humble and obedient servant,


Simon Cozens

MSXML, It's Not Just for VB Programmers Anymore


My co-workers cringe when I tell them the truth. What XML parser are you using? MSXML? With Perl? You've gotta be crazy.

Yes, it's true, but I couldn't help myself. After test driving MSXML in a Visual Basic application, it begged the question: "I wonder if Perl can use MSXML?"

I have been using MSXML to do my XML parsing in Perl and the truth is that Perl is excellent for working with the Microsoft's MSXML Parser on the Win32 platform. If you use Perl on Win32, give MSXML a try from the comfort of your favorite text editor.

Grab the MSXML Parser

You Grab It

Go to Microsoft's MSDN site to download the latest version of MSXML, which is the 3.0 Release. Run the installation program and restart your machine. You have installed the latest version in side-by-side mode. None of your other Microsoft applications that use previous versions of MSXML will be affected.

Now Let Perl at It

Perl can control the MSXML parser using OLE. As with almost everything Perl, the difficult part has been done for us. The "kind people at Hip and ActiveWare(ActiveState)" have already provided us with the Win32::OLE module. The only thing that we need to know is the progID for the MSXML parser. A progID is a string used to uniquely identify an OLE automation class in the Windows registry. MSXML offers version dependent and version independent progIDs depending on the method of the installation. Since we have installed MSXML in side-by-side mode, we will need to use the version dependent progID.

Creating an OLE Instance of MSXML.DOMDocument

I begin by using the Win32::OLE module.

use Win32::OLE qw(in with);  # make sure you include(in & with)!!
                             # we will need them later.

Now I am ready to use OLE to create an instance of the MSXML parser or, more correctly, an OLE instance of MSXML2.DOMDocument.3.0, which I will simply call a DOMDocument.


# Version dependent method - this is what we want -
  my $DOM_document = Win32::OLE->new('MSXML2.DOMDocument.3.0') 
    or die "couldn't create";

# Version independent method - Assumes MSXML was installed in Replace Mode
# if you get errors with the above example - try using this example.
  my $DOM_document = Win32::OLE->new('MSXML2.DOMDocument') 
    or die "couldn't create";

Parsing the XML

Since I am a swim coach, I keep all kind of records, times and scores on hand. One of my favorite things to track is records, so I maintain an XML document that contains the school's top 10 times for each event. Below is what toptimes.xml looks like:

<TOP_TEN_TIMES>
   <EVENT NAME="200 Freestyle">
      <SWIMMER NUMBER="1" TIME="1:51.49" DATE="2/21/98" NAME="Chris Miller"/>
      <SWIMMER NUMBER="2" TIME="1:54.19" DATE="2/17/01" NAME="Peter Myers"/>
 ...
      <SWIMMER NUMBER="10" TIME="2:19.31" DATE="12/8/00" NAME="Andrew Johnson"/>
   </EVENT>
   <EVENT NAME="200 IM">
 ... 
   </EVENT>
   ...
</TOP_TEN_TIMES>

My backstrokers and butterfliers don't like XML very much, so I want to parse the XML document and print out the top 10 times for the 100 backstroke and 100 butterfly. I begin by loading toptimes.xml using the DOMDocument object that I have already created. The load method is where the XML document is actually parsed into its respective pieces such as Nodes and NodeLists. Validation also occurs at this point. The Load method returns a boolean that I can use to test whether my document loaded properly. I am going to validate my document, so I will set the validateOnParse property to 'True'.

 $DOM_document->{async} = "False";           # disable asynchrous
 $DOM_document->{validateOnParse} = "True";  # validate
 my $boolean_Load = $DOM_document->Load("topten.xml");
 if (!$boolean_Load) 
 {
   die "topten.xml did not load";
 }

Iterating Through the XML Document

Now that I have successfully loaded the document, I need a method of iterating through all of the document Nodes. In order to iterate through the document, I first need to find the root Node. In this example, the root Node is <TOP_TEN_TIMES>, so I will define $Top_Ten_Times to be the root Node of the xml document as such:

my $Top_Ten_Times = $DOM_document->DocumentElement();  # assign the root node

Next, I want to find all of the child Nodes of <TOP_TEN_TIMES>. $Events will be all of the root's child Nodes. In this example, $Events refers to every <EVENT> Node.

my $Events = $Top_Ten_Times->childNodes();      # all of the root's child nodes

$Events is now an NodeList (which is an OLE collection object) that I can use to iterate through each <EVENT> node in the XML document. Veteran Perl programmers will recognize the iteration code as being very similar to iterating through the elements of an array. The only difference is the little keyword 'in' that I mentioned earlier when we used Win32::OLE. The keyword 'in' is used to distinguish an OLE collection object from a standard Perl array.

I now iterate over each <EVENT> Node in the document checking each time to see whether I have one of the events that I need. When I arrive at one of the desired events, I will print the NAME Attribute of the current <EVENT> Node and create a new NodeList called $Swimmers. I will then iterate over each <SWIMMER> Node and print the TIME Attribute.

foreach my $Event (in $Events) # make sure you include the 'in'
{
   if ( ($Event->Attributes->getNamedItem("NAME")->{Text} eq "100 Backstroke") ||
        ($Event->Attributes->getNamedItem("NAME")->{Text} eq "100 Butterfly") )
   {
       # print the event name stored in the NAME attribute
        print $Event->Attributes->getNamedItem("NAME")->{Text}, "\n"; 
        my $Swimmers = $Event->childNodes();       # $Swimmers is now a NodeList collection
        foreach my $Swimmer (in $Swimmers )        # iterate through all swimmers
        {
           print $Swimmer->Attributes->getNamedItem("TIME")->{Text}, "\n";  # print the time
        }
   }
}

Transforming the XML

Now that I have satisfied the butterfliers and backstrokers on my team, I am beginning to realize that the design of my XML syntax is less than desirable. Most of my actual data is stored as attribute data, and I would really like it to be element data. I am going to perform a transformation that will place all of my actual data into element data. My goal is to make the XML document look like the following.

<TOP_TEN_TIMES>
   <EVENT>
      <EVENT_NAME>200 Freestyle</EVENT_NAME>
      <SWIM>
         <SWIMMER>Chris Miller</SWIMMER>
         <TIME>1:51.49</TIME>
         <DATE>2/21/98</DATE>
      </SWIM>
      <SWIM>
         <SWIMMER>Peter Myers</SWIMMER>
         <TIME>1:54.19</TIME>
         <DATE>2/17/01</DATE>
      </SWIM>
      ...
   </EVENT> 
   <EVENT>
      ...
   </EVENT>
   ...
</TOP_TEN_TIMES>

After a little work, I come up with the following stylesheet to do the transformation.


<?xml version="1.0" encoding="ISO-8859-1"?> 
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="xml" indent="yes" encoding="ISO-8859-1"/>
<xsl:template match="/">
<TOP_TEN_TIMES>
<xsl:for-each select="TOP_TEN_TIMES">
  <xsl:for-each select="EVENT">
  <EVENT>
     <EVENT_NAME><xsl:value-of select="@NAME"/></EVENT_NAME>
     <xsl:for-each select="SWIMMER">
     <SWIM>
        <SWIMMER><xsl:value-of select="@NAME"/></SWIMMER>
        <TIME><xsl:value-of select="@TIME"/></TIME>
        <DATE><xsl:value-of select="@DATE"/></DATE>
     </SWIM>
     </xsl:for-each>
  </EVENT>
  </xsl:for-each>
</xsl:for-each>
</TOP_TEN_TIMES>
</xsl:template>
</xsl:stylesheet>

To perform the transformation, I will need to create three OLE instances of DOMDocument. The first instance loads the top-times document and the second instance loads the stylesheet from above. The third instance will be used as the result of the transformation. I have created a subroutine that uses what we have already covered to create the three DOMDocument instances, and to load the top-times and stylesheet documents.

sub Transform {
 # Assign the File Names
  my $xml_doc_file     = shift;
  my $stylesheet_file  = shift;
  my $new_xml_doc_file = shift;
  my $boolean_Load;

 # Create the three OLE DOM instances
  my $doc_to_transform = Win32::OLE->new('MSXML2.DOMDocument.3.0');  
  my $style_sheet_doc  = Win32::OLE->new('MSXML2.DOMDocument.3.0');
  my $transformed_doc  = Win32::OLE->new('MSXML2.DOMDocument.3.0');

 # Load the Top Times document - just like above
  $doc_to_transform->{async} = "False";
  $doc_to_transform->{validateOnParse} = "True";
  $boolean_Load = $doc_to_transform->Load("$xml_doc_file");
  if(!$boolean_Load)
  {
      die "The Top Times did not load\n";
  }

 # Load the Stylesheet - just like above
  $style_sheet_doc->{async} = "False";
  $style_sheet_doc->{validateOnParse} = "True";
  $boolean_Load = $style_sheet_doc->Load($stylesheet_file);
  if(!$boolean_Load)
  {
      die "The Stylesheet did not load\n";
  }

 #Perform the transformation and save the resulting DOM object
  $doc_to_transform->transformNodeToObject($style_sheet_doc, $transformed_doc);
  $transformed_doc->save("$new_xml_doc_file");
}

The transformNodeToObject method is where the magic happens. I use the top-times DOMDocument instance to invoke the transformNodeToObject method and I pass the stylesheet and transformation-result instances as arguments. After the method returns, the result of the transformation is stored in $transformed_doc, which is strictly in memory. We simply call the save method and write the XML document to disk.

Now we can perform transformations using any stylesheet or XML document that we want (as long as the stylesheet relates to the XML document). We just need to pass the subroutine three file names: the name of the document to transform, the name of the stylesheet and the name of the new document. For our example, I will call the subroutine like the following.

Transform("toptimes.xml", "toptimes.xsl", "newtoptimes.xml");

After this code has executed, I have a brand new XML document newtoptimes.xml that conforms to my new XML syntax.

Updating the XML Document

After all this great work, one of my butterfliers informs me that I have been spelling his name wrong all season. I guess that there is two 'e's in Myers, not one. No problem. I can do this easily enough.

I can't use the exact code from above because the document structure has changed. Since the swimmers' names are pretty deep in the new structure, it will be too painful to find the root node and create a slew of nested loops (not to mention expensive to the processor). This is exactly what XPath is for. I will create a query to find all occurrences of "Peter Myers" and change them to "Peter Meyers". Once again, I create an instance of DOMDocument and load the XML document. However, this time I will call a new method, the selectNodes method, directly against the DOMDocument. As an argument, I supply an XPath query. The method returns a NodeList of all the Nodes that matched the XPath query. I can then iterate through the NodeList just like above and update the element data as I go.

  
my $Peter_Nodes = 
     $new_DOM_document->selectNodes("TOP_TEN_TIMES/EVENT/SWIM/SWIMMER[. = \"Peter Myers\"]");
foreach my $Peter (in $Peter_Nodes)
{
   $Peter->{nodeTypedValue} = "Peter Meyers"  # update the Value
}
$new_DOM_document->save("newertoptimes.xml"); # save the changes

Conclusion

MSXML isn't just for Visual Basic and Visual C++ programmers. The Win32::OLE module allows Perl programmers to take advantage of Microsoft's XML parser from the comfort of their favorite text editor, and now that everyone on the team is happy with the top ten times, I can put away my XML Parser until next season ... .

Resources

This Week on p5p 2001/04/15



Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

There were just under 300 messages this week.

perl-beginners list

Casey West came up with the great idea of having a mailing list for Perl beginners. It's worth quoting extensively from his post:

I have been throwing around the idea of a central place where Perl newcomers can come and ask FAQ questions and the like. My dream is to funnel all the RTFM traffic from p5p, c.l.p.* and other such places. I would like to see a place where the newbies are accepted as people who just don't know yet. ...

There are a few crucial things that must take place in order to make this list as effective as possible:

* Moderation. Flames must be removed at all times. It should be easy to be the newbie.

* Direction. Archives are a valuable resource. We don't do homework.

* Publicity. List and News Group FAQ's can and should list perl-newbies as the primary source for simple, 'new commer' questions. Questions of this nature should be immediatly redirected to perl-newbies.

* Teachers. They can and will ask, we must answer.

At first I thought PerlMonks was this thing, however, PerlMonks is not an environment where newbies are alowed to ask simple questions answered in the documentation.

Abigail suggested that "learning how to uses available documentation isn't something unique for Perl - it should be standard practise for anyone out of primary school". While I personally agree with that, I think there's definitely a need for hand-holding as well, and I take my hat off to Casey for having done something about it.

Accordingly, the mailing list - called "beginners" to avoid people feeling demeaned by the "newbie" tag - has been set up. Mail beginners-subscribe@perl.org to get involved, whether you want to ask questions or to answer them.

Kevin Lenzo noted that he's set up a help IRC channel with Casey's goals in mind at #perl-help on irc.rhizomatic.net. Feel free to join in, but remember - no flames, no RTFMs, be nice.

Dynamically scoped last()

There was a very long and very pointless thread which started when Michael Schwern found the following bit of functionality:

	$ perl -wle 'sub foo { last; } while(1) { foo() }  print "Bar"'
	Exiting subroutine via last at -e line 1.
	Bar

He expected that to be an infinite loop, but the last actually ended the nearest loop, which is the while loop - sub isn't a loop, which is why you're not supposed to exit one with last. A million and one people pointed out that this was documented in perlsyn, but it seemed to be documented very confusingly.

But what about bare blocks, he cried? Well, they're actually loops, they're just loops that are executed once, replied Abigail, Dan and Graham in chorus. Schwern asked why Perl behaved like this; Dan answered:

I think it makes the parser a bit simpler, because otherwise it would need to track loop starts and subroutine declarations and such to match up loops with lasts/redos/nexts. Plus you'd still need to walk the call stack with string evals, which'd be a bit of a mess.

I can't really see any reason that last, redo, or next should propagate outside a subroutine, but it's probably a bit late to change this in perl 5

Abigail patched the documentation to make it explicit what's going on, and after a few iterations, came up with a patch that everyone broadly agreed on. After this brief burst of productivity, the discussion then got silly, and the referees started blowing whistles and giving free kicks.

In the end, Michael Schwern came up with some tests for the existing behaviour, plus some patches to rewrite the warnings to call the behaviour "deprecated".

Schwern also added tests for Exporter, which amazingly didn't have any to date - they even found a bug in require_version. He tried to open the B::walkoptree_slow container of annelids again, but Jarkko didn't take the bait.

perlbug Administration

Jarkko reminded the world that there are - or at least, were - 1777 open bug ids, and suggested that the bug administrators get 5.6.1 installed and attack the bug database with all their might. Merijn found that all the HP/UX bugs were his smoke reports - how convenient! Ask suggested that Chris Nandor advertise for bug administrators on Use Perl, and I shall do it here as well:

If you want to help out squashing Perl bugs, please get in touch with Richard Foley who will tell you how to get involved. You don't need to be able to fix the bugs, although if you can, it obviously helps - we just need people to be able to run through the database and say "This is still a problem as of 5.6.1" or "I don't see this in 5.6.1 any more", and report back to us at P5P. A knowledge of the way P5P and bug squashing works - which you can get through the P5P FAQ will be useful, although Richard will tell you all you need to know about the actual bug database side of things. You can also subscribe to the Bug Mongers mailing list, but I have a feeling you're better off talking to Richard first.

Problems with tar

Mike Guy found that he was unable to untar Perl 5.6.1 on Solaris and SunOS - Sarathy explained that the Solaris tar is broken for archives which contain long path names, so he used GNU tar instead. Robert Spier documented this fact in the README for CPAN.

Calle Dybedahl explained that it was the usual case of GNU versus the world:

For paths longer than 100 characters, there are two incompatible standards for tar archives: GNU tar and the rest of the world. If you have such paths and built the archive with GNU tar, you must have GNU tar to unpack it.

Great, thought Sarathy - but we don't use any paths over 100 characters. It turned out that there were at least three different problems intersecting here: firstly, Mike's download was busted. Second, Solaris' tar is busted, but a patch has been issued, and thirdly, there are two different standards for tars - Alan explained it beautifully.

Net::Ping and the "stream" protocol

Scott Bronson came up with an interesting idea for Net::Ping: because dynamic IPs exist, he needed to be able to both ping a host and ensure it was the same host he's pung in the past. So, he added a protocol to Net::Ping which keeps the connection alive between pings, and duly submitted the patch to P5P.

Colin McMillen, the maintainer of Net::Ping, recommended it for inclusion into bleadperl, and came up with some changes of his own, particularly to replace the alarm timeout with non-blocking IO using select. This made Scott unhappy:

First of all, tcp ping is broken. Select won't return if you haven't written any data to be echoed. If you were to write data, however, that would disagree with the documentation: "No data bytes are actually sent since the successful establishment of a connection is proof enough of the reachability of the remote host." Which is it to be? And, how about warning me that it doesn't work before I start trying to figure out what I broke?

I fixed this in the patch by writing some bytes to be echoed and reading them back. But I think this is a very bad idea. Also from the docs, "tcp is expensive and doesn't need our help to add to the overhead." You're changing Net::Ping's personality.

Colin defended his changes, saying that the current behaviour is horribly broken anyway, as has been discussed several times in the past. He also defended forking a new process for ping on Win32, since expensive Win32 functionality is better than none. Graham Barr also chimed in, demonstrating how to do a non-blocking connect and select with timeout, and noting that alarm is best avoided.

Various

Rich Williams found a little bug with the regular expression optimizer in Perl 5.6.1 - apparently it only strikes when using .* together with /sg. However, he put in the time and the detective work to isolate it to a couple of patches, from which Sarathy and Hugo managed to squash the bug; Gisle fixed Digest::MD5 for UTF8 strings.

Peter Prymmer updated the perlebcdic documentation to document Nick Ing-Simmon's heroic UTF-EBCDIC effort. Tom Horsley complained about perlbug occasionally putting ~s ...the subject and ~c ...the sender's email address in the body of the mail instead of actually writing the headers properly - Kurt Starsinic explained that it was actually a bug in Mail::Mailer and a workaround is to set PERL_MAILERS=mail:/no/such/thing

Jonathan Stowe found some horrible bugs in SCO's NaN handling - NaN compares equal to 1.0 and less than 1.0. Urgh. Andy Dougherty suggested that we whip up a C program which tests for platforms with broken NaN and papers over the cracks in the test suite; Jarkko, on the other hand, was more sanguine as suggested that 5.7.2 will have to bite the NaN and inf bullets. Nick Clark pointed out that $_ & ~0 will do weird things for many values of $_.

Vadim Konovalov found a problem with the -f operator in Cygwin - it interprets -f "foo" as -f "foo" || -f "foo.exe". Urgh. Mike Giroux explained that it made some sense - programs on Windows want to execute foo.exe by saying just foo. Fair enough, but Vadim pointed out that there's code in win32/win32.c to work around this for Borland's C compiler, so this could be extended to Cygwin.

Robin Houston continued his conquest of B::*. His changes were numerous.

Jarkko has suggested that he wants to integrate Time::HiRes into 5.7.2 - after some time, the maintainer of the module was tracked down and subdued. There'll be a lot more about new module integration next week.

Aaron Mackey reported that building Perl on Mac OS X doesn't work properly because of problems with case-folding in the file system - if you're finding this, try adding -Dmakefile=GNUmakefile to the Configure command, and/or set the environment variable LC_ALL=C.

The FETCHSLICE/ STORESLICE idea reared its ugly head again, with Rick Delaney implementing a FETCHSIZE method. While it's certainly a good idea, Dave Mitchell pointed out it may well have the same problems as *SLICE when it comes to unscrupulously inheriting modules.

Eric Kolve noted that Perl does Weird Things if you try doing a tie inside of a FETCH from a tied variable - this isn't surprising, but it would be a good thing for someone with a bit of spare time to look into.

Until next week I remain, your humble and obedient servant,


Simon Cozens

Designing a Search Engine


A couple of months ago, I was approached by a company that I had done some work for previously, and was asked to build them a search engine to index about 200MB of HTML. However, some fairly strict rules were laid down about what I could and couldn't do. First, the search engine had to be built from scratch, so that the rights to it were fully theirs. Second, I wasn't allowed to use any non-standard modules. Third, the data had to be held in plain-text files. Fourth, I was assured that all DB* modules were not working, and couldn't be made to work, which seemed somewhat surprising. And finally, one had to be able to perform Boolean searches, including the use of brackets.

As you can imagine, this presented quite a challenge, so I accepted and got to work. I want to discuss the two most interesting parts of the project: how to store and retrieve the data efficiently, and how to parse search terms.

Because of the ideas I had on parsing the search terms, which will be discussed later, I needed a system that could take a word and send back a list of all the files it appeared in and how many times it appeared in them. This had to be as fast as possible without using insane amounts of disk space. Second, I needed to be able to get information about the files indexed quickly without actually opening them - the files could be large, and could be held remotely.

The first thing to do was to assign each indexed file a unique number - this would allow me to save space when referring to it, and would allow me to easily look up information on the file. Each word would have a list of these numbers listed under it, so I could take the word 'camel' for example, and get a list of file numbers back.

To me, the most obvious way of implementing this was to use a tied hash. However, my project rules stated I wasn't allowed to do that. I tried to think of other ways to implement a tied-hash system without using any of the DB or DBI modules. My first idea was to piggy back off a similar feature used in almost all operating systems; or, to be more precise, every file system. By creating each word as a file I could easily open (DATA, "camel") to get my data.

At this point, I had two types of indexes: one that listed summary information for each file by file number, and another that held information about each word.

There were two problems, however, with using the file system as my hash. While with a small number of words it was quite fast, most operating systems use a linear search to locate files in a directory, so opening "zulu" in a 10,000 file directory quickly became quite slow. The other problem was minimum file sizes, especially under Windows. This is a huge problem when your minimum file size is 16kb - as it is on fat16 with a 1GB hard drive - as 100 files translates as 1.6MB. When the data you're indexing is 20MB and you get about four times this much worth of index files, you're doing something wrong.

The solution for the file numbers was quite easy: I would spit out the file offsets of where information about each file was stored in my file index. Then, I'd read these offsets in at the start of each search, so that if I wanted information about file 15, I could get the offset by saying $file[15] and then seeking to that point.

I was beginning to despair for an elegant solution to the word indexing until Simon Cozens pointed out the wonderful Search::Dict. Search::Dict is a standard module that searches for keys in dictionary files. Essentially, it allows you to give it a word and a file-handle, where the file-handle is tied to a bunch of lines in alphabetical order, and will then change the next read position of the file-handle to the line that starts with the word. For example:

look *INDEX, "camel";
$line = <INDEX>;

would assign $line to the first line in the file handle beginning with camel. In addition, since it uses an effective binary search to achieve this, retrieval of data was fast. Problem solved.

Parsing of the Boolean search terms was the most difficult part of the program. Users had to be able to enter terms such as cat and (dog not sheep) and get sane results. The only way I could deal with this, I decided, was to go along the search terms from left to right and keep an array of file numbers that were still applicable.

To do this, I created three subroutines that I called &array_and, &array_not, and &array_both. &array_and would take our current results array and add the list of file numbers from a given word. &array_not would 'subtract' the file numbers from one word from our results array, and &array_both would return shared elements between the results array and the file numbers from the search word.

I ripped a lot of code from the Perl Cookbook to make these array functions. The code for these functions can be seen below:

sub array_both {
 my $prev = 'nonesuch';
 my @total_first = (@{$_[0]}, @{$_[1]});
 @total_first = sort(@total_first);
 my $current;
 my @return_array;
 foreach $current (@total_first) {
  if ($prev eq $current) { push(@return_array, $prev);  }
  $prev = $current;
 }
 @return_array = @{&add_array(\@return_array, \@return_array)};
 return \@return_array;
}

sub array_and {
 my $prev = 'nonesuch';
 my @total_first = (@{$_[0]}, @{$_[1]});
 my %seen = (); # These next two lines are from the PCB
 my @return_array = grep { ! $seen{$_} ++ } @total_first ;
 return \@return_array;
}

sub array_not {

 # Again this is ripped straight from PCB

 my @a = @{$_[0]};
 my @b = @{$_[1]};
 my %hash = map {$_ => 1} @a;
 my $current;
 foreach $current (@b) {
  delete $hash{$current};
 }
 @a = keys %hash;
 @a = sort(@a);
 return \@a;
}

The only big problem left was dealing with brackets. The only solution I could come up with was to make the search term parsing code a subroutine that returned an array. This way when I came to some logic in brackets, I could send it back to the subroutine, from which I would get a list of words, exactly like if the logic had been a single word. The main problem that I envisaged with this would be getting Perl to deal with expressions such as sheep and (dog not cat) and (camel not panther). How did I get Perl to match just the first set of brackets, or, if nested brackets were present, to match all the logic? Damian Conway has written an excellent module called Text::Balanced, which I was just about to start using, before the project specifications changed(!) and I was told we no longer needed to allow nested-bracket searching.

Yet again, Perl came into its own when writing the search engine. The availability of solutions to my problems in the form of modules saved me a lot of time, and saved the task from being inundated with my own, rather bad, code. The ability to use Perl to quickly extract titles from HTML documents and strip HTML tags in very few lines of code also made my life far easier.

This Week on p5p 2001/04/08



Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

Apologies for the delay in this week's summary; I was travelling this weekend, and didn't get back until Tuesday.

Perl 5.6.1 and Perl 5.7.1 Released!

The big news this week is that, as you probably already know by now, Perl 5.6.1 and Perl 5.7.1 are available. Get them from CPAN. Well done to Sarathy and Jarkko for getting these out of the door.

Robin Houston Left On ALL WEEK

Someone has just far too much free time on their hands. Robin Houston started the week by adding a warning to

    push @array;

This was slightly controversial, as it might introduce new warnings to old code, (albeit old, broken code) and some people - myself included - wanted to see push default to $_ instead. That really would change the semantics of old code, however. Mark-Jason Dominus wondered whether anyone could write push @array by accident, or even have a program write it legimitately, but both Ronald and Jarkko confessed to have written it when they meant push @array, $_.

Then he moved over to hacking on B::Concise; he found an interesting optimisation where you have code like

@a = split(/pat/, str);

Perl actually places the symbol table entry for @a inside the opcode for the regular expression, casting it into an OP as it goes. B::Concise wasn't aware of this peculiarity and thought that the op was, well, an op. Boom.

Next, he fixed up something with lexicals; the way lexicals are stored is quite complex, as he explains:

perl stores the names of lexical variables in SV structures. But of course it doesn't stop there. Oh no! Not only are the IVX and NVX slots used to hold scope information, but SvCUR is used for some nefarious purpose which I don't understand. (look in op.c/Perl_newASSIGNOP for (at least some of) this SvCUR chicanery. It seems to be connected with the mysterious variable PL_generation.)

B::Concise is making naive assumptions about perl's sanity again. Try this, for example:

perl -MO=Concise -e 'my @x; @x=2;' |less

The variable name ("@x") is followed by all kinds of weird binary data, because B::Concise believed perl when it gave the length of the string, not suspecting for a moment that the slot which ordinarily holds the length happens to be used for devil-worship in this situation.

This is, as Stephen McCamant explained, related to a clever optimization to allow both

    ($a, $b) = ($c, $d);

and

    ($a, $b) = ($b, $a);

to both work - in the second case, Perl needs to use a temporary variable to swap the values over. It determines whether or not the left-hand and right-hand sides have elements in common by "marking" each element, and that's what the devil-worship was for.

Robin realised that this meant that variables with very long names used in a list assignment would be truncated, so he fixed them up as well.

By now, however, Robin is unstoppable. He made B::Deparse give sensible output for variables which start with control characters such as $^WIDE_SYSTEM_CALLS. Realising this was a common problem, he made a B::GV::SAFENAME method to ensure the name was printable and converted all the B::* modules to use that. Good work!

But no, it doesn't end here. He fixed up B::* to cope with IVs that were actually UVs, B::Deparse handling of regular expressions, "${foo}bar" and binmode, and made it warning-clean.

By the end of the week, he was pleased to report that you can now run t/base/lex.t through B::Deparse and back through Perl, and all tests pass. Wow.

More on glob

Benjamin Sugars tried to speed up glob. As you may or may not now, core's glob automagically loads up File::Glob and uses the glob function that that provides. Unfortunately, this also pulls in Exporter and all sorts of other modules. Benjamin rewrote File::Glob to avoid the compile-time dependencies on these modules, and fiddled the bit in the core (after a few false starts) which loads up the module to make it equivalent to use File::Glob (). He also documented load_module, which is the core function for magical module using.

Changes files in core

John Allen asked "does anyone else think it may be time to grant the voluminous ChangesX.XXX files in the standard perl distribution their own separately downloadable gzip file?" Well, as usual, some did and some didn't. The motivation for not doing so would be that it would not be necessary to download anything other than the Perl source tarball to understand the sources. Further, Jarkko pointed out that the Changes files describe patches that affect more than one file at once and wouldn't be sensibly documented in any single one of them.

Benjamin Sugars suggested moving them to a separate directory rather than getting rid of them altogether; Jarkko hinted that he wasn't going to go through the Perforce hell of moving files just for the sake of moving them.

John's eventual compromise was to keep them in the development sources for those people who want to hack on them, and remove them from the stable sources. This seems to suit everyone, but nobody said anything further.

How Magic and Ties Work

I provided a sample few sections from my forthcoming Perl Internals training course about how magic and ties work. Enjoy.

Various

Olaf Flebbe turned in EPOC fixes before 5.6.1 went out, and Craig Berry updated README.vms while Peter Prymmer updated all sorts of other VMS things; Gisle came up with a neat patch to let

    @foo = do { ... };

propagate array context to the do block.

Paul Schinder came up with some OS X fixes for 5.6.1; apparently OS X's gcc isn't very gcc, and this caused the preprocessor to do weird things which broke [Errno].

John Peacock fixed up some bugs in Math::BigFloat found by Tom Roche but the only thanks he got was a discussion of the indentation style of the Perl sources. 8-wide tabs, 4 spaces, people. ( Russ says we should use spaces instead of tabs, but it's a bit late now.)

Jonathan Stowe tried to change something about $#; Sarathy suggested that this might break some ten year old code, even though the bug's been in there since forever...

The TIESLICE/ STORESLICE issue came back. The discussion produced more light than heat.

And finally... Dan Brian explained what happens if you cross Black and White with Perl5-Porters.

Until next week I remain, your humble and obedient servant,


Simon Cozens

Apocalypse 1: The Ugly, the Bad, and the Good

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 01 for the latest information.


Table of Contents

RFC 141: This Is The Last Major Revision

RFC 28: Perl should stay Perl

RFC 16: Keep default Perl free of constraints such as warnings and strict

RFC 73: All Perl core functions should return objects

RFC 26: Named operators versus functions

People get scared when they hear the word Apocalypse, but here I mean it in the good sense: a Revealing. An Apocalypse is supposed to reveal good news to good people. (And if it also happens to reveal bad news to bad people, so be it. Just don't be bad.)

What I will be revealing in these columns will be the design of Perl 6. Or more accurately, the beginnings of that design, since the design process will certainly continue after I've had my initial say in the matter. I'm not omniscient, rumors to the contrary notwithstanding. This job of playing God is a little too big for me. Nevertheless, someone has to do it, so I'll try my best to fake it. And I'll expect all of you to help me out with the process of creating history. We all have to do our bit with free will.

"If you look at the history of Perl 6 up to this point, you will see why this column is subtitled The Ugly, the Bad, and the Good. The RFC process of last year was ugly, in a good sense. It was a brainstorming process, and that means it was deliberately ugly—not in the sense of incivility, since the RFC process was in fact surprisingly civil, but in the sense that there was little coherent design to the suggestions in the RFCs. Frankly, the RFCs are all over the map, without actually covering the map. There are contradictory RFCs, and there are missing RFCs. Many of the RFCs propose real problems but go off at funny angles in trying to propose solutions. Many of them patch symptoms without curing the underlying ailments.

Larry Wall will give his annual entertaining talk on the state of the Perl world, covering both Perl 5 and Perl 6 at this Year's Open Source Convention. Don't miss this rare opportunity to hear the creator of Perl, patch, and run share his insights.

I also discovered Larry's First Law of Language Redesign: Everyone wants the colon.

That was the Ugly part. The Bad part was that I was supposed to take these RFCs and produce a coherent design in two weeks. I starting out thinking I could just classify the RFCs into the good, bad, and ugly categories, but somehow most of them ended up in the ugly category, because the good ones typically had something wrong with them, and the even the bad ones typically indicated a problem that could use some thought, even if the solution was totally bogus.

It is now five months later, and I've been mulling over coherence the whole time, for some definition of mulling. Many of you know what happens when the size of your Perl process exceeds the size of your physical memory—you start thrashing. Well, that's basically what happened to me. I couldn't get enough of the problem into my head at once to make good progress, and I'm not actually very good at subdividing problems. My forte is synthesis, not analysis. It didn't help that I had a number of distractions in my life, some of them self-inflicted, and some of them not. I won't go into all that. Save it for my unauthorized autobiography.

Programming Perl, 3rd Edition
Larry Wall, Tom Christiansen & Jon Orwant
3rd Edition July 2000
0-596-00027-8, Order Number: 0278
1092 pages, $49.95

But now we come to the Good part. (I hope.) After thinking lots and lots about many of the individual RFCs, and not knowing how to start thinking about them as a whole, it occurred to me (finally!) that the proper order to think about things was, more or less, the order of the chapters in the Camel Book. That is, the Camel Book's order is designed to minimize forward references in the explanation of Perl, so considering Perl 6 in roughly the same order will tend to reduce the number of things that I have to decide before I've decided them.

So I've merrily classified all the RFCs by chapter number, and they look much more manageable now. (I also restructured my email so that I can look at a slice of all the messages that ever talked about a particular RFC, regardless of which mailing list the message was on. That's also a big help.) I intend to produce one Apocalypse for each Chapter, so Apocalypse 1 corresponds to Chapter 1: An Overview of Perl. (Of course, in the book, the Overview is more like a small tutorial, not really a complete analysis of the philosophical underpinnings of Perl. Nevertheless, it was a convenient place to classify those RFCs that talk about Perl 6 on that level.)

So today I'm talking about the following RFCs:

    RFC  PSA  Title
    ---  ---  -----
     16  bdb  Keep default Perl free of constraints such as warnings and
strict.
     26  ccb  Named operators versus functions
     28  acc  Perl should stay Perl.
     73  adb  All Perl core functions should return objects
    141  abr  This Is The Last Major Revision

The PSA rating stands for ``Problem, Solution, Acceptance''. The problem and solution are graded on an a-f scale, and very often you'll find I grade the problem higher than the solution. The acceptance rating is one of

    a  Accepted wholeheartedly
    b  Accepted with a few "buts"
    c  Accepted with some major caveats
    r  Rejected

I might at some point add a ``d'' for Deferred, if I really think it's too soon to decide something.

RFC 141: This Is The Last Major Revision

I was initially inclined to accept this RFC, but decided to reject it on theological grounds. In apocalyptic literature, 7 is the number representing perfection, while 6 is the number representing imperfection. In fact, we probably wouldn't end up converging on a version number of 2*PI as the RFC suggests, but rather on 6.6.6, which would be rather unfortunate.

So Perl 7 will be the last major revision. In fact, Perl 7 will be so perfect, it will need no revision at all. Perl 6 is merely the prototype for Perl 7. :-)

Actually, I agree with the underlying sentiment of the RFC—I only rejected it for the entertainment value. I want Perl to be a language that can continue to evolve to better fit the problems people want to solve with it. To that end, I have several design goals that will tend to be obscured if you just peruse the RFCs.

First, Perl will support multiple syntaxes that map onto a single semantic model. Second, that single semantic model will in turn map to multiple platforms.

Multiple syntaxes sound like an evil thing, but they're really necessary for the evolution of the language. To some extent we already have a multi-syntax model in Perl 5; every time you use a pragma or module, you are warping the language you're using. As long as it's clear from the declarations at the top of the module which version of the language you're using, this causes little problem.

A particularly strong example of how support of multiple syntaxes will allow continued evolution is the migration from Perl 5 to Perl 6 itself. See the discussion of RFC 16 below.

Multiple backends are a necessity of the world we live in today. Perl 6 must not be limited to running only on platforms that can be programmed in C. It must be able to run in other kinds of virtual machines, such as those supported by Java and C#.

RFC 28: Perl should stay Perl.

It is my fond hope that those who are fond of Perl 5 will be fonder still of Perl 6. That being said, it's also my hope that Perl will continue trying to be all things to all people, because that's part of Perl too.

While I accept the RFC in principle (that is, I don't intend to go raving mad), I have some major caveats with it, because I think it is needlessly fearful that any of several programming paradigms will ``take over'' the design. This is not going to happen. Part of what makes Perl Perl is that it is intentionally multi-paradigmatic. You might say that Perl allows you to be paradigmatic without being ``paradogmatic''.

The essence of Perl is really context sensitivity, not just to syntactic context, but also to semantic, pragmatic, and cultural context. This overall philosophy is not going to change in Perl 6, although specific context sensitivities may come and go. Some of the current context sensitivities actually prevent us from doing a better job of it in other areas. By intentionally breaking a few things, we can make Perl understand what we mean even better than it does now.

As a specific example, there are various ways things could improve if we muster the courage to break the ``weird'' relationship between @foo and $foo[]. True, we'd lose the current slice notation (it can be replaced with something better, I expect). But by consistently treating @foo as an utterance that in scalar context returns an array reference, we can make subscripts always take an array reference, which among other things fixes the botch that in Perl 5 requires us to distinguish $foo[] from $foo->[]. There will be more discussion of this in Apocalypse 2, when we'll dissect ideas like RFC 9: Highlander Variable Types.

RFC 16: Keep default Perl free of constraints such as warnings and strict.

I am of two minds about this debate—there are good arguments for both sides. And if you read through the discussions, all those arguments were forcefully made, repeatedly. The specific discussion centered around the issue of strictness, of course, but the title of the RFC claims a more general philosophical position, and so it ended up in this Apocalypse.

I'll talk about strictness and warnings in a moment, and I'll also talk about constraints in general, but I'd like to take a detour through some more esoteric design issues first. To my mind, this RFC (and the ones it is reacting against), are examples of why some language designer like me has to be the one to judge them, because they're all right, and they're all wrong, simultaneously. Many of the RFCs stake out polar positions and defend them ably, but fail to point out possible areas of compromise. To be sure, it is right for an RFC to focus in on a particular area and not try to do everything. But because all these RFCs are written with (mostly) the design of Perl 5 in mind, they cannot synthesize compromise even where the design of Perl 6 will make it mandatory.

To me, one of the overriding issues is whether it's possible to translate Perl 5 code into Perl 6 code. One particular place of concern is in the many one-liners embedded in shell scripts here and there. There's no really good way to translate those invocations, so requiring a new command line switch to set ``no strict'' is not going to fly.

A closely related question is how Perl is going to recognize when it has accidentally been fed Perl 5 code rather than Perl 6 code. It would be rather bad to suddenly give working code a brand new set of semantics. The answer, I believe, is that it has to be impossible by definition to accidentally feed Perl 5 code to Perl 6. That is, Perl 6 must assume it is being fed Perl 5 code until it knows otherwise. And that implies that we must have some declaration that unambiguously declares the code to be Perl 6.

Now, there are right ways to do this, and wrong ways. I was peeved by the approach taken by DEC when they upgraded BASIC/PLUS to handle long variable names. Their solution was to require every program using long variable names to use the command EXTEND at the top. So henceforth and forevermore, every BASIC/PLUS program had EXTEND at the top of it. I don't know whether to call it Bad or Ugly, but it certainly wasn't Good.

A better approach is to modify something that would have to be there anyway. If you go out to CPAN and look at every single module out there, what do you see at the top? Answer: a ``package'' declaration. So we break that.

I hereby declare that a package declaration at the front of a file unambiguously indicates you are parsing Perl 5 code. If you want to write a Perl 6 module or class, it'll start with the keyword module or class. I don't know yet what the exact syntax of a module or a class declaration will be, but one thing I do know is that it'll set the current global namespace much like a package declaration does.

Now with one fell swoop, much of the problem of programming in the large can be dealt with simply by making modules and classes default to strict, with warnings. But note that the default in the main program (and in one liners) is Perl 5, which is non-strict by definition. We still have to figure out how Perl 6 main programs should distinguish themselves from Perl 5 (with a ``use 6.0'' maybe?), and whether Perl 6 main programs should default to strict or not (I think not), but you can already see that a course instructor could threaten to flunk anyone who doesn't put ``module Main'' at the front each program, and never actually tell their pupils that they want that because it turns on strictures and warnings.

Other approaches are possible, but that leads us to a deeper issue, which is the issue of project policy and site policy. People are always hankering for various files to be automatically read in from various locations, and I've always steadfastly resisted that because it makes scripts implicitly non-portable. However, explicit non-portability is okay, so there's no reason our hypothetical class instructor could not insist that programs start with a ``use Policy;'' or some such.

But now again we see how this leads to an even deeper language design issue. The real problem is that it's difficult to write such a Policy module in Perl 5, because it's really not a module but a meta-module. It wants to do ``use strict'' and ``use warnings'' on behalf of the student, but it cannot do so. Therefore one thing we must implement in Perl 6 is the ability to write meta-use statements that look like ordinary use statements but turn around and declare other things on behalf of the user, for the good of the user, or of the project, or of the site. (Whatever. I'm not a policy wonk.)

So whether I agree with this RFC really depends on what it means by ``default''. And like Humpty Dumpty, I'll just make it mean whatever I think is most convenient. That's context sensitivity at work.

I also happen to agree with this RFC because it's my philosophical position that morality works best when chosen, not when mandated. Nevertheless, there are times when morality should be strongly suggested, and I think modules and classes are a good place for that.

RFC 73: All Perl core functions should return objects

I'm not sure this belongs in the overview, but here it is nonetheless. In principle, I agree with the RFC. Of course, if all Perl variables are really objects underneath, this RFC is trivially true. But the real question is how interesting of an object you can return for a given level of performance. Perl 5's objects are relatively heavyweight, and if all of Perl 6's objects are as heavy, things might bog down.

I'm thinking that the solution is better abstract type support for data values that happen to be represented internally by C structs. We get bogged down when we try to translate a C struct such a struct tm into an actual hash value. On the other hand, it's rather efficient to translate a struct tm to a struct tm, since it's a no-op. We can make such a struct look like a Perl object, and access it efficiently with attribute methods as if it were a ``real'' object. And the typology will (hopefully) mostly only impose an abstract overhead. The biggest overhead will likely be memory management of a struct over an int (say), and that overhead could go away much of the time with some amount of contextually aware optimization.

In any event, I just want to point out that nobody should panic when we talk about making things return objects that didn't used to return them. Remember that any object can define its stringify and numify overloadings to do whatever the class likes, so old code that looks like

    print scalar localtime;

can continue to run unchanged, even though localtime might be returning an object in scalar context.

RFC 26: Named operators versus functions

Here's another RFC that's here because I couldn't think of a better place for it.

I find this RFC somewhat confusing because the abstract seems to suggest something more radical than the description describes. If you ignore the abstract, I pretty much agree with it. It's already the case in Perl 5 that we distinguish operators from functions primarily by how they are called, not by how they are defined. One place where the RFC could be clarified is that Perl 5 distinguishes two classes of named operators: named unary operators vs list operators. They are distinguished because they have different precedence. We'll discuss precedence reform under Apocalypse 3, but I doubt we'll combine the two kinds of named operators. (As a teaser, I do see ways of simplifying Perl's precedence table from 24 levels down to 18 levels, albeit with some damage to C compatibility in the less frequently used ops. More on that later.)


Perl 6 Apocalypse

The rest of the "Apocalypse" series can be found here, as well as other articles by Larry Wall.

Do you begin to see why my self-appointed job here is much larger than just voting RFCs up or down? There are many big issues to face that simply aren't covered by the RFCs. We have to decide how much of our culture is just baggage to be thrown overboard, and how much of it is who we are. We have to smooth out the migration from Perl 5 to Perl 6 to prevent people from using that as an excuse not to adopt Perl 6. And we have to stare at all those deep issues until we see through them down to the underlying deeper issues, and the issues below that. And then in our depths of understanding, we have to keep Perl simple enough for anyone to pick up and start using to get their job done right now.

Stay tuned for Apocalypse 2, wherein we will attempt to vary our variables, question our quotes, recontextualize our contexts, and in general set the lexical stage for everything that follows.

This Week on p5p 2001/04/01



Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

There were 446 messages this week.

Perl and HTML::Parser

Gisle complained that a recent snapshot of Perl broke HTML::Parser. Apparently, his code did something like

sv_catpvf(sv, "%c", 128);

and Perl upgraded the SV to UTF8, which caused confusion when his C-level code then looked at the real value of the string. Jarkko asked why Gisle's code cared about the representation of the string, but it seemed like Gisle expected it to be non-UTF8. (I could argue that this was Perl's fault, and I could argue that it was Gisle's.) Sarathy warned ominously: "We need to tread very carefully here, or 5.8.0 might break a lot of XS code out there." Nick I-S pointed out the handy-looking SvPVbyte macro which returned a non-UTF8 version of the string's contents, plus another way of doing things which was actually backwards compatible.

Autoloading Errno

Last week, we covered the fact that using %! should autoload the Errno module but at the time, it failed to. Robin Houston fixed that, with another quotable analysis:

I must admit that I'm slightly dubious as to the wisdom of doing this. It's not needed for compatibility (it's never worked), and any code which uses %! could simply put " use Errno;" at the top.

The intention, presumably, is that code which doesn't make use of %! shouldn't have to incur the penalty of loading [Errno.pm].

Currently cleverness only takes place when a glob is created. So, if you use a hash called %^E then the magical scalar $^E is set up, even though you don't use it.

In this case though, we want Errno.pm to be loaded only if %! is used. Loading the damned thing for every script which uses $! would be Bad.

The upshot of this all is that an extra test has to be inserted into the code which deals with creating new stash variables (not just the first variable of the particular glob). Even a marginal slowdown like this doesn't seem worth the insignificant benefit of not having to load Errno yourself.

However, Sarathy commented that the intention was simplicity and transparency; the %! language feature should be implemented in a manner transparent to the end user, just like the loading of the Unicode swash tables. "Besides," he concluded, "there is probably no precedent for forcing people to load a non-pragma to enable a language feature."

Jarkko looked slightly guilty. "Ummm, well, in other news, I may have have just created one", he said, referring to the new ability to export IO::Socket::atmark to sockatmark.

Robin also added some more reporting to B::Debug, and fixed up a parenthesis bug in B::Deparse.

Math::Big*

Tels and John Peacock have been working together to rewrite Math::BigInt and Math::BigFloat. Their version is on CPAN . Jarkko seems understandably a little hestitant about replacing the in-core version with this one; while we're assured that it will be backwards compatible (minus bugfixes, naturally) but obviously a complete rewrite isn't mature enough to be considered for core yet.

pack and unpack

Someone asked a (non-maintainance) question about pack and unpack which MJD dealt with; I took this as a cue to show my current work on a perlpacktut. A few people produced useful suggestions for that, which I'll get finished when the next consignment of tuits arrives. There was a short diversion about what an asciz string was; (see the documentation for the "w" pack format) it's actually a C-style null-terminated string.

Taint testing

For some reason, the usual way to detect taintedness in the test suite seems to be

    eval { $possibly_tainted, kill 0; 1 }
Nick Clark thinks this sucks, but it's a bit too late to change it now.

Of course, MacPerl doesn't have kill so Chris found that his test suite was going horribly wrong. He had a number of violent suggestions to fix this up, including having kill be a no-op which died on tainted data. MJD suggested that kill should do what it does already but be a no-op if it's passed a 0. The eventual solution was to have it return 0 but check for tainted data. He also hinted that this may be the precursor to Win32-like pseudoprocesses.

Various

Benjamin Sugars was at it again. He fixed a bug in socket which leaked file descriptors, wrote a test suite for Cwd, joined the bug admin team, patched up B::Terse and File::* to be warnings-happy, produced another version of his XS Cwd module. He didn't document references in @INC, though, so he doesn't get the gold star.

I zapped OPpRUNTIME, a flag that was set but never tested!

Stephen McCamant produced a couple of optimizations to peep(), the optimizer.

Thomas Wegner and Chris Nandor fixed up File::Glob for MacOS.

Jarkko floated the idea of a FETCHSLICE and STORESLICE for tied hashes and arrays to avoid multiple FETCH/ STORE operations; there was a little discussion about the syntax:

    STORESLICE($thing, $nkeys, $nvalues, @keys, @values)

would be more efficient but less user-friendly than

    STORESLICE($thing, \@keys, \@values)

but no implementation as yet.

Schwern asked if we were going to document the fact that ref qr/Foo/ returns "Regexp". Everyone went very quiet.

Mark-Jason Dominus tried to introduce a new operator, epochtime, which return the time of the system epoch; for instance, one could use

    localtime(epochtime())

to portably find out the date of the system epoch, allowing you to write epoch-independent code. Jarkko rejected the patch on the grounds that it was not sufficiently portable.

Until next week, then, squawk,


Simon Cozens
Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en