July 2002 Archives

Improving mod_perl Sites' Performance: Part 4

Introduction

If your OS supports sharing of memory (and most sane systems do), you might save a lot of RAM by sharing it between child processes. This will allow you to run more processes and hopefully better satisfy the client, without investing extra money into buying more memory.

This is only possible when you preload code at server startup. However, during a child process' life, its memory pages tend to become unshared. There is no way we can make Perl allocate memory so that (dynamic) variables land on different memory pages from constants, so the copy-on-write effect will hit you almost at random.

If you are pre-loading many modules, you might be able to trade off the memory that stays shared against the time for an occasional fork by tuning MaxRequestsPerChild. Each time a child reaches this upper limit and dies, it should release its unshared pages. The new child which replaces it will share its fresh pages until it scribbles on them.

The ideal is a point where your processes usually restart before too much memory becomes unshared. You should take some measurements to see if it makes a real difference, and to find the range of reasonable values. If you have success with this, tuning the value of MaxRequestsPerChild will probably be peculiar to your situation and may change with changing circumstances.

It is very important to understand that your goal is not to have MaxRequestsPerChild to be 10000. Having a child serving 300 requests on precompiled code is already a huge overall speedup, so if it is 100 or 10000 it probably does not really matter if you can save RAM by using a lower value.

Do not forget that if you preload most of your code at server startup, the newly forked child gets ready very very fast, because it inherits most of the preloaded code and the perl interpreter from the parent process.

During the life of the child, its memory pages (which aren't really its own to start with, it uses the parent's pages) gradually get `dirty' - variables which were originally inherited and shared are updated or modified -- and the copy-on-write happens. This reduces the number of shared memory pages, thus increasing the memory requirement. Killing the child and spawning a new one allows the new child to get back to the pristine shared memory of the parent process.

The recommendation is that MaxRequestsPerChild should not be too large, otherwise you lose some of the benefit of sharing memory.

How Shared Is My Memory?

You've probably noticed that the word shared is repeated many times in relation to mod_perl. Indeed, shared memory might save you a lot of money, since with sharing in place you can run many more servers than without it.

How much shared memory do you have? You can see it by either using the memory utility that comes with your system or you can deploy the GTop module:


  use GTop ();
  print "Shared memory of the current process: ",
    GTop->new->proc_mem($$)->share,"\n";
  
  print "Total shared memory: ",
    GTop->new->mem->share,"\n";

When you watch the output of the top utility, don't confuse the RES (or RSS) columns with the SHARE column. RES is RESident memory, which is the size of pages currently swapped in.

Calculating Real Memory Usage

I have shown how to measure the size of the process' shared memory, but we still want to know what the real memory usage is. Obviously this cannot be calculated simply by adding up the memory size of each process because that wouldn't account for the shared memory.

On the other hand we cannot just subtract the shared memory size from the total size to get the real memory usage numbers, because in reality each process has a different history of processed requests, therefore the shared memory is not the same for all processes.

So how do we measure the real memory size used by the server we run? It's probably too difficult to give the exact number, but I've found a way to get a fair approximation, which was verified in the following way. I calculated the real memory used by a technique you will see in the moment, and then stopped the Apache server and saw that the memory usage report indicated that the total used memory went down by almost the same number I've calculated. Note that some OSs do smart memory pages caching so you may not see the memory usage decrease as soon as it actually happens when you quit the application.

This is a technique I've used:

  1. For each process sum up the difference between shared and system memory. To calculate a difference for a single process use:
    
      use GTop;
      my $proc_mem = GTop->new->proc_mem($$);
      my $diff     = $proc_mem->size - $proc_mem->share;
      print "Difference is $diff bytes\n";
  2. Now if we add the shared memory size of the process with maximum shared memory, we will get all the memory that actually is being used by all httpd processes, except for the parent process.
  3. Finally, add the size of the parent process.

Please note that this might be incorrect for your system, so you use this number on your own risk.

I've used this technique to display real memory usage in the module Apache::VMonitor (see the previous article), so instead of trying to manually calculate this number you can use this module to do it automatically. In fact in the calculations used in this module there is no separation between the parent and child processes, they are all counted indifferently using the following code:


  use GTop ();
  my $gtop = GTop->new;
  my $total_real = 0;
  my $max_shared = 0;
  # @mod_perl_pids is initialized by Apache::Scoreboard,
  # irrelevant here
  my @mod_perl_pids = some_code();
  for my $pid (@mod_perl_pids)
    my $proc_mem = $gtop->proc_mem($pid);
    my $size     = $proc_mem->size($pid);
    my $share    = $proc_mem->share($pid);
    $total_real += $size - $share;
    $max_shared  = $share if $max_shared < $share;
  }
  $total_real += $max_shared;

So as you see we that we accumulate the difference between the shared and reported memory:


    $total_real  += $size-$share;

and at the end add the biggest shared process size:


  $total_real += $max_shared;

So now $total_real contains approximately the really used memory.

Are My Variables Shared?

How do you find out if the code you write is shared between the processes or not? The code should be shared, except where it is on a memory page with variables that change. Some variables are read-only in usage and never change. For example, if you have some variables that use a lot of memory and you want them to be read-only. As you know the variable becomes unshared when the process modifies its value.

So imagine that you have this 10Mb in-memory database that resides in a single variable, you perform various operations on it and want to make sure that the variable is still shared. For example if you do some matching regular expression (regex) processing on this variable and want to use the pos() function, will it make the variable unshared or not?

The Apache::Peek module comes to rescue. Let's write a module called MyShared.pm which we preload at server startup, so all the variables of this module are initially shared by all children.


  MyShared.pm
  ---------
  package MyShared;
  use Apache::Peek;
  
  my $readonly = "Chris";
  
  sub match    { $readonly =~ /\w/g;               }
  sub print_pos{ print "pos: ",pos($readonly),"\n";}
  sub dump     { Dump($readonly);                  }
  1;

This module declares the package MyShared, loads the Apache::Peek module and defines the lexically scoped $readonly variable which is supposed to be a variable of large size (think about a huge hash data structure), but we will use a small one to simplify this example.

The module also defines three subroutines: match() that does a simple character matching, print_pos() that prints the current position of the matching engine inside the string that was last matched and finally the dump() subroutine that calls the Apache::Peek module's Dump() function to dump a raw Perl data-type of the $readonly variable.

Here is the script that prints the process ID (PID) and calls all three functions. The goal is to check whether pos() makes the variable dirty and therefore unshared.


  share_test.pl
  -------------
  use MyShared;
  print "Content-type: text/plain\r\n\r\n";
  print "PID: $$\n";
  MyShared::match();
  MyShared::print_pos();
  MyShared::dump();

Before you restart the server, in httpd.conf set:


  MaxClients 2

for easier tracking. You need at least two servers to compare the print outs of the test program. Having more than two can make the comparison process harder.

Now open two browser windows and issue the request for this script several times in both windows, so you get different processes PIDs reported in the two windows and each process has processed a different number of requests to the share_test.pl script.

In the first window you will see something like that:


  PID: 27040
  pos: 1
  SV = PVMG(0x853db20) at 0x8250e8c
    REFCNT = 3
    FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
    IV = 0
    NV = 0
    PV = 0x8271af0 "Chris"\0
    CUR = 5
    LEN = 6
    MAGIC = 0x853dd80
      MG_VIRTUAL = &vtbl_mglob
      MG_TYPE = 'g'
      MG_LEN = 1

And in the second window:


  PID: 27041
  pos: 2
  SV = PVMG(0x853db20) at 0x8250e8c
    REFCNT = 3
    FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
    IV = 0
    NV = 0
    PV = 0x8271af0 "Chris"\0
    CUR = 5
    LEN = 6
    MAGIC = 0x853dd80
      MG_VIRTUAL = &vtbl_mglob
      MG_TYPE = 'g'
      MG_LEN = 2

We see that all the addresses of the supposedly big structure are the same (0x8250e8c and 0x8271af0), therefore the variable data structure is almost completely shared. The only difference is in SV.MAGIC.MG_LEN record, which is not shared.

So given that the $readonly variable is a big one, its value is still shared between the processes, while part of the variable data structure is non-shared. But it's almost insignificant because it takes a very little memory space.

Now if you need to compare more than variable, doing it by hand can be quite time consuming and error prune. Therefore it's better to correct the testing script to dump the Perl data-types into files (e.g /tmp/dump.$$, where $$ is the PID of the process) and then using diff(1) utility to see whether there is some difference.

So correcting the dump() function to write the info to the file will do the job. Notice that I use Devel::Peek and not Apache::Peek. The both are almost the same, but Apache::Peek prints it output directly to the opened socket so I cannot intercept and redirect the result to the file. Since Devel::Peek dumps results to the STDERR stream I can use the old trick of saving away the default STDERR handler, and open a new filehandler using the STDERR. In our example when Devel::Peek now prints to STDERR it actually prints to our file. When I'm done, I make sure to restore the original STDERR filehandler.

So this is the resulting code:


  MyShared2.pm
  ---------
  package MyShared2;
  use Devel::Peek;
  
  my $readonly = "Chris";
  
  sub match    { $readonly =~ /\w/g;               }
  sub print_pos{ print "pos: ",pos($readonly),"\n";}
  sub dump{
    my $dump_file = "/tmp/dump.$$";
    print "Dumping the data into $dump_file\n";
    open OLDERR, ">&STDERR";
    open STDERR, ">".$dump_file or die "Can't open $dump_file: $!";
    Dump($readonly);
    close STDERR ;
    open STDERR, ">&OLDERR";
  }
  1;

When if I modify the code to use the modified module:


  share_test2.pl
  -------------
  use MyShared2;
  print "Content-type: text/plain\r\n\r\n";
  print "PID: $$\n";
  MyShared2::match();
  MyShared2::print_pos();
  MyShared2::dump();

And run it as before (with MaxClients 2), two dump files will be created in the directory /tmp. In our test these were created as /tmp/dump.1224 and /tmp/dump.1225. When I run diff(1):


  % diff /tmp/dump.1224 /tmp/dump.1225
  12c12
  <       MG_LEN = 1
  ---
  >       MG_LEN = 2

We see that the two padlists (of the variable readonly) are different, as we have observed before when I did a manual comparison.

In fact we if we think about these results again, we get to a conclusion that there is no need for two processes to find out whether the variable gets modified (and therefore unshared). It's enough to check the datastructure before the script was executed and after that. You can modify the MyShared2 module to dump the padlists into a different file after each invocation and than to run the diff(1) on the two files.

If you want to watch whether some lexically scoped (with my()) variables in your Apache::Registry script inside the same process get changed between invocations you can use the Apache::RegistryLexInfo module instead. Since it does exactly this: it makes a snapshot of the padlist before and after the code execution and shows the difference between the two. This specific module was written to work with Apache::Registry scripts so it won't work for loaded modules. Use the technique I have described above for any type of variables in modules and scripts.

Surely another way of ensuring that a scalar is readonly and therefore sharable is to either use the constant pragma or readonly pragma. But then you won't be able to make calls that alter the variable even a little, like in the example that I've just showen, because it will be a true constant variable and you will get compile time error if you try this:


  MyConstant.pm
  -------------
  package MyConstant;
  use constant readonly => "Chris";
  
  sub match    { readonly =~ /\w/g;               }
  sub print_pos{ print "pos: ",pos(readonly),"\n";}
  1;
  
  % perl -c MyConstant.pm
  
  Can't modify constant item in match position at MyConstant.pm line
  5, near "readonly)"
  MyConstant.pm had compilation errors.

However this code is just right:


  MyConstant1.pm
  -------------
  package MyConstant1;
  use constant readonly => "Chris";
  
  sub match { readonly =~ /\w/g; }
  1;

Preloading Perl Modules at Server Startup

You can use the PerlRequire and PerlModule directives to load commonly used modules such as CGI.pm, DBI and etc., when the server is started. On most systems, server children will be able to share the code space used by these modules. Just add the following directives into httpd.conf:


  PerlModule CGI
  PerlModule DBI

But an even better approach is to create a separate startup file (where you code in plain perl) and put there things like:


  use DBI ();
  use Carp ();

Don't forget to prevent importing of the symbols exported by default by the module you are going to preload, by placing empty parentheses () after a module's name. Unless you need some of these in the startup file, which is unlikely. This will save you a few more memory bits.

Then you require() this startup file in httpd.conf with the PerlRequire directive, placing it before the rest of the mod_perl configuration directives:


  PerlRequire /path/to/start-up.pl

CGI.pm is a special case. Ordinarily CGI.pm autoloads most of its functions on an as-needed basis. This speeds up the loading time by deferring the compilation phase. When you use mod_perl, FastCGI or another system that uses a persistent Perl interpreter, you will want to precompile the functions at initialization time. To accomplish this, call the package function compile() like this:


  use CGI ();
  CGI->compile(':all');

The arguments to compile() are a list of method names or sets, and are identical to those accepted by the use() and import() operators. Note that in most cases you will want to replace ':all' with the tag names that you actually use in your code, since generally you only use a subset of them.

Let's conduct a memory usage test to prove that preloading, reduces memory requirements.

In order to have an easy measurement I will use only one child process, therefore I will use this setting:


  MinSpareServers 1
  MaxSpareServers 1
  StartServers 1
  MaxClients 1
  MaxRequestsPerChild 100

I'm going to use the Apache::Registry script memuse.pl which consists of two parts: the first one preloads a bunch of modules (that most of them aren't going to be used), the second part reports the memory size and the shared memory size used by the single child process that I start. and of course it prints the difference between the two sizes.


  memuse.pl
  ---------
  use strict;
  use CGI ();
  use DB_File ();
  use LWP::UserAgent ();
  use Storable ();
  use DBI ();
  use GTop ();

  my $r = shift;
  $r->send_http_header('text/plain');
  my $proc_mem = GTop->new->proc_mem($$);
  my $size  = $proc_mem->size;
  my $share = $proc_mem->share;
  my $diff  = $size - $share;
  printf "%10s %10s %10s\n", qw(Size Shared Difference);
  printf "%10d %10d %10d (bytes)\n",$size,$share,$diff;

First I restart the server and execute this CGI script when none of the above modules preloaded. Here is the result:


     Size   Shared     Diff
  4706304  2134016  2572288 (bytes)

Now I take all the modules:


  use strict;
  use CGI ();
  use DB_File ();
  use LWP::UserAgent ();
  use Storable ();
  use DBI ();
  use GTop ();

and copy them into the startup script, so they will get preloaded. The script remains unchanged. I restart the server and execute it again. I get the following.


     Size   Shared    Diff
  4710400  3997696  712704 (bytes)

Let's put the two results into one table:


  Preloading    Size   Shared     Diff
     Yes     4710400  3997696   712704 (bytes)
      No     4706304  2134016  2572288 (bytes)
  --------------------------------------------
  Difference    4096  1863680 -1859584

You can clearly see that when the modules weren't preloaded the shared memory pages size, were about 1864Kb smaller relative to the case where the modules were preloaded.

Assuming that you have had 256M dedicated to the web server, if you didn't preload the modules, you could have:


  268435456 = X * 2572288 + 2134016

  X = (268435456 - 2134016) / 2572288 = 103

103 servers.

Now let's calculate the same thing with modules preloaded:


  268435456 = X * 712704 + 3997696

  X = (268435456 - 3997696) / 712704 = 371

You can have almost 4 times more servers!!!

Remember that I have mentioned before that memory pages gets dirty and the size of the shared memory gets smaller with time? So I have presented the ideal case where the shared memory stays intact. Therefore the real numbers will be a little bit different, but not far from the numbers in our example.

Also it's obvious that in your case it's possible that the process size will be bigger and the shared memory will be smaller, since you will use different modules and a different code, so you won't get this fantastic ratio, but this example is certainly helps to feel the difference.

References

This week on Perl 6 (week ending 2002-07-21)

Another week, another Perl 6 summary. Cunningly this week I have taken over the summary from Piers in order to make it easier for me to namecheck myself. It's been a good week too, with more happening in perl6-internals than perl6-language. So that's where I'll start...

Parrot 0.0.7

The big news for this week is that DrForr has released Parrot 0.0.7 to the world (strange that lots of open source projects are releasing code just before the O'Reilly Open Source conference...). This release contains a Perl 6 grammar (with small, partial compiler!), functional subroutine, coroutine and continuation PMCs, global variables, an intermediate code compiler (imcc), a pure-Perl assembler and working garbage collection. "The name is Parrot. Percy Parrot."

http://archive.develooper.com/perl6-internals@perl.org/msg11090.html
http://www.cpan.org/modules/by-authors/id/J/JG/JGOFF/parrot-0_0_7.tgz

Note that the really cool Perl 6 compiler needs at least Perl 5.6. Oh, and check out imcc if you haven't looked at it yet.

Retro Perl

Nicholas Clark stated that "In October 2000 I believed that 5.005 maintenance *is* important for the acceptance of perl6, and I still do now". A first patch to the preliminary Perl 6 compiler was sent by Leopold Toetsch to make it work on 5.005_03 and seeing as Chip Salzenberg has restarted work on a new maintenance release of Perl 5.005 it's probably good for various parts of Parrot to run on retro perls. Shouldn't be a major problem.

Parrot docs

One of the big pushes last week was for more documentation inside Parrot. Writing documentation is always a problem for an open source project and it hit the wall last week. The good news is that lots of new documentation has been added to Parrot.

There was some discussion on the nature of documentation. The result is that inline C documention should write up API details and that longer discussions (say, the choice of algorithms, how to avoid overflows in unsigned arithmetic, the pros and cons of differing hash algorithms) would end up as .dev files inside the docs/dev/ directory, much as PDD07 "Conventions and Guidelines for Perl Source Code" says. A few more documentation patches followed.

Recently the mailing list and IRC channel have been quite busy and it seems like the new push for more documentation has attracted new people. Bonus!

http://archive.develooper.com/perl6-internals@perl.org/msg11080.html

MANIFESTations

The Parrot MANIFEST file tends not to be kept up-to-date with recent additions. Andy Dougherty produced a patch to do this. Nicholas Clark asked: "Is CVS flexible enough to let us run a manifest check on each commit and generate warnings that get sent somewhere useful if it fails?". Robert Spier answered that it could and with any luck he'll get it in soon...

RECALL

Tanton Gibbs posted a patch to clean up a problem with our Copy on Write strategy. He kindly explained it for me: "The basic problem is that in perlint.pmc we have something like:


  void set_string( PMC* value ) {
    CHANGE_TYPE( SELF, PerlString );
    SELF->data = value->data
  }

In other words implement a COW strategy after being changed into a PerlString. However, in perlstring.pmc the following is performed:


  void set_string( PMC* value ) {
    SELF->data = string_copy( INTERP, value->data );
  }

The RECALL command automates that so that set_string now looks like:


  void set_string( PMC* value ) {
    CHANGE_TYPE( pmc, PerlString );
    RECALL;
  }

Thanks to Tanton for explaining.

Internals misc

There were also lots of other small patches and discussions. It looks like the push for this week is to make it easier to add new PMCs to Parrot.

Meanwhile, in perl6-language

It was a quiet week in the perl6-language list, which is probably a good thing as thinking too much about hyper operators makes my head hurt.

Hyper operators

There was some discussion on hyper operators this week. It didn't go anywhere in particular, but discussed lots of syntax. Objections such as "this code looks ugly!" came up regularly when talking about code such as:


  @solution =  (^-@b + sqrt(@b^**2 ^+ 4^*@a^*@c) ) ^/ (2^*@a);

Luke Palmer pointed out that it might be better expressed as:


  for @a; @b; @c; @s is rw ->
    $a; $b; $c; $s {
      $s = (-$b + sqrt($b**2 - 4*$a*$c)) / (2*$a)
  }

Karl Glazebrook explained that PDL keeps everything as objects and does hyper operator magic without additional syntax. So Perl 6 "@y = $a ^* @x ^+ @b" happens in PDL with the clearer "$y = $a * $x + $b". Isn't PDL shiny?

Whitespace?

Brent Dax noticed that there might be a problem with the regular expression modifier ":w". The words modifier, according to Apocalypse 5 , "causes an implicit match of whitespace wherever there's literal whitespace in a pattern". He asked what the following expand to:


  m:w/foo [~|bar]/
  m:w/[~|bar] foo/
  m:w/[~|bar] [^|baz]/
  m:w/@foo @bar/

Luke Palmer expanded that "In other words, it replaces every sequence of actual whitespace in the pattern with a \s+ (between two identifiers) or a \s* (between anything else)". Thus, the first would expand to:


  m/ foo \s* [~ | bar] /

However, it's not easy to represent, as the later cases point out. He continues: "Perhaps :w wouldn't transform the rexex, but keep 'markers' on where there was whitespace in the regex". Nevertheless, it's a very useful feature.

Acknowledgements

This summary was brought to you from the O'Reilly Open Source conference and with the music from the intro to Buffy the Vampire Slayer.

As Piers says: Once again, if you liked this, then give money to YAS, if you didn't like it, well, then you can still give them money; maybe they'll use it to hire a better writer. Or maybe you could write a competing summary.

Graphics Programming with Perl

I recently received Martien Verbruggen's long-awaited "Graphics Programming in Perl," and I wasn't quite sure what to make of it. As he notes himself, "I didn't think there would be enough coherent material to write such a book, and I wasn't entirely certain there would be much room for one." Sure, you can write a chapter or so on business graphing -- something on GraphViz -- and a few chapters on GD, Imager and Image::Magick. But an entire book?

Like Martien, the more I look at this topic, the more there is to say, and the more comfortable I am with the way Martien says it. The book seems to concentrate primarily on Image::Magick, with some examples of GD.

All technical books seem to begin with a certain amount of introductory waffle; in "Graphics Programming in Perl," the waffle is at least to some degree relevant - there's a fundamental introduction to such things as color spaces, including some relatively fearsome equations converting between the various color systems. The introduction is carried on through chapter 2, a review of graphics file formats. I can't really categorize this as waffle, though, since a thorough understanding of these things are fundamental to graphics programming.

The real Perl meat starts around the middle of chapter 2, with sections on finding the size of an image and converting between images. Unfortunately, there's more introductory material again in chapter 3, with sections on the CPAN and descriptions on the modules that will be used in the rest of the book. Hence, I wouldn't really say this was the fastest-starting book around, and most people will be able to happily skip the first 30 or 35 pages without much loss of continuity.

Chapter 4 is where we actually start using Perl to draw things, the stated purpose of the book. We begin with drawing simple objects in GD, which is adequately explained, but unfortunately, there's no mention of how to save the images yet, so we can't check them or play with the examples and examine the results!

Next, the same examples are implemented using Image::Magick, a good comparison of the two modules; there's also another good comparison in the middle of an ill-fitting chapter on module interface design. In the middle, there's precisely the sort of thing you'd expect for a book of this nature: font handling, business graphs, 3D rendering, (although a little more detail on this topic would have been nice) and so on. The section on designing graphics for the Web is, if you'll allow a slight exaggeration, flawless.

I find the "bullet-point annotated code" style of explanation gets the important points across well, and Martien has achieved a nice balance of explanatory prose and demonstration code. The material occasionally seems to be let down by the odd bug or two in Image::Magick, but we can hardly blame the author for that.

What really disappointed me about this book was the glaring and complete omission of the Imager module; this is another module for doing programmatic graphics creation, and I personally favor it above Image::Magick and GD, which both require an intermediary external C library on top of the various libraries for handling graphics formats.

Similarly, much more could have been made of the interaction between Perl and the Gimp - there were a few pages on creating animated GIFs, but nothing about using Gimp plug-ins and the like.

Hence, in conclusion, I think if you take this book as being a complete reference to everything you can do with graphics and Perl, you're going to be disappointed. However, if you have certain tasks in mind and need to know how to do them, or you're particularly interested in what you can do with the Image::Magick module, then this book is for you.

Graphics Programming With Perl is available from Manning and all good computer bookshops.

Improving mod_perl Sites' Performance: Part 3

In this article we will continue the topic started in the previous article. This time we talk about tools that help us with code profiling and memory usage measuring.

Code Profiling Techniques

The profiling process helps you to determine which subroutines or just snippets of code take the longest time to execute and which subroutines are called most often. You will probably just want to optimize those.

When do you need to profile your code? You do that when you suspect that some part of your code is being called very often and so there may be a need to optimize it to significantly improve the overall performance.

For example, you might have used the diagnostics pragma, which extends the terse diagnostics normally emitted by both the Perl compiler and the Perl interpreter, augmenting them with the more verbose and endearing descriptions found in the perldiag manpage. If you've ever done so, then you know that it might slow your code down tremendously, so let's first see whether or not it actually does.

We will run a benchmark, once with diagnostics enabled and once disabled, on a subroutine called test_code.

The code inside the subroutine does an arithmetic and a numeric comparison of two strings. It assigns one string to another if the condition tests true but the condition always tests false. To demonstrate the diagnostics overhead the comparison operator is intentionally wrong. It should be a string comparison, not a numeric one.


  use Benchmark;
  use diagnostics;
  use strict;
  
  my $count = 50000;
  
  disable diagnostics;
  my $t1 = timeit($count,\&test_code);
  
  enable  diagnostics;
  my $t2 = timeit($count,\&test_code);
  
  print "Off: ",timestr($t1),"\n";
  print "On : ",timestr($t2),"\n";
  
  sub test_code{
    my ($a,$b) = qw(foo bar);
    my $c;
    if ($a == $b) {
      $c = $a;
    }
  }

For only a few lines of code we get:


  Off:  1 wallclock secs ( 0.81 usr +  0.00 sys =  0.81 CPU)
  On : 13 wallclock secs (12.54 usr +  0.01 sys = 12.55 CPU)

With diagnostics enabled, the subroutine test_code() is 16 times slower than with diagnostics disabled!

Now let's fix the comparison the way it should be, by replacing == with eq, so we get:


    my ($a,$b) = qw(foo bar);
    my $c;
    if ($a eq $b) {
      $c = $a;
    }

and run the same benchmark again:


  Off:  1 wallclock secs ( 0.57 usr +  0.00 sys =  0.57 CPU)
  On :  1 wallclock secs ( 0.56 usr +  0.00 sys =  0.56 CPU)

Now there is no overhead at all. The diagnostics pragma slows things down only when warnings are generated.

After we have verified that using the diagnostics pragma might adds a big overhead to execution runtime, let's use the code profiling to understand why this happens. We are going to use Devel::DProf to profile the code. Let's use this code:


  diagnostics.pl
  --------------
  use diagnostics;
  print "Content-type: text/html\n\n";
  test_code();
  sub test_code{
    my ($a,$b) = qw(foo bar);
    my $c;
    if ($a == $b) {
      $c = $a;
    }
  }

Run it with the profiler enabled, and then create the profiling stastics with the help of dprofpp:


  % perl -d:DProf diagnostics.pl
  % dprofpp
  
  Total Elapsed Time = 0.342236 Seconds
    User+System Time = 0.335420 Seconds
  Exclusive Times
  %Time ExclSec CumulS #Calls sec/call Csec/c  Name
   92.1   0.309  0.358      1   0.3089 0.3578  main::BEGIN
   14.9   0.050  0.039   3161   0.0000 0.0000  diagnostics::unescape
   2.98   0.010  0.010      2   0.0050 0.0050  diagnostics::BEGIN
   0.00   0.000 -0.000      2   0.0000      -  Exporter::import
   0.00   0.000 -0.000      2   0.0000      -  Exporter::export
   0.00   0.000 -0.000      1   0.0000      -  Config::BEGIN
   0.00   0.000 -0.000      1   0.0000      -  Config::TIEHASH
   0.00   0.000 -0.000      2   0.0000      -  Config::FETCH
   0.00   0.000 -0.000      1   0.0000      -  diagnostics::import
   0.00   0.000 -0.000      1   0.0000      -  main::test_code
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::warn_trap
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::splainthis
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::transmo
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::shorten
   0.00   0.000 -0.000      2   0.0000      -  diagnostics::autodescribe

It's not easy to see what is responsible for this enormous overhead, even if main::BEGIN seems to be running most of the time. To get the full picture we must see the OPs tree, which shows us who calls whom, so we run:


  % dprofpp -T

and the output is:


 main::BEGIN
   diagnostics::BEGIN
      Exporter::import
         Exporter::export
   diagnostics::BEGIN
      Config::BEGIN
      Config::TIEHASH
      Exporter::import
         Exporter::export
   Config::FETCH
   Config::FETCH
   diagnostics::unescape
   .....................
   3159 times [diagnostics::unescape] snipped
   .....................
   diagnostics::unescape
   diagnostics::import
 diagnostics::warn_trap
   diagnostics::splainthis
      diagnostics::transmo
      diagnostics::shorten
      diagnostics::autodescribe
 main::test_code
   diagnostics::warn_trap
      diagnostics::splainthis
         diagnostics::transmo
         diagnostics::shorten
         diagnostics::autodescribe
   diagnostics::warn_trap
      diagnostics::splainthis
         diagnostics::transmo
         diagnostics::shorten
        diagnostics::autodescribe

So we see that two executions of diagnostics::BEGIN and 3161 of diagnostics::unescape are responsible for most of the running overhead.

If we comment out the diagnostics module, we get:


  Total Elapsed Time = 0.079974 Seconds
    User+System Time = 0.059974 Seconds
  Exclusive Times
  %Time ExclSec CumulS #Calls sec/call Csec/c  Name
   0.00   0.000 -0.000      1   0.0000      -  main::test_code

It is possible to profile code running under mod_perl with the Devel::DProf module, available on CPAN. However, you must have apache version 1.3b3 or higher and the PerlChildExitHandler enabled during the httpd build process. When the server is started, Devel::DProf installs an END block to write the tmon.out file. This block will be called at server shutdown. Here is how to start and stop a server with the profiler enabled:


  % setenv PERL5OPT -d:DProf
  % httpd -X -d `pwd` &
  ... make some requests to the server here ...
  % kill `cat logs/httpd.pid`
  % unsetenv PERL5OPT
  % dprofpp

The Devel::DProf package is a Perl code profiler. It will collect information on the execution time of a Perl script and of the subs in that script (remember that print() and map() are just like any other subroutines you write, but they come bundled with Perl!)

Another approach is to use Apache::DProf, which hooks Devel::DProf into mod_perl. The Apache::DProf module will run a Devel::DProf profiler inside each child server and write the tmon.out file in the directory $ServerRoot/logs/dprof/$$ when the child is shutdown (where $$ is the number of the child process). All it takes is to add to httpd.conf:


  PerlModule Apache::DProf

Remember that any PerlHandler that was pulled in before Apache::DProf in the httpd.conf or startup.pl, will not have its code debugging information inserted. To run dprofpp, chdir to $ServerRoot/logs/dprof/$$ and run:


  % dprofpp

(Lookup the ServerRoot directive's value in httpd.conf to figure out what your $ServerRoot is.)

Measuring the Memory of the Process

One very important aspect of performance tuning is to make sure that your applications don't use much memory, since if they do you cannot run many servers and therefore in most cases under a heavy load the overall performance degrades.

In addition the code may not be clean and leak memory, which is even worse. In this case, the same process serves many requests and after each request more memory is used. After a while all your RAM will be used and machine will start swapping (use the swap partition) which is a very undesirable event, since it may lead to a machine crash.

The simplest way to figure out how big the processes are and see whether they grow is to watch the output of top(1) or ps(1) utilities.

For example the output of top(1):


    8:51am  up 66 days,  1:44,  1 user,  load average: 1.09, 2.27, 2.61
  95 processes: 92 sleeping, 3 running, 0 zombie, 0 stopped
  CPU states: 54.0% user,  9.4% system,  1.7% nice, 34.7% idle
  Mem:  387664K av, 309692K used,  77972K free, 111092K shrd,  70944K buff
  Swap: 128484K av,  11176K used, 117308K free                170824K cached

     PID USER PRI NI SIZE  RSS SHARE STAT LIB %CPU %MEM   TIME COMMAND
  29225 nobody 0  0  9760 9760  7132 S      0 12.5  2.5   0:00 httpd_perl
  29220 nobody 0  0  9540 9540  7136 S      0  9.0  2.4   0:00 httpd_perl
  29215 nobody 1  0  9672 9672  6884 S      0  4.6  2.4   0:01 httpd_perl
  29255 root   7  0  1036 1036   824 R      0  3.2  0.2   0:01 top
    376 squid  0  0 15920  14M   556 S      0  1.1  3.8 209:12 squid
  29227 mysql  5  5  1892 1892   956 S N    0  1.1  0.4   0:00 mysqld
  29223 mysql  5  5  1892 1892   956 S N    0  0.9  0.4   0:00 mysqld
  29234 mysql  5  5  1892 1892   956 S N    0  0.9  0.4   0:00 mysqld

Which starts with overall information of the system and then displays the most active processes at the given moment. So for example if we look at the httpd_perl processes we can see the size of the resident (RSS) and shared (SHARE) memory segments. This sample was taken on the production server running linux.

But of course we want to see all the apache/mod_perl processes, and that's where ps(1) comes to help. The options of this utility vary from one Unix flavor to another, and some flavors provide their own tools. Let's check the information about mod_perl processes:


  % ps -o pid,user,rss,vsize,%cpu,%mem,ucomm -C httpd_perl
    PID USER      RSS   VSZ %CPU %MEM COMMAND
  29213 root     8584 10264  0.0  2.2 httpd_perl
  29215 nobody   9740 11316  1.0  2.5 httpd_perl
  29216 nobody   9668 11252  0.7  2.4 httpd_perl
  29217 nobody   9824 11408  0.6  2.5 httpd_perl
  29218 nobody   9712 11292  0.6  2.5 httpd_perl
  29219 nobody   8860 10528  0.0  2.2 httpd_perl
  29220 nobody   9616 11200  0.5  2.4 httpd_perl
  29221 nobody   8860 10528  0.0  2.2 httpd_perl
  29222 nobody   8860 10528  0.0  2.2 httpd_perl
  29224 nobody   8860 10528  0.0  2.2 httpd_perl
  29225 nobody   9760 11340  0.7  2.5 httpd_perl
  29235 nobody   9524 11104  0.4  2.4 httpd_perl

Now you can see the resident (RSS) and virtual (VSZ) memory segments (and shared memory segment if you ask for it) of all mod_perl processes. Please refer to the top(1) and ps(1) man pages for more information.

You probably agree that using top(1) and ps(1) are cumbersome if we want to use memory size sampling during the benchmark test. We want to have a way to print memory sizes during the program execution at desired places. If you have GTop modules installed, which is a perl glue to the libgtop library, it's exactly what we need.

Note: GTop requires the libgtop library but is not available for all platforms. Visit http://www.home-of-linux.org/gnome/libgtop/ to check whether your platform/flavor is supported.

GTop provides an API for retrieval of information about processes and the whole system. We are only interested in memory sampling API methods. To print all the process related memory information we can execute the following code:


  use GTop;
  my $gtop = GTop->new;
  my $proc_mem = $gtop->proc_mem($$);
  for (qw(size vsize share rss)) {
      printf "   %s => %d\n", $_, $proc_mem->$_();
  }

When executed we see the following output (in bytes):


      size => 1900544
     vsize => 3108864
     share => 1392640
       rss => 1900544

So if we are interested in to print the process resident memory segment before and after some event we just do it: For example if we want to see how much extra memory was allocated after a variable creation we can write the following code:


  use GTop;
  my $gtop = GTop->new;
  my $before = $gtop->proc_mem($$)->rss;
  my $x = 'a' x 10000;
  my $after  = $gtop->proc_mem($$)->rss;
  print "diff: ",$after-$before, " bytes\n";

and the output


  diff: 20480 bytes

So we can see that Perl has allocated extra 20480 bytes to create $x (of course the creation of after needed a few bytes as well, but it's insignificant compared to a size of $x)

The Apache::VMonitor module with help of the GTop module allows you to watch all your system information using your favorite browser from anywhere in the world without a need to telnet to your machine. If you are looking into what information you can retrieve with GTop, you should examine Apache::VMonitor, as it deploys a big part of the API that GTop provides.

If you are running a true BSD system, you may use BSD::Resource::getrusage instead of GTop. For example:


  print "used memory = ".(BSD::Resource::getrusage)[2]."\n"

For more information refer to the BSD::Resource manpage.

Measuring the Memory Usage of Subroutines

With help of Apache::Status you can find out the size of each and every subroutine.

  1. Build and install mod_perl as you always do, make sure it's version 1.22 or higher.
  2. Configure /perl-status if you haven't already:
    
      <Location /perl-status>
        SetHandler perl-script
        PerlHandler Apache::Status
        order deny,allow
        #deny from all
        #allow from ...
      </Location>
  3. Add to httpd.conf
    
      PerlSetVar StatusOptionsAll On
      PerlSetVar StatusTerse On
      PerlSetVar StatusTerseSize On
      PerlSetVar StatusTerseSizeMainSummary On
    
      PerlModule B::TerseSize
  4. Start the server (best in httpd -X mode)
  5. From your favorite browser fetch http://localhost/perl-status
  6. Click on 'Loaded Modules' or 'Compiled Registry Scripts'
  7. Click on the module or script of your choice (you might need to run some script/handler before you will see it here unless it was preloaded)
  8. Click on 'Memory Usage' at the bottom
  9. You should see all the subroutines and their respective sizes.

Now you can start to optimize your code, or test which of several implementations is of the least size.

For example let's compare CGI.pm's OO vs. procedural interfaces:

As you will see below the first OO script uses about 2k bytes while the second script (procedural interface) uses about 5k.

Here are the code examples and the numbers:

  1. 
      cgi_oo.pl
      ---------
      use CGI ();
      my $q = CGI->new;
      print $q->header;
      print $q->b("Hello");
  2. 
      cgi_mtd.pl
      ---------
      use CGI qw(header b);
      print header();
      print b("Hello");

After executing each script in single server mode (-X) the results are:

  1. 
      Totals: 1966 bytes | 27 OPs
      
      handler 1514 bytes | 27 OPs
      exit     116 bytes |  0 OPs
  2. 
      Totals: 4710 bytes | 19 OPs
      
      handler  1117 bytes | 19 OPs
      basefont  120 bytes |  0 OPs
      frameset  120 bytes |  0 OPs
      caption   119 bytes |  0 OPs
      applet    118 bytes |  0 OPs
      script    118 bytes |  0 OPs
      ilayer    118 bytes |  0 OPs
      header    118 bytes |  0 OPs
      strike    118 bytes |  0 OPs
      layer     117 bytes |  0 OPs
      table     117 bytes |  0 OPs
      frame     117 bytes |  0 OPs
      style     117 bytes |  0 OPs
      Param     117 bytes |  0 OPs
      small     117 bytes |  0 OPs
      embed     117 bytes |  0 OPs
      font      116 bytes |  0 OPs
      span      116 bytes |  0 OPs
      exit      116 bytes |  0 OPs
      big       115 bytes |  0 OPs
      div       115 bytes |  0 OPs
      sup       115 bytes |  0 OPs
      Sub       115 bytes |  0 OPs
      TR        114 bytes |  0 OPs
      td        114 bytes |  0 OPs
      Tr        114 bytes |  0 OPs
      th        114 bytes |  0 OPs
      b         113 bytes |  0 OPs

Note, that the above is correct if you didn't precompile all CGI.pm's methods at server startup. Since if you did, the procedural interface in the second test will take up to 18k and not 5k as we saw. That's because the whole of CGI.pm's namespace is inherited and it already has all its methods compiled, so it doesn't really matter whether you attempt to import only the symbols that you need. So if you have:


  use CGI  qw(-compile :all);

in the server startup script. Having:


  use CGI qw(header);

or


  use CGI qw(:all);

is essentially the same. You will have all the symbols precompiled at startup imported even if you ask for only one symbol. It seems to me like a bug, but probably that's how CGI.pm works.

BTW, you can check the number of opcodes in the code by a simple command line run. For example comparing 'my %hash' vs. 'my %hash = ()'.


  % perl -MO=Terse -e 'my %hash' | wc -l
  -e syntax OK
      4

  % perl -MO=Terse -e 'my %hash = ()' | wc -l
  -e syntax OK
     10

The first one has fewer opcodes.

Note that you shouldn't use Apache::Status module on production server as it adds quite a bit of overhead to each request.


References

This Week on Perl 6 (8 - 14 Jul 2002)

Perl 6 summary for week ending 20020714

Well, what a week it's been, eh, sportsfans? Without further ado, here's a rundown of all the excitement in the Perl 6 development camps.

Still waiting for Exegesis 5?

The week before last saw a couple of fantastic postings on Perlmonks dealing with the fun stuff in Apocalypse 5. I'm sorry I missed them last week. Damian is still beavering away on the Exegesis, but these (shall I call them Apocrypha?) are worth reading.

http://www.perlmonks.org/index.pl

http://www.perlmonks.org/index.pl

Is Parrot a second system?

John Porter worried about the second-system effect, and about whether the movement to implement a bunch of 'foreign' VM ops on Parrot was just going to add bloat and inefficiency. Dan assured him that ``these 'add-on' bytecode interpreters don't get any special consideration in the core.'' John was reassured.

I think it was decided that Parrot is a second system, but that we're working it to avoid the classic problems associated with it.

http://archive.develooper.com/perl6-internals@perl.org/msg10802.html

Don't mix labels and comments

Simon Glover had a problem with

    A:              # prints "a"
        print "a"
        end

which kills the assembler because of the presence of the comment. Tom Hughes posted a patch to fix it, and Brian Wheeler pointed out that the patch means you can't do print "#", which would be bad. Tom reckons he fixed that with his second patch.

http://archive.develooper.com/perl6-internals@perl.org/msg10915.html Tom's initial fix.

http://archive.develooper.com/perl6-internals@perl.org/msg10918.html And the second.

Perl 6 grammar, take 5

Sean O'Rourke is my hero. He's still beavering away on writing a Perl 6 grammar. The latest incarnation is apparently, ``Turing-complete, if you have a Parrot engine and a bit of spare time.'' The grammar is still incomplete (of course), and someone pointed out that it had a problem with code like { some_function_returning_a_hash() }. Should it give a closure? Or a hash ref. Larry hasn't commented so far.

Sean comments: ``The fact that I've been able to whip this up in a couple thousand lines of code is a remarkable testament to Parrot's maturity, and to the wealth of tools available in Perl 5. In particular, without The Damian's Parse::RecDescent, Melvin Smith's IMCC, and Sarathy's Data::Dumper, it never would have been possible.''

Quote of the thread: ``What, this actually runs? Oh, my.'' -- Dan Sugalski

http://archive.develooper.com/perl6-internals@perl.org/msg10866.html

So, What Is IMCC Then?

I asked, they answered. Apparently, reading TFM would have been a good place to start, though Melvin Smith didn't put it quite so bluntly when he told me. Essentially, the IMCC is the Parrot intermediate language compiler. It's a bit like an assembler but sits at a slightly higher level and worries about the painful things like ``register allocation, low level optimisation, and machine code generation.'' And everyone gets to share that wealth -- Perl6, Ruby, Python, whatever -- they all need the same facilities that IMCC provides.

The idea is that, instead of worrying about registers, you just provide a string of temporaries or named locals, so you can write:

    $I1 = 1
    $I2 = 2
    $I3 = $I1 + $I2
    $I5 = $I3 + 5

And IMCC will notice that it only needs to use two registers when it turns that into:

    set I0, 1
    set I1, 2
    add I0, I0, I1
    add I0, I0, 5

Melvin finishes by saying: `` If people don't get anything else, they should get this. Implementing a compiler will be twice as easy if they target the IR instead of raw Parrot. At a minimum, they implement their parser, generate an AST, and walk the tree, emitting intermediate expressions and directives.''

Leon Brocard, who I am constitutionally required to namecheck in every Perl 6 summary, tells me: ``IMCC is the coolest thing. ... Please don't quote me verbatim.'' Tee hee.

The fine manual is at languages/imcc/README in the Parrot source tree.

Vtables and Multimethod Dispatch

This continued from last week. For some reason this ended up with a discussion of Pythagoras' Theorem and Manhattan Distance (this was to do with the idea of dispatch based on distance in type space ... .)

John Porter worried about the cost of generating full MM dispatch tables, quoting some scary numbers. Dan reckoned that the numbers weren't that scary, and that the problem was limited quite neatly.

Quote of the thread: ``I'm not sure I want an algorithm that drives on the sidewalks, runs red lights, and chases pedestrians ... .'' -- Dan Sugalski (again)

http://archive.develooper.com/perl6-internals@perl.org/msg10814.html is a good 'root' to start at.

http://archive.develooper.com/perl6-internals@perl.org/msg10859.html Quote of the thread, in context.

Can I put out a plea for someone, once the dust has settled, to summarise the state of multidispatch?

Building Support for Non-Native Bytecode

Dan mapped out what would be needed to implement a non-native VM. I think he just wants to play Zork using parrot, but I'd never actually say that. He also said he'd have the specs for dynamic opcode and PMC loading out within 24 hours, but I think events may have intervened.

http://archive.develooper.com/perl6-internals@perl.org/msg10806.html

Mutable vs. Immutable Strings

Clark C. Evans reckoned that we'd need both strings and buffers and argued that all strings should start as mutable, but that it should be possible to 'freeze' them. He also pointed out that there should be no corresponding 'thaw' operation. He wondered too whether these semantics might be useful for more than just strings. Florian Haeglsperger wondered whether Copy on Write didn't solve the problem at a stroke, to which the answer appears to be 'not really.' Dan thinks that 'read-onlyness' should probably hang off PMCs rather than strings and buffers. Dan also commented that we need to nail down some semantics about when things can and can't be modified.

The discussion slowly morphed into a discussion of types in Perl and other languages. Melvin Smith noted that ``we've built this register based VM upon which Perl will probably be the most non-optimised language. Things like exposing your lexical pads, eval, etc., blow optimisation out of the water,'' but reckoned that Perl itself would probably still see a massive speed-up.

Ashley Winters got scary. A paragraph that begins, ``Whenever the compiler runs across an eval, take a continuation from within the compiler,'' is always going to be worrying. Actually, he proposed it as a topic for a CS master's thesis, and Dan pointed out that one Damian Conway is looking for students.

Quote of the thread: A tie between Ashley's paragraph opener above and ``[Parrot] will have reflection, introspection, and Deep Meditative Capabilities.'' -- Who else, Dan Sugalski

http://archive.develooper.com/perl6-internals@perl.org/msg10807.html (re)start here.

Adding the System Stack to the Rootset.

One of the weird things about a system that can do continuations is that stack frames need to come out of managed memory, you can't just use the C stack. And if you *don't* manage stack frames using garbage collection, then you end up with memory leaks because the stack frames don't get released; which is where we stood.

Dan is in the process of making sure that system stack frames get properly garbage collected. Mike Lambert also stepped up and did some/most of the implementation.

http://archive.develooper.com/perl6-internals@perl.org/msg10829.html

http://archive.develooper.com/perl6-internals@perl.org/msg10881.html

http://archive.develooper.com/perl6-internals@perl.org/msg10905.html Mike's patch.

Making Keyed Aggregates Really Work.

Melvin Smith put out a call for someone to do an audit of the keyed aggregate handling code and find out which methods are missing, which ones aren't handled by the assembler and generally to fix them. Sean (Her)O'Rourke has apparently done the deed.

Parrot: Copying Arrays

Alberto Sim&otilde;es wondered about copying and cloning arrays and other aggregates. How deeply should one go when making a copy as opposed to just taking a reference? This one is still awaiting an answer.

http://archive.develooper.com/perl6-internals@perl.org/msg10868.html

Coders: Add Doco

The internals list came close to a flame war over this one. John Porter opened with the somewhat incendiary, ``I have to say that I am extremely disappointed at the paucity of internal documentation,'' and went on to make some good points, but in a tone that rather annoyed several people.

Productive stuff that came out of this, and the subsequent 'Parrot contribution' thread include:

  • FAQ writing should be a collaborative effort. The questions that an experienced Parrot head has, or thinks are important, are probably not the questions that a newbie would ask.
    So, if you have a question that you think belongs in the FAQ, then send a message to the list with a subject line of 'PARROT QUESTION' and we'll try and produce some sensible answers and add them to the FAQ.
  • The Parrot IRC channel is a good place for some stuff but has no 'journal of record.' Something like London.pm's lovely 'scribot' bot could prove useful. (I'm deliberately not putting pointers to the IRC channel. If you need to know, then read the thread.)
  • Why questions, and their answers are often really important.
  • We really should be maintaining the .dev files associated with each source file, as mentioned in PDD7.
  • Dan tries to be on IRC regularly from 1300-1700EST Wednesday. ``While it's no substitute for e-mail, and not as good a realtime communication method as face to face communication, it's better than no realtime communication at all.''

In the end, I think we ended up generating more light than heat, but it was touch and go for a while ... .

http://archive.develooper.com/perl6-internals@perl.org/msg10870.html

http://archive.develooper.com/perl6-internals@perl.org/msg10882.html

The first PARROT QUESTIONs

Sadly, the first PARROT QUESTION post didn't contain any actual questions. Ashley Winters pointed out that 'test_main.c' is a rather weird place to find parrot's main loop and wondered why parrot.c is empty.

His follow-up contained the actual questions, most of which were answered in the following thread, which is still ongoing.

Tom Hughes told us he was trying to make sense of the current status of keyed access at all levels, from assembler through the ops to the vtables and is getting more confused the more he looks; which can't be good. Melvin Smith agreed that things were confusing, but thought that things might get a little less confusing when he'd committed Sean's patches. Discussion is ongoing.

http://archive.develooper.com/perl6-internals@perl.org/msg10926.html Ashley's background post.

http://archive.develooper.com/perl6-internals@perl.org/msg10927.html Ashley's questions

http://archive.develooper.com/perl6-internals@perl.org/msg10930.html Tom Hughes on keyed access

More Docs

Stephen Rawls submitted rx.dev, an overview of the regular expression code. Brent Dax added some clarification.

Alberto Sim&otilde;es unveiled a pile of documentation for the Array, PerlArray and PerlHash PMCs, earning himself a few Hero Points from me.

Type Morphing

I'm not entirely sure I understood this thread. Sean O'Rourke submitted some patches to fix Parrot's PMC type morphing. Mike Lambert pointed at some ambiguities and then Sean showed some code that seems rather counter intuitive to do with type morphing and comparisons. He also provided a test file that shows some places where Perl 5 and Parrot seem to disagree on semantics.

http://archive.develooper.com/perl6-internals@perl.org/msg10940.html

Glossary Requests

Mike Litherland made some suggestions about what should go in the glossary. Patches were welcomed, and Dan added some terms to the glossary, which is visible at http://www.parrotcode.org/glossary/

Meanwhile, in Perl6-language

The game of 'find a good description of continuations' rumbled on for a bit. I liked Mike Lambert's description involving a traveler in a maze (that's why Dan wants a Z-machine interpreter running on Parrot. Continuations are the 'maze of little twisty passages all similar').

Anyway, Dan also posted a splendidly clear and lucid explanation of continuations. (Oh, frabjous day! Calloo! Callay!) Peter Scott wondered about serializing continuations, which is a tough problem because some state really can't be serialized (filehandles for instance), which lead Ted Zlatonov to suggest FREEZE {...} and THAW {...} blocks.

http://archive.develooper.com/perl6-language@perl.org/msg10284.html Mike's 'maze' analogy.

http://archive.develooper.com/perl6-language@perl.org/msg10275.html Dan's ``Continuations are just the stack, you know?'' version.

What's MY.line

Chip Salzenburg asked some hardish questions about MY, %MY and the difference between them. Piers (continuations everywhere) Cawley, proposed $a_continuation.the_pad, which should probably be $a_continuation.MY on further reflection, which Dan seemed to think wasn't utterly insane.

It was also proposed that things like

    [localhost:~] $ perl
    my $foo = 12;
    print $foo;
    my $foo = 'ho';
    print $foo;
    12ho[localhost:~] $

which is legal (with -w a warning) in Perl 5 should be illegal in perl 6. This was left as Larry's call.

Quote of the thread: 'And side effects like ``I call you, you modify me invisibly ... '' seems more like taking dangerous drugs than programming.' -- Melvin Smith

On seeing the quote of the thread, Richard (madman) Clamp popped up to point out that, with the aid of Devel::LexAlias, you could already do that in Perl 5. Which is nice.

http://archive.develooper.com/perl6-language@perl.org/msg10290.html Thread starts here. Pretty much all worthwhile.

http://archive.develooper.com/perl6-language@perl.org/msg10312.html Quote in context

http://archive.develooper.com/perl6-language@perl.org/msg10319.html Richard Clamp's bombshell

In Brief

http://archive.develooper.com/perl6-internals@perl.org/msg10803.html -- Mark M. Adkins announced a perl script that hunts down all the POD files in the Parrot directory tree and uses that to generate an HTML tree in a new subdirectory. It looks rather handy.

http://arstechnica.com/paedia/c/caching/caching-1.html -- Dan pointed us at an explanation of CPU caches

Robert Spier pointed everyone at http://www.parrotcode.org, specifically the Development Resources.

Sean O'Rourke proposed ripping a bunch of set_* ops out of core.ops now that we've got 'proper' keyed access in the assembler. Dan concurred.

http://archive.develooper.com/perl6-internals@perl.org/msg10920.html Tanton Gibbs sent a patch that adds documentation and a .dev file for byteorder.c

Nicholas Clark is trying to eliminate compiler warnings.

http://archive.develooper.com/perl6-internals@perl.org/msg10925.html Steve Purkis has a patch to add usleep(int) and sleep(int) to the Linux version of Parrot. Dan likes the idea, but the patch won't go in until it can be made to work on Win32 as well.

http://archive.develooper.com/perl6-language@perl.org/msg10270.html Luke Palmer has a vim highlighting file for Perl 6.

The return of ``Who's Who in Perl 6''

This week, Allison Randal answered the new standard ``5Ws questionnaire''

Who are you?
I'm on the IT staff at the University of Portland. In my spare time I enjoy working on Perl 6. In my spare-spare time I like to swing in a hammock and read and ponder the ineffability of 42.
What do you do for/with Perl 6?
I dream in Perl 6 ... ;)

On the Perl 6 design team I'm the other linguist, or Damian's clone, or the assistant cat-herder, or sometimes the Devil's Advocate. It depends on the day, really.

Where are you coming from?
Perl 6 will be an incredible jump in power and flexibility. But it's also a lot easier to use. I think that fact is often missed. People see a flurry of changes, but they can't see the forest for the trees. It's not just about making the hard things more possible, it's about making the easy things easier. That's the message I want to carry.
When do you think Perl 6 will be released?
February 17, 2004 at 13:42 GMT. ;) Okay, no. :) But the current estimates of 12-18 months sound pretty reasonable, especially with the progress we've seen in Parrot.
Why are you doing this?
Life's too short to settle for weak coffee.

No, really, for the most selfish reason imaginable: I want to use Perl 6. Anything I can do to make it a better language or to help it reach production faster is well worth the effort.

And it's incredibly fun.

You have five words. Describe yourself.
Extreme Geekiness in Unusual Packaging.
Do you have anything to declare?
Perl rules!

Acknowledgements

Thanks to the denizens of #perl and #parrot for their, ahem, 'mad proofreading skeelz.' To Melvin Smith and Leon Brocard for their explanations of IMCC.

This summary was brought to you with the assistance of GNER tea, and the music of Waterson:Carthy and Gillian Welch.

Once again, if you liked it, then give money to YAS, if you didn't like it, well, then you can still give them money; maybe they'll use it to hire a better writer. Or maybe you could write a competing summary.

A Test::MockObject Illustrated Example

People like to find excuses to avoid writing tests for their code. One of the most common goes something like, "It's not feasible to test this, because it relies on external objects" - CGI code, code using the Apache request object, TCP/IP servers, and so on.

The Test::MockObject module makes it much easier to isolate code that uses such objects. For example, if your code uses a CGI object, then you could fiddle with query strings and faked STDIN, trying to persuade CGI.pm to produce testable values. It's easier to use Test::MockObject to create an object that looks and behaves like a CGI object -- but that is completely under your control.

This comes in handy in large software projects, where objects encapsulate tremendous amounts of hidden behavior. If your application follows good design principles by hiding complexity behind well-defined interfaces, then you can replace nearly any component with its interface equivalent. (The internals, of course, are free to change. That's why this is possible.) Often, it's sufficient just to accept certain arguments and to return specific values.

Using a mock object, you can test whether the code in question uses a particular interface correctly. It's possible to do this by hand, but Test::MockObject has several utility methods to add fake methods and verify calls. It integrates with Test::Builder, so it works correctly with Test::Simple, Test::More, and their cousins.

The Background You Need to Know

I assume that you are already familiar with Test::More. Perhaps you've read Test::Tutorial, which comes with the Test::Simple distribution. You may also have read my earlier introduction to the subject. If not, then you may wish to do so. (My roommate tried it out of order and hurt his head. If you fare any better, then the Perl QA group is interested in your natural talent!)

My example comes from the unit tests for the Everything Engine. I chose it for two reasons. First, it's a project near and dear to my heart. Second, it needs more users, testers and developers. More importantly, it's where I came up with the ideas that led to Test::MockObject.

My colleague on the Engine, Darrick Brown, devised a clever technique he dubbed Form Objects. These are used to bind menu choices to nodes. (Everything in Everything is a node. It's the base unit of data and behavior.) Form objects control the creation of HTML widgets, verify submitted data and ultimately update nodes. They all inherit strongly from Everything::FormObject and operate on node objects, so they're an ideal candidate for mock objects.

Mock Objects

This article focuses on white-box unit testing with mock objects. "White box" testing means that you're allowed and encouraged to look at the internals of the thing being tested. This is scarily possible, with Perl. By contrast, "black box" testing happens when you cannot know the internal details: you just know allowed inputs and the expected outputs. (If you don't know that, then you can't do much testing.)

Unit testing, of course, is testing individual components of the program in isolation, as far as possible. This is different from integration testing, which exercises the program as a whole, and acceptance testing, which explores the desired end-user behavior of the program. No type of testing is better or worse than any other. Done properly, they are complementary: Unit tests are capable of exploring internal behaviors that are difficult to prove with acceptance tests; integration tests demonstrate the interoperability between different components that unit tests usually cannot guarantee.

The points of a mock object is to isolate the units being tested from their dependencies, and to give testers more complete control over the testing environments. This follows from standard programming principles: If you can fake the interfaces your unit relies on, then you can control and monitor its behavior.

Perl makes this particularly easy.

The Example

I'm writing a test for Everything::HTML::FormObject::AuthorMenu. This class represents an HTML select box used to set the author of a node. It has two methods. genObject() produces the HTML necessary for the select widget. It is called when the Engine builds a page to display to the user. The other method, cgiVerify(), is called when receiving data from a user submission. It checks to see whether the requested author exists and has write access to the node.

Looking Inside

The module begins rather simply:


        use strict;
        use Everything;
        use Everything::HTML;
        use Everything::HTML::FormObject;

        use vars qw(@ISA);
        @ISA = ("Everything::HTML::FormObject");

Testing this is all very easy. I'd like to make sure that the module continues to load all of these modules, so I need some way to catch their use. (Don't laugh -- I've forgotten to load important modules before, causing really tricky errors. It's better to be precise. Now you can laugh.) Because use calls import() behind the scenes, if I can install my own import() before AuthorMenu is compiled, then I can test that these modules are actually used. As a side benefit, doing so prevents these other classes from loading, making it easier to mock them. The test starts:


        package Everything::HTML::FormObject::AuthorMenu;
        use vars qw( $DB );

        package main;

        use Test::More 'no_plan';

Because I'm faking the other packages, anything the true modules would normally import will not be imported. The only thing that really matters at this point is the $DB object, exported from the Everything package. (I'm cheating a little bit. I know I'll use it later, and I know where it's defined and how it's used. At this point, I should probably say, ``The module will fail to compile unless I fake it here,'' and leave it at that.)

Because I'm not ready to implement $DB, I just switch to the package to be tested and declare it as a global variable. When the package is compiled, it's already there, so it'll compile successfully. Then I return to the main package so I don't accidentally clobber anything important and use the testing module.


        my @imports;
        for ( 'Everything', 'Everything::HTML',
                'Everything::HTML::FormObject') {
                no strict 'refs';
                *{ $_ . '::import' } = sub {
                        push @imports, $_[0];
                };
                (my $modpath = $_) =~ s!::!/!g;
                $INC{ $modpath . '.pm' } = 1;
        }

Here's where it starts to get tricky. Because I want to ensure that these three modules are loaded correctly, I have to make Perl think they're already loaded. %INC comes to the rescue. When you use() or require() a module successfully, Perl adds an entry to %INC with the module pathname, relative to one of the directories in @INC. That way, if you use() or require() the module again, then Perl knows that it's already been loaded.

As mentioned before, my preferred way to check that a module is loaded is to trap all calls to import(). That's why I install fake import subroutines. They simply save the name of the package by which they're identified. The tests are simple to write:


        use_ok( 'Everything::HTML::FormObject::AuthorMenu' );
        
        is( $imports[0], 'Everything', 'Module should use 
        	Everything' );
        is( $imports[1], 'Everything::HTML',
                'Module should use Everything::HTML' );
        is( $imports[2], 'Everything::HTML::FormObject',
                'Module should use 
                     Everything::HTML::FormObject' );

Because use_ok() fires at runtime, it's safe not to wrap this whole section in a BEGIN block. (If you're curious about this, see what perlfunc has to say about use() and its equivalent.)

That works, but it's a little messy and uses some tricks that might scare (or at least confuse) the average Perl hacker. One of the goals of the Test::* modules is to do away with the ``evil black magic'' you'd normally have to use. So, now I show you a more excellent way.

Making That Last Bit Easier

Having found myself writing that code way too often (at least twice), I added it to Test::MockObject. Using the module, the corresponding loop is now:


        my @imports;
        for ( 'Everything', 'Everything::HTML',
                'Everything::HTML::FormObject') {
                Test::MockObject->fake_module(
                        $_,
                        import => sub { push @imports,
                              $_[0] }
                );
        }

Behind the scenes, the module does exactly what the loop did. The nice thing is that you don't have to remember how to fake that a module is loaded or how to test import(). It's already done and it's nicely encapsulated in a module. (I inadvertently drove this point home to myself when writing this section. It turns out that version 0.04 of Test::MockObject populated %ENV instead of %INC. I make that typo often. This time, it was in both the module and the test. Get at least version 0.08, as this article has led to bug fixes and new conveniences. :)

This isn't the most important feature of Test::MockObject, though. It's just a convenient addition. At some point, it should probably be spun off into Test::MockPackage or Test::MockModule. (Want to contribute?)

Testing an Actual Method

Once the module is loaded and ready, I like to test my methods in the order in which they appear. This helps to keep the test suite and the module somewhat synchronized. The first method is genObject(). It's fairly simple:


        my $this = shift @_;
        my ($query, $bindNode, $field, $name, $default) =
                getParamArray(
                "query, bindNode, field, name, 
                     default", @_);
        
        $name ||= $field;
        
        my $html = $this->SUPER::genObject(
                $query, $bindNode, $field, $name) . 
                     "\n";
        
        if(ref $bindNode)
        {
                my $author = $DB->getNode($$bindNode{$field});
                if($author && $author->isOfType('user'))
                {
                        $default ||= $$author{title};
                }
                elsif($author)
                {
                        $default ||= "";
                }
        }

        $html .= $query->textfield(-name => $name, 
        	    -default => $default,
                -size => 15, -maxlength => 255,
                -override => 1) . "\n";
                
        return $html;

I can see several spots that need tests. First, I want to make sure that getParamArray() is called with the proper arguments. (This function makes it possible to pass parameters by position or in name => value pair style, similar to Sub::NamedParams.) Next, I'll test that SUPER::genObject() is called correctly, with the proper values. (This call looks odder than it is, due to the way the Engine handles node inheritance. For a good time, read Everything::Node::AUTOLOAD(), or take up bowling.) After that, there's the conditional statement. I'll have to call the function at least three times to test the branches effectively. The function ends with a textfield() call I want to test, and has a return value where I can check some other things. It's not as complex as it looks.

One of the side benefits of testing is that you'll start to write smaller and smaller functions. This is especially evident if you write tests before you write code to pass them. Besides being easier to read and debug, this tends to produce code that's more flexible and much more powerful.

Having identified several things to test, I next write test names. These are short descriptions of the intent of the tests. When confronted with a piece of existing code, I usually try to figure out what kinds of things can break, and what the important behavior really is. With experience, you can look at a piece of well-written code and figure it out intuitively. There's room to be explicit when you're just starting, though.


        # genObject() should call getParamArray()
        # ... passing a string of desired parameters
        # ... and its arguments, minus the object
        # ... should call SUPER::genObject()
        # ... passing the important parameters
        # ... and should call textfield()
        # ... with the important parameters
        # ... returning its and its parents results
        # ... if node is bound, should call getNode()
        # ... on bound node field
        # ... checking for an author node
        # ... and setting the default to the author name
        # ... or nothing, if it is not an author node

That's a pretty good rough draft. Going through the list reveals the need to make at least two more passes through the code. Getting this in the right order takes a little work, but once you know how to set up the mocking conditions correctly, it's really fast and easy. The best way I've found to handle this is just to jump right in.


        can_ok( 'Everything::HTML::FormObject::AuthorMenu', 
        	'genObject' );

First, I want to make sure this method exists. Why? It's part of the ``Do the simplest thing that could possibly work'' principle. Whenever I add a method, I first check to see whether it exists. It sounds too stupid to have any use, but this is a thing that could possibly break. First, I've been known to misspell method names occasionally. This'll catch that immediately. Second, it gives me a place to start and a test that passes with little work. That's a nice psychological boost that moves me on to the next test. I've kept this habit when writing tests for existing code.

Next, I want to test the call to getParamArray(). Since it's not a method, I can't use a mock object. I'll have to mock it sideways. Though the function comes from Everything.pm, it would normally be exported into this package. I'll use a variant of the import()-mockery earlier:


        my ($gpa, @gpa);
        $mock->fake_module( $package,
            getParamArray => sub { push @gpa, \@_; return $gpa }
        );

I can count the elements of @gpa to see whether it was called, and pull the arguments out of the array. $gpa allows me to control the output. The test itself is easy to write:


        my $result = genObject();
        is( @gpa, 1, 'genObject() should call getParamArray' );

OK, it's a little easier to write than it should have been. If you're paying attention, then you should wonder why the genObject() call works, as I'm still in the main package and the method is in the class. I've just added a variable with the tested package name, as well as an AUTOLOAD() function. I'm already tired of typing the big long package name:


        # near the start
        use vars qw( $AUTOLOAD );
        my $package = 'Everything::HTML::FormObject::AuthorMenu';
        
        ...
        
        # way down at the end
        sub AUTOLOAD {
                my ($subname) = $AUTOLOAD =~ /([^:]+)$/;
                if (my $sub = UNIVERSAL::can( $package,
                     $subname )) {
                        $sub->( @_ );
                } else {
                        warn "Cannot call <$subname> 
                             in ($package)\n";
                }
        }

With all of that infrastructure in place, it's a little disappointing to realize that the test dies. Since I'm calling the method as a function, there's no object on which to call SUPER::genObject(). This is easily solved. Remember that $mock (a mock object) was created earlier? Here's one bit of Perl's magic that drives some OO purists crazy, but makes it oh-so-easy to test. A method call is a function call with a special first argument. If $this, inside genObject(), can do everything that an Everything::HTML::FormObject::AuthorMenu object can do, then it'll just work. Hooray for polymorphism!

To make the SUPER::genObject() call pass, that call will also have to be mocked. The method resolves to Everything::HTML::FormObject::genObject(), so I'll add a function of the appropriate name and package. (This test code is starting to look familiar. Again, Test::MockModule, anyone?)


        my @go;
        $mock->fake_module( 'Everything::HTML::FormObject',
            genObject => sub { push @go, \@_; 
                 return 'some html' }
        );

Now I modify the genObject() call, passing in my mock object:


        my $result = genObject( $mock );

I get further before things fail. Since there's nothing passed in for the $query variable to hold, the textfield() call fails. Now I can finally use my mock object to good effect. First, I'm going to change what getParamArray() returns, using $package again to save on typing:


        my (%gpa, @gpa);
        $mock->fake_module( $package,
            getParamArray => sub { push @gpa, \@_; return 
            @gpa{qw( q bn f n d )}}
        );

Since AuthorMenu expects to receive its arguments in order, I'll create a hash where I can store them. I might use more descriptive key names, but they seem to make sense now. Next, I'll make sure that 'q' returns something controllable. In this case, that means that it supports the textfield() method and returns something sane:


        $mock->set_always( 'textfield', 'more html' );
        $gpa{q} = $mock;

I could create a new mock object for this, but since there's no name collision yet, it's not a big priority. Whether you do this is a matter of personal style.

For now, genObject() does not die, and all tests pass. Whew. Next up, I test to see whether the first argument to getParamArray() is correct.


        is( $gpa[0][0], 'query, bindNode, field, name, default',
            '... requesting the appropriate arguments' );

It is, so I'll make sure that it passes along the rest of the method arguments, minus the object itself:


        like( join(' ', @{ $gpa[0] }), qr/1 2 3$/,
                '... with the method arguments' );
        unlike( join(' ', @{ $gpa[0] }), qr/$mock/,
                '... but not the object itself' );

Only the first of these fails, and that's because I'm not passing any other arguments to the method call. I'll modify it:


        my $result = genObject( $mock, 1, 2, 3 );

This gives me nine tests that pass. I'm also following the test names fairly closely. (In between writing those and actually writing this code, a couple of days passed. Their similarities make me think I'm on the right track.)

Since the next piece of the code tries to load a bound node and I'm not passing one in, I'll test to see that getNode() is not called. Since the call is on the $DB object, I'll set it to the mock object. I'll also use the called() method to make sure that nothing happens. For that to happen, I need to mock getNode(). The code to implement all of this is pretty simple. (Note that it must go in various places):


        $Everything::HTML::FormObject::AuthorMenu::DB = $mock;
        $mock->set_series( 'getNode', 0, 1, 2 );

        # genObject() calls skipped in this example...

        ok( ! $mock->called( 'getNode' ),
                '... should not fetch bound node without one' );

Two things bear more explanation. First, since I don't really know what I want getNode() to return, I'll give it a dummy series. (I'm pretty sure I'll be using set_series() on it, because I've done tests like this before. I can't explain it much beyond intuitive experience.) Second, I'm negating the return value of called() so it will fit with ok(). This can be a little hard to see, sometimes.

The last few tests of the first pass all revolve around the textfield() call. I've already mocked it, so now I'll see whether it was called with the correct arguments:


        my ($method, $args) = $mock->next_call();
        shift @$args;
        my %args = @$args;

        is( $method, 'textfield', '... and should create 
             a text field' );
        is( join(' ', sort keys %args ),
        join(' ', sort qw( -name -default -size 
             -maxlength -override )), '... passing the 
             essential arguments' );

Several bits here stick out. The next_call() method iterates through the call stack of mocked methods, in order. It doesn't track every method called, just the ones you've added through one of Test::MockObject's helper methods. next_call() returns the name of the method (in scalar context) or the name of the method an an anonymous array containing the arguments (in list context).

Since I want to check the arguments passed to textfield(), I call it in list context. Because the arguments are passed as key => value pairs, the most natural comparison seems to be as a hash. I use the join-sort idiom quite often, as I've never quite been comfortable with the list comparison functions of Test::More. This test would probably be much simpler if I used them.

I explicitly sort both arrays just so a hardcoded list order won't cause unnecessary test failures. (This has bitten me when writing code that ought to work on EBCDIC machines, not just ASCII ones. Of course, if you get Everything up and going on a mainframe, then this is probably the least of your concerns.)

Finally, I test to see whether the returned data is created properly:


        is( $result, "some html\nmore html\n",
            '... returning the parent object plus the 
             new textfield html' );

So far, all 13 of the tests succeed. At this point, I started my second pass through the method, but noticed that I hadn't yet tested that $name would get its default value from $field. I'll add 'field' to %gpa before the first pass:


        $gpa{f} = 'field';

This test ought to go before the final test in this pass, so I add it there, too. This finishes the first pass:


        is( $args{-name}, 'field',
                '... and widget name should default 
                to field name' );

For the second pass, I will test what happens when I provide a node to which to bind the form object. In an unmocked Everything, this node comes as an argument to the function. In the tests, I must return them from the mocked getParamArray(), so that's where I will start. I also set the 'field' value in the mock object to a sentinel value I'll check for later. Since the value of $field will be 'field,' it works out nicely. (There's room to be much more creative on these names, especially if you're trying to sneak the name of a friend into your software.)


        $gpa{bn} = $mock;
        $mock->{field} = 'bound node';

Because getNode() has a series set on it and hasn't been called before, it will return 0. That means that isOfType() won't be called on the author object, and the default choice won't be modified from its undefined value. These tests are fairly easy:


        genObject( $mock );
        
        ($method, $args) = $mock->next_call();
        isnt( $method, 'isOfType',
        '... not checking bound node type if it is not found' );
        
        shift @$args;
        %args = @$args;
        is( $args{-default}, undef, '... and not modifying 
             default selection' );
            

As before, next_call() comes in handy. Since I already know that textarea() will be the first (and last, for this pass) method called, I can make sure that isOfType() was not called.

Two tests follow. One ensures that the code checks the node's type. The other makes sure that the default value becomes a blank string if the type is incorrect. To make this work, I had to modify the existing getNode() series to return two instances of $mock.


        $mock->{title} = 'bound title';
        $mock->set_series( 'isOfType', 0, 1 );
        
        genObject( $mock );
        
        ($method, $args) = $mock->next_call( 2 );
        is( $method, 'isOfType', '... if bound node 
            is found, should check type' );
        is( $args->[1], 'user', '... (the user type)' );
        
        ($method, $args) = $mock->next_call();
        shift @$args;
        %args = @$args;
        is( $args{-default}, '',
            '... setting default to blank string 
                 if it is not a user' );

The only new idea here is of passing an argument to next_call(). I know getNode() is the first mocked method, so I can safely skip it. These tests all pass. The final testable condition is where the bound node exists and is a 'user' type node. The set_series() call in the last test block makes isOfType() return true for this pass:


        genObject( $mock );
        ($method, $args) = $mock->next_call( 3 );
        shift @$args;
        %args = @$args;
        is( $args{-default}, 'bound title', '... but using 
             node title if it is' );

I now have 22 successful tests. My original test name plan had 14 tests, but that number generally grows as I see more logic branches. I could add more tests, making sure that default values are not overwritten, and that the essential (hardcoded) attributes of the textfield() are set, but I'm reasonably confident in the tests as they stand. If something breaks, then I'll add a test to catch the bug before I fix it, but what's left is simple enough; I doubt it will break. (Writing that is a good way to have to eat my words later.)

Testing Another Method (A Less Detailed Example)

One method remains for this form Object: cgiVerify(). When the Engine processes input from submitted Form Object forms, it must rebuild the objects. Then, it checks the input against allowed values for the widgets. This method does just that. Its code is slightly longer, and reads:


        my ($this, $query, $name, $USER) = @_;
        
        my $bindNode = $this->getBindNode($query, $name);
        my $author = $query->param($name);
        my $result = {};
        
        if($author)
        {
                my $AUTHOR = $DB->getNode($author, 'user');
                
                if($AUTHOR)
                {
                        # We have an author!!  Set the CGI param 
                        # so that the
                        # inherited cgiUpdate() will just do 
                        # what it needs to!
                        $query->param($name, $$AUTHOR{node_id});
                }
                else
                {
                        $$result{failed} = "User '$author' 
                             does not exist!";
                }
        }
        
        if($bindNode)
        {
                $$result{node} = $bindNode->getId();
                $$result{failed} = "You do not have permission"
                        unless($bindNode->hasAccess($USER, 'w'));
        }
        
        return $result;

Rather than describing the writing of the tests (and my steps and missteps therein), I'll just comment on the tests themselves.


        my $qmock = Test::MockObject->new();
        $mock->set_series( 'getBindNode', 0, 0, 0, 
             $mock, $mock );
        $qmock->set_series( 'param', 0, 'author', 'author' );

Because of the way this method handles things, it's clearer to create another mock object to pass in as $query. I'm also setting up the two main series used for the several passes through this method. While writing the tests, I kept adding new values to these series. This is what remained at the end. This approach makes more sense to me than setting each mock before each pass, but it's a matter of style, and either way will work.


        $result = cgiVerify( $mock, $qmock, 'boundname' );

        ($method, $args) = $mock->next_call();
        is( $method, 'getBindNode', 'cgiVerify() should get 
             bound node' );
        is( join(' ', @$args), "$mock $qmock boundname",
                '... with query object and query 
                     parameter name' );

Here's the reason I used separate mock objects: to tell them apart as arguments in the getBindNode() call.


        ($method, $args) = $qmock->next_call();
        is( $method, 'param', '... fetching parameter' );
        is( $args->[1], 'boundname', '... by name' );

        isa_ok( $result, 'HASH', '... and should return a data 
             structure which' );

The weird test name here is another of my little idioms. isa_ok() adds 'isa (reference type)' to the end of its test names, and I want them to be as clear as possible when they're displayed.


        $mock->set_series( 'getNode', 0, { node_id => 
             'node_id' } );
        $result = cgiVerify( $mock, $qmock, 'boundname' );
        
        ($method, $args) = $mock->next_call( 2 );
        is( $method, 'getNode', '... fetching the node, if an 
             author is found' );
        is( join(' ', @$args), "$mock author user",
                '... with the author, for the user type' );
                
        is( $result->{failed}, "User 'author' does not exist!",
                '... setting a failure message on failure' );

I like the approach of joining the arguments in a string and doing an is() or a like() call on them. The benefit of like() is that you can ignore the $self passed as the first argument, because it's the mock object. I used is() here to make it more explicit what I'm expecting.


        $qmock->clear();
        $result = cgiVerify( $mock, $qmock, 'boundname' );
        ($method, $args) = $qmock->next_call( 2 );
        is( $method, 'param', '... setting parameters, 
             on success' );
        is( join(' ', @$args), "$qmock boundname node_id",
             '... with the name and node id' );
        is( $result->{failed}, undef, 
             '... and no failure message' );

This bit of code gave me trouble until I added the clear() call. It's worth remembering that a mock object's stack of mocked calls persists through passes. I had forgotten that, and was using the first param() call instead of the second. Oops.

Another thing worth noting is that I pass 'undef' as the expected result to is(). Conveniently, Test::More silently does the right thing.


        $mock->set_always( 'getId', 'id' );
        $mock->set_series( 'hasAccess', 0, 1 );
        $result = cgiVerify( $mock, $qmock, 
             'boundname', 'user' );

Here, I exercise the method's final clause. These tests are more complex than the code being tested! Sometimes, it works out that way.


        $mock->called_pos_ok( -2 , 'getId',
                '... should get bound node id, if it exists' );
        is( $result->{node}, 'id',
                '... setting it in the resulting node field' );
                
        $mock->called_pos_ok( -1, 'hasAccess', 
             '... checking node access ');
        is( $mock->call_args_string( -1, ' ' ), 
             "$mock user w", 
             '... for user with write permission' );

I've moved away from the next_call() approach to the older Test::MockObject behavior of using positions in the call stack. I'm still not quite pleased with the names of these methods, but the negative subscripts are handy. (Maybe I need to add prev_call()). All I have to do is remember that hasAccess() is called last, and that getId() should be called as the second-to-last method.

The other new method here is call_args_string(), which simply joins the arguments at the specified call position together. It saves a bit of typing, most of which is offset by the long method name.



        is( $result->{failed}, 'You do not have permission',
                '... setting a failure message if user lacks 
                 write permission' );

        $result = cgiVerify( $mock, $qmock, 'boundname', 'user' );
        is( $result->{failed}, undef, '... and none if the user 
             has it' );

These final two tests demonstrate how, at the end of a long series of tests, the available options are whittled down. By the final couple of passes, I'm generally testing only one thing at a time. That always seems mathematically poetic, to me, as if I'm refining with Newton's method.

Conclusion

This whole exercise has produced 39 tests. My next step is to update the test plan with this information.


        # way back at the top
        use Test::More tests => 39;

This makes it easier to see whether too many or too few tests were run. I get better results about failures and successes this way, too. As it turns out, writing the tests for cgiVerify() took about 20 minutes, give or take some laundry-related distractions. That seems about right for 17 tests on code I haven't read in several months -- about one a minute, when you know what you're doing.

It's worth noting the features of this module that lead me to consider mock objects. Mostly, the process of fetching and building nodes is complex enough that I didn't really want to hook up a fake database, just so I could go through all of the code paths required to test that an author is really an author. If the code had gone through some simple mathematical or textual manipulations, then I would probably have used black-box testing. Code that relies on a database abstraction layer (as does Everything) generally makes me reach for Test::MockObject, however.

For more information on what the module can do, please see its documentation. The current stable version is 0.08, though 0.09 will probably have been released by the time this article is stable.

chromatic is the author of Modern Perl. In his spare time, he has been working on helping novices understand stocks and investing.

This week on Perl 6 ~(24-30 June 2002)

Notes

Experimenting with a slightly different format this week (theft from NTKnow considered sensible ...), I'll also be looking at some of the things that've been posted on use.perl.org (and I'd really appreciate some reports on the YAPC Perl6 chats from those who were there, so I can summarize them for the next issue or as an appendix to this one).

System calls/spawning new processes

A newbie seemed somewhat confused about the purpose of the perl6-language list and asked about spawning subprocesses ... in Perl 5. Mark J Read demonstrated admirable tact and diplomacy in both pointing the newcomer at a better place to ask (perl-beginners@perl.org, a great mailing list) and providing some pointers toward an answer.

The only reason this particular episode made the summary, by the way, is because of its rarity. I've been generally impressed by the general 'on topicness' of the perl6-* lists. I hope it stays that way for a long time to come.

Mailing list urls omitted to protect the innocent.

Ruby iterators

Ruby interators were the subject of Erik Steven Harrison's post, which also referred to 'pass by name' and 'the Jensen Machine,' and wanted to know 'the Perl 6 stance on the matter.' Nobody has yet stepped up to the plate on this one and, speaking personally, I'm not entirely sure I've understood the question.

http://archive.develooper.com/perl6-language@perl.org/msg10175.html has the whole question

Fun With the Perl 6 Grammar.

Sean O'Rourke asked us to ``disregard my past shoddy attempt at a Perl 6 grammar'' and presented ``a larger one that appears to capture much more of the syntax found in Apocalypses and Exegeses 1 - 4 (5 just scares me).'' He promises bugs and missing elements, but still earns hero points from me.

Later in the week, Dan asked how the Perl 6 grammar was going, and Ashley Winters responded that he isn't working on a grammar, but posted a list of variable forms (and a couple of other awkward constructs) with which he'd be testing each grammar. I particularly liked the last three entries in his list:

   ...
   # As a final sanity check, let's chain these things
   @.proc[$*PID].ps.{RSS};
   @proc.[$*PID].ps().{RSS};
   # And, just to make sure your parser doesn't pass this suite
   $foo{"Do you want to try $interpolated{"string"}?"};

Sean O'Rourke responded with some comments on the legality of some of the examples, and offered his own *@$*foo, a flattened, dereferenced, global $foo, he thinks (and I concur). (Ooh... I just had a bad thought. @foo^.(), AFAICT it's the same as map $_.(), @foo, but a good deal more evil).

There was also some talk of overriding [] on hashes and {} on arrays to do surprising things. Personally, I reckon if you're going to do perverse stuff like that, then you should add some rules to the grammar to make it legal, as well as writing the tie magic (or its perl 6 equivalent), but laziness meant I didn't post that to the list.

Oh, yes. John Porter suggested that 'maybe Damian should write [the grammar]?' Which leads me to postulate an analog to Godwin's Law, tailored to the Perl 6 process, stating that at some point in any thread about the Perl 6 language, someone will suggest that Damian do it. Or maybe Damian will do it. After all, Reports from YAPC seem to imply that he's come up with a cunning way of doing things in zero time (but I'm assuming he uses lots of space to compensate).

This thread kicks off at http://archive.develooper.com/perl6-internals@perl.org/msg10705.html.

Sean O'Rourke's parser is at http://archive.develooper.com/perl6-internals@perl.org/msg10692.html.

The Increasingly Misnamed 'Perl5 humor' Thread

Particulary, the branch that should be titled 'Porting XS modules to Parrot' rumbled on, mostly discussing how one would do it without implementing the entire Perl 5 virtual machine on top of Parrot, with a small digression about achieving flight if you waved your hands fast enough.

Dan, as usual, applied the scalpel very neatly. ``Parrot's not going to have XS as an interface -- that'd be insane,'' proposing instead an 80-20 type solution and suggested that, ``If someone wants to start a list of sensible entries in Perl 5's API that should be provided, we can work from there.'' Tim Bunce suggested starting ``with a perl script that reads extension source code ... and spits out details of what perlapi/guts have been used,'' feeding it a bunch of popular extensions, and then profiling the results.

The implementation of such a tool was left as an exercise for the interested reader.

Dave Goehrig appears to have been that interested reader, as he kicked off the 'Possibility of XS support' thread. He took at look at some sample XS-based modules and reckoned that getting a minimal core of XS up would need about 50 functions to be ported. Dan thought that would be cool (no surprise there then, it would be cool) and Dave (the star) set off implementing an extension API for parrot. Dave also pointed at the Python and Ruby extension mechanisms as examples of good solutions to the general problem. Dan told us that he's ``trying to kate the line between exposing the semantics of the internals and the actual internals,'' and commented that the semantics of what XS does are fine, but that the syntax is horrible. He also made it clear that any XS compatibility layer would live outside the Parrot core.

See: http://archive.develooper.com/perl6-internals@perl.org/msg10672.html for the application of the scalpel.

http://archive.develooper.com/perl6-internals@perl.org/msg10671.html has the start of the 'Possibility of XS support' thread. It's all worth reading.

Stack performance

Tom Hughes posted a patch to the stack code. His test figures are impressive: No overflow Overflow Integer stack, before patch 0.065505s 16.589480s Integer stack, after patch 0.062732s 0.068460s Generic stack, before patch 0.161202s 5.475367s Generic stack, after patch 0.166938s 0.168390s

The test programs ``push and pop 65536 times with the first column being when that loop doesn't cross a chunk boundary and the second being when it does cross a chunk boundary.''

Dan was impressed, and subject to a couple of caveats, it looks like the patch will be accepted.

http://archive.develooper.com/perl6-internals@perl.org/msg10704.html has the patch.

Small stuff

http://archive.develooper.com/perl6-internals@perl.org/msg10703.html Dan added a find_type op to make working with noncore PMC types much easier (no more modifying the assembler code every time ...) And a lookback op for inspecting the user stack (in type safe fashion).

http://archive.develooper.com/perl6-internals@perl.org/msg10682.html: Dan also took advantage of the new, mutable strings to write a string_append op, and doctored copre.ops to use it for ``concat Sx, Sy'' where x is both source and destination. The phrase ``Order of magnitude or two'' was used, so that's probably good then.

http://archive.develooper.com/perl6-internals@perl.org/msg10702.html: Brian Wheeler gave us the ``typeof op, which returns the integer type or string type of a PMC.'' (Thanks. Applied. -- Dan)

http://archive.develooper.com/perl6-internals@perl.org/msg10701.html: Simon offers ``more extensive tests for the mul_p_p op.''

http://archive.develooper.com/perl6-internals@perl.org/msg10700.html: Eric Kidder provided a ``list of all the opcodes [...] that are not documented in docs/pdds/pdd06_pasm.pod,'' and a list of all the opcodes that are documented but not implemented.

Peter Cooper pointed to his review of ``Virtual Machine Implementation in C/C++'', available at http://makeashorterlink.com/

http://archive.develooper.com/perl6-internals@perl.org/msg10708.html Leon Brocard offers a patch to escape strings when tracing.

Leon also offers http://archive.develooper.com/perl6-internals@perl.org/msg10709.html, the latest version of his bravura mandelbrot set generator written in Parrot.

Meanwhile, Away From the Mailing Lists ...

Dan said in his use.perl.org journal that ``Damian's [YAS/Perl Foundation] grant is up now, and not coming back. Mine ends at the end of July, and barring a miracle (which means we gather more than $60,000 before the end of July), is also not being extended. A chunk of what's left is going to hire a professional grant writer, with everything after, up to $40,000, going to fund Larry.''

I'd like to extend my heartfelt thanks and best wishes to Damian for the work he's done for Perl 6. While the more visible parts of the work he's done have been jeux d'esprits like Lingua::Romana::Perligata, Acme::Bleach and the scary talks like Time::Space::Continuum or whatever it's called, it's also apparent that the work that's gone on behind the scenes to help Larry make Perl 6 be the best it can be has been unceasing, and his Exegeses have often been masterful examples of how to explain tricky concepts with clarity.

Dan's work for Perl 6 has been more visible more often. Parrot is looking good, in large part because of Dan's efforts as programmer and project manager. Personally, I hope the miracle happens; having Dan full time can only be a good thing.

http://use.perl.org/~Elian/journal/6101

Colophon

I'm not getting paid for this, and I don't particularly want to get paid for it. But if you find the summaries useful, please, please give some money to the Perl Foundation to support the ongoing design and development of Perl 6 and Parrot.

Taglib TMTOWTDI

As with many Perl systems, AxKit often provides multiple ways of doing things. Developers from other programming cultures may find these choices and freedom a bit bewildering at first but this (hopefully) soon gives way to the realization that the options provide power and freedom. When a tool set limits your choices too much, you end up doing things like driving screws with a nailgun. Of course, too many choices isn't necessarily a good thing, but it's better than too few.

Last time, we saw how to build a weather reporting application by implementing a simple taglib module, My::WeatherTaglib, in Perl and deploying it in a pipeline with other XML filters. The pipeline approach allows one kind of flexibility: the freedom to decompose an application in the most appropriate manner for the requirements at hand and for the supporting organization.

Another kind of flexibility is the freedom to implement filters using different technologies. For instance, it is sometimes wise to build taglibs in different ways. In this article, we'll see how to build the same taglib using two other approaches. The first rebuild uses the technique implemented by the Cocoon project, LogicSheets. The second uses Jörg Walter's relatively new SimpleTaglib in place of the TaglibHelper used for My::WeatherTaglib in the previous article. SimpleTaglib is a somewhat more powerful, and, oddly, more complex module than TaglibHelper (though the author intends to make it a bit simpler to use in the near future).

CHANGES

AxKit v1.6 is now out with some nice bug fixes and performance improvements, mostly by Matt Sergeant and Jörg Walter, along with several new advanced features from Kip Hampton which we'll be covering in future articles.

Matt has also updated his AxKit compatible AxPoint PowerPoint-like HTML/PDF/etc. presentation system. If you're going to attend any of the big Perl conferences this season, then you're likely to see presentations built with AxPoint. It's a nice system that's also covered in an XML.com article by Kip Hampton.

AxTraceIntermediate

The one spiffy new feature I used -- rather more often than I'd like to admit -- in writing this article is the debugging directive AxTraceIntermediate, added by Jörg Walter. This directive defines a directory in which AxKit will place a copy each of the intermediate documents passed between filters in the pipeline. So a setting like:

    AxTraceIntermediate /home/barries/AxKit/www/axtrace

will place one file in the axtrace directory for each intermediate document. The full set of directives in httpd.conf used for this article is shown later.

Here is the axtrace directory after requesting the URIs / (from the first article), /02/weather1.xsp (from the second article), /03/weather1.xsp and /03/weather2.xsp (both from this article):

    |index.xsp.XSP         # Perl source code for /index.xsp
    |index.xsp.0           # Output of XSP filter

    |02|weather1.xsp.XSP   # Perl source code for /02/weather1.xsp
    |02|weather1.xsp.0     # Output of XSP
    |02|weather1.xsp.1     # Output of weather.xsl
    |02|weather1.xsp.2     # Output of as_html.xsl

    |03|weather1.xsp.XSP   # Perl source code for /03/weather1.xsp
    |03|weather1.xsp.0     # Output of XSP
    |03|weather1.xsp.1     # Output of weather.xsl
    |03|weather1.xsp.2     # Output of as_html.xsl

    |03|weather2.xsp.XSP   # Perl source code for /02/weather2.xsp
    |03|weather2.xsp.0     # output of my_weather_taglib.xsl
    |03|weather2.xsp.1     # Output of XSP
    |03|weather2.xsp.2     # Output of weather.xsl
    |03|weather2.xsp.3     # Output of as_html.xsl

Each filename is the path portion of the URI with the /s replaced with |s and a step number (or .XSP) appended. The numbered files are the intermediate documents and the .XSP files are the Perl source code for any XSP filters that happened to be compiled for this request. Compare the |03|weather2.xsp.* files to the the pipeline diagram for the /03/weather2.xsp request.

Watch those "|" characters: they force you to quote the filenames in most shells (and thus foil any use of wildcards):

    $ xmllint --format "www/axtrace/|03|weather2.xsp.3"
    <?xml version="1.0" standalone="yes"?>
    <html>
      <head>
        <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
        <title>My Weather Report</title>
      </head>
      <body>
        <h1><a name="title"/>My Weather Report</h1>
        <p>Hi! It's 12:43:52</p>
        <p>The weather in Pittsburgh is Sunny
        ....

NOTE: The .XSP files are only generated if the XSP sheet is recompiled, so you may need to touch the source document or restart the server to generate a new one. Another gotcha is that if an error occurs halfway down the processing pipeline, then you can end up with stale files. In this case, the lower-numbered files (those generated by successful filters) will be from this request, but the higher-numbered files will be stale, left over from the previous requests. A slightly different issue can occur when using dynamic pipeline configurations (which we'll cover in the future): you can end up with a shorter pipeline that only overwrites the lower-numbered files and leaves stale higher-numbered files around.

These are pretty minor gotchas when compared to the usefulness of this feature, you just need to be aware of them to avoid confusion. When debugging for this article, I used a Perl script that does something like:

    rm -f www/axtrace/*
    rm www/logs/*
    www/bin/apachectl stop
    sleep 1
    www/bin/apachectl start
    GET http://localhost:8080/03/weather1.xsp

to start each test run with a clean fileset.

Under the XSP Hood

Before we move on to the examples, let's take a quick peek at how XSP pages are handled by AxKit. This will help us understand the tradeoffs inherent in the different approaches.

AxKit implements XSP filters by compiling the source XSP page into a handler() function that is called to generate the output page. This is compiled in to Perl bytecode, which is then run to generate the XSP output document:

XSP architecture

This means that XSP page is not executed directly, but by running relatively efficient compiled Perl code. The bytecode is kept in memory so the overhead of parsing and code generation is not incurred for each request.

There are three types of Perl code used in building the output document: code to build the bits of static content, code that was present verbatim in the source document -- enclosed in tags like <xsp:logic> and <xsp:expr> -- and code that implements tags handled by registered taglib modules like My::WeatherTaglib from the last article.

Taglib modules hook in to the XSP compiler by registering themselves as handlers for a namespace and then coughing up snippets of code to be compiled in to the handler() routine:

XSP with Taglib Modules Hooking in

The snippets of code can call back into the taglib module or out to other modules as needed. Modules like TaglibHelper, which we used to build My::WeatherTaglib and SimpleTaglib, which we use later in this article for My::SimpleWeatherTaglib, automate the drudgery of building a taglib module so you don't need to parse XML or even (usually) generate XML.

You can view the source code that AxKit generates by cranking the AxDebugLevel up to 10 (which places the code in Apache's ErrorLog) or using the AxTraceIntermediate directive mentioned above. Then you must persuade AxKit to recompile the XSP page by restarting the server and requesting a page. If either of the necessary directives are already present in a running server, then simply touching the file to update its modification time will suffice.

This can be useful for getting a really good feel for what's going on under the hood. I encourage new taglib authors to do this to see how the code for your taglib is actually executed. You'll end up needing to do it to debug anyway (trust me :).

LogicSheets: Upstream Taglibs

AxKit uses a pipeline processing model and XSP includes tags like <xsp:logic> and <xsp:expr> that allow you to embed Perl code in an XSP page. This allows taglibs to be implemented as XML filters that are placed upstream of the XSP processor. These usually use XSLT to and convert taglib invocations to inline code using XSP tags:

Upstream LogicSheets feeding the XSP processor

In fact, this is how XSP was originally designed to operate and Cocoon uses this approach exclusively to this day (but with inline Java instead of Perl). I did not show this approach in the first article because it is considerably more awkward and less flexible than the taglib module approach offered by AxKit.

The Cocoon project calls XSLT sheets that implement taglibs LogicSheets a convention I follow in this article (I refer to the all-Perl taglib implementation as "taglib modules").

weather2.xsp

Before we look at the logicsheet version of the weather report taglib, here is the XSP page from the last article updated to use it:


<?xml-stylesheet href="my_weather_taglib.xsl" type="text/xsl"?>
<?xml-stylesheet href="NULL"                  type="application/x-xsp"?>
<?xml-stylesheet href="weather.xsl"           type="text/xsl"?>
<?xml-stylesheet href="as_html.xsl"           type="text/xsl"?>

<xsp:page
    xmlns:xsp="http://apache.org/xsp/core/v1"
    xmlns:util="http://apache.org/xsp/util/v1"
    xmlns:param="http://axkit.org/NS/xsp/param/v1"
    xmlns:weather="http://slaysys.com/axkit_articles/weather/"
>
<data>
  <title><a name="title"/>My Weather Report</title>
  <time>
    <util:time format="%H:%M:%S" />
  </time>
  <weather>
    <weather:report>
      <!-- Get the ?zip=12345 from the URI and pass it
           to the weather:report tag as a parameter -->
      <weather:zip><param:zip/></weather:zip>
    </weather:report>
  </weather>
</data>
</xsp:page>

In This Series

Introducing AxKit
The first in a series of articles by Barrie Slaymaker on setting up and running AxKit. AxKit is a mod_perl application for dynamically transforming XML. In this first article, we focus on getting started with AxKit.

XSP, Taglibs and Pipelines
Barrie explains what a "taglib" is, and how to use them to create dynamic pages inside of AxKit.

The <?xml-stylesheet href="my_weather_taglib.xsl" type="text/xsl"?> processing instruction causes my_weather_taglib.xsl (which we'll cover next) to be applied to the weather2.xsp page before the XSP processor sees it. The other three PIs are identical to the previous version: the XSP processor is invoked, followed by the same presentation and HTMLification XSLT stylesheets that we used last time.

The only other change from the previous version is that this one uses the corrent URI for XSP tags. I accidently used a deprecated URI for XSP tags in the previous article and ended up tripping over it when I used the up-to-date URI in the LogicSheet for this one. Such is the life of a pointy-brackets geek.

The ability to switch implementations without altering (much) code is one of XSP's advantages over things like inline Perl code: the implementation is nicely decoupled from the API (the tags). The only reason we had to alter weather1.xsp at all is because we're switching from a more advanced approach (a taglib module, My::WeatherTaglib) that is configured in the httpd.conf file to LogicSheets, which need per-document configuration when using <xml-stylesheet> stylesheet specifications. AxKit has more flexible httpd.conf, plugin and Perl based stylesheet specification mechanisms which we will cover in a future article; I'm using the processing instructions here because they are simple and obvious.

The pipeline built by the processing instructions looks like:

The pipeline for weather2.xsp

(does not show final compression stage).

my_weather_taglib.xsl

Now that we've seen the source document and the overall pipeline, here is My::WeatherTaglib recast as a LogicSheet, my_weather_taglib.xsl:


<xsl:stylesheet 
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xsp="http://apache.org/xsp/core/v1"
  xmlns:weather="http://slaysys.com/axkit_articles/weather/"
>

<xsl:output indent="yes" />

<xsl:template match="xsp:page">
  <xsl:copy>
    <xsp:structure>
      use Geo::Weather;
    </xsp:structure>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="weather:report">
  <xsp:logic>
    my $zip = <xsl:apply-templates select="weather:zip/*" />;
    my $w = Geo::Weather->new->get_weather( $zip );
    die "Could not get weather for zipcode '$zip'\n" unless ref $w;
  </xsp:logic>
  <state><xsp:expr>$w->{state}</xsp:expr></state>
  <heat><xsp:expr>$w->{heat}</xsp:expr></heat>
  <page><xsp:expr>$w->{page}</xsp:expr></page>
  <wind><xsp:expr>$w->{wind}</xsp:expr></wind>
  <city><xsp:expr>$w->{city}</xsp:expr></city>
  <cond><xsp:expr>$w->{cond}</xsp:expr></cond>
  <temp><xsp:expr>$w->{temp}</xsp:expr></temp>
  <uv><xsp:expr>$w->{uv}</xsp:expr></uv>
  <visb><xsp:expr>$w->{visb}</xsp:expr></visb>
  <url><xsp:expr>$w->{url}</xsp:expr></url>
  <dewp><xsp:expr>$w->{dewp}</xsp:expr></dewp>
  <zip><xsp:expr>$w->{zip}</xsp:expr></zip>
  <baro><xsp:expr>$w->{baro}</xsp:expr></baro>
  <pic><xsp:expr>$w->{pic}</xsp:expr></pic>
  <humi><xsp:expr>$w->{humi}</xsp:expr></humi>
</xsl:template>

<xsl:template match="@*|node()">
  <!-- Copy the rest of the doc almost verbatim -->
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

The first <xsl:template> inserts an <xsp:structure> at the top of the page with some Perl code to use Geo::Weather; so the Perl code in the later <xsl:logic> element can refer to it. You could also preload Geo::Weather in httpd.conf to share it amongst httpd processes and simplify this stylesheet, but that would introduce a bit of a maintainance hassle: keeping the server config and the LogicSheet in synchronization.

The second <xsl:template> replaces all occurences of <weather:report> (assuming the weather: prefix happens to map to the taglib URI; see James Clark's introduction to namespace for more details). In place of the <weather:report> tag(s) will be some Perl code surrounded by <xsp:logic> and <xsp:expr> tags. The <xsp:logic> tag is used around Perl code that is just logic: any value the code returns is ignored. The <xsp:expr> tags surround Perl code that returns a value to be emitted as text in the result document.

The get_weather() call returns a hash describing the most recent weather oberservations somewhere close to a given zip code:

    {
      'city'  => 'Pittsburgh',
      'state' => 'PA',
      'cond'  => 'Sunny',
      'temp'  => '77',
      ...
    };

All those <xsp:expr> tags extract the values from the hash one by one and build an XML data structure. The resulting XSP document looks like:

    <?xml version="1.0"?>
    <xsp:page xmlns:xsp="http://apache.org/xsp/core/v1"
xmlns:util="http://apache.org/xsp/util/v1"
xmlns:param="http://axkit.org/NS/xsp/param/v1"
xmlns:weather="http://slaysys.com/axkit_articles/weather/"> <xsp:structure> use Geo::Weather; </xsp:structure> <data> <title><a name="title"/>My Weather Report</title> <time> <util:time format="%H:%M:%S"/> </time> <weather> <xsp:logic> my $zip = <param:zip/>; my $w = Geo::Weather->new->get_weather( $zip ); die "Could not get weather for zipcode '$zip'\n" unless ref $w; </xsp:logic> <state><xsp:expr>$w->{state}</xsp:expr></state> <heat><xsp:expr>$w->{heat}</xsp:expr></heat> <page><xsp:expr>$w->{page}</xsp:expr></page> <wind><xsp:expr>$w->{wind}</xsp:expr></wind> <city><xsp:expr>$w->{city}</xsp:expr></city> <cond><xsp:expr>$w->{cond}</xsp:expr></cond> <temp><xsp:expr>$w->{temp}</xsp:expr></temp> <uv><xsp:expr>$w->{uv}</xsp:expr></uv> <visb><xsp:expr>$w->{visb}</xsp:expr></visb> <url><xsp:expr>$w->{url}</xsp:expr></url> <dewp><xsp:expr>$w->{dewp}</xsp:expr></dewp> <zip><xsp:expr>$w->{zip}</xsp:expr></zip> <baro><xsp:expr>$w->{baro}</xsp:expr></baro> <pic><xsp:expr>$w->{pic}</xsp:expr></pic> <humi><xsp:expr>$w->{humi}</xsp:expr></humi> </weather> </data> </xsp:page>

and the output document of that XSP page looks like:

    <?xml version="1.0" encoding="UTF-8"?>
    <data>
      <title><a name="title"/>My Weather Report</title>
      <time>17:06:15</time>
      <weather>
        <state>PA</state>
        <heat>77</heat>
        <page>/search/search?what=WeatherLocalUndeclared
        &where=15206</page>
        <wind>From the Northwest at 9 gusting to 16</wind>
        <city>Pittsburgh</city>
        <cond>Sunny</cond>
        <temp>77</temp>
        <uv>4</uv>
        <visb>Unlimited miles</visb>
        <url>http://www.weather.com/search/search?
        what=WeatherLocalUndeclared&where=15206</url>
        <dewp>59</dewp>
        <zip>15206</zip>
        <baro>29.97 inches and steady</baro>
        <pic>http://image.weather.com/web/common/wxicons/52/30.gif</pic>
        <humi>54%</humi>
      </weather>
    </data>

LogicSheet Advantages

  • One taglib can generate XML that calls another taglib. Taglib modules may call each other at the Perl level, but taglib modules are XSP compiler plugins and do not cascade: The XSP compiler lives in a pipeline environment but does not use a pipeline internally.
  • No need to add an AxAddXSPTaglib directive and restart the Web server each time you write a tag lib.

Restarting a Web server just because a taglib has changed can be awkward in some environments, but this seems to be rare; restarting an Apache server is usually quick enough in a development environment and better not be necessary too often in a production environment.

In the Cocoon community, LogicSheets can be registered and shared somewhat like the Perl community uses CPAN to share modules. This is an additional benefit when Cocooning, but does not carry much weight in the Perl world, which already has CPAN (there are many taglib modules on CPAN). There is no Java equivalent to CPAN in wide use, so Cocoon logic sheets need their own mechanism.

LogicSheet Disadvantages

There are two fundamental drwabacks with LogicSheets, each with several symptoms. Many of the symptoms are minor, but they add up:

  1. Requires inline code, usually in an XSLT stylesheet.
    • Putting Perl code in XML is awkward: You can't easily syntax check the code (I happen to like to run perl -cw ThisFile.pm a lot while writing Perl code) or take advantage of language-oriented editor features such as autoindenting, tags and syntax highlighting.
    • The taglib author needs to work in four languages/APIs: XSLT (typically), XSP, Perl, and the taglib under development. XSLT and Perl are far from trivial, and though XSP is pretty simple, it's easy to trip yourself up when context switching between them.
    • LogicSheets are far less flexible than taglib modules. For instance, compare the rigidity of my_weather_taglib.xsl's output structure with the that of My::WeatherTaglib or My::SimpleWeatherTaglib. The LogicSheet approach requires hardcoding the result values, while the two taglib modules simply convert whatever is in the weather report data structures to XML.
    • XSLT requires a fair amount of extra boilerplate to copy non-taglib bits of XSP pages through. This can usually be set up as boilerplate, but boilerplate in a program is just another thing to get in the way and require maintainance.
    • LogicSheet are inherently single-purpose. Taglib modules, on the other hand, can be used as regular Perl modules. An authentication module can be used both as a taglib and as a regular module, for instance.
    • LogicSheets need a working Web server for even the most basic functional testing since they need to be run in an XSP environment and AxKit does not yet support XSP outside a Web server. Writing taglib modules allows simple test suites to be written to vet the taglib's code without needing a working Web server.
    • Writing LogicSheets works best in an XML editor, otherwise you'll need to escape all your < characters, at least, and reading / writing XML-escaped Perl and Java code can be irksome.
    • Embracing and extending a LogicSheet is difficult to do: The source XSP page needs to be aware of the fact that the taglib it's using is using the base taglib and declare both of their namespaces. With taglib modules, Perl's standard function import mechanism can be used to releive XSP authors of this duty.
  2. Requires an additional stylesheet to process, usually XSLT. This means:
    • A more complex processing chain, which leads to XSP page complexity (and thus more likelihood of bugs) because each page must declare both the namespace for the taglib tags and a processing instruction to run the taglib. As an example of a gotcha in this area, I used an outdated version of the XSP namespace URI in weather2.xsp and the current URI in my_weather_taglib.xsl. This caused me a bit of confusion, but the AxTraceIntermediate directive helped shed some light on it.
    • More disk files to check for changes each time an XSP page is served. Since each LogicSheet affects the output, each LogicSheet must be stat()ed to see if it has changed since the last time the XSP page was compiled.

As you can probably tell, I feel that LogicSheets are a far more awkward and less flexible approach than writing taglibs as Perl modules using one of the helper libraries. Still, using upstream LogicSheets is a valid and perhaps occasionally useful technique for writing AxKit taglibs.

Upstream Filters good for?

So what is XSLT upstream of an XSP processor good for? You can do many things with it other than implementing LogicSheets. One use is to implement branding: altering things like logos, site name, and perhaps colors, or other customization, like administrator's mail addresses on a login page that is shared by several sub-sites.

A key advantage of doing transformations upstream of the XSP processor is that the XSP process caches the results of upstream transformations. XSP converts whatever document it receives in to Perl bytecode in memory and then just runs that bytecode if none of the upstream documents have changed.

Another use is to convert source documents that declare what should be on a page to XSP documents that implement the machinery of a page. For instance, a survey site might have the source documents declare what questions to ask:

    <survey>
      <question>
        <text>Have you ever eaten a Balut</text>
        <response>Yes</response>
        <response>No</response>
        <response>Eeeewww</response>
      </question>
      <question>
        <text>Ok, then, well how about a nice haggis</text>
        <response>Yes</response>
        <response>No</response>
        <response>Now that's more like it!</response>
      </question>
      ...
    </survey>

XSLT can be used to transform the survey definition in to an XSP page that uses the PerForm taglib to automate form filling, etc. This approach allows pages to be defined in terms of what they are instead of how they should work.

You can also use XSLT upstream of the XSP processor to do other things, like translate from a limited or simpler domain-specific tagset to a more complex or general purpose taglib written as a taglib module. This can allow you to define taglibs that are easier to use in terms of more powerful (but scary!) taglibs that are loaded in to the XSP processor.

My::SimpleWeatherTaglib

A new-ish taglib helper module has been bundled in recent AxKit releases: Jörg Walter's SimpleTaglib (the full module name is Apache::AxKit::Language::XSP::SimpleTaglib). This module performs roughly the same function as Steve Willer's TaglibHelper, but supports namespaces and uses a feature new to Perl, subroutine attributes, to specify the parameters and result formatting instead of a string.

Here is My::SimpleWeatherTaglib:

    package My::SimpleWeatherTaglib;

    use Apache::AxKit::Language::XSP::SimpleTaglib;

    $NS = "http://slaysys.com/axkit_articles/weather/";

    package My::SimpleWeatherTaglib::Handlers;

    use strict;
    require Geo::Weather;

    ## Return the whole report for fixup later in the processing pipeline
    sub report :  child(zip) struct({}) {
        return 'Geo::Weather->new->get_weather( $attr_zip );'
    }

    1;

The $NS variable defines the namespace for this taglib. This module uses the same namespace as my_weather_taglib.xsl and My::WeatherTaglib, because all three implement the same taglib (this repetetiveness is to demonstrate the differences between the approaches). See the Mixing and Matching Taglibs section to see how My::WeatherTaglib and My::SimpleWeatherTaglib can both be used in the same server instance.

My::SimpleWeatherTaglib then shifts gears in to a new package, My::SimpleWeatherTaglib::Handlers to define the subroutines for the taglib tags. Using a virgin package like this provides a clean place with which to declare the tag handlers. SimpleTaglib looks for the modules in the Foo::Handlers package if it's use()d in the Foo package (don't use require for this!).

My::SimpleWeatherTaglib requires Geo::Weather and declares a single tag, which handles the <weather:report> tag in weather1.xsp (which we'll show in a moment).

The require Geo::Weather; instead of use Geo::Weather; is to avoid importing subroutines in to our otherwise ...::Handlers namespace which might look like a handler.

There's something new afoot in the declaration for sub report: subroutine attributes. Subroutine attributes are a new feature of Perl (as of perl5.6) that allow us to hang additional little bits of information on the subroutine declaration that describe it a bit more. perldoc perlsub for the details of this syntax. Some attributes are predefined by Perl, but modules may define others for their own purposes. In this case, the SimpleTaglib module defines a handful of attributes, some of which describe what parameters the taglib tag can take and others which describe how to convert the result value from the taglib implementation into XML output.

The child(zip) subroutine attribute tells the SimpleTaglib module that this handler expects a single child element named zip in the taglib's namespace. In weather1.xsp, this ends up looking like:


    <weather:report>
      <!-- Get the ?zip=12345 from the URI and pass it
           to the weather:report tag as a parameter -->
      <weather:zip><param:zip/></weather:zip>
    </weather:report>

The text from the <weather:zip> element (which will be filled in from the URI query string using the param: taglib) will be made available in a variable named $attr_zip at request time. The fact that the text from an element shows up in a variable beginning with $attr_ is confusing, but it does actually work that way.

The struct({}) attribute specifies that the result of this tag will be returned as a Perl data structure that will be converted into XML. Geo::Weather->new->get_weather( $zip ) returns a HASH reference that looks like:

    {
      'city'  => 'Pittsburgh',
      'state' => 'PA',
      'cond'  => 'Sunny',
      'temp'  => '77',
      ...
    };

The struct attribute tells SimpleTaglib to turn this in to XML like:


    <city>Pittsburgh</city>
    <state>PA</state>
    <cond>Sunny</cond>
    <temp>77</temp>
    ....

The {} in the struct({}) attribute specifies that the result nodes should be not be in a namespace (and thus have no namespace prefix), just like the static portions of our weather1.xsp document. This is one of the advantages that SimpleTaglib has over other methods: It's easier to emit nodes in different namespaces. To emit nodes in a specific namespace, put the namespace URI for that namespace inside the curlies: struct({http://my.namespace.com/foo/bar}). The {} notation is referred to as James Clark (or jclark) notation.

Now, the tricky bit. Harkening back to our discussion of how XSP is implemented, remember that the XSP processor compiles the XSP document into Perl code that is executed to build the output document. As XSP compiles the page, it keeps a lookout for tags in namespaces handled by taglib modules that have been configured in with AxAddXSPTaglib. When XSP sees one of these tags, it calls in to the taglib module--My::SimpleWeatherTaglib here--for that namespace and requests a chunk of Perl source code to compile in place of the tag.

Taglibs implemented with the SimpleTaglib module covered here declare handlers for each taglib tag (sub report, for instance). That handler subroutine is called at parse time, not at request time. Its job is to return the chunk of code that will be compiled and then run later, at request time, to generate the output. So report() returns a string containing a snippet of Perl code that calls into Geo::Weather. This Perl code will be compiled once, then run for each request.

This is a key difference between the TaglibHelper module that My::WeatherTaglib used in the previous article and the SimpleTaglib module used here. SimpleTaglib calls My::SimpleWeatherTaglib's report() subroutine at compile time whereas TaglibHelper quietly, automatically arranges to call My::WeatherTaglib's report() subroutine at request time.

This difference makes SimpleTaglib not so simple unless you are used to writing code that generates code that will be compiled and run later. On the other hand, "Programs that write programs are the happiest programs in the world" (Andrew Hume, according to a few places on the net). This is true here because we are able to return whatever code is appropriate for the task at hand. In this case, the code is so simple that we can return it directly. If the work to be done was more complicated, then we could also return a call to a subroutine of our own devising. So, while a good deal less simple than the approach taken by TaglibHelper, this approach does offer a bit more flexibility.

SimpleTaglib's author does promise that a new version of SimpleTaglib will offer the "call this subroutine at request time" API which I (and I suspect most others) would prefer most of the time.

I will warn you that the documentation for SimpleTaglib does not stand on its own, so you need to have the source code for an example module or two to put it all together. Beyond the overly simple example presented here, the documentation refers you to a couple of others. Mind you, I'm casting stones while in my glass house here, because nobody has ever accused me of fully documenting my own modules.

For reference, here is the weather1.xsp from the previous article, which we are reusing verbatim for this example:

    <?xml-stylesheet href="NULL"        type="application/x-xsp"?>
    <?xml-stylesheet href="weather.xsl" type="text/xsl"?>
    <?xml-stylesheet href="as_html.xsl" type="text/xsl"?>

    <xsp:page
        xmlns:xsp="http://www.apache.org/1999/XSP/Core"
        xmlns:util="http://apache.org/xsp/util/v1"
        xmlns:param="http://axkit.org/NS/xsp/param/v1"
        xmlns:weather="http://slaysys.com/axkit_articles/weather/"
    >
    <data>
      <title><a name="title"/>My Weather Report</title>
      <time>
        <util:time format="%H:%M:%S" />
      </time>
      <weather>
        <weather:report>
          <!-- Get the ?zip=12345 from the URI and pass it
               to the weather:report tag as a parameter -->
          <weather:zip><param:zip/></weather:zip>
        </weather:report>
      </weather>
    </data>
    </xsp:page>

The processing pipeline and intermediate files are also identical to those from the previous article, so we won't repeat them here.

Mixing and Matching Taglibs using httpd.conf

As detailed in the first article in this series, AxKit integrates tightly with Apache and Apache's configuration engine. Apache allows different files and directories to have different configurations applied, including what taglibs are used. In the real world, for instance, it is sometimes necessary to have part of a site to use a new version of a taglib that might break an old portion.

In the server I used to build the examples for this article, for instance, the 02/ directory still uses My::WeatherTaglib from the last article, while the 03/ directory uses the my_weather_taglib.xsl for one of this article's examples and My::SimpleWeatherTaglib for the other. This is done by combining Apache's <Directory> sections with the AxAddXSPTaglib directive:

    ##
    ## Init the httpd to use our "private install" libraries
    ##
    PerlRequire startup.pl

    ##
    ## AxKit Configuration
    ##
    PerlModule AxKit

    <Directory "/home/me/htdocs">
        Options -All +Indexes +FollowSymLinks

        # Tell mod_dir to translate / to /index.xml or /index.xsp
        DirectoryIndex index.xml index.xsp
        AddHandler axkit .xml .xsp

        AxDebugLevel 10

        AxTraceIntermediate /home/me/axtrace

        AxGzipOutput Off

        AxAddXSPTaglib AxKit::XSP::Util
        AxAddXSPTaglib AxKit::XSP::Param

        AxAddStyleMap application/x-xsp \\
                      Apache::AxKit::Language::XSP

        AxAddStyleMap text/xsl \\
                      Apache::AxKit::Language::LibXSLT
    </Directory>

    <Directory "/home/me/htdocs/02">
        AxAddXSPTaglib My::WeatherTaglib
    </Directory>

    <Directory "/home/me/htdocs/03">
        AxAddXSPTaglib My::SimpleWeatherTaglib
    </Directory>

See How Directory, Location and Files sections work from the apache httpd documentation (v1.3 or 2.0) for the details of how to use <Directory> and other httpd.conf sections to do this sort of thing.

Help and thanks

Jörg Walter as well as Matt Sergeant were of great help in writing this article, especially since I don't do LogicSheets. Jörg also fixed a bug in absolutely no time and wrote the SimpleTaglib module and the AxTraceIntermediate feature.

In case of trouble, have a look at some of the helpful resources we listed in the first article.

Copyright 2002, Robert Barrie Slaymaker, Jr. All Rights Reserved.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en