Recently in XML Category

Don't Be Afraid to Drop the SOAP

SOAP has great hype; portable, simple, efficient, flexible, and open, SOAP has it all. According to many intelligent people, writing a web service with SOAP should be a snap, and the results will speak for themselves. So they do, although what they have to say isn't pretty.

Two years ago I added a SOAP interface to the Bricolage open source content management system. I had high expectations. SOAP would give me a flexible and efficient control system, one that would be easy to develop and simple to debug. What's more, I'd be out on the leading edge of cool XML tech.

Unfortunately the results haven't lived up to my hopes. The end result is fragile and a real resource hog. In this article I'll explore what went wrong and why.

Last year, I led the development of a new content-management system called Krang, and I cut SOAP out of the mix. Instead, I created a custom XML file-format based on TAR. Performance is up, development costs are down, and debugging is a breeze. I'll describe this system in detail at the end of the article.

What is SOAP?

In case you've been out to lunch, SOAP (Simple Object Access Protocol) is a relatively new RPC (Remote Procedure Call) system that works by exchanging XML messages over a network connection, usually over HTTP. In an RPC system, a server offers routines (procedures) that clients may call over a network connection. SOAP surpasses its direct predecessor, XML-RPC, with an enhanced type system and an improved error-handling system. Despite the name, SOAP is neither particularly simple nor object-oriented.

Bricolage Gets SOAP

When I joined the Bricolage project, it lacked a good way to control the application aside from the browser-based GUI. In particular, we needed a way to import data and trigger publish runs. Bricolage is a network application, and some useful tasks require interaction with multiple Bricolage servers. SOAP seemed like an obvious choice. I read "Programming Web Services with Perl" and I was ready to go.

I implemented the Bricolage SOAP interface as a set of classes that map SOAP requests to method calls on the underlying objects, with some glue code to handle XML serialization and deserialization. I used XML Schema to describe an XML vocabulary for each object type, which we used to validate input and output for the SOAP methods during testing.

By far the most important use-case for this new system was data import. Many of our customers were already using content management systems (CMSs) and we needed to move their data into Bricolage. A typical migration involved processing a database dump from the client's old system and producing XML files to load in Bricolage via SOAP requests.

The SOAP interface could also move content from one system to another, most commonly when moving completed template changes into production. Finally, SOAP helped to automate publish runs and other system maintenance tasks.

To provide a user interface to the SOAP system, I wrote a command-line client called bric_soap. The bric_soap script is a sort of Swiss Army knife for the Bricolage SOAP interface; it can call any available method and pipe the results from command to command. For example, to find and export all the story objects with the word foo in their title:

$ bric_soap story list_ids --search "title=%foo%" |
	bric_soap story export - > stories.xml

Later we wrote several single-purpose SOAP clients, including bric_republish for republishing stories and bric_dev_sync for moving templates and elements between systems.

What Went Right

  • The well-documented XML format for Bricolage objects made developing data import systems straightforward. Compared to previous projects that attempted direct-to-SQL imports, the added layer of abstraction and validation was an advantage.
  • The interface offered by the Bricolage SOAP classes is simpler and more regular than the underlying Bricolage object APIs. This, coupled with the versatile bric_soap client, allowed developers to easily script complex automations.

What Went Wrong

  • SOAP is difficult to debug. The SOAP message format is verbose even by XML standards, and decoding it by hand is a great way to waste an afternoon. As a result, development took almost twice as long as anticipated.
  • The fact that all requests happened live over the network further hampered debugging. Unless the user was careful to log debugging output to a file it was difficult to determine what went wrong.
  • SOAP doesn't handle large amounts of data well. This became immediately apparent as we tried to load a large data import in a single request. Since SOAP requires the entire request to travel in one XML document, SOAP implementations usually load the entire request into memory. This required us to split large jobs into multiple requests, reducing performance and making it impossible to run a complete import inside a transaction.
  • SOAP, like all network services, requires authentication to be safe against remote attack. This means that each call to bric_soap required at least two SOAP requests — one to login and receive a cookie and the second to call the requested method. Since the overhead of a SOAP request is sizable, this further slowed things down. Later we added a way to save the cookie between requests, which helped considerably.
  • Network problems affected operations that needed to access multiple machines, such as the program responsible for moving templates and elements — bric_dev_sync. Requests would frequently timeout in the middle, sometimes leaving the target system in an inconsistent state.
  • At the time, there was no good Perl solution for validating object XML against an XML Schema at runtime. For testing purposes I hacked together a way to use a command-line verifier using Xerces/C++. Although not a deficiency in SOAP itself, not doing runtime validation led to bad data passing through the SOAP interface and ending up in the database where we often had to perform manual cleanup.

Round Two: Krang

When I started development on Krang, our new content management system, I wanted to find a better way to meet our data import and automation needs. After searching in vain for better SOAP techniques, I realized that the problems were largely inherent in SOAP itself. SOAP is a network system, tuned for small messages and it carries with it complexity that resists easy debugging.

On the other hand, when I considered the XML aspects of the Bricolage system, I found little to dislike. XML is easy to understand and is sufficiently flexible to represent all the data handled by the system. In particular, I wanted to reuse my hard-won XML Schema writing skills, although I knew that I'd need runtime validation.

In designing the new system I took a big step back from the leading edge. I based the new system on the TAR archive file format, which dates back to the mid-70s!


Figure 1.

I named the file format "Krang Data Set" (KDS). A KDS file is a TAR archive containing a set of XML files. A special file, index.xml, contains data about all the files contained in the KDS file, providing class names and IDs. To reduce their size, it's possible to compress KDS files using gzip.

I wrote two scripts, krang_import and krang_export, to read and write KDS files. Each object type has its own XML Schema document describing its structure. Krang classes implement their own deserialize_xml() and serialize_xml() methods. For example, to export all templates into a file called templates.kds:

$ krang_export --templates --output templates.kds

To import those templates, possibly on a different machine:

$ krang_import templates.kds

If the object being exported has any dependencies, the KDS file will include them. In this way a KDS file generated by krang_export is guaranteed to import successfully.

By using a disk-based system for importing and exporting data I cut the network completely out of the picture. This alone accomplishes a major reduction in complexity and a sizable performance increase. Recently we completed a very large import into Krang comprising 12,000 stories and 160,000 images. This took around 4 hours to complete, which may seem like a long time but it's a big improvement over the 28 hours the same import required using SOAP and Bricolage!

For system automation such as running publish jobs from cron, I decided to code utilities directly to Krang's Perl API. This means these tools must run on the target machine, but in practice this is usually how people used the Bricolage tools. When an operation must run across multiple machines, perhaps when moving templates from beta to production, the administrator simply uses scp to transfer the KDS files.

I also took the opportunity to write XML::Validator::Schema, a pure-Perl XML Schema validator. It's far from complete, but it supports all the schema constructs I needed for Krang. This allows Krang to perform runtime schema validation on KDS files.

What Went Right

  • The new system is fast. Operating on KDS files on disk is many times faster than SOAP network transfers.
  • Capacity is practically unlimited. Since KDS files separate objects into individual XML files, Krang never has to load them all into memory at once. This means that a KDS file containing 10,000 objects is just as easy to process as one containing 10.
  • Debugging is much easier. When an import fails the user simply sends me the KDS file and I can easily examine the XML files or attempt an import on my own system. I don't have to wade through SOAP XML noise or try to replicate network operations to reproduce a bug. Separating each object into a single XML file made working on the data much easier because each file is small enough to load into Emacs.
  • Runtime schema validation helps find bugs faster and prevents bad data from ending up in the database.
  • Because Krang's design accounted for the XML system from the start it has a much closer integration with the overall system. This gives it greater coverage and stability.

What Went Wrong

  • Operations across multiple machines require the user to manually transfer KDS files across the network.
  • Users who have developed expertise in using the Bricolage SOAP clients must learn a new technology.

Conclusion

SOAP isn't a bad technology, but it does have limits. My experience developing a SOAP interface for Bricolage taught me some important lessons that I've tried to apply to Krang. So far the experiment is a success, but Krang is young and problems may take time to appear.

Does this mean you shouldn't use SOAP for your next project? Not necessarily. It does mean that you should take a close look at your requirements and consider whether an alternative implementation would help you avoid some of the pitfalls I've described.

The best candidates for SOAP applications are lightweight network applications without significant performance requirements. If your application doesn't absolutely require network interaction, or if it will deal with large amounts of data then you should avoid SOAP. Maybe you can use TAR instead!

Resources

An AxKit Image Gallery

AxKit is not limited to working with pure XML data. Starting with this article, we'll work with and around non-XML data by developing an image browser that works with two types of non-XML data: a directory listing built from operating system calls (file names and statistics) and image files. Furthermore, it will be built from small modules that you can adapt to your needs or use elsewhere, like the thumbnail generator or the HTML table wrapper.

By the time we're done, several articles from now, we'd like an application that:

  • provides navigation around a tree of directories containing images,
  • displays image galleries with thumbnails,
  • ignores nonimage files,
  • allows you to define and present a custom set of information ("meta data") about each image,
  • allows you to view the complete images with and without metadata,
  • uses a non-AxKit mod_perl handler to generate thumbnail images on the fly, and
  • allows you to edit the metadata information in-browser

That feature list should allow us to build a "real world" application (rather than the weather examples we've discussed so far), and hopefully a useful one as well. Here's a screenshot of the page created by this article and the next:

Example page.

That page has four sections:

  1. Heading: Tells you where you are and offers navigation up the directory tree.
  2. Folders: links to the parent directory and any sub folders (Jim and Mary).
  3. Images: offers a thumbnail and caption area for each image. Clicking on an image or image title takes you to the full-size variant.
  4. Footer: A breadcrumbs display for getting back up the directory tree after scrolling down through a large page of images.

We'll implement the (most challenging) third section in this article and the other section in the next article.

If you want to review the basics of AxKit and Apache configuration, then here are the previous articles in this series:

Working with non-XML data as XML

The easiest way to actually work with non-XML data in AxKit is to turn it in to XML often and feed it to AxKit. AxKit itself takes this approach in its new directory handling feature -- thanks to Matt Sergeant and Jörg Walters AxKit can now scan the directory and build an XML document with all of the data. This is a lot like what native Apache does when it serves up an HTML directory listing, but it allows you to filter it. The main part of this article is about filtering this directory listing in order to create a gallery, or proofsheet, of thumbnail images.

In This Series

Introducing AxKit
The first in a series of articles by Barrie Slaymaker on setting up and running AxKit. AxKit is a mod_perl application for dynamically transforming XML. In this first article, we focus on getting started with AxKit.

XSP, Taglibs and Pipelines
Barrie explains what a "taglib" is, and how to use them to create dynamic pages inside of AxKit.

Taglib TMTOWTDI
Continuing our look at AxKit tag libraries, Barrie explains the use of SimpleTaglib and LogicSheets.

In this case, we'll be using a relatively recent addition to AxKit's standard toolkit, SAX Machines, integrated in to AxKit thanks to Kip Hampton. (disclaimer: XML::SAX::Machines is a module I wrote.) The SAX machine we'll create will be a straight pipeline with a few filters, a lot like the pipelines that AxKit uses. This pipeline will dissect directory listings and generate a list of images segmented into rows for easy display purposes. We don't get in to the details of SAX or SAX machines except to bolt together three building blocks; all of the gory details are handled for us by other modules. If you are interested in the gory details, then see Part One and Part Two of Kip's article "Introducing XML::SAX::Machines" on XML.com.

After the SAX machine builds our list of images, XSLT will be used to merge in metadata (like image titles and comments) from independant XML files and format the result for the browser. The resulting pages look like:

Managing non-XML data (the images)

On the other hand, it doesn't make sense to XMLify raw image data (though things like SVG--covered in XML.com's Sacre SVG articles--and dia files are a natural fit), so we'll take advantage of AxKit's integration with Apache and mod_perl to delegate image handlng to these more suitable tools.

This is done by using a distinctive URL for thumbnail image files and a custom mod_perl handler, My::Thumbnailer to convert full-size images to thumbnails. Neither AxKit nor mod_perl code will be used to serve the images, that will be left to Apache.

Thumbnails will be autogenerated in files with the same name as the main image file with a leading period (".") stuck on the front. In Unix land, this indicates a hidden file, and we don't want thumbnails (or other dotfiles) showing up in our gallery pages.

My::Thumbnailer uses the relatively new Imager module by Arnar M. Hrafnkelsson and Tony Cook. This is a best-of-breed module that competes with the likes of the venerable GD, the juggernaut Image::Magick, and Graphics::Libplot). Imager is gaining a reputation for speed, quality and a full-featured API.

The .meta file

Before we delve in to the implementation, let's look at one of the more subtle points of this design. Our previous examples have all been of straight pipelines that successively process a source document into an HTML page. In this application, however, we'll be funneling data from the source document and a collection of related files we'll call meta files.

This subtlety is not apparent from the screenshot, but if you look closely you can see that the caption for the first image ("A baby picture") contains more information than the captions for the other eight. This is because the first image has a meta file that contains a title and a comment to be displayed while the others don't (though they could).

The first image ("A baby picture") is from a file named a-look.jpeg, for which there is a meta file named a-look.meta in the same directory that looks like (bold shows the data that ends up getting sent to the browser):

    <meta>
      <title>A baby picture</title>
      <comment>
        <b>ME!</b>.  Well, not really.  Actually, it's some
        random image from the 'net.
      </comment>
    </meta>

An important feature of this file is that its contents and how they are presented within the caption area are completely unspecified by the core image gallery code. This makes our image gallery highly customizable: the site designer can determine what meta information needs to be associated with each image and how that information gets presented. Data can be presented in the thumbnail caption, in the expanded view, or used for nondisplay purposes.

Here's what's in each caption area:

  1. The title. If a .meta file is found for an image and it has a nonempty <title> element, then it is used as the name, otherwise the image's filename is stripped of extensions and used.
  2. The last modified time of the image file (in server-local time, unfortunately).
  3. A comment (optional): if a .meta file has a <comment> element, including XHTML markup, it is displayed.

Why a .meta file per image instead of one huge file? It will hopefully allow admins to manage images and meta files together and to allow us to access an image's meta information in a single file, a natural thing to do in AxKit. By having a pair of files for each image, you can use simple filesystem manipulations to move them around, or use filesystem links to make an image appear in multiple directories, perhaps with the same meta file, perhaps with different ones. This way we don't need to develop a lot of complex features to get a lot of mileage out of our image gallery (though we could if need be).

The Pipeline

No AxKit implementation documentation would be complete without detailing the pipeline. Here is the pipeline for the image proofsheet page shown above (click on any of the boxes to take you to the discussion about that portion of the pipeline, click on any of the miniature versions of this diagram to come to this one):

The AxKit pipeline for the image gallery application, take 1
The <filelist> document generated by AxKit My::Filelist2Data. Converts the <filelist> to a Perl data structure My::ProofSheet.  Takes the Perl data structure and generates a list of images with some metadata XML::Filter::TableWrapper.  Segments the list of images in to rows suitable for use in an HTML <table> My::ProofSheetMachine.  A SAX machine containing 3 SAX filters rowsplitter.xsl.  Converts each row of thumbnail metadata in to two rows, one for images, the other for captions metamerger.xsl.  Adds in the external .meta files, if they exist .meta files for the images captionstyler.xsl.  Converts the captions to XHTML pagestyler.xsl.  Converts the main part of the page to XHTML the final output The blue documents are content: the directory listing, the meta files and the generated HTML. This does not show the image processing, see My::Thumbnailer for that.

In this case, unlike our previous pipelines, data does not flow in a purely linear fashion: The directory listing from AxKit (<filelist>) feeds the pipeline and is massaged by three SAX filters and then by four XSLT filters. There are so many filters because this application is built to be customizable by tweaking specific filters or by adding other filters to the pipeline. It also uses several SAX filters available on CPAN to make life much easier for us.

In actual use, you may want to add more filters for things like branding, distinguishing groups of images by giving directory heirarchies different backgrounds or titles, adding ad banners, etc.

Here's a brief description of what each filter does, and why each is an independant filter:

  • My::ProofSheetMachine is a short module that builds a SAX Machine Pipeline. SAX filters are used in this application to handle tasks that are more suited to Perl than to XSLT or XSP:
    • My::FileList2Data is another short module that uses the XML::Simple module from CPAN to convert the <filelist> in to a Perl data structure that is passed on. This is its own filter because we want to customize XML::Simple and the resulting data structure a bit before passing it on.
    • My::ProofSheet is the heart of the gallery page generation. It builds a list of images from the filelist data structure and adds information about the thumbnail images and meta files.
    • XML::Filter::TableWrapper is a module from CPAN that is used to wrap a possibly lengthy list of images into rows of no more than five images each.
  • rowsplitter.xsl takes each row of images and makes it into two table rows: one for the images and one for the captions. This is easier to do in XSLT than in SAX, so here is where we shift from SAX processing to XSLT processing.
  • metamerger.xsl examines each caption to see if My::ProofSheet put the URL for a meta file in it. If so, it opens the meta file and inserts it in the caption. This is a separate filter because the site admin may prefer to write a custom filter here to integrate meta information from some other source, like a single master file or a centralized database.
  • captionstyler.xsl looks at each caption and rewrites it to be XHTML. This is a separate filter for two reasons: it allows the look and feel of the captions to be altered without having to mess with the other filters and, because it is the only filter that cares about the contents of the meta file, the site admin can alter the schema of the meta files and then alter this filter to match.
  • pagestyler.xsl converts everything outside of the caption elements in to HTML. It is separate so that the page look and feel can be altered per-site or per-directory without affecting the caption content, etc.

There are several key things to note about this design. The first is that the separation of the process into multiple filters offers the administrator the ability to modify the site's content and styling. Second, because AxKit is built on Apache's configuration engine, which filters are used for a particular directory request can be selected based on URL, directory path, query string parameters, browser types, etc. The third point to note is the use of SAX processors to handle tasks that are easier (far easier in some cases) to implement in Perl, while XSLT is used when it is more (programmer and/or processor) efficient.

The Configuration

Here's how we configure AxKit to do all of this:

    ##
    ## Init the httpd to use our "private install" libraries
    ##
    PerlRequire startup.pl

    ##
    ## AxKit Configuration
    ##
    PerlModule AxKit

    <Directory "/home/me/htdocs">
        Options -All +Indexes +FollowSymLinks

        # Tell mod_dir to translate / to /index.xml or /index.xsp
        DirectoryIndex index.xml index.xsp
        AddHandler axkit .xml .xsp

        AxDebugLevel 10

        AxTraceIntermediate /home/me/axtrace

        AxGzipOutput Off

        AxAddXSPTaglib AxKit::XSP::Util
        AxAddXSPTaglib AxKit::XSP::Param

        AxAddStyleMap text/xsl \
                      Apache::AxKit::Language::LibXSLT

        AxAddStyleMap application/x-saxmachines \
                      Apache::AxKit::Language::SAXMachines

    </Directory>

    
    <Directory "/home/me/htdocs/04">
        # Enable XML directory listings (see Generating File Lists)
        AxHandleDirs On

        #######################
        # Begin pipeline config
        AxAddRootProcessor application/x-saxmachines . \
            {http://axkit.org/2002/filelist}filelist
        PerlSetVar AxSAXMachineClass "My::ProofSheetMachine"

        # The absolute stylesheet URLs are because
        # I prefer to keep stylesheets out of the
        # htdocs for security reasons.
        AxAddRootProcessor text/xsl file:///home/me/04/rowsplitter.xsl \
            {http://axkit.org/2002/filelist}filelist

        AxAddRootProcessor text/xsl file:///home/me/04/metamerger.xsl \
            {http://axkit.org/2002/filelist}filelist

        AxAddRootProcessor text/xsl file:///home/me/04/captionstyler.xsl \
            {http://axkit.org/2002/filelist}filelist

        AxAddRootProcessor text/xsl file:///home/me/04/pagestyler.xsl \
            {http://axkit.org/2002/filelist}filelist
        # End pipeline config
        #####################

        # This is read by My::ProofSheetMachine
        PerlSetVar MyColumns 5

        # This is read by My::ProofSheet
        PerlSetVar MyMaxX 100

        # Send thumbnail image requests to our
        # thumbnail generator
        <FilesMatch "^\.">
            SetHandler  perl-script
            PerlHandler My::Thumbnailer
            PerlSetVar  MyMaxX 100
            PerlSetVar  MyMaxY 100
        </FilesMatch>
        
    </Directory>

The first <Directory> section contains the AxKit directives we introduced in article 1 and a new stylesheet mapping for application/x-saxmachines that allows us to use a SAX machine in the pipeline. Otherwise, all of the configuration directives key to this example are in the <Directory "/home/me/htdocs/04"> section.

We saw basic examples of how AxKit works with the Apache configuration engine in article 1 and article 2 in this series. We'll use this photo gallery application to demonstrate many of the more powerful mechanisms in a future article.

By setting AxHandleDirs On, we tell AxKit to generate the <filelist> document (described in the section Generating File Lists) in the 04 directory and below.

Then it's off to configure the pipeline for the 04 directory hierarchy. To do this, we take advantage of the fact that AxKit places all elements in the filelist document in to the namespace http://axkit.org/2002/filelist. The AxAddRootProcessor's third parameter causes AxKit to look at all documents it serves from the 04 directory tree and check to see whether the root element matches the namespace and element name.

This is specified in the notation used by James Clark in his introduction to XML namespaces.

If the document matches, and all AxKit-generated filelists will, then the MIME type and the stylesheet specified in the first two parameters are added to the pipeline. The four AxAddRootProcessor directives add the SAX machine and the four XSLT filters we described in the section "The Pipeline".

When loading a SAX machine into the pipeline, you can give it a simple list of SAX filters (there are many available on CPAN) and it will build a pipeline of them. This is done with a (not shown) PerlSetVar AxSAXMachineFilters "..." directive. The limitation with this directive is that you cannot pass in any initialization values to the filters and we want to.

So, instead, we use the PerlSetVar AxSAXMachineClass "My::ProofSheetMachine" to tell the Apache::AxKit::Language::SAXMachines module to load the class My::ProofSheetMachine and let that class construct the SAX machine.

The final part of the configuration uses a <Files> section to forward all requests for thumbnail images to the mod_perl handler in My::Thumbnailer.

Walking the Pipeline

Now that we have our filters in place, let's walk the pipeline and take a look at each filter and what it emits.

Generating File Lists

<filelist> document's position in the processing pipeline

First, here's a look at the <filelist> document that feeds the chain. This is created by AxKit when it serves a directory request in much the same way that Apache creates HTML directory listings. AxKit only generates these pages when AxHandleDirs On directive. This causes AxKit to scan the directory for the above screenshot and emit XML like (whitespace added, repetitive stuff elided):

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE filelist PUBLIC
      "-//AXKIT/FileList XML V1.0//EN"
      "file:///dev/null"
    >
    <filelist xmlns="http://axkit.org/2002/filelist">
      <directory
        atime="1032276941"
        mtime="1032276939"
        ctime="1032276939"
        readable="1"
        writable="1"
        executable="1"
        size="4096" >.</directory>
      <directory ...>..</directory>
      <directory ...>Mary</directory>
      <directory ...>Jim</directory>
      <file mtime="1031160766" ...>a-look.jpeg</file>
      <file mtime="1031160787" ...>a-lotery.jpeg</file>
      <file mtime="1031160771" ...>a-lucky.jpeg</file>
      <file mtime="1032197214" ...>a-look.meta</file>
      <file mtime="1035239142" ...>foo.html</file>
      ...
    </filelist>

The emboldened bits are the pieces of data we want to display: some filenames and their modification times. Some things to notice:

  • All of the elements -- most importantly the root element as we'll see in a bit -- are in a special namespace, http://axkit.org/2002/filelist, using the xmlns= attribute (see James Clark's introduction for details).
  • The entries are in unsorted order. We might want to allow the user to sort by different attributes someday, but this means that we at least need to sort the results somehow.
  • They contain the complete output from the stat() system call as attributes, so we can use the mtime attribute to derive a modification time.
  • There are files in there (a-look.meta and foo.html) that we clearly should not be displayed as images.
  • The filename for a-look.jpeg is not emboldened: We'll use the <title> element from the a-look.meta file instead.
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE filelist PUBLIC
      "-//AXKIT/FileList XML V1.0//EN"
      "file:///dev/null"
    >
    <filelist xmlns="http://axkit.org/2002/filelist">
      <directory
        atime="1032276941"
        mtime="1032276939"
        ctime="1032276939"
        readable="1"
        writable="1"
        executable="1"
        size="4096" >.</directory>
      <directory ...>..</directory>
      <directory ...>Mary</directory>
      <directory ...>Jim</directory>
      <file mtime="1031160766" ...>a-look.jpeg</file>
      <file mtime="1031160787" ...>a-lotery.jpeg</file>
      <file mtime="1031160771" ...>a-lucky.jpeg</file>
      <file mtime="1032197214" ...>a-look.meta</file>
      <file mtime="1035239142" ...>foo.html</file>
      ...
    </filelist>

My::ProofSheetMachine

My::ProofSheetMachine's position in the processing pipeline.

The processing pipeline is kicked off with a set of three SAX filters built by the My::ProofSheetMachine module:

    package My::ProofSheetMachine;
    
    use strict;
    
    use XML::SAX::Machines qw( Pipeline );
    use My::ProofSheet;
    use XML::Filter::TableWrapper;
    
    sub new {
        my $proto = shift;
        return bless {}, ref $proto || $proto;
    }
    
    sub get_machine {
        my $self = shift;
        my ( $r ) = @_;
    
        my $m = Pipeline(
            My::Filelist2Data
            => My::ProofSheet->new( Request => $r ),
            => XML::Filter::TableWrapper->new(
                ListTags => "{}images",
                Columns  => $r->dir_config( "MyColumns" ) || 3,
            ),
        );
    
        return $m;
    }
    
    1;

This module provides a minimal constructor, new() so it can be instantiated (this is an Apache::AxKit::Language::SAXMachines requirement, we don't need that for our sake). AxKit will call the get_machine() method once each request to obtain the SAX machine is used. SAX machines are not reused from request to request.

$r is a reference to the Apache request object (well, actually, to an AxKit subclass of it). This is passed into My::ProofSheet, which uses to interact query some httpd.conf settings, to control AxKit's cache, and to probe the filesystem through Apache.

$r is also queried in this module to see whether there is a MyColumns setting for this request, with a default in case, it's not. The ListTags setting tells XML::Filter::TableWrapper to segment the image list produced by the first two filters into rows of images (preparing it to be an HTML table, in other words).

The need to pass parameters like this to the SAX filters is the sole reason we're using a SAX machine factory class like this. This class is specified by using PerlSetVar AxSAXMachineClass; if we didn't need to initialize the filters like this, then we could have listed them in a PerlSetVar AxSAXMachineFilters directive. For more details on how SAX machines are integrated with AxKit, see the man page

Currently, only one SAX machine is allowed in an AxKit pipeline at a time (though different pipelines can have different machines in them). This is a limitation of the configuration system more than anything and may well change if need be. However, if we need to add SAX processors to the end of the machine, then the PerlSetVar AxSAXMachineFilters can be used to insert site-specific filters after the main machine (and before the XSLT processors).

My::Filelist2Data

My::Filelist2Data's position in the processing pipeline.

Converting the <filelist> into a proofsheet takes a bit of detailed data munging. This is quite easy in Perl, so the first step in our pipeline is to convert the XML file listing into data. XML::Simple provides this functionality for us, and we overload it so we can grab the resulting data structure and pass it on:

    package My::Filelist2Data;
    
    use XML::Simple;
    @ISA = qw( XML::Simple );
    
    use strict;
    
    sub new {
        my $proto = shift;
        my %opts = @_;
    
        # The Handler value is passed in by the Pipeline()
        # call in My::ProofSheetMachine.
        my $h = delete $opts{Handler};
    
        # Even if there's only one file element present,
        # make XML::Simple put it in an ARRAY so that
        # the downstream filter can depend on finding an
        # array of elements and not a single element.
        # This is an XML::Simple option that is almost
        # always set in practice.
        $opts{forcearray} = [qw( file )];
    
        # Each <file> and <directory> element contains
        # the file name as simple text content.  This
        # option tells XML::Simple to store it in the
        # data member "filename".
        $opts{contentkey} = "filename";
    
        # This subroutine gets called when XML::Simple
        # has converted the entire document with the
        # $data from the document.
        $opts{DataHandler} = sub {
            shift;
            my ( $data ) = @_;
    
            # If no files are found, place an array
            # reference in the right spot.  This is to
            # to simplify downstream filter code.
            $data->{file}      ||= [];
    
            # Pass the data structure to the next filter.
            $h->generate( $data );
        } if $h;
    
        # Call XML::Simple's constructor.
        return $proto->SUPER::new( %opts );;
    }
    
    1;

Sending a data structure like this between SAX machines using a non-SAX event is known as "cheating." But this is Perl, and allowing you to cheat responsibly and judiciously is one of Perl's great strengths. This works and should work for the foreseeable future. If you're planning on doing something like this for a general purpose filter, then it behooves you to also provide set_handler and get_handler methods so your filter can be repositioned after instantiation (something XML::SAX::Machines do if need be), but we don't need to clutter this single-purpose example.

The <filelist> document gets converted to a Perl data structure where each element is a data member in a HASH or an array, like (data elided and rearranged to relate well to the source XML):

    {
      xmlns => 'http://axkit.org/2002/filelist',
      directory => [
        {
          atime      => '1032276941'
          mtime      => '1032276939',
          ctime      => '1032276939',
          readable   => '1',
          writable   => '1',
          executable => '1',
          size       => '4096',
          content    => '.',
        },
        {
          ...
          content    => '..',
        },
        {
          ...
          content    => 'Mary',
        },
        {
          ...
          content    => 'Jim',
        }
      ]
      file => [
        {
          mtime      => '1031160766',
          ...
          content    => 'a-look.jpeg',
        },
        {
          mtime      => '1031160787',
          ...
          content    => 'a-lotery.jpeg',
        },
        {
          mtime      => '1031160771',
          ...
          content    => 'a-lucky.jpeg',
        },
        {
          mtime      => '035239142',
          ...
          content    => 'foo.html',
        },
        ...
      ],
    }

My::ProofSheet

My::ProofSheet's position in the processing pipeline.

Once the data is in Perl data structure, it's easy to tweak it (making mtime fields into something readable, for instance) and extend it (adding information about thumbnail images and .meta files, for instance). This is what My::ProofSheet does:

    package My::ProofSheet;
    
    use XML::SAX::Base;
    @ISA = qw( XML::SAX::Base );
    
    # We need to access the Apache request object to
    # get the URI of the directory we're presenting,
    # its physical location on disk, and to probe
    # the files in it to see if they are images.
    use Apache;
    
    # My::Thumbnailer is an Apache/mod_perl module that
    # creates thumbnail images on the fly.  See below.
    use My::Thumbnailer qw( image_size thumb_limits );
    
    # XML::Generator::PerlData lets us take a Perl data
    # structure and emit it to the next filter serialized
    # as XML.
    use XML::Generator::PerlData;
    
    use strict;
    
    sub generate {
        my $self = shift;
        my ( $data ) = @_;
    
        # Get the AxKit request object so we can
        # ask it for the URI and use it to test
        # whether files are images or not.
        my $r = $self->{Request};
    
        my $dirname = $r->uri;      # "/04/Baby_Pictures/Other/"
        my $dirpath = $r->filename; # "/home/me/htdocs/...Other/"
    
    
        my @images = map $self->file2image( $_, $dirpath ),
            sort {
                $a->{filename} cmp $b->{filename}
            } @{$data->{file}};
    
        # Use a handy SAX module to generate XML from our Perl
        # data structures.  The XML will look basically like:
        # Write XML that looks like
        #
        # <proofsheet>
        #   <images>
        #     <image>...</image>
        #     <image>...</image>
        #     ...
        #   </images>
        #   <title>/04/BabyePictures/Others</title>
        # </proofsheet>
        #
        XML::Generator::PerlData->new(
            rootname => "proofsheet",
            Handler => $self,
        )->parse( {
            title       => $dirname,
            images      => { image => \@images },
        } );
    }
    
    
    sub file2image {
        my $self = shift;
        my ( $file, $dirpath ) = @_;
    
        # Remove the filename from the fields so it won't
        # show up in the <image> structure.
        my $fn = $file->{filename};
    
        # Ignore hidden files (first char is a ".").
        # Thumbnail images are cached as hidden files.
        return () if 0 == index $fn, ".";
    
        # Ignore files Apache knows aren't images
        my $type = $self->{Request}->lookup_file( $fn )->content_type;
        return () unless
            defined $type
            && substr( $type, 0, 6 ) eq "image/";
    
        # Strip the extension(s) off.
        ( my $name = $fn ) =~ s/\..*//;
    
        # A meta filename is the image filename with a ".meta"
        # extension instead of whatever extension it has.
        my $meta_fn   = "$name.meta";
        my $meta_path = "$dirpath/$meta_fn";
    
        # The thumbnail file is stored as a hidden file
        # named after the image file, but with a leading
        # '.' to hide it.
        my $thumb_fn   = ".$fn";
        my $thumb_path = "$dirpath/$thumb_fn";
    
        my $last_modified = localtime $file->{mtime};
    
        my $image = {
            %$file,                  # Copy all fields
            type           => $type, # and add a few
            name           => $name,
            thumb_uri      => $thumb_fn,
            path           => "$dirpath/$fn",
            last_modified  => $last_modified,
        };
    
        if ( -e $meta_path ) {
            # Only add a URI to the meta info, metamerger.xsl will
            # slurp it up if and only if <meta_uri> is present.
            $image->{meta_filename} = $meta_fn;
            $image->{meta_uri}      = "file://$meta_path";
        }
    
        # If the thumbnail exists, grab its width and height
        # so later stages can populate the <img> tag with them.
        # The eval {} is in case the image doesn't exist or
        # the library can't cope with the image format.
        # Disable caching AxKit's output if a failure occurs.
        eval {
            ( $image->{thumb_width}, $image->{thumb_height} )
                = image_size $thumb_path;
        } or $self->{Request}->no_cache( 1 );
    
        return $image;
    }
    
    
    1;

When My::Filelist2Data calls generate(), generate() sorts and scans the list of files by filename, converts each to an image and sends a page title and the resulting list of images to the next filter (XML::Filter::TableWrapper). Kip Hampton's XML::Generator::PerlData is a Perl data -> XML serialization module. It's not meant for generating generic XML; it focuses purely on building an XML representation of a Perl data structure. In this case, that's ideal, because we will be generating the output document with XSLT templates and we don't care about the exact order of the elements in each <image> element, each <image> element is just a hash of key/value pairs. We do control the order of the <image> elements, however, by passing an ordered list of them in to XML::Generator::PerlData as an array.

Sorting by filename may not be the preferred thing to do for all applications, because users may prefer to sort by the caption title for the image, but then again they may not, and this allows the site administrator to control sort order by naming the files appropriately. We can add always add sorting later.

Another peculiarity of this code is that it doesn't guarantee that there will be thumb_width and thumb_height values available. If you just drop the source images in a directory, then the first time the server generates this page, there will be no thumbnails available. In this case, the call to no_cache(1) prevents AxKit from caching the output page so that suboptimal HTML does not get stuck in the cache. This will give the server another chance at generating it with proper tags, hoping of course that by the next time this page is requested, the requisite thumbnails will be available to measure.

This approach gets the HTML to the browser fast, so the user's browser window will clear quickly and start filling with the top of ths page, so the user will see some activity and be less likely to get impatient. The thumbnails will be generated when the browser sees all the <img> tags. The alternative approach would be to thumbnail the images inline, which would result in a significant delay on large listings before the first HTML hits the browser, or prethumbnailing.

One thing to note about this approach is that many browsers will request images several at a time, which will cause several server processes to be thumbnailing several different images at once. This should result in lower lag on low-load servers because processes can interleave CPU time and disk I/O waits, and can take advantage of multiple processors, if present. On heavily loaded servers, of course, this might be a bad thing; pregenerating thumbnails there would be a good idea.

The output from this filter looks like:

    <?xml version="1.0"?>
    <proofsheet>
      <images>
        <image>
          <path>
		    /home/barries/src/mball/AxKit/www/htdocs/04/Baby_Pictures/Others/a-look.jpeg
		  </path>
          <writable>1</writable>
          <filename>a-look.jpeg</filename>
          <thumb_uri>.a-look.jpeg</thumb_uri>
          <meta_filename>a-look.meta</meta_filename>
          <name>a-look</name>
          <last_modified>Wed Sep  4 13:32:46 2002</last_modified>
          <ctime>1032552249</ctime>
          <meta_uri>
            file:///home/barries/src/mball/AxKit/www/htdocs/04/Baby_Pictures/Others/a-look.meta
          </meta_uri>
          <mtime>1031160766</mtime>
          <size>8522</size>
          <readable>1</readable>
          <type>image/jpeg</type>
          <atime>1032553327</atime>
        </image>
        <image>
          <path>
            /home/barries/src/mball/AxKit/www/htdocs/04/Baby_Pictures/Others/a-lotery.jpeg
          </path>
          <writable>1</writable>
          <filename>a-lotery.jpeg</filename>
          <thumb_uri>.a-lotery.jpeg</thumb_uri>
          <name>a-lotery</name>
          <last_modified>Wed Sep  4 13:33:07 2002</last_modified>
          <ctime>1032552249</ctime>
          <mtime>1031160787</mtime>
          <size>10113</size>
          <readable>1</readable>
          <type>image/jpeg</type>
          <atime>1032553327</atime>
        </image>
      </images>
      ...
      <title>/04/Baby_Pictures/Others</title>
    </proofsheet>

All the data from the original <file> elements are in each <image> element along with the new fields. Note that the first <image> contains the <meta_uri> (pointing to a-look.meta) while the second doesn't because there is no a-lotery.meta. As expected both have the <thumb_uri> tags. The parts in bold face are the bits that our presentation happens to want; yours might want more or different bits.

While there is a lot of extra information in this structure, it's really just the output from one system call (stat()) and some possibly useful byproducts of the My::ProofSheet machinations, so it's very cheap information that some front end somewhere might want. It's also easier to leave it all in than to emit just what our example frontend might want and will enable any future upstream filters or extentions to AxKit's directory scanning to shine through.

No <thumb_width> or <thumb_height> tags are present because I copied this file from the axtrace directory (see the AxTraceIntermediate directive in our httpd.conf file) after viewing a newly added directory. Here's what the first <image> element looks like when viewing after my browser had requested all thumbnails:

    <?xml version="1.0"?>
    <proofsheet>
      <images>
        <image>
          <thumb_width>72</thumb_width>
          <path>
            /home/barries/src/mball/AxKit/www/htdocs/04/Baby_Pictures/Others/a-look.jpeg
          </path>
          <writable>1</writable>
          <filename>a-look.jpeg</filename>
          <thumb_height>100</thumb_height>
          <thumb_uri>.a-look.jpeg</thumb_uri>
          <meta_filename>a-look.meta</meta_filename>
          <name>a-look</name>
          <last_modified>Wed Sep  4 13:32:46 2002</last_modified>
          <ctime>1032552249</ctime>
          <meta_uri>
            file:///home/barries/src/mball/AxKit/www/htdocs/04/Baby_Pictures/Others/a-look.meta
          </meta_uri>
          <mtime>1031160766</mtime>
          <size>8522</size>
          <readable>1</readable>
          <type>image/jpeg</type>
          <atime>1032784360</atime>
        </image>
        ...
      </images>
      <title>/04/Baby_Pictures/Others</title>
    </proofsheet>

XML::Filter::TableWrapper

My::TableWrapper's position in the processing pipeline

XML::Filter::TableWrapper is a CPAN module is used to take the <images> list and segmenting it by insert <tr>...</tr> tags around every (it's configurable) <image> elements. This configuration is done by the My::ProofSheetMachine module we showed earlier:

    XML::Filter::TableWrapper->new(
        ListTags => "{}images",
        Columns  => $r->dir_config( "MyColumns" ) || 3,
    ),

The output, for our list of 9 images, looks like:

    <?xml version="1.0"?>
    <proofsheet>
      <images>
        <tr>
          <image>
            ...
          </image>
          ... 4 more image elements...
        </tr>
        <tr>
          <image>
            ...
          </image>
          ... 3 more image elements...
        </tr>
      </images>
      <title>/04/Baby_Pictures/Others</title>
    </proofsheet>

Now all the presentation stylesheet (pagestuler.xsl) can key off the <tr> tags to build an HTML <table> or ignore them (and not pass them through) if it wants to display in a list format.

While I'm sure this is possible in XSLT, I have no idea how to do it easily.

rowsplitter.xsl

rowsplitter.xsl's position in the processing pipeline.

Experimentation with an early version of this application showed that presenting captions in the same table cell as the thumbnails when the thumbnails are of differing heights caused the captions to be showed at varying heights. This made it hard to scan the captions and added a lot of visual clutter to the page.

One solution is to add an XSLT filter that splits each table row of image data in to two rows, one for the thumbnail and another for the caption:

    <xsl:stylesheet 
      version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    >
    
    <xsl:template match="image" mode="caption">
      <caption>
        <xsl:copy-of select="@*|*|node()" />
      </caption>
    </xsl:template>
    
    <xsl:template match="images/tr">
      <xsl:copy-of select="." />
      <tr><xsl:apply-templates select="image" mode="caption" /></tr>
    </xsl:template>
    
    <xsl:template match="@*|node()">
      <xsl:copy>
        <xsl:apply-templates select="@*|node()"/>
      </xsl:copy>
    </xsl:template>
    
    </xsl:stylesheet>

The second template in this stylesheet matches each row (<tr> element) in the <images> element and copies it verbatim and then emits a second <tr> element right after it with a list of <caption> elements with copies of the content of each of the <image> tags in the original row. The first template is applied only to the <image> tags when creating this second row due to the mode="caption" attributes.

The third template is a standard piece of XSLT boilerplate that passes through all the XML that is not matched by the first two templates. This XML would otherwise be mangled (stripped of elements, to be specific) by the wacky default XSLT rules.

Now, I know several ways to do this in Perl in the AxKit environment and none are so easy for me as using XSLT. YMMV.

The output from that stage looks like:

    <?xml version="1.0"?>
    <proofsheet>
      <images>

        <tr><image>...  </image>   ...total of 5... </tr>
        <tr><caption>...</caption> ...total of 5... </tr>

        <tr><image>...  </image>   ...total of 4... </tr>
        <tr><caption>...</caption> ...total of 4... </tr>

      </images>
      <title>/04/Baby_Pictures/Others</title>
    </proofsheet>

The content of each <image> tag and each <caption> tag is identical. It's easier to do the transform this way and allows the frontend stylesheets the flexibility of doing things like putting the image filename or modification time in the same cell as the thumbnail.

metamerger.xsl

metamerger.xsl's position in the processing pipeline

As with the row splitter, expressing the metamerger in XSLT is an expedient way of merging in external XML documents, for several reasons. The first is for efficiency's sake: We're already using XSLT before and after this filter, and AxKit optimizes XSLT->XSLT handoffs to avoid reparsing. Another is that the underlying implementation of AxKit's XSLT engine is the speedy C of libxslt. A third is that we're not altering the incoming file at all in this stage, so the XSLT does not get out of hand (I do not consider XSLT to be a very readable programming language; its XML syntax makes for very opaque source code).

Another approach would be to go back and tweak My::ProofSheet to inherit from XML::Filter::Merger and insert it using a SAX parser. That would be a bit slower, I suspect, because SAX parsing in general tends to be slower than XSLT's internal parsing. It would rob the application of the configurability that having merging as a separate step engenders. By factoring this functionality in to the metamerger.xsl stylesheet, we offer the site designer the ability to pull data from other sources, or even to fly without any metadata at all.

Here's what metamerger.xsl looks like:

    <xsl:stylesheet 
      version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    >
    
    <xsl:template match="caption">
      <caption>
        <xsl:copy-of select="*|@*|node()" />
        <xsl:copy-of select="document( meta_uri )" />
      </caption>
    </xsl:template>
    
    <xsl:template match="*|@*">
      <xsl:copy>
        <xsl:apply-templates select="*|@*|node()" />
      </xsl:copy>
    </xsl:template>
    
    </xsl:stylesheet>

The first template does all the work of matching each <caption> element and copying its content, then parsing and inserting the document indicated by the <meta_uri> element, if present. The document() function turns into a noop if <meta_uri> is not present. The second template is that same piece of boilerplate we saw in rowsplitter.xsl to copy through everything we don't explicitly match.

And here's what the <caption> for a-look.jpeg now looks like (all the other <caption> elements were left untouched because there are no other .meta files in this directory):

    <caption>
      <thumb_width>72</thumb_width>
      <path>/home/barries/src/mball/AxKit/www/htdocs/04/Baby_Pictures/Others/a-look.jpeg</path>
      <writable>1</writable>
      <filename>a-look.jpeg</filename>
      <thumb_height>100</thumb_height>
      <thumb_uri>.a-look.jpeg</thumb_uri>
      <meta_filename>a-look.meta</meta_filename>
      <name>a-look</name>
      <last_modified>Wed Sep  4 13:32:46 2002</last_modified>
      <ctime>1032552249</ctime>
      <meta_uri>file:///home/barries/src/mball/AxKit/www/htdocs/04/Baby_Pictures/Others/a-look.meta</meta_uri>
      <mtime>1031160766</mtime>
      <size>8522</size>
      <readable>1</readable>
      <type>image/jpeg</type>
      <atime>1032784360</atime>
      <meta>
        <title>A baby picture</title>
        <comment><b>ME!</b>.  Well, not really.  Actually, it's some random image from the 'net.
</comment>
      </meta>
    </caption>

As mentioned before, this stylesheet does not care what you put in the meta file, it just inserts anything in that file from the root element on down. So you are free to put any meta information your application requires in the meta file and adjust the presentation filters to style it as you will.

The .meta information is not inserted in to the <image> tags because we know that none of our presentation will not need any of it there.

captionstyler.xsl

captionstyler.xsl's position in the processing pipeline

The last two stages of our pipeline turn the data assembled so far into HTML. This is done in two stages in order to separate general layout and presentation from the presentation of the caption because the these portions of the presentation might need to vary independently between one collection of images and another.

The caption stylesheet for this example is:

    <xsl:stylesheet 
      version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    >
    
    <xsl:template match="caption">
      <caption width="100" align="left" valign="top">
    
        <a href="{filename}">
          <xsl:choose>
            <xsl:when test="meta/title and string-length( meta/title )">
              <xsl:copy-of select="meta/title/node()" />
            </xsl:when>
            <xsl:otherwise>
              <xsl:value-of select="name" />
            </xsl:otherwise>
          </xsl:choose>
        </a><br />
    
        <font size="-1" color="#808080">
          <xsl:copy-of select="last_modified/node()" />
          <br />
        </font>
    
        <xsl:copy-of select="meta/comment/node()" />
    
      </caption>
    </xsl:template>
    
    <xsl:template match="*|@*|node()">
      <xsl:copy>
        <xsl:apply-templates />
      </xsl:copy>
    </xsl:template>
    
    </xsl:stylesheet>

The first template replaces all <caption> elements with new <caption> cells with a default width and alignment, and then fills these with the name of the image, which is also a link to the underling image file, and the <last_modified time string formatted by My::ProofSheet and any <comment> that might be present in the meta file.

The <xsl:choose> element is what selects the title to display for the image. The first <xsl:when>looks to see if there is a <title> element in the meta file and uses it if present. The <xsl:otherwise> defaults the name to the <name> set by My::ProofSheet.

The captions output by this stage look like:

    <caption width="100" align="left" valign="top">
      <a href="a-look.jpeg">A baby picture</a>
      <br/>
      <font size="-1" color="#808080">Wed Sep
        4 13:32:46 2002<br/>
      </font>
      <b>ME!</b>.  Well, not really.  Actually, it's
        some random image from the 'net.
    </caption>
    <caption width="100" align="left" valign="top">
      <a href="a-lotery.jpeg">a-lotery</a>
      <br/>
      <font size="-1" color="#808080">Wed Sep
        4 13:33:07 2002<br/></font>
    </caption>

The former is what comes out when a .meta file is found, the latter when it is not.

pagestyler.xsl

And now, the final stage. If you've made it this far, congratulations; this is the start of a real application and not just a toy, so it's taken quite some time to get here.

pagestyler.xsl's position in the processing pipeline

The final stage of the processing pipeline generates an HTML page from the raw data, except for the attributes and content of <caption> tags, which it passes through as-is:

    <xsl:stylesheet 
      version="1.0"
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    >
    
    <xsl:template match="/*">
      <html>
        <head>
          <title>Images in <xsl:value-of select="title" /></title>
        </head>
        <body bgcolor="#ffffff">
          <xsl:apply-templates select="images" />
        </body>
      </html>
    </xsl:template>
    
    
    <xsl:template match="images">
      <table>
        <xsl:apply-templates />
      </table>
    </xsl:template>
    
    <xsl:template match="tr">
      <xsl:copy>
        <xsl:apply-templates select="*" />
      </xsl:copy>
    </xsl:template>
    
    <xsl:template match="image">
      <td align="left" valign="top">
        <a href="{filename}">
          <img border="0" src="{thumb_uri}">
            <xsl:if test="thumb_width">
              <xsl:attribute name="width">
                <xsl:value-of select="thumb_width" />
              </xsl:attribute>
            </xsl:if>
            <xsl:if test="thumb_height">
              <xsl:attribute name="height">
                <xsl:value-of select="thumb_height" />
              </xsl:attribute>
            </xsl:if>
          </img>
        </a>
      </td>
    </xsl:template>
    
    <xsl:template match="@*|node()" mode="caption">
      <xsl:copy>
        <xsl:apply-templates select="@*|node()" mode="caption" />
      </xsl:copy>
    </xsl:template>
    
    <xsl:template match="caption">
      <td>
        <xsl:apply-templates select="@*|node()" mode="caption" />
      </td>
    </xsl:template>
    
    </xsl:stylesheet>

The first template generates the skeleton of the HTML page, the second one grabs the <images> list from the source document, emits a <table>, the third copies the <tr> tags, the fourth replaces all <image> tags with <td> tags containing the thumbnail image as a link to the underlying image (similar to what captionstyler.xsl did with the picture name). The only subtlety here is that the optional <thumb_width> and <thumb_height> elements are used, if present, to inform the browser of the size of the thumbnail in order to speed up the layout process (as mentioned before, pages that don't contain this information are not cached so that when the thumbnails are generated, new HTML will be generated with it).

The fourth template converts the <caption> elements to <td> elements and copies all their content through, since captionstyler.xsl already did the presentation for them.

Tweaking this stylesheet or replacing it controls the entire page layout other than thumbnail sizing (which is set by the optional MyMaxX and MyMaxY PerlSetVar settings in httpd.conf). A different stylesheet in this point in the chain could choose to ignore the <tr> tags and present a list style output. A later stylesheet could be added to add branding or advertising to the site, etc., etc.

My::ThumbNailer

Here's the apache module that generates thumbnails. The key thing to remember is that, unlike all the other code and XML shown in this article, this is called once per thumbnail image, not once per directory. When a browser requests a directory listing, it gets HTML from the pipeline above with lots of URIs for thumbnail images. It will then usually request each of those in turn. The httpd.conf file directs all requests for dotfiles to this module:

    package My::Thumbnailer;
    
    # Allow other modules like My::ProofSheet to use some
    # of our utility routines.
    use Exporter;
    @ISA = qw( Exporter );
    @EXPORT_OK = qw( image_size thumb_limits );
    
    use strict;
    
    use Apache::Constants qw( DECLINED );
    use Apache::Request;
    use File::Copy;
    use Imager;
    
    
    sub image_size {
        my $img = shift;
    
        if ( ! ref $img ) {
            my $fn = $img;
            $img = Imager->new;
            $img->open( file => $fn )
                or die $img->errstr(), ": $fn";
        }
    
        ( $img->getwidth, $img->getheight );
    }
    
    
    sub thumb_limits {
        my $r = shift;
    
        # See if the site admin has placed MyMaxX and/or
        # MyMaxY in the httpd.conf.
        my ( $max_w, $max_h ) = map
            $r->dir_config( $_ ),
            qw( MyMaxX MyMaxY );
    
        return ( $max_w, $max_h )
            if $max_w || $max_h;
    
        # Default to scaling down to fit in a 100 x 100
        # pixel area (aspect ration will be maintained).
        return ( 100, 100 );
    }
    
    
    # Apache/mod_perl is configured to call
    # this handler for every dotfile
    # requested.  All thumbnail images are dotfiles,
    # some dotfiles may not be thumbnails.
    sub handler {
        my $r = Apache::Request->new( shift );
    
        # We only want to handle images.
        # Let Apache handle non-images.
        goto EXIT
            unless substr( $r->content_type, 0, 6 ) eq "image/";
    
        # The actual image filename is the thumbnail
        # filename without the leading ".".  There's
        ( my $orig_fn = $r->filename ) =~ s{/\.([^/]+)\z}{/$1}
            or die "Can't parse ", $r->filename;
    
        # Let Apache serve the thumbnail if it already
        # exists and is newer than the original file.
        {
            my $thumb_age = -M $r->finfo;
            my $orig_age  = -M $orig_fn;
            goto EXIT
                if $thumb_age && $thumb_age <= $orig_age;
        }
    
        # Read in the original file
        my $orig = Imager->new;
        unless ( $orig->open( file => $orig_fn ) ) {
            # Imager can't hack the format, fall back
            # to the original image.  This can happen
            # if you forget to install libgif
            # (as I have done).
            goto FALLBACK
                if $orig->errstr =~ /format not supported/;
    
            # Other errors are probably more serious.
            die $orig->errstr, ": $orig_fn\n";
        }
    
        my ( $w, $h ) = image_size( $orig );
    
        die "!\$w for ", $r->filename, "\n" unless $w;
        die "!\$h for ", $r->filename, "\n" unless $h;
    
        my ( $max_w, $max_h ) = thumb_limits( $r );
    
        # Scale down only,  If the image is smaller than
        # the thumbnail limits, let Apache serve it as-is.
        # thumb_limits() guarantees that either $max_w
        # or $max_h will be true.
        goto FALLBACK
            if ( ! $max_w || $w < $max_w )
            && ( ! $max_h || $h < $max_h );
    
        # Scale down to the maximum dimension to the
        # requested size.  This can mess up for images
        # that are meant to be scaled on each axis
        # independantly, like graphic bars for HTML
        # page seperators, but that's a very small
        # demographic.
        my $thumb = $orig->scale(
            $w > $h
                ? ( xpixels => $max_w )
                : ( ypixels => $max_h )
        );
        $thumb->write( file => $r->filename,)
            or die $thumb->errstr, ": ", $r->filename;
    
        goto BONK;
    
    FALLBACK:
        # If we can't or don't want to build the thumbnail,
        # just copy the original and let Apache figure it out.
        warn "Falling back to ", $orig_fn, "\n";
        copy( $orig_fn, $r->filename );
    
    BONK:
        # Bump apache on the head just hard enough to make it
        # forget the thumbnail file's old stat() and
        # mime type since we've most likely changed all
        # that now.  This is important for the headers
        # that control downstream caching, for instance,
        # or in case Imager changed mime types on us
        # (unlikely, but hey...)
        $r->filename( $r->filename );
    
    EXIT:
        # We never serve the image data, Apache is perfectly
        # good at doing this without our help.  Returning
        # DECLINED causes Apache to use the next handler in
        # its list of handlers.  Normally this is the default
        # Apache file handler.
        return DECLINED;
    }
    
    1;

There should be enough inline commentary to explain that lot. The only thing I'll say is that, to head off the gotophobes, I think the use of goto makes this routine a lot clearer than the alternatives; the early versions did not use it and were less readable/maintainable. This is because the three normal exit routes happen to stack nicely up from the bottom so the fallthrough from one labeled chunk to the next happens nicely.

The most glaring mistake here is that there is no file locking. We'll add that in next time.

Summary

The final result of the code in this article is to build the image proofsheet section of the page we showed at the beginning of the article. The next article will complete that page, and then we'll build the image presentation page and a metadata editor in future articles.

Help and thanks

In case of trouble, have a look at some of the helpful resources we listed in the first article.

Taglib TMTOWTDI

As with many Perl systems, AxKit often provides multiple ways of doing things. Developers from other programming cultures may find these choices and freedom a bit bewildering at first but this (hopefully) soon gives way to the realization that the options provide power and freedom. When a tool set limits your choices too much, you end up doing things like driving screws with a nailgun. Of course, too many choices isn't necessarily a good thing, but it's better than too few.

Last time, we saw how to build a weather reporting application by implementing a simple taglib module, My::WeatherTaglib, in Perl and deploying it in a pipeline with other XML filters. The pipeline approach allows one kind of flexibility: the freedom to decompose an application in the most appropriate manner for the requirements at hand and for the supporting organization.

Another kind of flexibility is the freedom to implement filters using different technologies. For instance, it is sometimes wise to build taglibs in different ways. In this article, we'll see how to build the same taglib using two other approaches. The first rebuild uses the technique implemented by the Cocoon project, LogicSheets. The second uses Jörg Walter's relatively new SimpleTaglib in place of the TaglibHelper used for My::WeatherTaglib in the previous article. SimpleTaglib is a somewhat more powerful, and, oddly, more complex module than TaglibHelper (though the author intends to make it a bit simpler to use in the near future).

CHANGES

AxKit v1.6 is now out with some nice bug fixes and performance improvements, mostly by Matt Sergeant and Jörg Walter, along with several new advanced features from Kip Hampton which we'll be covering in future articles.

Matt has also updated his AxKit compatible AxPoint PowerPoint-like HTML/PDF/etc. presentation system. If you're going to attend any of the big Perl conferences this season, then you're likely to see presentations built with AxPoint. It's a nice system that's also covered in an XML.com article by Kip Hampton.

AxTraceIntermediate

The one spiffy new feature I used -- rather more often than I'd like to admit -- in writing this article is the debugging directive AxTraceIntermediate, added by Jörg Walter. This directive defines a directory in which AxKit will place a copy each of the intermediate documents passed between filters in the pipeline. So a setting like:

    AxTraceIntermediate /home/barries/AxKit/www/axtrace

will place one file in the axtrace directory for each intermediate document. The full set of directives in httpd.conf used for this article is shown later.

Here is the axtrace directory after requesting the URIs / (from the first article), /02/weather1.xsp (from the second article), /03/weather1.xsp and /03/weather2.xsp (both from this article):

    |index.xsp.XSP         # Perl source code for /index.xsp
    |index.xsp.0           # Output of XSP filter

    |02|weather1.xsp.XSP   # Perl source code for /02/weather1.xsp
    |02|weather1.xsp.0     # Output of XSP
    |02|weather1.xsp.1     # Output of weather.xsl
    |02|weather1.xsp.2     # Output of as_html.xsl

    |03|weather1.xsp.XSP   # Perl source code for /03/weather1.xsp
    |03|weather1.xsp.0     # Output of XSP
    |03|weather1.xsp.1     # Output of weather.xsl
    |03|weather1.xsp.2     # Output of as_html.xsl

    |03|weather2.xsp.XSP   # Perl source code for /02/weather2.xsp
    |03|weather2.xsp.0     # output of my_weather_taglib.xsl
    |03|weather2.xsp.1     # Output of XSP
    |03|weather2.xsp.2     # Output of weather.xsl
    |03|weather2.xsp.3     # Output of as_html.xsl

Each filename is the path portion of the URI with the /s replaced with |s and a step number (or .XSP) appended. The numbered files are the intermediate documents and the .XSP files are the Perl source code for any XSP filters that happened to be compiled for this request. Compare the |03|weather2.xsp.* files to the the pipeline diagram for the /03/weather2.xsp request.

Watch those "|" characters: they force you to quote the filenames in most shells (and thus foil any use of wildcards):

    $ xmllint --format "www/axtrace/|03|weather2.xsp.3"
    <?xml version="1.0" standalone="yes"?>
    <html>
      <head>
        <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/>
        <title>My Weather Report</title>
      </head>
      <body>
        <h1><a name="title"/>My Weather Report</h1>
        <p>Hi! It's 12:43:52</p>
        <p>The weather in Pittsburgh is Sunny
        ....

NOTE: The .XSP files are only generated if the XSP sheet is recompiled, so you may need to touch the source document or restart the server to generate a new one. Another gotcha is that if an error occurs halfway down the processing pipeline, then you can end up with stale files. In this case, the lower-numbered files (those generated by successful filters) will be from this request, but the higher-numbered files will be stale, left over from the previous requests. A slightly different issue can occur when using dynamic pipeline configurations (which we'll cover in the future): you can end up with a shorter pipeline that only overwrites the lower-numbered files and leaves stale higher-numbered files around.

These are pretty minor gotchas when compared to the usefulness of this feature, you just need to be aware of them to avoid confusion. When debugging for this article, I used a Perl script that does something like:

    rm -f www/axtrace/*
    rm www/logs/*
    www/bin/apachectl stop
    sleep 1
    www/bin/apachectl start
    GET http://localhost:8080/03/weather1.xsp

to start each test run with a clean fileset.

Under the XSP Hood

Before we move on to the examples, let's take a quick peek at how XSP pages are handled by AxKit. This will help us understand the tradeoffs inherent in the different approaches.

AxKit implements XSP filters by compiling the source XSP page into a handler() function that is called to generate the output page. This is compiled in to Perl bytecode, which is then run to generate the XSP output document:

XSP architecture

This means that XSP page is not executed directly, but by running relatively efficient compiled Perl code. The bytecode is kept in memory so the overhead of parsing and code generation is not incurred for each request.

There are three types of Perl code used in building the output document: code to build the bits of static content, code that was present verbatim in the source document -- enclosed in tags like <xsp:logic> and <xsp:expr> -- and code that implements tags handled by registered taglib modules like My::WeatherTaglib from the last article.

Taglib modules hook in to the XSP compiler by registering themselves as handlers for a namespace and then coughing up snippets of code to be compiled in to the handler() routine:

XSP with Taglib Modules Hooking in

The snippets of code can call back into the taglib module or out to other modules as needed. Modules like TaglibHelper, which we used to build My::WeatherTaglib and SimpleTaglib, which we use later in this article for My::SimpleWeatherTaglib, automate the drudgery of building a taglib module so you don't need to parse XML or even (usually) generate XML.

You can view the source code that AxKit generates by cranking the AxDebugLevel up to 10 (which places the code in Apache's ErrorLog) or using the AxTraceIntermediate directive mentioned above. Then you must persuade AxKit to recompile the XSP page by restarting the server and requesting a page. If either of the necessary directives are already present in a running server, then simply touching the file to update its modification time will suffice.

This can be useful for getting a really good feel for what's going on under the hood. I encourage new taglib authors to do this to see how the code for your taglib is actually executed. You'll end up needing to do it to debug anyway (trust me :).

LogicSheets: Upstream Taglibs

AxKit uses a pipeline processing model and XSP includes tags like <xsp:logic> and <xsp:expr> that allow you to embed Perl code in an XSP page. This allows taglibs to be implemented as XML filters that are placed upstream of the XSP processor. These usually use XSLT to and convert taglib invocations to inline code using XSP tags:

Upstream LogicSheets feeding the XSP processor

In fact, this is how XSP was originally designed to operate and Cocoon uses this approach exclusively to this day (but with inline Java instead of Perl). I did not show this approach in the first article because it is considerably more awkward and less flexible than the taglib module approach offered by AxKit.

The Cocoon project calls XSLT sheets that implement taglibs LogicSheets a convention I follow in this article (I refer to the all-Perl taglib implementation as "taglib modules").

weather2.xsp

Before we look at the logicsheet version of the weather report taglib, here is the XSP page from the last article updated to use it:


<?xml-stylesheet href="my_weather_taglib.xsl" type="text/xsl"?>
<?xml-stylesheet href="NULL"                  type="application/x-xsp"?>
<?xml-stylesheet href="weather.xsl"           type="text/xsl"?>
<?xml-stylesheet href="as_html.xsl"           type="text/xsl"?>

<xsp:page
    xmlns:xsp="http://apache.org/xsp/core/v1"
    xmlns:util="http://apache.org/xsp/util/v1"
    xmlns:param="http://axkit.org/NS/xsp/param/v1"
    xmlns:weather="http://slaysys.com/axkit_articles/weather/"
>
<data>
  <title><a name="title"/>My Weather Report</title>
  <time>
    <util:time format="%H:%M:%S" />
  </time>
  <weather>
    <weather:report>
      <!-- Get the ?zip=12345 from the URI and pass it
           to the weather:report tag as a parameter -->
      <weather:zip><param:zip/></weather:zip>
    </weather:report>
  </weather>
</data>
</xsp:page>

In This Series

Introducing AxKit
The first in a series of articles by Barrie Slaymaker on setting up and running AxKit. AxKit is a mod_perl application for dynamically transforming XML. In this first article, we focus on getting started with AxKit.

XSP, Taglibs and Pipelines
Barrie explains what a "taglib" is, and how to use them to create dynamic pages inside of AxKit.

The <?xml-stylesheet href="my_weather_taglib.xsl" type="text/xsl"?> processing instruction causes my_weather_taglib.xsl (which we'll cover next) to be applied to the weather2.xsp page before the XSP processor sees it. The other three PIs are identical to the previous version: the XSP processor is invoked, followed by the same presentation and HTMLification XSLT stylesheets that we used last time.

The only other change from the previous version is that this one uses the corrent URI for XSP tags. I accidently used a deprecated URI for XSP tags in the previous article and ended up tripping over it when I used the up-to-date URI in the LogicSheet for this one. Such is the life of a pointy-brackets geek.

The ability to switch implementations without altering (much) code is one of XSP's advantages over things like inline Perl code: the implementation is nicely decoupled from the API (the tags). The only reason we had to alter weather1.xsp at all is because we're switching from a more advanced approach (a taglib module, My::WeatherTaglib) that is configured in the httpd.conf file to LogicSheets, which need per-document configuration when using <xml-stylesheet> stylesheet specifications. AxKit has more flexible httpd.conf, plugin and Perl based stylesheet specification mechanisms which we will cover in a future article; I'm using the processing instructions here because they are simple and obvious.

The pipeline built by the processing instructions looks like:

The pipeline for weather2.xsp

(does not show final compression stage).

my_weather_taglib.xsl

Now that we've seen the source document and the overall pipeline, here is My::WeatherTaglib recast as a LogicSheet, my_weather_taglib.xsl:


<xsl:stylesheet 
  version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xsp="http://apache.org/xsp/core/v1"
  xmlns:weather="http://slaysys.com/axkit_articles/weather/"
>

<xsl:output indent="yes" />

<xsl:template match="xsp:page">
  <xsl:copy>
    <xsp:structure>
      use Geo::Weather;
    </xsp:structure>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="weather:report">
  <xsp:logic>
    my $zip = <xsl:apply-templates select="weather:zip/*" />;
    my $w = Geo::Weather->new->get_weather( $zip );
    die "Could not get weather for zipcode '$zip'\n" unless ref $w;
  </xsp:logic>
  <state><xsp:expr>$w->{state}</xsp:expr></state>
  <heat><xsp:expr>$w->{heat}</xsp:expr></heat>
  <page><xsp:expr>$w->{page}</xsp:expr></page>
  <wind><xsp:expr>$w->{wind}</xsp:expr></wind>
  <city><xsp:expr>$w->{city}</xsp:expr></city>
  <cond><xsp:expr>$w->{cond}</xsp:expr></cond>
  <temp><xsp:expr>$w->{temp}</xsp:expr></temp>
  <uv><xsp:expr>$w->{uv}</xsp:expr></uv>
  <visb><xsp:expr>$w->{visb}</xsp:expr></visb>
  <url><xsp:expr>$w->{url}</xsp:expr></url>
  <dewp><xsp:expr>$w->{dewp}</xsp:expr></dewp>
  <zip><xsp:expr>$w->{zip}</xsp:expr></zip>
  <baro><xsp:expr>$w->{baro}</xsp:expr></baro>
  <pic><xsp:expr>$w->{pic}</xsp:expr></pic>
  <humi><xsp:expr>$w->{humi}</xsp:expr></humi>
</xsl:template>

<xsl:template match="@*|node()">
  <!-- Copy the rest of the doc almost verbatim -->
  <xsl:copy>
    <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

The first <xsl:template> inserts an <xsp:structure> at the top of the page with some Perl code to use Geo::Weather; so the Perl code in the later <xsl:logic> element can refer to it. You could also preload Geo::Weather in httpd.conf to share it amongst httpd processes and simplify this stylesheet, but that would introduce a bit of a maintainance hassle: keeping the server config and the LogicSheet in synchronization.

The second <xsl:template> replaces all occurences of <weather:report> (assuming the weather: prefix happens to map to the taglib URI; see James Clark's introduction to namespace for more details). In place of the <weather:report> tag(s) will be some Perl code surrounded by <xsp:logic> and <xsp:expr> tags. The <xsp:logic> tag is used around Perl code that is just logic: any value the code returns is ignored. The <xsp:expr> tags surround Perl code that returns a value to be emitted as text in the result document.

The get_weather() call returns a hash describing the most recent weather oberservations somewhere close to a given zip code:

    {
      'city'  => 'Pittsburgh',
      'state' => 'PA',
      'cond'  => 'Sunny',
      'temp'  => '77',
      ...
    };

All those <xsp:expr> tags extract the values from the hash one by one and build an XML data structure. The resulting XSP document looks like:

    <?xml version="1.0"?>
    <xsp:page xmlns:xsp="http://apache.org/xsp/core/v1"
xmlns:util="http://apache.org/xsp/util/v1"
xmlns:param="http://axkit.org/NS/xsp/param/v1"
xmlns:weather="http://slaysys.com/axkit_articles/weather/"> <xsp:structure> use Geo::Weather; </xsp:structure> <data> <title><a name="title"/>My Weather Report</title> <time> <util:time format="%H:%M:%S"/> </time> <weather> <xsp:logic> my $zip = <param:zip/>; my $w = Geo::Weather->new->get_weather( $zip ); die "Could not get weather for zipcode '$zip'\n" unless ref $w; </xsp:logic> <state><xsp:expr>$w->{state}</xsp:expr></state> <heat><xsp:expr>$w->{heat}</xsp:expr></heat> <page><xsp:expr>$w->{page}</xsp:expr></page> <wind><xsp:expr>$w->{wind}</xsp:expr></wind> <city><xsp:expr>$w->{city}</xsp:expr></city> <cond><xsp:expr>$w->{cond}</xsp:expr></cond> <temp><xsp:expr>$w->{temp}</xsp:expr></temp> <uv><xsp:expr>$w->{uv}</xsp:expr></uv> <visb><xsp:expr>$w->{visb}</xsp:expr></visb> <url><xsp:expr>$w->{url}</xsp:expr></url> <dewp><xsp:expr>$w->{dewp}</xsp:expr></dewp> <zip><xsp:expr>$w->{zip}</xsp:expr></zip> <baro><xsp:expr>$w->{baro}</xsp:expr></baro> <pic><xsp:expr>$w->{pic}</xsp:expr></pic> <humi><xsp:expr>$w->{humi}</xsp:expr></humi> </weather> </data> </xsp:page>

and the output document of that XSP page looks like:

    <?xml version="1.0" encoding="UTF-8"?>
    <data>
      <title><a name="title"/>My Weather Report</title>
      <time>17:06:15</time>
      <weather>
        <state>PA</state>
        <heat>77</heat>
        <page>/search/search?what=WeatherLocalUndeclared
        &where=15206</page>
        <wind>From the Northwest at 9 gusting to 16</wind>
        <city>Pittsburgh</city>
        <cond>Sunny</cond>
        <temp>77</temp>
        <uv>4</uv>
        <visb>Unlimited miles</visb>
        <url>http://www.weather.com/search/search?
        what=WeatherLocalUndeclared&where=15206</url>
        <dewp>59</dewp>
        <zip>15206</zip>
        <baro>29.97 inches and steady</baro>
        <pic>http://image.weather.com/web/common/wxicons/52/30.gif</pic>
        <humi>54%</humi>
      </weather>
    </data>

LogicSheet Advantages

  • One taglib can generate XML that calls another taglib. Taglib modules may call each other at the Perl level, but taglib modules are XSP compiler plugins and do not cascade: The XSP compiler lives in a pipeline environment but does not use a pipeline internally.
  • No need to add an AxAddXSPTaglib directive and restart the Web server each time you write a tag lib.

Restarting a Web server just because a taglib has changed can be awkward in some environments, but this seems to be rare; restarting an Apache server is usually quick enough in a development environment and better not be necessary too often in a production environment.

In the Cocoon community, LogicSheets can be registered and shared somewhat like the Perl community uses CPAN to share modules. This is an additional benefit when Cocooning, but does not carry much weight in the Perl world, which already has CPAN (there are many taglib modules on CPAN). There is no Java equivalent to CPAN in wide use, so Cocoon logic sheets need their own mechanism.

LogicSheet Disadvantages

There are two fundamental drwabacks with LogicSheets, each with several symptoms. Many of the symptoms are minor, but they add up:

  1. Requires inline code, usually in an XSLT stylesheet.
    • Putting Perl code in XML is awkward: You can't easily syntax check the code (I happen to like to run perl -cw ThisFile.pm a lot while writing Perl code) or take advantage of language-oriented editor features such as autoindenting, tags and syntax highlighting.
    • The taglib author needs to work in four languages/APIs: XSLT (typically), XSP, Perl, and the taglib under development. XSLT and Perl are far from trivial, and though XSP is pretty simple, it's easy to trip yourself up when context switching between them.
    • LogicSheets are far less flexible than taglib modules. For instance, compare the rigidity of my_weather_taglib.xsl's output structure with the that of My::WeatherTaglib or My::SimpleWeatherTaglib. The LogicSheet approach requires hardcoding the result values, while the two taglib modules simply convert whatever is in the weather report data structures to XML.
    • XSLT requires a fair amount of extra boilerplate to copy non-taglib bits of XSP pages through. This can usually be set up as boilerplate, but boilerplate in a program is just another thing to get in the way and require maintainance.
    • LogicSheet are inherently single-purpose. Taglib modules, on the other hand, can be used as regular Perl modules. An authentication module can be used both as a taglib and as a regular module, for instance.
    • LogicSheets need a working Web server for even the most basic functional testing since they need to be run in an XSP environment and AxKit does not yet support XSP outside a Web server. Writing taglib modules allows simple test suites to be written to vet the taglib's code without needing a working Web server.
    • Writing LogicSheets works best in an XML editor, otherwise you'll need to escape all your < characters, at least, and reading / writing XML-escaped Perl and Java code can be irksome.
    • Embracing and extending a LogicSheet is difficult to do: The source XSP page needs to be aware of the fact that the taglib it's using is using the base taglib and declare both of their namespaces. With taglib modules, Perl's standard function import mechanism can be used to releive XSP authors of this duty.
  2. Requires an additional stylesheet to process, usually XSLT. This means:
    • A more complex processing chain, which leads to XSP page complexity (and thus more likelihood of bugs) because each page must declare both the namespace for the taglib tags and a processing instruction to run the taglib. As an example of a gotcha in this area, I used an outdated version of the XSP namespace URI in weather2.xsp and the current URI in my_weather_taglib.xsl. This caused me a bit of confusion, but the AxTraceIntermediate directive helped shed some light on it.
    • More disk files to check for changes each time an XSP page is served. Since each LogicSheet affects the output, each LogicSheet must be stat()ed to see if it has changed since the last time the XSP page was compiled.

As you can probably tell, I feel that LogicSheets are a far more awkward and less flexible approach than writing taglibs as Perl modules using one of the helper libraries. Still, using upstream LogicSheets is a valid and perhaps occasionally useful technique for writing AxKit taglibs.

Upstream Filters good for?

So what is XSLT upstream of an XSP processor good for? You can do many things with it other than implementing LogicSheets. One use is to implement branding: altering things like logos, site name, and perhaps colors, or other customization, like administrator's mail addresses on a login page that is shared by several sub-sites.

A key advantage of doing transformations upstream of the XSP processor is that the XSP process caches the results of upstream transformations. XSP converts whatever document it receives in to Perl bytecode in memory and then just runs that bytecode if none of the upstream documents have changed.

Another use is to convert source documents that declare what should be on a page to XSP documents that implement the machinery of a page. For instance, a survey site might have the source documents declare what questions to ask:

    <survey>
      <question>
        <text>Have you ever eaten a Balut</text>
        <response>Yes</response>
        <response>No</response>
        <response>Eeeewww</response>
      </question>
      <question>
        <text>Ok, then, well how about a nice haggis</text>
        <response>Yes</response>
        <response>No</response>
        <response>Now that's more like it!</response>
      </question>
      ...
    </survey>

XSLT can be used to transform the survey definition in to an XSP page that uses the PerForm taglib to automate form filling, etc. This approach allows pages to be defined in terms of what they are instead of how they should work.

You can also use XSLT upstream of the XSP processor to do other things, like translate from a limited or simpler domain-specific tagset to a more complex or general purpose taglib written as a taglib module. This can allow you to define taglibs that are easier to use in terms of more powerful (but scary!) taglibs that are loaded in to the XSP processor.

My::SimpleWeatherTaglib

A new-ish taglib helper module has been bundled in recent AxKit releases: Jörg Walter's SimpleTaglib (the full module name is Apache::AxKit::Language::XSP::SimpleTaglib). This module performs roughly the same function as Steve Willer's TaglibHelper, but supports namespaces and uses a feature new to Perl, subroutine attributes, to specify the parameters and result formatting instead of a string.

Here is My::SimpleWeatherTaglib:

    package My::SimpleWeatherTaglib;

    use Apache::AxKit::Language::XSP::SimpleTaglib;

    $NS = "http://slaysys.com/axkit_articles/weather/";

    package My::SimpleWeatherTaglib::Handlers;

    use strict;
    require Geo::Weather;

    ## Return the whole report for fixup later in the processing pipeline
    sub report :  child(zip) struct({}) {
        return 'Geo::Weather->new->get_weather( $attr_zip );'
    }

    1;

The $NS variable defines the namespace for this taglib. This module uses the same namespace as my_weather_taglib.xsl and My::WeatherTaglib, because all three implement the same taglib (this repetetiveness is to demonstrate the differences between the approaches). See the Mixing and Matching Taglibs section to see how My::WeatherTaglib and My::SimpleWeatherTaglib can both be used in the same server instance.

My::SimpleWeatherTaglib then shifts gears in to a new package, My::SimpleWeatherTaglib::Handlers to define the subroutines for the taglib tags. Using a virgin package like this provides a clean place with which to declare the tag handlers. SimpleTaglib looks for the modules in the Foo::Handlers package if it's use()d in the Foo package (don't use require for this!).

My::SimpleWeatherTaglib requires Geo::Weather and declares a single tag, which handles the <weather:report> tag in weather1.xsp (which we'll show in a moment).

The require Geo::Weather; instead of use Geo::Weather; is to avoid importing subroutines in to our otherwise ...::Handlers namespace which might look like a handler.

There's something new afoot in the declaration for sub report: subroutine attributes. Subroutine attributes are a new feature of Perl (as of perl5.6) that allow us to hang additional little bits of information on the subroutine declaration that describe it a bit more. perldoc perlsub for the details of this syntax. Some attributes are predefined by Perl, but modules may define others for their own purposes. In this case, the SimpleTaglib module defines a handful of attributes, some of which describe what parameters the taglib tag can take and others which describe how to convert the result value from the taglib implementation into XML output.

The child(zip) subroutine attribute tells the SimpleTaglib module that this handler expects a single child element named zip in the taglib's namespace. In weather1.xsp, this ends up looking like:


    <weather:report>
      <!-- Get the ?zip=12345 from the URI and pass it
           to the weather:report tag as a parameter -->
      <weather:zip><param:zip/></weather:zip>
    </weather:report>

The text from the <weather:zip> element (which will be filled in from the URI query string using the param: taglib) will be made available in a variable named $attr_zip at request time. The fact that the text from an element shows up in a variable beginning with $attr_ is confusing, but it does actually work that way.

The struct({}) attribute specifies that the result of this tag will be returned as a Perl data structure that will be converted into XML. Geo::Weather->new->get_weather( $zip ) returns a HASH reference that looks like:

    {
      'city'  => 'Pittsburgh',
      'state' => 'PA',
      'cond'  => 'Sunny',
      'temp'  => '77',
      ...
    };

The struct attribute tells SimpleTaglib to turn this in to XML like:


    <city>Pittsburgh</city>
    <state>PA</state>
    <cond>Sunny</cond>
    <temp>77</temp>
    ....

The {} in the struct({}) attribute specifies that the result nodes should be not be in a namespace (and thus have no namespace prefix), just like the static portions of our weather1.xsp document. This is one of the advantages that SimpleTaglib has over other methods: It's easier to emit nodes in different namespaces. To emit nodes in a specific namespace, put the namespace URI for that namespace inside the curlies: struct({http://my.namespace.com/foo/bar}). The {} notation is referred to as James Clark (or jclark) notation.

Now, the tricky bit. Harkening back to our discussion of how XSP is implemented, remember that the XSP processor compiles the XSP document into Perl code that is executed to build the output document. As XSP compiles the page, it keeps a lookout for tags in namespaces handled by taglib modules that have been configured in with AxAddXSPTaglib. When XSP sees one of these tags, it calls in to the taglib module--My::SimpleWeatherTaglib here--for that namespace and requests a chunk of Perl source code to compile in place of the tag.

Taglibs implemented with the SimpleTaglib module covered here declare handlers for each taglib tag (sub report, for instance). That handler subroutine is called at parse time, not at request time. Its job is to return the chunk of code that will be compiled and then run later, at request time, to generate the output. So report() returns a string containing a snippet of Perl code that calls into Geo::Weather. This Perl code will be compiled once, then run for each request.

This is a key difference between the TaglibHelper module that My::WeatherTaglib used in the previous article and the SimpleTaglib module used here. SimpleTaglib calls My::SimpleWeatherTaglib's report() subroutine at compile time whereas TaglibHelper quietly, automatically arranges to call My::WeatherTaglib's report() subroutine at request time.

This difference makes SimpleTaglib not so simple unless you are used to writing code that generates code that will be compiled and run later. On the other hand, "Programs that write programs are the happiest programs in the world" (Andrew Hume, according to a few places on the net). This is true here because we are able to return whatever code is appropriate for the task at hand. In this case, the code is so simple that we can return it directly. If the work to be done was more complicated, then we could also return a call to a subroutine of our own devising. So, while a good deal less simple than the approach taken by TaglibHelper, this approach does offer a bit more flexibility.

SimpleTaglib's author does promise that a new version of SimpleTaglib will offer the "call this subroutine at request time" API which I (and I suspect most others) would prefer most of the time.

I will warn you that the documentation for SimpleTaglib does not stand on its own, so you need to have the source code for an example module or two to put it all together. Beyond the overly simple example presented here, the documentation refers you to a couple of others. Mind you, I'm casting stones while in my glass house here, because nobody has ever accused me of fully documenting my own modules.

For reference, here is the weather1.xsp from the previous article, which we are reusing verbatim for this example:

    <?xml-stylesheet href="NULL"        type="application/x-xsp"?>
    <?xml-stylesheet href="weather.xsl" type="text/xsl"?>
    <?xml-stylesheet href="as_html.xsl" type="text/xsl"?>

    <xsp:page
        xmlns:xsp="http://www.apache.org/1999/XSP/Core"
        xmlns:util="http://apache.org/xsp/util/v1"
        xmlns:param="http://axkit.org/NS/xsp/param/v1"
        xmlns:weather="http://slaysys.com/axkit_articles/weather/"
    >
    <data>
      <title><a name="title"/>My Weather Report</title>
      <time>
        <util:time format="%H:%M:%S" />
      </time>
      <weather>
        <weather:report>
          <!-- Get the ?zip=12345 from the URI and pass it
               to the weather:report tag as a parameter -->
          <weather:zip><param:zip/></weather:zip>
        </weather:report>
      </weather>
    </data>
    </xsp:page>

The processing pipeline and intermediate files are also identical to those from the previous article, so we won't repeat them here.

Mixing and Matching Taglibs using httpd.conf

As detailed in the first article in this series, AxKit integrates tightly with Apache and Apache's configuration engine. Apache allows different files and directories to have different configurations applied, including what taglibs are used. In the real world, for instance, it is sometimes necessary to have part of a site to use a new version of a taglib that might break an old portion.

In the server I used to build the examples for this article, for instance, the 02/ directory still uses My::WeatherTaglib from the last article, while the 03/ directory uses the my_weather_taglib.xsl for one of this article's examples and My::SimpleWeatherTaglib for the other. This is done by combining Apache's <Directory> sections with the AxAddXSPTaglib directive:

    ##
    ## Init the httpd to use our "private install" libraries
    ##
    PerlRequire startup.pl

    ##
    ## AxKit Configuration
    ##
    PerlModule AxKit

    <Directory "/home/me/htdocs">
        Options -All +Indexes +FollowSymLinks

        # Tell mod_dir to translate / to /index.xml or /index.xsp
        DirectoryIndex index.xml index.xsp
        AddHandler axkit .xml .xsp

        AxDebugLevel 10

        AxTraceIntermediate /home/me/axtrace

        AxGzipOutput Off

        AxAddXSPTaglib AxKit::XSP::Util
        AxAddXSPTaglib AxKit::XSP::Param

        AxAddStyleMap application/x-xsp \\
                      Apache::AxKit::Language::XSP

        AxAddStyleMap text/xsl \\
                      Apache::AxKit::Language::LibXSLT
    </Directory>

    <Directory "/home/me/htdocs/02">
        AxAddXSPTaglib My::WeatherTaglib
    </Directory>

    <Directory "/home/me/htdocs/03">
        AxAddXSPTaglib My::SimpleWeatherTaglib
    </Directory>

See How Directory, Location and Files sections work from the apache httpd documentation (v1.3 or 2.0) for the details of how to use <Directory> and other httpd.conf sections to do this sort of thing.

Help and thanks

Jörg Walter as well as Matt Sergeant were of great help in writing this article, especially since I don't do LogicSheets. Jörg also fixed a bug in absolutely no time and wrote the SimpleTaglib module and the AxTraceIntermediate feature.

In case of trouble, have a look at some of the helpful resources we listed in the first article.

Copyright 2002, Robert Barrie Slaymaker, Jr. All Rights Reserved.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02