March 2002 Archives

CPAN PLUS

Setting: A warm August day, somewhere in Amsterdam. In the bar of Yet Another Hotel. A large number of people are gathered. Judging by their attire, they're not here for business -- or at least not the business of selling vacuum cleaners. They notice the sun and appreciate it, but it is the shining light of their laptops that captivates them.

A man with long dark hair stands up -- he appears to be the leader of this unusal congregation. He's wearing overalls and is barefoot.

He begins to speak of something called "see pants," which has the potential to change the world of the laptop-people. It would alter the way free software is distributed. It would receive a mark of quality; it would be tested and reviewed. And gosh darnit, it would be good.

That man was Michael Schwern. His idea was CPANTS, the "CPAN Testing Service," and the occasion was YAPC::Europe 2001. But CPANTS required work. Each of us set off with our own little part, trying to make the world a better place.

Being a novice to the Perl community and eager for a challenging project to sink my teeth into, I offered to patch CPAN.pm so that CPANTS could automatically build modules and test them. I imagined this would be a simple role.

I started looking through the sources of CPAN.pm. Although in my experience as a user, CPAN.pm had always worked as advertised, as a developer I was confronted with its limitations: It was not very modular, allowing few additions to its functionality; it had a limited programming interface, being basically designed for interactive use.

So there I stood, having made a commitment to ameliorate the CPAN-interface and lacking the code base to do so, with two choices: Complain or Fix. Since complaining wouldn't give the desired result, the only remaining option was to start anew.

Thus CPANPLUS was born. Its objective is simple: do what CPAN.pm does, but do it better. We'd start with a clean code base designed to accommodate different types of use. But at the same time, this code should be a starting point, not the end point.

So is CPANPLUS better? That's for you to decide. The project began in October 2001, and late March 2002 marks the first official release on CPAN, timed to accompany this article.

Setting Up CPANPLUS

Setting up CPANPLUS should be simple. It is installed like any other Perl module:


    perl Makefile.PL
    make
    make test
    make install

The setup of CPANPLUS happens with "perl Makefile.PL." CPANPLUS will attempt to do two things at this point:

  • write an appropriate configuration for your system; and
  • probe for modules CPANPLUS would like to have, but which are not required.

Currently, the configuration is not automatic, so you will be prompted to answer some questions about your system, although in many cases the default values are acceptable. CPANPLUS will then fetch the index files for the first time and ask you to choose your favorite CPAN mirrors. You can pick from a list, or specify your own.

One question remains: Should we probe for missing modules? It is recommended that you do so, because CPANPLUS is faster and better with those modules installed. It's up to you, however: CPANPLUS does not require any noncore modules to run.

Continue with make and make test. All tests should pass -- if they don't, something is wrong. A list of tested platforms is available at the CPANPLUS FAQ site. Finally, run "make install" and CPANPLUS will be installed on your system!

The Structure of CPANPLUS

As you may have noticed if you've looked at the sources, CPANPLUS is spread out over many modules. This is because of heavy subclassing: We believe that each specific task should have its own space in the CPANPLUS library. This modular build allows for extensions to the library and many plugins.

There are two modules that are of particular interest to users of CPANPLUS. One is the user-interface CPANPLUS::Shell, and the other is the programming-interface CPANPLUS::Backend. Both modules will be explained in more depth later. Two other modules allow you to alter the behavior of the library at run time. These are CPANPLUS::Error, which allows you to manipulate error messages to and from CPANPLUS; and CPANPLUS::Configure, which allows you to change the configuration at runtime. They're definately worth looking at if you are a developer using CPANPLUS.

The User's Interface: 'Shell'

In truth, it isn't fair to say that "Shell" is the user's interface; CPANPLUS is designed to work with any number of shells. In fact, if you want to write your own shell, then that's possible. At the moment, only the default shell exists, but Jouke Visser is at work on a wxPerl shell.

You can specify which shell you wish to run in your configuration file. The default is CPANPLUS::Shell::Default.

There are two ways to invoke the shell. One is the familiar way:

perl -MCPANPLUS -eshell

There is also an executable in your Perl bin directory:

cpanp

The -M syntax also accepts a few flags that allow you to install modules from the command line. See the perldoc for CPANPLUS for details.

Now let's look at using the default shell.

One of its features is that each command works with a single letter. This could be called the "compact shell" as it is designed to be small, but still provide all the basic commands you need.

Here is a short summary of the command options available, which can also be seen by typing "h" or "?" at the shell prompt:


    a AUTHOR [ AUTHOR]    Search by author or authors
    m MODULE [ MODULE]    Search by module or modules
    i MODULE | NUMBER     Install a module by name or previous result
    d MODULE | NUMBER     Download a module to the current directory
    l MODULE [ MODULE]    Display detailed information about a module
    e DIR    [ DIR]       Add directories to your @INC
    f AUTHOR [ AUTHOR]    List all distributions by an author
    s OPTION VALUE        Set configuration options for this session
    p [ FILE]             Print the error stack (optionally to a file)
    h | ?                 Display help
    q                     Exit the shell

Let's assume you want to see whether you can install a module in the Acme:: namespace.

First, you'd look for modules that match your criteria with a module search:

m ^acme::

As you can see, a search can take a regular expression. It's a feature of the default shell that all searches are case-insensitive.

This search will return a result like this:


    0001  Acme::Bleach          1.12    DCONWAY
    0002  Acme::Buffy           undef   LBROCARD
    0003  Acme::Colour          0.16    LBROCARD
    0004  Acme::ComeFrom        0.05    AUTRIJUS
    0005  Acme::DWIM            1.05    DCONWAY

The first number is simply the ID for this search, which can be used as a shortcut for subsequent commands. The next column is the name of the module, followed by a version number. The last is the CPAN id of the author.

Imagine you'd like to get more information about Acme::Buffy. Simply type:

 l Acme::Buffy

Or, to save effort, the ID from the most recent search can be used:

l 2

This will give results like this:


  Details for Acme::Buffy:
  Description               An encoding scheme for Buffy fans
  Development Stage         Released
  Interface Style           hybrid, object and function interfaces available
  Language Used             Perl-only, no compiler needed, should be platform independent
  Package                 Acme-Buffy-1.2.tar.gz
  Support Level             Developer
  Version                   undef

If "Acme::Buffy" looks appealing, then you can install it:

i 2

That's all you need to install modules with CPANPLUS.

The Programmer's Interface: 'Backend'

CPANPLUS shells are built upon the CPANPLUS::Backend module. Backend provides generic functions for module management tasks. It is suitable for creating not only shells, but also autonomous programs.

Rather than describe in detail all the available methods, which are documented in the CPANPLUS::Backend pod, I will give a few bits of sample code to show what you can do with Backend.


    ### Install all modules in the POE:: namespace ###
    my $cb = new CPANPLUS::Backend;
    my $hr = $cb->search( type => 'module', list => [qw|^POE$ ^POE::.*|] );
    my $rv = $cb->install( modules => [ keys %$hr ] );

The variable $rv is a hash reference where the keys are the names of the modules and the values are exit states. This allows you to check how the installation went for each module. You can also get an error object from Backend with a complete history of what CPANPLUS did while installing these modules.


    ### Fetch a certain version of LWP ###
    my $cb = new CPANPLUS::Backend;
    my $rv = $cb->fetch( modules => ['/G/GA/GAAS/libwww-perl-5.62.tar.gz'] );

Once again, $rv is a hash reference, where the key is the module you tried to fetch and the value is the location on your disk where it was stored. Some people might not care for the way searches are handled and would rather roll their own. Backend allows you to take matters into your own hands:


    ### Do your own thing ###
    my $cb = new CPANPLUS::Backend;
    my $mt = $cb->module_tree();

$mt now holds the complete module tree, which is the same tree CPANPLUS uses internally. For this hash reference, the keys are the names of modules, and values are CPANPLUS::Internals::Module objects.


    for my $name ( keys %$mt ) {
        if ($name =~ /^Acme/) {
            my $href = $mt->{$name}->modules();
            
            while ( my ($mod,$obj) = each %$href ) {
                print $obj->install()
                    ? "$mod installed succesfully\n"
                    : "$mod installation failed!\n";
            }
        }
    }

This traverses the module tree, looking for module names that match the regular expression '/^Acme/' and installing all modules by the same author.

Why would you want to do this? We all have our reasons, and mine is that the Acme:: namespace is CPAN's bleeding edge codebase. Authors who have modules there must be trustworthy!

Merits of the Interfaces

In addition to the modules I've mentioned, there are plenty more: CPANPLUS currently contains 17 modules. These modules are part of a three-tiered approach. Underneath everything sits Internals, which performs the nitty-gritty work; Backend rests on top of it; and finally Shell provides a user interface.

The logic beyond the layered structure is that everyone wants something different from CPANPLUS. Some people just want a working shell like CPAN.pm provided. Others need a way to write applications that manage Perl installations. Still others dream of more elaborate plugins, like CPANTS or an automatic bug ticketing with RT -- something which is already planned.

The division allows us to stay flexible. There is something for everyone in CPANPLUS -- and if it's not there yet, it can probably be built upon the existing codebase.

Current and Future Developments

CPANPLUS was just released, but we're not resting. There's still a lot of functionality we want to provide.

It's high priority to create backward compatibility with the current CPAN.pm so CPANPLUS can eventually be released as CPAN.pm, possibly taking over its place in the core Perl distribution.

Another development that was already mentioned is automatic bug reporting, which would give authors of modules feedback on the performance of their modules on varying platforms, under various configurations. Of course, there's also CPANTS, the idea that sparked the entire CPANPLUS project. CPANTS is intended to provide automated testing of CPAN modules to make certain they meet minimal standards.

We have thoughts about integrating with known package managers like PPM, RPM and dpkg.

We also plan to develop more shells, both for the command-line and for the Windows and X environments.

Naturally, we don't want to stop there. There are a million possibilities with CPANPLUS, and hopefully they'll all be explored and developed.

If you have a good idea, then mail us your suggestion; or better yet, join as a developer and contribute!

Support and Contributing

If you have questions or suggestions, or want to join up as a developer, then send mail to: cpanplus-info@lists.sourceforge.net. This is the general mailing list.

Reports of bugs should be sent to: cpanplus-bugs@lists.sourceforge.net Some of the developers are also regulars on the IRC channel #CP on magnet.

Where to Get CPANPLUS

There are two places where you can obtain CPANPLUS. The first is, of course, to check your local CPAN mirror (or look it up on search.cpan.org). The latest stable release will always be there.

If you are interested in development versions, then look at Sourceforge.

More Information

In addition to the documents that come with CPANPLUS, information is available on our Web site. All good things stem from there.

On a side note, I will be giving speeches and tutorials on CPANPLUS at both YAPC::America::North and YAPC::Europe, as well as TPC. Come by and share your ideas!

Credits

Of course, I couldn't end without giving credit to the other developers. Although I started CPANPLUS, it would never have become what it is now without Joshua Boschert and Autrijus Tang. Ann Barcomb wrote all the documentation and Michael Schwern provided tests and just general good ideas. Thanks also goes to everyone who contributed with posts to the development and bug mailing lists.

Conclusion

So here we are, eight months after that first magical gathering. One step closer to the goal. And one step closer to Yet Another Venue. And perhaps there, I'll be standing up, talking about new things, or at least CPANPLUS.

Hopefully I'll be able to meet you there, in a circle of laptops, and we'll continue to make our world a better place!

mod_perl in 30 minutes

Introduction

In the previous article, I've shown quite amazing Web performance reports from companies that have deployed mod_perl heavily. You might be surprised but you can quite easily get similarly amazing results if you move your service to mod_perl as well. In fact, getting started with mod_perl shouldn't take you more than 30 minutes -- the time it takes to compile and configure the server on a decent machine and get it running.

In this article I'll show step-by-step installation and configuration scenarios, and chances are you will be able to run the basic statically compiled mod_perl setup without reading any other documents. Of course, you will want and need to read the documentation later, but I think you will agree with me that it's ultimately cool to be able to get your feet wet without knowing much about the new technology up-front.

The mod_perl installation was tested on many mainstream Unix platforms, so unless you have a nonstandard system, you shouldn't have any problems building the basic mod_perl server.

If you are a Windows user, then the easiest way is to use the binary package available from http://perl.apache.org/distributions.html. From the same location, you can download the Linux RPM version and CVS snapshots. However, I always recommend to build the mod_perl from the source, and as you will see in a moment, it's an easy thing to do.

Installing mod_perl Is Easy

So let's start with the installation process. If you are an experienced Unix user, then you need no explanation for the following commands. Just copy and paste them and you will get the server installed.

I'll use a % sign as the shell program's prompt.

  % cd /usr/src
  % lwp-download http://www.apache.org/dist/httpd/apache_1.3.20.tar.gz
  % lwp-download http://perl.apache.org/dist/mod_perl-1.26.tar.gz
  % tar -zvxf apache_1.3.20.tar.gz
  % tar -zvxf mod_perl-1.26.tar.gz
  % cd mod_perl-1.26
  % perl Makefile.PL APACHE_SRC=../apache_1.3.20/src \
    DO_HTTPD=1 USE_APACI=1 EVERYTHING=1
  % make && make test && make install
  % cd ../apache_1.3.20
  % make install

That's all!

What's left is to add a few configuration lines to httpd.conf, an Apache configuration file, start the server and enjoy mod_perl.

If you have stumbled upon a problem at any of the above steps, then don't despair -- the next section will explain in detail each step.

Installing mod_perl Detailed

If you didn't have the courage to try the steps in the previous section or you simply want to understand more before you try, then let's go through the fine details of the installation process. If you have successfully installed mod_perl following the short scenario in the previous section, then you can skip this section and move on to the next one.

Before we proceed, I should note that you have to become a root user in order to install the files in a protected area. If you don't have root access, then you can install all the files under your home directory. We will talk about the nuances of this approach in a future articles. I'll also assume that you have perl and gcc or an equivalent C compiler installed.

I assume that all builds are being done in the /home/stas/src directory. So we go into this directory.

  % cd /home/stas/src

Now we download the latest source distributions of Apache and mod_perl. If you have the LWP module installed (also known as libwww and available from CPAN), then you should have the lwp-download utility that partly imitates your favorite browser by allowing you to download files from the Internet. You can use any other method to retrieve these files. Just make sure that you save both files in the /home/stas/src directory, as this will make it easier for you to follow the example installation process. Of course, you can install both packages anywhere on your file system.

  % lwp-download http://www.apache.org/dist/httpd/apache_1.3.20.tar.gz
  % lwp-download http://perl.apache.org/dist/mod_perl-1.26.tar.gz

You can make sure that you're downloading the latest stable versions by visiting the following distribution directories: http://www.apache.org/dist/httpd/ and http://perl.apache.org/dist/. As you have guessed already, the former URL is the main Apache distribution directory, the latter is the same thing for mod_perl.

Untar both sources. You have to uncompress and untar the files. In addition to its main usage for tarring and untarring files, the GNU tar utility is able to uncompress files compressed by the gzip utility, when the -z option is used.

  % tar -zvxf apache_1.3.20.tar.gz
  % tar -zvxf mod_perl-1.26.tar.gz

If you have a non-GNU tar utility, then chances are that it will be unable to decompress, so you need to do it in two steps. First, uncompress the packages with:

  % gzip -d apache_1.3.20.tar.gz
  % gzip -d mod_perl-1.26.tar.gz

Then untar them with:

  % tar -xvf apache_1.3.20.tar
  % tar -xvf mod_perl-1.26.tar

If you don't have tar or gzip utilities available, then install them or use their equivalents.

Now go into the mod_perl source distribution directory.

  % cd mod_perl-1.26

The next step is to create the Makefile.

  % perl Makefile.PL APACHE_SRC=../apache_1.3.20/src \
    DO_HTTPD=1 USE_APACI=1 EVERYTHING=1

mod_perl accepts a variety of parameters, in this scenario we are going to use those that will allow you to do almost everything with mod_perl. Once you learn more about mod_perl, you will be able to fine-tune the list of parameters passed to Makefile.PL. In future articles, I'll go through all the available options.

perl Makefile.PL ... execution will check for prerequisites and tell you which required software packages are missing from your system. If you don't have some of the Perl packages installed, then you will have to install these before you proceed. They all are available from CPAN and can be easily downloaded and installed.

If you choose to install mod_perl with help of the CPAN.pm module, then it will install all the missing modules for you. To do so, tell CPAN.pm to install the Bundle::Apache bundle.

This step also executes the ./configure script from Apache's source distribution directory (absolutely transparently for you), which prepares the Apache build configuration files. If you need to pass parameters to Apache's ./configure script, then pass them as options to perl Makefile.PL .... In future articles we will talk about all the available options.

Now you should build the httpd executable by using the make utility.

  % make

This command prepares mod_perl extension files, installs them in the Apache source tree and builds the httpd executable (the Web server itself) by compiling all the required files. Upon completion of the make process, you get returned to the mod_perl source distribution directory.

make test executes various mod_perl tests on the freshly built httpd executable.

  % make test

This command starts the server on a nonstandard port (8529) and tests whether all parts of the built server function correctly. If something goes wrong, then the process will report it to you.

make install completes the installation process of mod_perl by installing all the Perl files required for mod_perl to run and, of course, the server documentation (man pages).

  % make install

You can use the following commands concatenation style:

  % make && make test && make install

It simplifies the installation, since you don't have to wait for each command to complete before starting the next one. When installing mod_perl for the first time, it's better to do it step by step.

If you choose the all-in-one approach, then you should know that if make fails, then neither make test nor make install will be executed. If make test fails, then make install will not be executed.

Finally, change to the Apache source distribution directory, run make install to create the Apache directory tree and install Apache header files (*.h), default configuration files (*.conf), the httpd executable and a few other programs.

  % cd ../apache_1.3.20
  % make install

Note that, as with a plain Apache installation, any configuration files left from a previous installation won't be overwritten by this process. You don't need to back up your previously working configuration files before the installation.

When the make install process completes, it will tell you how to start a freshly built Web server (the path to the apachectl utility that is being used to control the server) and where the installed configuration files are. Remember or, even better, write down both of them, since you will need this information. On my machine the two important paths are:

  /usr/local/apache/bin/apachectl
  /usr/local/apache/conf/httpd.conf

So far, we have completed the building and installation of the mod_perl enabled Apache. The next steps are to configure httpd.conf, write a little test script, start the server and check that the test script is working.

Configuring and Starting mod_perl Server

First things first; we want to make sure that our Apache was built correctly and that we can serve plain HTML files with it. Why do that? To minimize the number of possible trouble makers, if we find out that mod_perl doesn't work. After you know that Apache can serve HTML files, you don't have to worry about it anymore. And if something goes wrong with mod_perl, you have eliminated the possibility that the httpd binary or basic configurations are broken, you know that you are allowed to bind to the port you have configured your server to listen to, and that the browser you're testing with is just fine. Again, you should follow these guidelines when installing mod_perl for the first time.

Configure Apache as you always do. Set Port, User, Group, ErrorLog and other directives in the httpd.conf file (remember I've asked you to remember the location of this file at the end of the previous section?). Use the defaults as suggested, customize only when you have to. Values that you need to customize are ServerName, Port, User, Group, ServerAdmin, DocumentRoot and a few others. You will find helpful hints preceding each directive. Follow them if in doubt.

When you have edited the configuration file, it's time to start the server. One of the ways to start and stop the server is to use the apachectl utility. You start the server with:

  % /usr/local/apache/bin/apachectl start

And stop it with:

  % /usr/local/apache/bin/apachectl stop

Note that you have to be root when starting the server if the server is going to listen on port 80 or another privileged port (<1024).

After you start the server, check in the error_log file (/usr/local/apache/logs/error_log is the file's default location) that the server has indeed started. Don't rely on the status apachectl reports. You should see something like this:

  [Thu Jun 22 17:14:07 2000] [notice] Apache/1.3.20 (Unix) 
  mod_perl/1.26 configured -- resuming normal operations

Now point your browser to http://localhost/ or http://your.server.name/ as configured with the ServerName directive. If you have set a Port directive with a value different from 80, then apply this port number at the end of the server name. If you have used port 8080, then test the server with http://localhost:8080/ or http://your.server.name:8080/. You should see the infamous ``It worked'' page, which is an index.html file that make install in the Apache source tree installs for you. If you don't see this page, then something is wrong and you should check the contents of the error_log file. You will find the path of the error log file by looking it up in the ErrorLog directive in httpd.conf.

If everything works as expected, then shut down the server, open httpd.conf in your favorite editor, and scroll to the end of the file, where we will add the mod_perl configuration directives (of course you can place them anywhere in the file).

Assuming that you put all scripts that should be executed by the mod_perl enabled server in the /home/httpd/perl/ directory, add the following configuration directives:

  Alias /perl/ /home/httpd/perl/

  PerlModule Apache::Registry
  <Location /perl>
    SetHandler perl-script
    PerlHandler Apache::Registry
    Options ExecCGI
    PerlSendHeader On
    allow from all
  </Location>

Save the modified file.

This configuration causes each URI starting with /perl to be handled by the Apache mod_perl module. It will use the handler from the Perl module Apache::Registry.

Preparing the Scripts Directory

Now create a /home/httpd/perl/ directory if it doesn't yet exist. In order for you and Apache to be able to read, write and execute files we have to set correct permissions. You could get away by simply doing:

  % chmod 0777  /home/httpd/perl

This is very, very insecure and you should not follow this approach on the production machine. This is good enough when you just want to try things out and want to have as few obstacles as possible. Once you understand how things work, you should tighten the permissions of files served by Apache. In future articles, we will talk about setting proper file permissions.

The ``mod_perl rules'' Apache::Registry Script

As you probably know, mod_perl allows you to reuse CGI scripts written in Perl that were previously used under mod_cgi. Therefore, our first test script can be as simple as:

  mod_perl_rules1.pl
  ------------------
  print "Content-type: text/plain\r\n\r\n";
  print "mod_perl rules!\n";

Save this script in the /home/httpd/perl/mod_perl_rules1.pl file. Notice that the shebang line is not needed with mod_perl, but you can keep it if you want. So the following script can be used as well:

  mod_perl_rules1.pl
  ------------------
  #!/usr/bin/perl
  print "Content-type: text/plain\r\n\r\n";
  print "mod_perl rules!\n";

Of course you can write the same script using the Apache Perl API:

  mod_perl_rules2.pl
  ------------------
  my $r = shift;
  $r->send_http_header('text/plain');
  $r->print("mod_perl rules!\n");

Save this script in the /home/httpd/perl/mod_perl_rules2.pl file.

Now make both of the scripts executable and readable by the server. Remember that when you execute scripts from a shell, they are being executed by the user-name you are logged with. When instead you try to run the scripts by issuing requests, Apache needs to be able to read and execute them. So we make the script readable and executable by everybody:

  % chmod 0755   /home/httpd/perl/mod_perl_rules1.pl \
                 /home/httpd/perl/mod_perl_rules2.pl

If you don't want other users to be able to read your script, then you should add yourself into the groupname the Web server is running with (as defined by the Group directive) and then make the script owned by that group and tighten the permissions. For example, on my machine I run the server under the group httpd and I'm the only one who is in the same group, so I can do the following:

  % chown stas.httpd /home/httpd/perl/mod_perl_rules1.pl \
                 /home/httpd/perl/mod_perl_rules2.pl
  % chmod 0750   /home/httpd/perl/mod_perl_rules1.pl \
                 /home/httpd/perl/mod_perl_rules2.pl

The first command makes the files belong to group httpd, the second sets the proper execution and read permissions.

That's secure, assuming that you have a dedicated groupname for your server.

Also, remember that all the directories that lead to the script should be readable and executable by the server.

You can test mod_perl_rules1.pl from the command line, since it is essentially a regular Perl script.

  % perl /home/httpd/perl/mod_perl_rules1.pl

You should see the following output:

  mod_perl rules!

You cannot test the second script by executing it from the command line since it uses the mod_perl API that is available only when run from within the mod_perl server.

Make sure the server is running and issue these requests using your favorite browser:

  http://localhost/perl/mod_perl_rules1.pl
  http://localhost/perl/mod_perl_rules2.pl

In both cases you will see on the following response:

  mod_perl rules!

If you see it--congratulations! You have a working mod_perl server.

If you're using port 8080 instead of 80, then you should use this number in the URL:

  http://localhost:8080/perl/mod_perl_rules1.pl
  http://localhost:8080/perl/mod_perl_rules2.pl

The localhost approach will work only if the browser is running on the same machine as the server. If not, then use the real server name for this test. For example:

  http://your.server.name/perl/mod_perl_rules1.pl

If there is any problem, then please refer to the error_log file for the error reports.

Now it's a time to move your CGI scripts from /somewhere/cgi-bin directory to /home/httpd/perl/ and see them running much much faster, when requested from the newly configured base URL (/perl/). If you were accessing the script as /cgi-bin/test.pl, then it will now be accessed from /perl/test.pl.

Some of your scripts might not work immediately and will require some minor tweaking or even a partial rewrite to work properly with mod_perl. Chances are that if you are not practicing sloppy programming, then the scripts will work without any modifications.

If you have a problem with your scripts, then a good approach is to replace Apache::Registry with Apache::PerlRun in httpd.conf, as the latter can execute really badly written scripts. Put the following configuration directives instead in httpd.conf and restart the server:

  PerlModule Apache::PerlRun
  <Location /perl>
    SetHandler perl-script
    PerlHandler Apache::PerlRun
    Options ExecCGI
    PerlSendHeader On
    allow from all
  </Location>

Now your scripts should work, unless there is something in them mod_perl doesn't accept. We will discuss these nuances in future articles.

The ``mod_perl rules'' Apache Perl Module

mod_perl is about running both scripts and handlers. Although I have started to present mod_perl using scripts, because it's easier if you have written CGI scripts before, the more advanced use of mod_perl is about writing handlers. But have no fear. As you will see in a moment, writing handlers is almost as easy as writing scripts.

To create a mod_perl handler module, all I have to do is to wrap the code I have used for the script into a handler subroutine, add a statement to return the status to the server when the subroutine has successfully completed, and append a package declaration at the top of the code.

Just as with scripts you can use either the CGI API you are probably used to:

  ModPerl/Rules1.pm
  ----------------
  package ModPerl::Rules1;
  use Apache::Constants qw(:common);

  sub handler{
    print "Content-type: text/plain\r\n\r\n";
    print "mod_perl rules!\n";
    return OK;
  }
  1; # satisfy require()

or the Apache Perl API that allows you to interact more intimately with the Apache core by providing an API unavailable under regular Perl. Of course, in the simple example that I show, using any of the approaches is fine, but when you need to use the API, this version of the code should be used.

  ModPerl/Rules2.pm
  ----------------
  package ModPerl::Rules2;
  use Apache::Constants qw(:common);

  sub handler{
    my $r = shift;
    $r->send_http_header('text/plain');
    print "mod_perl rules!\n";
    return OK;
  }
  1; # satisfy require()

Create a directory called ModPerl under one of the directories in @INC (e.g. /usr/lib/perl5/site_perl/5.005), and put Rules1.pm Rules2.pm into it, the files should include the code from the above examples.

To find out what the @INC directories are, execute:

  % perl -le 'print join "\n", @INC'

On my machine it reports:

  /usr/lib/perl5/5.6.1/i386-linux
  /usr/lib/perl5/5.6.1
  /usr/lib/perl5/site_perl/5.6.1/i386-linux
  /usr/lib/perl5/site_perl/5.6.1
  /usr/lib/perl5/site_perl
  .

Now add the following snippet to httpd.conf to configure mod_perl to execute the ModPerl::Rules::handler subroutine whenever a request to mod_perl_rules1 is made:

  PerlModule ModPerl::Rules1
  <Location /mod_perl_rules1>
    SetHandler perl-script
    PerlHandler ModPerl::Rules1
  </Location>

Now you can issue a request to:

  http://localhost/mod_perl_rules1

and just as with our mod_perl_rules.pl scripts you will see:

  mod_perl rules!

as the response.

To test the second module <ModPerl::Rules2> add the same configuration, while replacing all 1's with 2's:

  PerlModule ModPerl::Rules2
  <Location /mod_perl_rules2>
    SetHandler perl-script
    PerlHandler ModPerl::Rules2
  </Location>

And to test use the URI:

  http://localhost/mod_perl_rules2

Is This All I Need to Know About mod_perl?

Obviously, the next question you'll ask is: ``Is this all I need to know about mod_perl?''.

The answer is: `yes and no.

The yes part:

  • Just like with Perl, you have to know little about mod_perl to do really cool stuff. The presented setup allows you to run your visitor counters and guest book much faster, and amaze your friends, usually without changing a single line of code.

The No part:

  • A 50-fold improvement in guest book response times is great, but when you deploy a heavy service with thousands of concurrent users, taking into account a high-level competition between similar Web services, a delay of a few milliseconds might cost you a customer and probably many of them.

    Of course, when you test a single script and you are the only user, you don't really care about squeezing yet another millisecond from response time, but it becomes a real issue when these milliseconds add up at the production site, with hundreds of users concurrently generating requests to various scripts on your site. Users aren't merciful nowadays -- if there is another even less fancier site that provides the same service but a little bit faster, then chances are that they will go over there.

    Testing your scripts on an unloaded machine can be misleading, Everything might seem so perfect. But when you move them into a production machine, things don't behave as well as they did on your development box. Many times you just run out of memory on busy services. You need to learn how to optimize your code to use less memory and how to make the memory shared.

    Debugging is something people prefer not to talk about, since the process can be tedious. Learning how to make the debugging process simpler and efficient is a must if you consider yourself a Web programmer. This task is especially not so straightforward when debugging CGI scripts, and even more complicated with mod_perl -- unless you know how, and then it suddenly becomes easy.

    References

    The Apache site's URL: http://www.apache.org/

    The mod_perl site's URL: http://perl.apache.org/

    CPAN is the Comprehensive Perl Archive Network. The Master site's URL is http://cpan.org/. CPAN is mirrored at more than 100 sites worldwide. (http://cpan.org/SITES.html)

    mod_perl has many features unavailable under mod_cgi when working with databases. Among others the most important are persistent connections.

    You have to know how to keep your service running nonstop and be able to recover fast if there are any problems.

    Finally, the most important thing is the Apache-Perl API, which allows you to do anything with a received request, even intervene in every stage of request processing. This gives you great flexibility and allows you to create things you couldn't dream about with plain mod_cgi.

There are many more things to learn about mod_perl and Web programming in general. In future articles, I'll talk in details about all these issues.

Acknowledgements

Many thanks to Eric Cholet for reviewing this article.

A Perl Hacker's Foray into .NET

No, I haven't sold out; I haven't gone over to the dark side; I haven't been bought. I'm one of the last people to be using closed-source software by choice. But one of the traits of any self-respecting hacker is curiosity, and so when he hears about some cool new technology, he's almost obliged to check it out and see whether there's anything he can learn from it. So this particular Perl hacker took a look at Microsoft's .NET Framework, and, well, Mikey, I think he likes it.

What Is .NET?

When something's as incredibly hyped as Microsoft's .NET project, it's hard to convince people that there's a real working technology underneath it. Unfortunately, Microsoft doesn't do itself any favors by slapping the .NET moniker on anything they can. So let's clarify what we're talking about.

.NET is applied to anything with the broad notion of "Web services" -- from the Passport and Hailstorm automated privacy-deprivation services and the Web-service-enabled versions of operating systems and application products to the C# language and the Common Language Runtime. But there is an underlying theme and it goes like this: The .NET Framework is an environment based on the Common Language Runtime and (to some extent) the C# language, for creating portable Web services.

So for our exploration, the components of the .NET Framework that we care about are the Common Language Runtime and the C# language. And to nail it down beyond any doubt, these are things that you can download and use today. They're real, they exist and they work.

The .NET CLR

Let's begin with the CLR. The CLR is, in essence, a virtual machine for C# much like the Java VM, but which is specifically designed to allow a wide variety of languages other than C# to run on it. Does this ring any bells with Perl programmers? Yes, it's not entirely dissimilar to the idea of the Parrot VM, the host VM for Perl 6 but designed to run other languages as well.

But that's more or less where the similarity ends. For starters, while Parrot is chiefly intended to be ran as an interpreted VM but has a "bolted-on" JIT, CLR is expected to be JITted from the get-go. Microsoft seems to want to avoid the accusations of slowness leveled at Java by effectively requiring JIT compilation.

Another "surface" distinction between Parrot and CLR is that the languages supported by the CLR are primarily statically typed languages such as C#, J#, (a variant of Java) and Visual Basic .NET. The languages Parrot aims to support are primarily dynamically typed, allowing run-time compilation, symbolic variable access, (try doing ${"Package::$var"} in C#...) closures, and other relatively wacky operations.

To address these sorts of features, the Project 7 research project was set up to provide .NET ports for a variety of "academic" languages. Unfortunately, it transpires that this has highlighted some limitations of the CLR, and so almost all of the implementations have had to modify their target languages slightly or drop difficult features. For instance, the work on Mercury turned up some deficiencies in CLR's Common Type System that would also affect a Perl implementation. We'll discuss these deficiencies later when we examine how Perl and the .NET Framework can interact.

But on the other hand, let's not let this detract from what the CLR is good at - it can run a variety of different languages relatively efficiently, and it can share data between languages. Let's now take a look at C#, the native language of the CLR, and then see how we can run .NET executables on our favourite free operating systems.

C#

C# is Microsoft's new language for the .NET Framework. It shares some features with Java, and in fact looks extremely like Java at first glance. Here's a piece of C# code:


using System;

class App {
   public static void Main(string[] args) {
      Console.WriteLine("Hello World");
      foreach (String s in args) {
         Console.WriteLine("Command-line argument: " + s);
      }
   }
}

Naturally, the Java-like features are quite obvious to anyone who's seen much Java - everything's in a class, and there's an explicitly defined Main function. But what's this - a Perl-like foreach loop. And that using declaration seems strangely familiar.

Now, don't get me wrong. I'm not trying to claim that C# is some bastard offspring of Perl and Java, or even that C# really has that much in common with Perl; it doesn't. But it is a well-designed language that does have a bunch of "programmer-friendly" language features that traditionally made "scripting" languages like Perl or Python faster for rapid code prototyping.

Here's some more code, which forms part of a game-of-life benchmarking tool we used to benchmark the CLR against Parrot.


    static String generate(String input) {
        int cell, neighbours;
        int len = input.Length;
        String output = "";
        cell = 0; 
        do {
            neighbours = 0;
            foreach (int offset in new Int32[] {-16, -15, -14, -1, 1, 14, 15, 16}) {
                int pos = (offset + len + cell) % len;
                if (input.Substring(pos, 1) == "*")
                    neighbours++; 
            }
            if (input.Substring(cell, 1) == "*") {
                output += (neighbours < 2 || neighbours > 3) ? " " : "*";
            } else {
                output += (neighbours == 3) ? "*" : " ";
            } 
        } while (++cell < len); 
        return output;
    }

This runs one generation of the game of life, taking an input playing field and building an output string. What's remarkable about this is that I wrote it after a day of looking at C# code, with no prior exposure to Java. C# is certainly easy to pick up.

What can Perl learn from C#? That's an interesting question, especially as the Perl 6 design project is ongoing. Let's have a a quick look at some of the innovations in C# and how we might apply them to Perl.

Strong Names

We'll start with an easy one, since Larry has already said that something like this will already be in Perl 6: To avoid versioning clashes and interface incompatibilities, .NET has the concept of "strong names." Assemblies -- the C# equivalent of Java's jar files -- have metadata containing their name, version number, md5sum and cryptographic signature, meaning you can be sure you're always going to get the definitions and behavior you'd expect from any third-party code you run. More generally, assemblies support arbitrary metadata that you can use to annotate their contents.

This approach to versioning and metadata in Perl 6 was highlighted in Larry's State of the Onion talk this year, and is also the solution used by JavaScript 2.0, as described by Waldemar Horwat at his LL1 presentation, so it seems to be the way the language world is going.

Properties

C# supports properties, which are class fields with explicit get/set methods. This is slightly akin to Perl's tying, but much, much slicker. Here's an example:


    private int MyInt;
    public int SomeInt {
        get {
            Console.WriteLine("I was got.\n");
            return MyInt;
        }
        set {
            Console.WriteLine("I was set.\n");
            MyInt = value;
        }
    }

Whenever we access SomeInt, the get accessor is executed, and returns the value of the underlying MyInt variable; when we write to it, the corresponding set accessor is called. Here's one suggested way we could do something similar in Perl 6:


      my $myint;
      our $SomeInt :get(sub{ print "I was got!\n"; $myint })
                   :set(sub{ print "I was set!\n"; $myint = $^a });
    

C# actually takes this idea slightly further, providing "indexers", which are essentially tied arrays:


    private String realString;
    public String substrString[int idx] {
        get {
            return realString.Substring(idx, 1);
        }
        set {
            realString = realString(0, idx) + value + realString(idx+1);
        }
    }

    substrString[12] = "*"; // substr($string, 12, 1) = "*";

Object-Value Duality

Within the CLR type system, (CTS) there are two distinct types (as it were) of types: reference types and value types. Value types are the simple, honest-to-God values: integers, floating point numbers, strings, and so on. Reference types, on the other hand, are objects, references, pointers and the like.

Now for the twist: Each value type has an associated reference type, and you can convert values between them. So, if you've got an int counter;, then you can "box" it as an object like so: Object CounterObj = counter. More specifically, int corresponds to Int32. This gives us the flexibility of objects when we need to, for instance, call methods on them, but the speed of fixed values when we're doing tight loops on the stack.

While Perl is and needs to remain an essentially untyped language, optional explicit typing definitions combined with object-value duality could massively up Perl's flexibility as well as bringing some potential optimizations.

Chaining Delegates

Here's an extremely rare thing - a non-obvious use of operator overloading that actually makes some sense. In event-driven programming, you'll often want to assign callbacks to happen on a given event. Here's how C# does it: (The following code adapted from Events in C# by Sanju)


delegate void ButtonEventHandler(object source, int clickCount);

class Button {
    public event ButtonEventHandler ButtonClick;

    public void clicked(int count) { // Fire the handler
        if (ButtonClick != null) ButtonClick (this,count);
    }
}

public class Dialog {
    public Dialog() {
        Button b = new Button();

        b.ButtonClick += new ButtonEventHandler(onButtonAction);
        b.clicked(1);
    }
}

public void onButtonAction(object source,int clickCount) {
    //Define the actions to be performed on button-click here.
}

Can you see what's going on? The "delegate" type ButtonEventHandler is a function signature that we can use to handle button click events. Our Button class has one of these handlers, ButtonClick, which is defined as an event. In the Dialog class, we instatiate a new delegate, using the onButtonAction function to fulfill the role of a ButtonEventHandler.

But notice how we assign it to the Button's ButtonClick field - we use addition. We can add more handlers in the same way:


    b.ButtonClick += new ButtonEventHandler(myButtonHandler);
    b.ButtonClick += new ButtonEventHandler(otherButtonHandler);
    

And now when the button's clicked method fires off the delegates, all three of these functions will be called in turn. We might decide that we need to get rid of one of them:


    b.ButtonClick -= new ButtonEventHandler(myButtonHandler);
    

After that, only the two functions onButtonAction and otherButtonHandler are active. Chaining delegates like this is something I haven't seen in any other language, and makes sense for event-based programming; it's something it might be good for Perl 6 to support.

Mono and Rotor - Running .NET

OK, enough talk about C#. Let's go run some.

Of course, the easiest way to do this at present is to do your development on a Windows box. Just grab a copy of the .NET Framework SDK, (only 137M!) install it, and you have a C# compiler at your disposal which can produce .NET executables running on the Microsoft CLR. This is how I do my C# experimentation - I have a copy of Windows running on a virtual machine, sharing a filesystem with my OS X laptop. I do my editing in my favourite Unix editor, then pop over to the Windows session to run the CSC compiler.

I know that for some of us, however, that's not a great solution. Thankfully, the creative monkeys at Ximian have been feverishly working on bringing us an open-sourced .NET Framework implementation. The Mono project comprises of an implementation of the Common Language Runtime plus a C# compiler and other goodies; a very easy way to get started with .NET is to pick up a release of Mono, and compile and install it.

After the usual ./configure;make;make install, you have three new commands at your disposal: mcs is the Mono C# compiler; mint is the CLR Interpreter; and mono is its JITted cousin.

And yes, Veronica, you can run .NET EXE files on Linux. Let's take the first C# example from the top of this article, and run it:


 % mcs -o hello.exe hello.cs
 % mono hello.exe A Test Program
Hello World
Command-line argument: A
Command-line argument: Test
Command-line argument: Program
RESULT: 0

And just to show you we're not messing you around:


 % file hello.exe
hello.exe: MS Windows PE 32-bit Intel 80386 console executable

Mono isn't a particularly quick runtime, nor is it particularly complete, but it has a large number of hackers improving its base classes every day. It runs a large percentage of the .NET executables I throw at it, and the mcs compiler can now compile itself, so you can do all your development using open source tools.

Another option, once it appears, is Microsoft's Rotor project, a shared source CLR and compiler suite. Rotor aims to be the ECMA standard implementation of the .NET Framework; Microsoft has submitted the Framework for standardization, but in typical style, its own implementations add extra functionality not part of the standard. Oh, and in case the words "shared source" haven't jumped out at you yet, do not even consider looking at Rotor if you may work on Mono at some point. However, for the casual user, its comprehensive implementation means it will be a better short-term choice for .NET experimentation - again, once it's released.

CLR Architecture

Before we finish considering how Perl and the .NET Framework relate to each other, let's take a more in-depth look at the internals of the Common Language Runtime compared to our own Parrot.

First, the CLR is a stack-based virtual machine, as opposed to Parrot's register approach. I don't know why this approach was taken, other than, I imagine, "because everyone else does it." CLR runs a bytecode language, which Microsoft calls MS-IL when it is talking about their implementation of CLR, and what it calls CIL (Common Intermediate Language) to ECMA. It's object-oriented assembler, a true horror to behold, but it works. Here's a fragment of the IL for our Hello example:


    .method public static 
           default void Main(string[] args)  cil managed 
    {
        // Method begins at RVA 0x2090
        .entrypoint
        // Code size 78 (0x4e)
        .maxstack 9
        .locals (
                string  V_0,
                string[]        V_1,
                int32   V_2)
        IL_0000: ldstr "Hello World"
        IL_0005: call void System.Console::WriteLine(string)
        IL_000a: ldarg.s 0
     ...

In order to optimize CLR for JITting, it imposes a number of restrictions on the IL. For instance, the stack may only be used to store parameters and return values from operations and calls; you can't access arbitrary points in the stack; more significantly, the types of values on the stack have to be statically determinable and invariant. That's to say, at a given call in the code, you know for sure what types of things are on the stack at the time.

The types themselves are part of the Common Type System, something every language compiling to .NET has to conform to. As we have mentioned, CTS types are either value types or reference types. There's a smaller subset of CTS called the Common Language Specification, CLS. Languages must implement CLS types, and may implement their own types as part of the CTS. The CLS ought to be used in all "outward-facing" APIs where two different languages might meet; the idea being the data passed between two languages is guaranteed to have a known meaning and semantics. However, this API restriction is not enforced by the VM.

Types which can appear on the stack are restricted again; you're allowed int32, int64, int, float, a reference, a "managed" pointer or an unmanaged pointer. "Management" is determined by where the pointer comes from (trusted code is managed) and influences what it's allowed to see and how it gets GCed. Local arguments may live somewhere other than on the main stack - this is implementation-defined - in which case they have access to a richer set of types; but since you have a reference to an object, you should be OK.

Other value types include structures and enumerations. Since value types are passed around on the stack, you can't really have big structures, since you'd be passing loads of data. There's also the typed reference, which is a reference plus something storing what sort of reference it is. Reference types are kept in the heap, managed by garbage collection, and are referenced on the stack. This is not unlike what Parrot does with PMC and non-PMC registers.

Like Java, the CLR has a reasonably small number of operations. You can load/store constants, local variables, arguments, fields and array elements; you can create and dereference pointers; you can do arithmetic; you can do conversion, casting and truncating; there are branch ops (including a built-in lookup-table switch op) and method call ops; there's a special tail-recursion method-call op; you can throw and handle exceptions; you can box and unbox, converting value types to reference types and vice verca; you can create an array and find its length; you can handle typed references. And that's essentially it. Anything else is outside the realm of the CLR, and has to be implemented with external methods.

An excellent paper comparing the CLR and the JVM has been produced by the team working on Component Pascal; they've ported CP to both virtual machines, and so are very well-placed to run a comparison. See the GPCP project page.

Perl and .NET

How can we connect Perl and .NET? Well, let's look at the pieces of work that have already been done in this area. ActiveState have been leading research, with their experimental Perl for .NET Research and PerlNET projects.

Perl for .NET Research was a brave idea; Jan Dubois essentially wrote a Perl interpreter in C#, and used the standard Perl compilation technique of combining an embedded interpreter with a serialized representation of the Perl program. The resulting compiler is a C# analog of the B::CC module, and then runs the CSC compiler to compile the C# representation of the Perl program, linking in the Perl interpreter, into an executable. To be honest, I couldn't get Perl for .NET Research to produce executables, but I could study it enough to see what it was doing.

PerlNET, now included with AS's Perl Dev Kit, takes a rather different approach. This time the Perl interpreter sits "outside" the .NET Framework, communicating with it through DLLs. This allows for .NET Framework code to call into Perl, and also for Perl to make calls into the .NET Framework library. For instance, one may write:

    use namespace "System";
    Console->WriteLine("Hello World!");

to call the System.Console.WriteLine method in the .NET Framework runtime library.

However, neither of these initiatives compile Perl to MS-IL in the usual sense of the word. This is surprising, since it would be an interesting test of the flexibility of the Common Type System.

This is one of the possible avenues I'd like to see explored in terms of bringing .NET and Perl closer together. Other possibilities include crossover between CLR and Parrot - I'd love to see .NET executables run on top of Parrot and Parrot bytecode files convertable to .NET; I'd like to see a Perl 6 interpreter emit MS-IL; I'd like to see Perl programs sharing data and objects with other languages on top of some virtual machine.

Like it or not, there's a good chance that the .NET Framework is going to be a big part of the technological scene in the future. I hope after this brief introduction, you're a little more prepared for it when it happens, and we have some direction as to how Perl fits into it.


For more on .NET, check O'Reilly Network's .NET DevCenter.

Introducing AxKit

Series Introduction

This article is the first in a series of articles introducing the AxKit web application platform. Starting from the basics, this series explains how to install a basic AxKit server and then explores AxKit's more powerful capabilities. Basic familiarity with (or ability to google for) XML, Apache, and mod_perl is assumed but expertise is not required. Some References and Helpful Resources are provided at the end of this article to help get you started as well.

AxKit: What is it?

If you already know about AxKit and the wonders of server side XML processing, you may wish to skip to the Basic Installation section.

AxKit is an application and document server that uses XML processing pipelines to generate and process content and deliver it to clients in a wide variety of formats. It's important to note that AxKit is not limited to XML source documents; non-XML documents and data sources can be converted to XML as needed. A basic AxKit pipeline looks like:

Axkit Processing Overview

The source document may be an article, a data set, data returned by a database query, the output from a Perl routine, mod_perl handler, CGI script, etc., etc. This document is fed in to the first processor ("Transform 1") which alters it according to a "stylesheet", which specifies a set of transforms to apply to the document. The output from the first processor is fed to the second, the second to the third, and so on until the final document is passed to the browser.

Processing techniques available conventional XML processing like XSLT, advanced processing more suited to dynamic content (such as Perl versions of XSP and tag libraries), and low-level processing in Perl for those occasions where high level abstractions merely get in the way.

AxKit provides seamless caching (of both code for generating dynamic content and for transformed documents), compression, and character set conversion. AxKit also allows other technologies (CGI scripts, session management tools, data converters, Perl modules, Inline::C, etc) to be used to extend its capabilities.

The current version (v1.5) is tightly coupled with Apache/mod_perl and leverages the exceptional configurability and performance of that platform; work is underway to enable use in other environments like offline processing, cgi-bin, and standalone servers.

Why AxKit?

All the hype surrounding in-browser XML processing makes it seem like there should be little or no need for server-side XML processing. However, for a variety of reasons, XML in the browser is just not available everywhere and, when available, has limitations that server side processing addresses.

Server-side XML processing allows content to be queried, reorganized, translated, styled, and so on, before sending the "final" HTML, XML, text, SVG, or other output to the browser. The server can implement heavy duty data processing using the most effective and appropriate tools (as opposed to the rather more limited tools available client-side, even when you can control the client configuration). The server can then decide to deliver a formatted document ready for presentation and display or to delegate presentation formatting (and its attendant processing overhead) to the client. In some sense, that's an ideal approach: use the server to apply heavyweight or unusual transforms and offload the presentation formatting to the browser.

Some of the advantages of using XML on the server are:

  • XML can be transformed on the server in to a wide variety of content delivery languages: XML, WML, XHTML, HTML, RTF, PDF, graphics formats, plain text, etc.
  • Presentation can be separated from content and logic so that transforming the presentation does not mean altering the XML documents or settings on the authoring tools. This article demonstrates separating logic, content and presentation.
  • Even using the same delivery language (HTML, say), documents can be formatted in different ways for differing display media, screen vs. printer, for instance.
  • XML offers support for specifying the character encodings. Source documents can be transcoded into different character sets depending on the browser.
  • XML documents can be expressed in a natural order for their "primary" display mode and reordered differently to provide different views of the data. Terms in a document's glossary or index are not in document order, for instance.
  • As the content and capabilities of a web site evolves, new XML tags can be introduced in new content without having to "upgrade" older articles. We'll touch on tag libraries (taglibs) in this article and examine them in more depth in later articles.
  • XML related technologies are becoming well known; hiring XML literate personnel is becomming easier. With in-house proprietary formats, you always have to train them yourself. With third party proprietary formats, you hope you can get good enough training to do what you need.
  • XML processing tools (XSLT, editors, etc) are becoming commonly available, so adopting XML outside of the IT department is feasible. There's still a gap in support for WYSIWYG authoring tools, though some are now commercially available.
  • XML can optionally be transformed on the client for browsers that offer sufficient features, thus reducing server workload. Be prepared for slow client-side transforms on larger pages, and you will probably want to do reordering and subsetting operations on the server.
  • XML can provide descriptions of the structure of your web site and its content in multiple formats. The "Semantic Web" is coming to a browser near you (real soon now ;-). One of the biggest academic concerns about the web is the lack of semantic content. When looking at a page from a book in HTML format, who's to say whether a heading enclosed by <H1> is a book's author, title, publisher, or even a chapter or section heading? XML can be used to clarify such issues, and standards like the Resource Description Framework (RDF, of course) are gaining ground here.
  • XML-based technologies like the RDF Site Summary (RSS) and the more general Resource Description Framework can be leveraged to allow automated navigation, syndication, and summarization of a web site by other web sites. This is especially applicable to portions of a web site containing current news and press releases; these are often reproduced on other sites' news summaries and "headlines" pages.

AxKit enables all of this and far more. Unlike some more insular environments, AxKit happily bolts up to almost any other technology that you can load in to an Apache web server or reach using external requests.

Basic Installation

This section will walk you through a manual installation, but there's an easier way if you just want to play with AxKit: grab the AxKit demo tarball for this article, untar it and run the install script therein. The tarball includes the prerequisites necessary for most modern Unix-like systems and will be updated new versions and example code with each article. If this works for you (it's been tested on both Linux and FreeBSD), you can skip all of the manual install instructions and jump to Testing AxKit.

AxKit ties together a lot of different technologies and uses a lot of CPAN modules. Doing a basic manual install is not difficult if CPAN.pm (or equivalent; a CPANPLUS is in the works as of this writing) is working on your system.

The first step in installing AxKit is big, but not usually difficult: installing Apache and mod_perl 1.x versions (2.x versions are in development but are not released at the time of this writing). The mod_perl Developer's Guide covers this process in detail. Here's a quick recipe for a Unix system "private" (ie non-root) install in /home/me/axkit_articles-1.0 (vary the version numbers to suit):

$ mkdir axkit_articles-1.0
$ cd axkit_articles-1.0
$ lynx http://httpd.apache.org/dist/httpd/  # Get the latest 1.x version of apache
$ lynx http://perl.apache.org/dist/  # Get the latest 1.x version of mod_perl.
$ gunzip apache_1.3.23.tar.gz
$ tar xf apache_1.3.23.tar
$ gunzip mod_perl-1.26.tar.gz
$ tar xf mod_perl-1.26.tar
$ cd mod_perl-1.26
$ perl Makefile.PL \
>   APACHE_SRC=../apache_1.3.23/src/ \
>   DO_HTTPD=1 \
>   USE_APACI=1 \
>   APACHE_PREFIX=/home/me/axkit_articles-1.0/www \
>   PREFIX=/home/me/axkit_articles-1.0/www \
>   EVERYTHING=1
$ make
$ make test
$ make install
It is usually wise to write your AxKit/mod_perl/Apache build process in to a script so that you can debug it, repeat it, and alter it as needed. Servers like these are immensely powerful and configurable; it's pretty likely that you'll want a reproducable, tweakable build environment for them.

On Windows, look for the most recent Apache Win32 binary in http://httpd.apache.org/dist/binaries/win32/ and then use PPM to install a mod_perl binary. NOTE: it is not recommended to use Windows for production Apache/mod_perl servers; Apache/mod_perl 1.x is not able to scale well on this platform; Apache/mod_perl 2.x is addressing the fundamental architectural disagreements that cause this.

If all went well, running

$ www/bin/apachectl start

should fire up a (non-AxKit) httpd on port 8080. In case of trouble (both packages are mature, so trouble is not frequent), see the References and Helpful Resources section at the end of the article for some places to seek help.

The next part is to install some AxKit prerequisites: the GNOME project's libxml2 and libxslt will be used by the examples in this series of articles, though they project's libxml2. To see if they are already installed, try:

$ xml2-config --version
2.4.13                         # Need >= 2.4.13
$ xslt-config --version
1.0.10                         # Need >= 1.0.10

If not, grab them from the source tarball for the article or a random GNOME mirror and install them using ./configure && make && make install. The tarball above installs all of the prerequisites in the "private" install tree; here we're installing them in the system's shared locations to keep the manual install easy. See the commands generated by the install script if you want private copies.

Please note: libxslt 1.0.10 has a known (very minor) failure in its test suite which causes make to fail on some systems when testing tests/exslt/sets/has-same-node.1.xsl. The install script provided with this article's source tarball removes the offending test before running make.

Now that all of the major prerequisites are installed, let's let CPAN.pm install the final pieces:

$ su
Password:
# perl -MCPAN -e shell
...
cpan> install XML::LibXSLT
...
cpan> install AxKit::XSP::Util
...
cpan> quit
# exit

The AxKit::XSP::Util installation should install AxKit and a number of other prerequisites. If CPAN.pm does not work for you, you might just want to grab the axkit-demo tarball mentioned above and install the packages you find there by hand. There are a lot of them, though, so getting CPAN working is propably the easiest way to do a manual install.

Testing AxKit

All of the relative directories mentioned from now on assume that you are in the axkit_articles-x.y directory unless otherwise indicated.

Once all of the required modules are installed, tweak the www/conf/httpd.conf file to load AxKit by adding:

##
## AxKit Configuration
##

PerlModule AxKit

<Directory "/home/me/axkit_articles-1.0/www/htdocs">
    Options -All +Indexes +FollowSymLinks

    # Tell mod_dir to translate / to /index.xml or /index.xsp
    DirectoryIndex index.xsp
    AddHandler axkit .xml .xsp

    AxDebugLevel 10

    AxGzipOutput On

    AxAddXSPTaglib AxKit::XSP::Util

    AxAddStyleMap application/x-xsp \
                  Apache::AxKit::Language::XSP
</Directory>

We'll walk through the configuration in a bit, but first let's add a www/htdocs/index.xsp test page that looks like:

<?xml-stylesheet href="NULL" type="application/x-xsp"?>
<xsp:page
    xmlns:xsp="http://www.apache.org/1999/XSP/Core"
    xmlns:util="http://apache.org/xsp/util/v1"
>
  <html>
    <body>
      <p>Hi! It's <util:time format="%H:%M:%S"/>.</p>
    </body>
  </html>
</xsp:page>

Now you should be able to restart the server and request the test page like so (whitespace added for readability and so it can be compared to index.xsp easily):

$ www/bin/apachectl restart
$ lynx -source http://127.0.0.1:8080/
<?xml version="1.0" encoding="UTF-8"?>
<html>
  <body>
    <p>Hi! It's Tue Feb  5 16:26:31 2002.</p>
  </body>
<html>

Each request should generate a new page with a different time stamp. You may need to tweak the Port directive in www/conf/httpd.conf if something's already running on port 8080.

How the example works

Later articles in this series will examine various features of AxKit in more depth, for now let's take a look at how the example from the installation section works.

In fact, hopefully by the next article, a new release of AxKit should have a simple demo facility for each of it's major XML processing alternatives available. If so, the tarball accompanying the next article will contain the new release.

The Configuration

AxKit integrates quite tightly with the Apache configuration engine. The Apache configuration engine is far more than a text file parser: it forms the core of Apache's request handling capabilities and is the key to Apache's flexibility and extensibility. The directives added to the server's configuration above are a mix of native Apache directives and AxKit directives. Let's walk through the first part of the request cycle and see how the Apache configuration directives affect the request.

The Apache httpd is primarily a configuration engine and a collection of special-purpose modules. This discussion glosses over the fact that several modules other than mod_perl and AxKit are used to process this request and refers to them all as "Apache".

When the HTTP request arrives, Apache parses it and maps the path portion of the URL ("/") to a location on the harddrive (/home/me/axkit_articles-1.0/www/htdocs). The URI maps to a directory and the Apache directives "DocumentRoot" (not shown, it's part of the default install), "<Directory>", "Options +Indexes" and "DirectoryIndex" cause Apache to map this URI to the file index.xsp.

Now that the underlying resource has been identified, Apache uses the .xsp extension to figure out what module should deliver the resource to the browser. The AddHandler AxKit .xsp directive tells Apache to delegate the response handling to AxKit. This is very similar to establishing a mod_perl handler for a URI except that it is implemented in C and is a bit faster than a standard mod_perl response handler.

The Processing Chain

The test document, index.xsp is an example of XSP, eXtensible Server Pages, one of the several languages that AxKit supports. We'll get to how the XSP is processed in a moment.

When AxKit begins the task of handling the response it has already, through cooperation with Apache's configuration engine processed it's configuration directives. These have the following effects:

AxDebugLevel 10
Causes quite a lot of output in the www/logs/error_log,
AxGzipOutput On
Enables automatic gzip compression (via Compress::Zlib). This is only used if the client can accept compressed documents. AxKit even goes the extra mile and compresses output for a few clients that can handle it but don't set the HTTP Accept-Encoding: header properly.
AxAddStyleMap application/x-xsp Apache::AxKit::Language::XSP
Establishes a mapping between the MIME type "application/x-xsp" and the Apache::AxKit::Language::XSP module. We'll see shortly how this mapping tells AxKit what module to use when applying a type of transform.
AxAddXSPTaglib
Notes that the XSP engine needs to load the AxKit::XSP::Util (this supplies some of the Perl code called by index.xsp),

The first thing AxKit needs to do when handling a response is to configure the processing pipeline. The first place AxKit looks for directions is in the source document; it scans the source document for <?xml-stylesheet...?> processing instructions like:

<?xml-stylesheet href="NULL" type="application/x-xsp"?>
AxKit has two alternative mechanisms that provide far more power and flexibility; we'll look at these as we walk through more advanced configurations in later articles.

The xml-stylesheet PIs specify a list of transforms to apply to the source document; these are applied in the order that they occur in the document. Each processing instruction specifies a stylesheet ("NULL" in this case: XSP doesn't use them, we'll cover that in a moment), and a processor type ("application/x-xsp"). The AxAddStyleMap directives specify which Perl modules handle with processor types, and the one in our example maps application/x-xsp to Apache::AxKit::Language::XSP.

That's all quite complex; here's a diagram that shows how the most important bits of this example affect the processing pipeline:

index.xsp configuration data flow

and the resulting pipeline looks like:

index.xsp processing pipeline

As the diagram shows, the source .xsp page is read from disk, then compiled in to Perl source code (using the util: taglib as necessary) and cached on disk. The resulting source code is then run to generate the result document for this request, which is compressed (if the client supports compression), and sent to the client.

You can see the source code in www/logs/error_log when the AxDebugLevel is set to at least 10.

Note that AxKit is smart enough to not cache the output document (there's no cache between the XSP processor and the output); XSP is intended for dynamic pages and its output documents should not be cached.

When index.xsp is compiled, the resulting code builds the output document node by node. The <util:time .../> tag is converted in to a subroutine call that calls perl's localtime function. See the error_log to see the generated Perl (our AxDebugLevel is set to 10, so XSP.pm emits this to the error log), and see the function get_date() in AxKit::XSP::Util for the localtime() call.

XSP and Taglibs

XSP is unlike most XML processing "languages" in that it does not actually use stylesheets; instead, XSP pages contain special tags that are executed each time the page is requested. In index.xsp, for instance, the <util:time format="%H:%M:%S"/> tag is converted in to Perl code which calls localtime.

Most XML filters apply a transform to the source XML to generate the result XML. These transforms are called "stylesheets". As mentioned, XSP does not use stylesheets. We will cover stylesheet based transforms in future articles.

The util: portion of the tag is a prefix indicating that the util taglib will handle that tag. The util: prefix is an XML namespace prefix and is not hard-coded; the xmlns:util attribute in the root element of index.xsp:

<xsp:page
    xmlns:xsp="http://www.apache.org/1999/XSP/Core"
    xmlns:util="http://apache.org/xsp/util/v1"
>

binds the util: prefix to the taglib. The module that provides the code behind the util: taglib, AxKit::XSP::Util, has the same URI ("http://apache.org/xsp/util/v1") hardcoded in it. When the AxAddXSPTaglib directive is seen in the httpd.conf file, AxKit::XSP::Util registers with the XSP module to handle all tags in that namespace.

An XSP page may include as many taglib namespaces and tags as it needs. CPAN contains a large and growing collection of taglibs for use with AxKit's XSP implementation, and we'll look at two ways of writing taglibs for Apache in the next two articles.

The taglibs approach is superficially similar to many of the templating engines on CPAN; indeed, some of the the templating systems have been extended recently to includ taglibs. There are several important differences between these and XSP, however.

  • XSP input files must be well formed XML, this makes it impossible to generate malformed XML. With templating systems, typos in the content markup can easily reach the browser with no warnings.

  • The source document may be transformed before it is handed to the XSP processor. This allows you to build simple taglibs as XSLT transforms deployed upstream of the XSP processor. Because AxKit's XSP translates XSP pages in to code and caches the code, these transforms will not be run each request; they are captured in the cached code. This can also be used to "capture" static transforms.

  • XSP encourages content, logic, and presentation to be separated; XSP pages add logic to content and generate well formed XML that can be (and usually is) massaged by "real" stylesheets to effect different presentations.

  • as with some of the more sophisticated templating systems, XSP is designed to be extensible; adding a taglib is as simple as configuring adding an AxAddXSPTaglib statement to the httpd.conf file and then referring to it in the source document. In the test code, the <util:time> tag is provided by the Apache::XSP::Util module, and you may load as many taglibs as necessary in to the server.

Milepost 1 and the Road Ahead

This is the first article in this series, presenting AxKit's installation and introducing one of AxKit's processing technologies: XSP. In the next article we'll see how to chain together filters to apply stylesheets to both static documents and XSP-generated documents to allow the same documents to be delivered in different forms. Following that we'll examine how to write taglibs in Perl (the recommended approach) using some helper modules, and XSLT.

References and Helpful Resources

There are several very helpful places to research problems, ask questions, and learn more. Try to find others who have had similar problems before posting a question, of course, but the user groups listed here are the place to ask:

the AxKit web site (which may have moved to an xml.apache.org site by the time you read this)
The "official" AxKit web site.
The AxKit Guide
An in-depth introduction to AxKit.
The axkit-users@axkit.org mailing list.
Browse the archives or subscribe. This is the place to discuss AxKit-specific problems and offer solutions, patches and success stories. The mod_perl resources listed here are perfect for general mod_perl build and support issues as well.
mod_perl Developer's Guide
The first place to check for Apache+mod_perl build advice and debugging tips.
modperl@perl.apache.org email archives
Look here to see if anyone else has had your problems and (usually) found a solution. This list is about to move to an @perl.org address at the time of this writing so I won't point you to a soon-to-be-stale subscription form.
perl-xml@listserv1.ActiveState.com
A mailing list for general perl and XML questions, including axkit support. (subscribe at http://listserv.activestate.com/mailman/listinfo/perl-xml).
The #axkit IRC channel at irc.rhizomatic.net
A friendly place where you can often get quick advice right from experienced AxKit users and contributors.

As with all online Open Source communities, please do try to pay forward any help you receive.

Thanks to Martin Oldfield, Kip Hampton and Robin Berjon for their thorough technical reviews, though I'm sure I managed to sneak some bugs by them. AxKit and many of the Perl modules it uses are primarily written by Matt Sergeant with extensive contributions from these good folks and others, so many thanks to all contributors as well.

Copyright 2002, Robert Barrie Slaymaker, Jr. All Rights Reserved.

This Week on Perl 6 (3 -9 Mar 2002)

Notes

Both the email subscription and the web archive are temporarily offline. This should be remedied shortly. In the meantime, please send additions, submissions, corrections, kudos, and complaints to bwarnock@capita.com.

Perl 6 is the major redesign and rewrite of the Perl language. Parrot is the virtual machine that Perl 6 (and other languages) will be written for. For more information on the Perl 6 and Parrot development efforts, visit dev.perl.org and parrotcode.org.

Last week was extremely light, with just 70 messages across 34 threads, and 29 authors contributing.

printf

Uri Guttman cross-posted a thread discussing redesigning printf. Since % will now be used for all hash accesses, there's a potential ambiguity between interpolating a hash key, and a format specifier. Several solutions were presented, including requiring $() for interpolation, a new quote operator, and replacing the % with something else. The discussion continues.

Parrot 0.0.4

The latest version of Parrot is being wrapped up. The big feature of this release is the foundation of the garbage collector. A formal release announcement will be made, well, when this version is formally released.

Multimethod Dispatch

Michael Lambert asked whether Parrot itself should support multimethod dispatch. Internals head Dan Sugalski affirmed that it would, but only for method and subroutine dispatch. (As with most design decisions, it's a speed thing.)

The Assembler PDD

Simon Cozens released version 1 of the proposed Assembler PDD.

The Parrot Spotlight

Alex Gough is a physics student at Oxford with interests ranging from quantum computing to DNA simulation. He uses perl to do more in less time and hopes Parrot and Perl 6 will allow cheaper, shorter, one-line solutions to troublesome but otherwise irrelevant problems. He works on the big number and testing framework of Parrot.

When not at his computer, Alex canoes on foamy rivers, teaches basic lifeguarding, and helps mentally disabled children exercise.


Bryan C. Warnock

These Weeks on Perl 6 (10 Feb - 2 Mar 2002)

Notes

Both the email subscription and the web archive are temporarily offline. This should be remedied shortly. In the meantime, please send additions, submissions, corrections, kudos, and complaints to bwarnock@capita.com.

Perl 6 is the major redesign and rewrite of the Perl language. Parrot is the virtual machine that Perl 6 (and other languages) will be written for. For more information on the Perl 6 and Parrot development efforts, visit dev.perl.org and parrotcode.org.

For the three week period, there were 423 messages across 128 threads, with 61 authors contributing. About half the threads were patch related. Few of the remaining threads have little meaning outside the active development circle, so there's little of interest to report on.

Topicalizers

There was a fair amount of discussion, however, on perl6-language about topicalizers in Perl 6. (Topicalizers are the lexically scoped aliases in foreach iterators and the new given block.)

Allison Randal asked:

What would be the cost (performance, design or dwim) of making all the defaulting constructs pay attention to the current topicalizer in preference to $_?

Larry Wall replied:

It's been thought about, but neither accepted nor rejected yet. It's one of those things that depends on future decisions. Certainly Hugo and Dan will vouch for the fact that I was ruminating about similar issues last Wednesday, though in this case I was thinking about how a topic could supply a default to identical parameters of different subroutine or method calls, and not just as the object of the call.

Much of the subsequent thread discussed whether when should refer to $_ or the topicalizer bound to by given.

Garbage Collecting

Dan Sugalski committed his garbage collector framework, including built-in statistical generation. (As inspired by some horrendous performance early on.) The good news is that the performance problems have been cleared up. The bad news is that the garbage collector still doesn't collect garbage.

.NET CLR and Parrot

Simon Cozens submitted a lot of information on .NET. Even for the non-Parroteers, this is a good read.

PDDs

Simon also reminded folks that there are Design Documents to write. He then submitted the Keys and Indices PDD. Brent Dax followed up with the Regular Expression PDD, and Dave Mitchell's Coding Standards PDD was finally committed. Alex Gough contributed a Big Number PDD, while Bryan Warnock fixed some gaping holes in the PDD PDD. There's also a PDD for the assembler and the bytecode format on the way.

Parrot Magic Cookie Assignments

Dan Sugalski clarified how PMC assignments should work. Most of the subsequent discussion was attempting to mesh Dan's answers with typing, both weak and strong.

The Parrot Spotlight

Brent Dax is a sixteen-year-old high school junior. He lives in Southern California with his parents, a brother and sister, and a pet cat.

Brent works on a lot of stuff within Parrot. He has worked on the Configure system, the regular expression engine, the embedding interface, warnings, and formatted printing. He has two modules on the CPAN, both related to Perl 6. When he's not hacking on Parrot, a Perl script, or some other little project, he's probably handling e-mail, reading a book, doing homework, or watching CNN. He's sometimes on PerlMonks, and can usually be found on the developer's IRC channel #parrot.


Bryan C. Warnock

Stopping Spam with SpamAssassin

I receive a lot of spam; an absolute massive bucket load of spam. I received more than 100 pieces of spam in the first three days of this month. I receive so much spam that Hormel Foods sends trucks to take it away. And I'm convinced that things are getting worse. We're all being bombarded with junk mail more than ever these days.

Well, a couple of days ago, I reached my breaking point, and decided that the simple mail filtering I had in place up until now just wasn't up to the job. It was time to call in an assassin.

SpamAssassin

SpamAssassin is a rule-based spam identification tool. It's written in Perl, and there are several ways of using it: You can call a client program, spamassassin, and have it determine whether a given message is likely to be spam; you can do essentially the same thing but use a client/server approach so that your client isn't always loading and parsing the rules each time mail comes; or, finally, you can use a Perl module interface to filter spam from a Perl program.

SpamAssassin is extremely configurable; you can select which rules you want to use, change the way the rules contribute to a piece of mail's "spam score," and add your own rules. We'll look at some of these features later in the article. First, how do we get SpamAssassin installed and start using it?

If you're using Debian Linux or one of the BSDs, then this couldn't be easier: just install the appropriate package using apt or the ports tree respectively. (The BSD port is called p5-Mail-SpamAssassin)

Those less fortunate will have to download the latest version of SpamAssassin, and install it themselves.

Vipul's Razor

SpamAssassin uses a variety of ways for testing whether an e-mail is spam, ranging from simple textual checks on the headers or body and detecting missing or misleading headers to network-based checks such as relay blackhole lists and an interesting distributed system called Vipul's Razor.

Vipul's Razor takes advantage of the fact that spam is, by its nature, distributed in bulk. Hence, a lot of the spam that you see, I'm also going to see at some point. If there were a big clearing-house where you could report spam and I could see if my incoming mail matches what you've already reported, then I could have a guaranteed way of determining whether a given mail is spam. Vipul's Razor is that clearing-house.

Why is it a Razor? Because it's a collaborative system, its strength is directly derived from the quality of its database, which comes back to the way it's used by the likes of you and me. If end-users report lots of real spam, the Razor gets better; if the database gets "poisoned" by lots of false or misleading reports, then the efficiency of the whole system drops.

Just like any other spam detection mechanism, Razor isn't perfect. There are two points particularly worth noting. First, while it tries to completely avoid false positives (saying something's spam when it isn't) by requiring that spam be reported, it doesn't do anything about false negatives (saying something's not spam when it is) because it only knows about the mail in its database.

Second, spammers, like all other primitive organisms, are constantly evolving. Vipul's Razor only works for spam that is delivered in bulk without modification. Spam that is "personalized" by the addition of random spaces, letters or the name of the recipient, will produce a different signature that won't match similar spam messages in the Razor database.

Nevertheless, the Razor is an excellent addition to the spam fighter's arsenal, since when it marks something as spam, you can be almost positive it's correct. And just like SpamAssassin, it's all pure Perl. Mail::Audit has long supported a Razor plugin, but now we can move to calling Razor as part of a more comprehensive mail filtering system based on SpamAssasin and Mail::Audit

Installing Vipul's Razor is similar to installing SpamAssassin. Debian and BSD users have packages called "razor" and "razor-clients," respectively; and the rest of the world can download and install from the home page. SpamAssassin will detect whether Razor is available and, by default, use it if so.

Assassinating Spam With Mail::Audit : The Easy Way

So this is the part you've all been waiting for. How do we use these things to trap spam? For those of you who aren't familiar with Mail::Audit, the idea is simple: just like with procmail, you write recipes that determine what happens to your mail. However, in the case of Mail::Audit, you specify the recipe in Perl. For instance, here's a recipe to move all mail sent to perl5-porters@perl.org to another folder:


    use Mail::Audit;
    my $mail = Mail::Audit->new();
    if ($mail->from =~ /perl5-porters\@perl.org/) {
        $mail->accept("p5p");
    }
    $mail->accept();
For more details on how to construct mail filters with Mail::Audit, see my previous article.

Plugging SpamAssassin into your filters couldn't be simpler. First of all, you absolutely need the latest version of Mail::Audit, version 2.1 from CPAN. Nothing earlier will do! Now write a filter like this:


    use Mail::Audit;
    use Mail::SpamAssassin;
    my $mail = Mail::Audit->new();

    ... the rest of your rules here ...

    my $spamtest = Mail::SpamAssassin->new();
    my $status = $spamtest->check($mail);

    if ($status->is_spam ()) {
        $status->rewrite_mail() };
        $mail->accept("spam");
    }
    $mail->accept();
As you might be able to guess, the important thing here is the calls to check and is_spam. check produces a "status object" that we can query and use to manipulate the e-mail. is_spam tells us whether the mail has exceeded the number of "spam points" required to flag an e-mail as spam.

The rewrite_mail method adds some headers and rewrites the subject line to include the distinctive string "*****SPAM******". The additional headers explain why the e-mail was flagged as spam. For instance:


X-Spam-Status: Yes, hits=6.1 required=5.0 
tests=SUBJ_HAS_Q_MARK,REPLY_TO_EMPTY,SUBJ_ENDS_IN_Q_MARK version=2.1
This message had a question mark in the subject, an empty reply-to, and the subject ended in a question mark. The mail wasn't actually spam, but this goes to prove that the technique isn't perfect. Nevertheless, since installing the spam filter, I've only seen about 10 false positives, and zero false negatives. I'm happy enough with this solution.

One important point to remember, however, is where in the course of your filtering you should call SpamAssassin's checks. For instance, you want to do so after your mailing list filtering, because mail sent to mailing lists may have munged headers that might confuse SpamAssassin. However, this means that spam sent to mailing lists might slip through the net. Experiment, and find the best solution for your own e-mail patterns.

Assassinating Spam Without Mail::Audit

Of course, there are times when it might not be suitable to use Mail::Audit or you may not want to. Since SpamAssassin is provided as a command line tool as well as a set of Perl modules, it's easy enough to integrate it in whatever mail filtering solution you use.

For instance, here's a procmail recipe that calls out to spamassassin to filter out spam:


:0fw
| spamassassin -P

:0:
* ^X-Spam-Status: Yes
spambox
For the speed-conscious, you can run the spamd daemon and replace calls to spamassassin with spamc; be aware that this is a TCP/IP daemon that you may want to firewall from the rest of the world.

Another approach is to call spamassassin in your mail transport agent, meaning that spam is filtered out before it even attempts to be delivered to you. There's a Sendmail milter library available that allows you to use SpamAssassin, and similar tricks for Exim and other MTAs are available.

Assassinating Spam With Mail::Audit : More Complex Operations

The Mail::SpamAssassin module has many other methods you can use to manipulate e-mail. For instance, if you've identified something as definitely being spam, then you can use


    $spamtest->report_as_spam($mail);
to report it to Vipul's Razor. (Take note of this: As we've mentioned above, the efficiency of the Razor database comes from the fact that e-mails in it are confirmed as spam by a human. Adding false positives to the database would degrade its usefulness for everyone. Only submit mail that you've confirmed personally.)

If you're finding that mail checking is taking too long because SpamAssassin is having to contact the various network-based blacklists and databases, then you can instruct it to only perform "local" checking:


    $spamtest = Mail::SpamAssassin->new({local_tests_only => 1});

There is a wealth of other options available. See the Mail::SpamAssassin documentation for more details, and happy assassinating!

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en