Sign In/My Account | View Cart  
advertisement


Listen Print

Filters in Apache 2.0

by Geoffrey Young
April 17, 2003

Not too long ago, despite a relative dearth of free tuits, I decided that I had put off my investigation of mod_perl 2.0 for too long - it was time to really start kicking the tires and tinkering with all the new stuff I had been hearing about. What I found was that the new mod_perl API is full of interesting features, yet discovering and using them was tedious, frustrating, enlightening, and fun all at the same time. Hopefully, I can help ease some of the growing pains you are likely to encounter by experiencing the pain myself first, then sharing some of the lessons through a series of articles. Consider this and future articles to be our voyage together into the rocky but exciting new mod_perl frontier.

One of the more interesting and practical features to come out of the Apache 2.0 redesign effort is output filters. While in Apache 2.0 there are all kinds of filters, including input and connection filters, it's output filters that are most interesting to me - mostly because 2.0 discussions make a point of saying that it's impossible (well, really, really hard) to filter output content in Apache 1.3, despite the fact that mod_perl users have been able filter content (to some degree) for years. Thus, when I began to play around with mod_perl 2.0 it seemed only logical that my first task would be to port the instructional yet useful Apache::Clean, a content filter for mod_perl 1.0, over to the new architecture.

What we will be examining here is a preliminary implementation of Apache::Clean using the mod_perl 2.0 API. Because mod_perl 2.0 is still being tweaked daily, if you want to follow along on your own box, then you would need the current version of mod_perl from CVS, or a recent snapshot - the latest versions shipped with Linux distributions like RedHat, or even the latest version on CPAN (1.99_08), are far too out of date for what we will be doing. The most current version of Apache 2.0, as well as Perl 5.8.0, will also be helpful. Keep in mind that many of the more interesting features in mod_perl 2.0 are not entirely stable yet, so do not be surprised if things work just a bit differently six months from now.

Related Reading

Practical mod_perl
By Stas Bekman, Eric Cholet


Read Online--Safari Search this book on Safari:
 

Code Fragments only

What Are Output Filters Anyway?

Go ahead, admit it. At some point, you wrote a CGI script that generated HTML with embedded Server Side Include tags. The impetus behind the idea was a simple one: You had hopes that the embedded SSI tags would save you from the extra work of, say, adding a canned footer to the bottom of your otherwise dynamic page. Sounds reasonable, right? Seeing those SSI tags left unprocessed in the resulting page must have been shocking.

As it turns out, whether you knew it or not, in Apache-speak you were trying to filter your content, or pass the output of one process (the CGI script) into another (Apache's SSI engine) for subsequent processing. Content filtering is a simple idea, and one that feels natural to us as programmers. After all, Apache is supposed to be modular, and piping modular components together - cat yachts.txt | wc -l - is something we do on the Unix command line all the time. Wanting the same functionality in our Web server of choice seems not only logical, but almost required in the interests of efficient application programming.

While the idea is certainly sound, the above experiment exposes a limitation of the Apache 1.3 server itself, namely that by design you cannot have more than one content handler for a given request - you can use either mod_cgi to process and CGI script, or mod_include to parse an SSI document, but not both.

With Apache 2.0, the idea of output filters were introduced, which provide an official way to intercept and manipulate data on its way from the content handler to the browser. In the case of our SSI example, mod_include has been implemented as an output filter in Apache 2.0, giving it the ability to post-process either static files (served by the default Apache content handler) or dynamically generated scripts (such as those generated by mod_cgi, mod_perl, or mod_php). True to its goal of exposing the entire Apache API to Perl, mod_perl allows you to plug into the Apache filter API and create your own output filters in Perl, which is what we will be doing with Apache::Clean.

HTML::Clean and Apache::Clean

Let's take a moment to look at HTML::Clean before delving into Apache::Clean, which is basically just a mod_perl wrapper that takes HTML::Clean and turns it into an output filter. HTML::Clean is a nifty little module that reduces the size of an HTML page using a number of different but simple techniques, such as removing unnecessary white space, replacing longer HTML tags with shorter equivalents, and so on. The end result is a page that, while still valid HTML and easily rendered by a browser, is relatively compact. If reducing bandwidth is important in your environment, then using HTML::Clean to tidy up static pages offline is a quick and easy way to save some bytes.

Here is a simple example of HTML::Clean in action.

  
use HTML::Clean ();

use strict;

my $dirty = q!<strong>&quot;helm's alee&quot;</strong>!;

my $h = HTML::Clean->new(\$dirty);

$h->strip({ shortertags => 1, entities => 1 });

print ${$h->data};
  

As you can see, the interface for HTML::Clean is object-oriented and fairly straightforward. Things begin by calling the new() constructor to create an HTML::Clean object. new() accepts either a filename to clean or a reference to a string containing some HTML. Deciding exactly which aspects of the HTML to tidy is determined in one of two ways: either using the level() method to set an optimization level, or by passing the strip() method any number of options from a rich set. In either case, strip() is used to actually clean the HTML. After that, calling the data() method returns a reference to a string containing the HTML, polished to a Perly white. In our sample code, the original HTML has been changed to

  
<b>"helm's alee"</b>
  

which is half the size of our original string yet displayed the same way by browsers.

Depending on the size of your site, using HTML::Clean can lead to a significant reduction in the number of bytes sent over the wire - for instance, the front page of the current mod_perl project homepage becomes 70% of it's original size when scrubbed with $h->level(9). However, while spending the time to tidy static HTML might make sense, the number of static pages on any given site seems to be diminishing daily. What about dynamically generated HTML?

One way to handle dynamic HTML would be to add HTML::Clean routines to each dynamic component of your application, a process that really is neither scalable nor maintainable. A better solution would be to have Apache inject HTML::Clean processing directly into the server response wherever we wanted it, to create a pluggable module that we could configure to post-process requests to any given URI. Enter Apache::Clean.

Apache::Clean provides a basic interface into HTML::Clean but it works as an output filter. As briefly mentioned, Apache::Clean already exists for mod_perl 1.0, but over in Apache 1.3 land it was limited in that it could only post-process responses generated by mod_perl, and that only after sufficient magic. We are not going to get into how that all worked in mod_perl 1.0 - for a detailed explanation see Recipe 15.4 in the mod_perl Developer's Cookbook or the original Apache::Clean manpage. With Apache 2.0 and the advent of output filters, we can now code Apache::Clean as a genuine part of Apache's request processing, allowing us to clean responses on their way to the browser entirely independent of who generates the content.

New Directives

Here is a look at a possible configuration for Apache 2.0, one that takes output of a CGI script, post-processes it for SSI tags, then cleans it with our Apache::Clean output filter.

  
Alias /cgi-bin /usr/local/apache2/cgi-bin
<Location /cgi-bin>
  SetHandler cgi-script

  SetOutputFilter INCLUDES
  PerlOutputFilterHandler Apache::Clean

  PerlSetVar CleanOption shortertags
  PerlAddVar CleanOption whitespace

  Options +ExecCGI +Includes
</Location>
  

As with Apache 1.3, mod_cgi is still enabled the same way - in our case via the SetHandler cgi-script directive, although this is not the only way and the familiar ScriptAlias directive is still supported. What is different in this httpd.conf snippet is the configuration of the SSI engine, mod_include. As already mentioned, mod_include was implemented as an output filter in Apache 2.0, and output filters bring with them a new directive. The SetOutputFilter directive activates the SSI engine - the INCLUDES filter - within our container. This means that requests to cgi-bin/, no matter who handles the actual generation of content, will be parsed by mod_include. See the mod_include documentation for other possible SSI configurations and options.

With the generic Apache bits out of the way, we can move on to the mod_perl part, which isn't all that complex. While the PerlSetVar and PerlAddVar directives are exactly the same as they were in mod_perl 1.0, mod_perl 2.0 introduces a new directive - PerlOutputFilterHandler - which specifies the Perl output filter for the request. In our sample httpd.conf, the Apache::Clean output filter will be added after mod_include, which inserts SSI processing after mod_cgi. The really cool part about filters is that everything happens without any tricks or magic - getting all these independent modules to work in harmony in creating the server response is all perfectly normal, which is a huge improvement over Apache 1.3.

In the interests of safety, one thing that you should note about our sample configuration is that it does not include the entities option. Because we're cleaning dynamic content, reducing entity tags (such as changing &quot; to ") would inadvertently remove any protection against Cross Site Scripting introduced by the generating script. For more information about Cross Site Scripting and how to protect against it, a good overview is provided in this perl.com article.

Pages: 1, 2

Next Pagearrow