February 2002 Archives

Why mod_perl?

In this article, I'll give an initial introduction to mod_perl, make you want to give it a try and present a few examples of the well-known sites that are powered by mod_perl enabled Apache.

What Is mod_perl?

mod_perl is at the heart of the Apache/Perl integration project, which brings together the full power of the Perl programming language and the Apache Web server.

From the outset, Apache was designed so that you can extend it by the addition of ``modules.'' Modules can do anything you need to do, such as rewrite HTTP requests, restrict access to certain pages and perform database lookups. Modules are normally written in C, which can be hard work. mod_perl is a module that allows you to do all of these things, and more, by using Perl -- making the development much quicker than C. Apache is the most popular Web server on the Internet and mod_perl is one of the most popular modules for extending it.

Why Is mod_perl So Popular?

If you love Perl and your favorite Web server is Apache, then you will love mod_perl at first sight. Once you try it in action, you will never look back -- you will find mod_perl has everything you need. But even if you do find that there is something missing, then just speak up. Before you can count to three, someone will have made it for you. Which, of course, will make you want to give something in return. Eventually you will contribute something on your own, and that will save time for a huge mod_perl community so that they can create even more things for others to use.

You get the picture -- mod_perl empowers its users, who in turn empower mod_perl, which in turn empowers its users, who in turn ... . It's as simple as the nuclear reaction you learned about at school, or will learn at some point :)

With mod_perl it is possible to write Apache modules entirely in Perl. This allows you to easily do things that are more difficult or impossible in regular CGI programs, such as running sub-requests, writing your authentication and logging handlers.

The primary advantages of mod_perl are power and speed. You have full access to the inner workings of the Web server and you can intervene at any stage of HTTP request processing. This allows for customized processing of the various phases; for example, URI to filename translation, authorization, response generation and logging.

There are big savings in startup and compilation times. Having the Perl interpreter embedded in the server saves the very considerable overhead of starting an external interpreter for any HTTP request that needs to run Perl code. At least as important is code caching: the modules and scripts are loaded and compiled only once, when the server is first started. Then for the rest of the server's life the scripts are served from the cache, so the server only has to run the pre-compiled code. In many cases, this is as fast as running compiled C programs.

There is little run-time overhead. In particular, under mod_perl, there is no need to start a separate process per request, as is often done with other Web-server extensions. The most wide-spread such extension mechanism, the Common Gateway Interface (CGI), is replaced entirely with Perl code that handles the response generation phase of request processing. Bundled with mod_perl are two general purpose modules for this purpose: Apache::Registry, which can transparently run existing unmodified Perl CGI scripts and Apache::PerlRun, which does a similar job but allows you to run scripts that are to some extent ``dirtier.''

mod_perl allows you to configure your Apache server and handlers in Perl (using the PerlSetVar directive and the <Perl> sections). This makes the administration of servers with many virtual hosts and complex configuration a piece of cake. Hey, you can even define your own configuration directives!

How Fast and Stable Is mod_perl?

Many people ask, ``How much of a performance improvement does mod_perl give?'' Well, it all depends on what you are doing with mod_perl -- and possibly whom you ask. Developers report speed boosts from 200 percent to 2,000 percent. The best way to measure is to try it and see for yourself! (see http://perl.apache.org/tidbits.html and http://perl.apache.org/stories/ for the facts).

Every second of every day, thousands of Web sites all over the world are using mod_perl to serve hundreds of thousands of Web pages. Apache and mod_perl are some of the best-tested programs ever written. Of course, they are continually being developed and improved, but you do not have to work on the ``bleeding edge'' of development -- you can use the stable products for your sites and let others do the testing of the new versions for you.

I want to show you just a few of the many busy and popular sites that are driven by mod_perl. A thousand words can't substitute the experience. Visit the sites and feel the difference. They will persuade you that mod_perl rules!

  • ValueClick -- http://www.valueclick.com/ serves more than 70 million requests per day from about 20 machines. Each response is dynamic, with all sorts of calculation, storing, logging, counting -- you name it. All of their ``application'' programming is done in Perl.
  • Singles Heaven -- http://singlesheaven.com is a Match Maker site with 35,000+ members and growing. The site is driven by mod_perl, DBI, Apache::DBI (which provides a persistence to DB connections) and MySQL. The speed is enormous, chatting with mod_perl is a pleasure. Every page is generated by about 10 SQL queries, for it does many dynamic checks on every page -- like checking for new e-mails, users who are watched by various watchdogs and many more. You don't feel these queries are actually happening, the speed is as fast as the "Hello World'" script.
  • Internet Movie Database (Ltd) -- http://www.moviedatabase.com/ - serves about 2 million page views per day. All database lookups are handled inside Apache via mod_perl. Each request also goes through several mod_perl handlers and the output is then reformatted on the fly with mod_perl SSI to embed advertising banners and give different views of the site depending on the hostname used.
  • CMPnet -- http://www.cmpnet.com, a technology information network serves about 600k page views per day.
  • CitySearch.com -- http://www.citysearch.com/ is providing online city guides for more than 100 cities worldwide, citysearch.com helps people find and plan what they want to do and then lets them take action, offering local transactions such as buying event tickets and making hotel and restaurant reservations online. Its traffic exceeds 100 million page views a month.

How Many Sites Are Running a mod_perl Enabled Apache Web Server?

According to Netcraft ( http://netcraft.com ), as of August 2001 - 18 million hosts are running the free Apache Web server, which is about 60 percent of all checked in survey hosts!

Here is the graph of "Server Share in Internet Web Sites."

What about mod_perl? http://perl.apache.org/netcraft/ reports that sites running mod_perl account for 2,823,060 host names and 283,180 unique IP addresses. This is actually an underestimate, since when hosts are scanned for running Web servers only well-known ports are checked (80, 81, 8080 and a few others). If a server runs on unusual port, then it does not enter the count unless the owner has manually added it to the Netcraft database. Here is a graph of the growth in mod_perl usage:

mod_perl growth graph

For the latest numbers see http://perl.apache.org/netcraft/ .

The Road Ahead

You probably are excited about the release of Apache 2.0, the next generation of the best Web server. The major new features of this new generation of the Web server are threaded processes, which should make the server more scalable, and, of course, the very-welcomed filtering layer.

You probably are not less excited about the recent release of Perl 5.6, whose main new feature is (almost) stable support for threads, something that existed in the previous Perl version but which was quite shaky.

What has all this to do with mod_perl you ask? mod_perl 2.0 is being developed at this very moment and will benefit enormously from the new Apache and Perl features. The most important improvement will be a reduced process size -- a parsed Perl opcodes tree will be almost completely shared between threads of the same process.

Do you believe in coincidences? Both Perl 5.6 and Apache 2.0 were released in the same week in March 2000. Looks very suspicious to me. If you get the obvious conspiracy uncovered, then please let me know.

Of course, there are lots of bumps ahead of us. It will take time before all our applications will be able to benefit from the threading features. The main reason lies in fact that most of the Perl modules available from CPAN aren't thread safe. But you shouldn't despair. You can turn off threads for Perl code that is not thread safe or that uses modules that aren't thread safe.

I Want mod_perl Now, Where Do I Get It?

mod_perl's home is http://perl.apache.org. From the site you will be able to download the latest mod_perl software and various documentation; find commercial products and free third-party modules; read the success stories; and learn more about mod_perl.

It's quite important to get yourself subscribed to the mod_perl list. If you want know what happens with mod_perl, if you want to know what new features are being developed, if you want to influence and contribute or if you simply want to get help, then you don't want to skip this mailing list. To subscribe to the list send an empty e-mail to modperl-subscribe@apache.org .

Are There Any mod_perl Books and Documentation?

Lincoln Stein and Doug MacEachern wrote Writing Apache Modules with Perl and C . It can be purchased online from O'Reilly and Amazon.com.

You will find a vast list of mod_perl documentation on the mod_perl home page: http://perl.apache.org/ .

I Love mod_perl, I Want to Know Who Wrote This Great Free Product!

Well, Doug MacEachern is the person to blame :). He is the guy who gave mod_perl to the mod_perl community. He is the Linus of the mod_perl project.

But as you know in a big community, there are always people who love to help, and there is a bunch of developers from all around the world who patch mod_perl, develop entire Perl modules for it, debug the server and advocate it. I'm afraid the list of contributing developers is too long to include here. But you are welcome to join the mod_perl mailing list and see all these folks in action. I promise you, you won't regret the show, since you are going to learn much more than just about mod_perl. See for yourself.

Getting Involved

If you are using mod_perl or planning to use it, then it's a good idea to subscribe to the users mod_perl mailing list. You should send an empty e-mail to modperl-subscribe@apache.org in order to do that.

If you are interested in helping out with development of mod_perl 2.0, then you are welcome to join us. There are many features that are still need implementing and a lot of testing has to be done. So there is a lot of work if you are knowledgeable developer or even if you just a newbie. And the more help we get the sooner we bring mod_perl 2.0 into production shape. You can subscribe to the developers mailing list by sending an e-mail to dev-subscribe@perl.apache.org.

If you are familiar with mod_perl, then you probably know about the big fat mod_perl guide that I maintain with help of many people (http://perl.apache.org/guide/). However, mod_perl 2.0 has quite a few things changed so the new documentation project has been started. You are welcome to check the updates on the http://perl.apache.org/ site and subscribe to the documentation mailing list to stay up to date by sending e-mail to docs-dev-subscribe@perl.apache.org .

References

Preventing Cross-site Scripting Attacks

Introduction

The cross-site scripting attack is one of the most common, yet overlooked, security problems facing web developers today. A web site is vulnerable if it displays user-submitted content without checking for malicious script tags.

Luckily, Perl and mod_perl provide us with easy solutions to this problem. We highlight these built-in solutions and also a introduce a new mod_perl module: Apache::TaintRequest. This module helps you secure mod_perl applications by applying perl's powerful "tainting" rules to HTML output.

What is "Cross-Site Scripting"?

Lately the news has been full of reports on web site security lapses. Some recent headlines include the following grim items: Security problems open Microsoft's Wallet, Schwab financial site vulnerable to attack, or New hack poses threat to popular Web services. In all these cases the root problem was caused by a Cross-Site Scripting attack. Instead of targeting holes in your server's operating system or web server software, the attack works directly against the users of your site. It does this by tricking a user into submitting web scripting code (JavaScript, Jscript, etc.) to a dynamic form on the targeted web site. If the web site does not check for this scripting code it may pass it verbatim back to the user's browser where it can cause all kinds of damage.

Consider the following URL:

http://www.example.com/search.pl?text=<script>alert(document.cookie)</script>

If an attacker can get us to select a link like this,and the Web application does not validate input, then our browser will pop up an alert showing our current set of cookies.This particular example is harmless; an attacker can do much more damage, including stealing passwords, resetting your home page, or redirecting you to another Web site.

Even worse, you might not even need to select the link for this to happen. If the attacker can make your application display a chunk of html, you're in trouble. Both the IMG and IFRAME tags allow for a new URL to load when html is displayed. For example the following HTML chunk is sent by the BadTrans Worm. This worm uses the load-on-view feature provided by the IFRAME tag to infect systems running Outlook and Outlook Express.


  --====_ABC1234567890DEF_====
  Content-Type: multipart/alternative;
           boundary="====_ABC0987654321DEF_===="

  --====_ABC0987654321DEF_====
  Content-Type: text/html;
           charset="iso-8859-1"
  Content-Transfer-Encoding: quoted-printable


  <HTML><HEAD></HEAD><BODY bgColor=3D#ffffff>
  <iframe src=3Dcid:EA4DMGBP9p height=3D0 width=3D0>
  </iframe></BODY></HTML>
  --====_ABC0987654321DEF_====--

  --====_ABC1234567890DEF_====
  Content-Type: audio/x-wav;
           name="filename.ext.ext"
  Content-Transfer-Encoding: base64
  Content-ID: <EA4DMGBP9p>

This particular example results in executable code running on the target computer. The attacker could just as easily insert HTML using the URL format described earlier, like this:

<iframe src="http://www.example.com/search.pl?text=<script>alert(document.cookie)</script>">

The "cross-site" part of "cross-site scripting" comes into play when dealing with the web browser's internal restrictions on cookies. The JavaScript interpreter built into modern web browsers only allows the originating site to access it's own private cookies. By taking advantage of poorly coded scripts the attacker can bypass this restriction.

Any poorly coded script, written in Perl or otherwise, is a potential target. The key to solving cross-site scripting attacks is to never, ever trust data that comes from the web browser. Any input data should be considered guilty unless proven innocent.

Solutions

There are a number of ways of solving this problem for Perl and mod_perl systems. All are quite simple, and should be used everywhere there might be the potential for user submitted data to appear on the resulting web page.

Consider the following script search.pl. It is a simple CGI script that takes a given parameter named 'text' and prints it on the screen.


        #!/usr/bin/perl
        use CGI;

        my $cgi = CGI->new();
        my $text = $cgi->param('text');

        print $cgi->header();
        print "You entered $text";

This script is vulnerable to cross-site scripting attacks because it blindly prints out submitted form data. To rid ourselves of this vulnerability we can either perform input validation, or insure that user-submitted data is always HTML escaped before displaying it.

We can add input validation to our script by inserting the following line of code before any output. This code eliminates everything but letters, numbers, and spaces from the submitted input.


        $text =~ s/[^A-Za-z0-9 ]*/ /g;

This type of input validation can be quite a chore. Another solution involves escaping any HTML in the submitted data. We can do this by using the HTML::Entities module bundled in the libwww-perl CPAN distribution. The HTML::Entities module provides the function HTML::Entities::encode(). It encodes HTML characters as HTML entity references. For example, the character < is converted to &lt;, " is converted to &quot;, and so on. Here is a version of search.pl that uses this new feature.


        #!/usr/bin/perl
        use CGI;
        use HTML::Entities;

        my $cgi = CGI->new();
        my $text = $cgi->param('text');

        print $cgi->header();
        print "You entered ", HTML::Entities::encode($text);

Solutions for mod_perl

All of the previous solutions apply to the mod_perl programmer too. An Apache::Registry script or mod_perl handler can use the same techniques to eliminate cross-site scripting holes. For higher performance you may want to consider switching calls from HTML::Entities::encode() to mod_perl's much faster Apache::Util::escape_html() function. Here's an example of an Apache::Registry script equivilant to the preceding search.pl script.


        use Apache::Util;
        use Apache::Request;

        my $apr = Apache::Request->new(Apache->request);

        my $text = $apr->param('text');

        $r->content_type("text/html");
        $r->send_http_header;
        $r->print("You entered ", Apache::Util::html_encode($text));

After a while you may find that typing Apache::Util::html_encode() over and over becomes quite tedious, especially if you use input validation in some places, but not others. To simplify this situation consider using the Apache::TaintRequest module. This module is available from CPAN or from the mod_perl Developer's Cookbook web site.

Apache::TaintRequest automates the tedious process of HTML escaping data. It overrides the print mechanism in the mod_perl Apache module. The new print method tests each chunk of text for taintedness. If it is tainted the module assumes the worst and html-escapes it before printing.

Perl contains a set of built-in security checks know as taint mode. These checks protect you by insuring that tainted data that comes from somewhere outside your program is not used directly or indirectly to alter files, processes, or directories. Apache::TaintRequest extends this list of dangerous operations to include printing HTML to a web client. To untaint your data just process it with a regular expression. Tainting is the Perl web developer's most powerful defense against security problems. Consult the perlsec man page and use it for every web application you write.

To activate Apache::TaintRequest simply add the following directive to your httpd.conf.

       PerlTaintCheck on    

This activates taint mode for the entire mod_perl server.

The next thing we need to do modify our script or handler to use Apache::TaintRequest instead of Apache::Request. The preceding script might look like this:


        use Apache::TaintRequest;

        my $apr = Apache::TaintRequest->new(Apache->request);

        my $text = $apr->param('text');

        $r->content_type("text/html");
        $r->send_http_header;

        $r->print("You entered ", $text);

        $text =~ s/[^A-Za-z0-9 ]//;
        $r->print("You entered ", $text);

This script starts by storing the tainted form data 'text' in $text. If we print this data we will find that it is automatically HTML escaped. Next we do some input validation on the data. The following print statement does not result in any HTML escaping of data.

Tainting + Apache::Request.... Apache::TaintRequest

The implementation code for Apache::TaintRequest is quite simple. It's a subclass of the Apache::Request module, which provides the form field and output handling. We override the print method, because that is where we HTML escape the data. We also override the new method -- this is where we use Apache's TIEHANDLE interface to insure that output to STDOUT is processed by our print() routine.

Once we have output data we need to determine if it is tainted. This is where the Taint module (also available from CPAN) becomes useful. We use it in the print method to determine if a printable string is tainted and needs to be HTML escaped. If it is tainted we use the mod_perl function Apache::Util::html_escape() to escape the html.


package Apache::TaintRequest;

use strict;
use warnings;

use Apache;
use Apache::Util qw(escape_html);
use Taint qw(tainted);

$Apache::TaintRequest::VERSION = '0.10';
@Apache::TaintRequest::ISA = qw(Apache);

sub new {
  my ($class, $r) = @_;

  $r ||= Apache->request;

  tie *STDOUT, $class, $r;

  return tied *STDOUT;
}


sub print {
  my ($self, @data) = @_;

  foreach my $value (@data) {
    # Dereference scalar references.
    $value = $$value if ref $value eq 'SCALAR';

    # Escape any HTML content if the data is tainted.
    $value = escape_html($value) if tainted($value);
  }

  $self->SUPER::print(@data);
}

To finish off this module we just need the TIEHANDLE interface we specified in our new() method. The following code implements a TIEHANDLE and PRINT method.


sub TIEHANDLE {
  my ($class, $r) = @_;

  return bless { r => $r }, $class;
}

sub PRINT {
  shift->print(@_);
}

The end result is that Tainted data is escaped, and untainted data is passed unaltererd to the web client.

Conclusions

Cross-site scripting is a serious problem. The solutions, input validation and HTML escaping are simple but must be applied every single time. An application with a single overlooked form field is just as insecure as one that does no checking whatsoever.

To insure that we always check our data Apache::TaintRequest was developed. It builds upon Perl's powerful data tainting feature by automatically HTML escaping data that is not input validated when it is printed.

Resources

This Fortnight on Perl 6 (27 Jan - 9 Feb 2002)

Notes

You can subscribe to an email version of this summary by sending an empty message to perl6-digest-subscribe@netthink.co.uk.

This summary, as with past summaries, can be found in here. (Note that this is an @Home address, and will change sometime this month.) Please send additions, submissions, corrections, kudos, and complaints to bwarnock@capita.com.

Perl 6 is the major redesign and rewrite of the Perl language. Parrot is the virtual machine that Perl 6 (and other languages) will be written for. For more information on the Perl 6 and Parrot development efforts, visit dev.perl.org and parrotcode.org.

For the two week period, there were 410 messages across 110 threads, with 64 authors contributing. Quite a few patches.

Parrot Is Broken

Parrot is broken until the keyed access interface to arrays and hashes are fixed. This shouldn't take long, but be warned if you decide to play with Parrot.

Speed

Raptor was impressed with the speed that Parrot is running through code, and asked if the numbers were true. Sadly, not really.

Unicode Strings

Larry Wall gave his input on how Unicode and Strings should interrelate in Perl 6.

But within Perl, character strings are simply sequences of integers. The internal representation must be optimized for this concept, not for any particular Unicode representation, whether UTF-8 or UTF-16 or UTF-32. Any of these could be used as underlying representations, but the abstraction of sequences of integers must be there explicitly in the internal high-level string API. To oversimplify, the high-level API must not have any parameters whose type contains the string "UTF".

The Regex Engine

Ashley Winters had a beef with the current regex engine, and how it's not being written in pure Parrot opcodes. These replies from Brent Dax, Bryan Warnock, and Angel Faus respond to Ashley's concerns.

Perl 6 On Mono

Aaron Sherman asked whether targeting Mono's CIL wasn't a better idea than writing our own VM. Brent Dax quoted the FAQ, but Paolo Molaro wasn't so sure.

Is Perl 6 Perl?

Aaron Sherman also asked whether Perl 6 should, in fact, be called Perl. Here's one particular response.

The Parrot Spotlight

Jeff Goff has been programming in Perl since roughly 1994, and designing Parrot since November 2001. He's the author of the PMC base types, the miniperl and Scheme compilers, and is currently redesigning the PMC classes and working on Unicode support.

Aside from Perl, Parrot, compilers, neural networks and genetic algorithms, his non-computer interests include origami, juggling, voracious reading, and playing guitar (seven notes and holding). He's owned by the unofficial Parrot mascot, a precocious baby African Grey named Percy, who he's been known to speak to in Japanese. Percy usually gives him blank stares, but was once heard to mutter 'Speak English, idiot.'


Bryan C. Warnock

Optimizing Your Perl

Is your Perl program taking too long to run? This might be because you've chosen a data structure or algorithm that takes a long time to run. By rethinking how you've implemented a function, you might be able to realize huge gains in speed.

Some Simple Complexity Theory

Before we can talk about speeding up something, we need a way to describe how long something takes. Because we're talking about algorithms that may have varying amounts of input, the actual "time" to do something isn't conclusive. Computer scientists and mathematicians use a system called big-O notation to describe the order of magnitude of how long something will take. Big-O notation represents a worst-case analysis. There are other notations to represent the magnitude of minimum and actual runtimes.

Don't let the talk of computer scientists and mathematicians scare you. The next few paragraphs talk about a way to describe the difference between orders of magnitude such as a second, minute, hour or day. Or 1, 10, 100 and 1,000. You don't need to know any fancy or scary math to understand it.

It's simple. If the runtime of a function is constant, then this is written as O(1). (Note the capital or "big" O.) Constant means that no matter how much data is processed, (i.e., how much input there is) the time doesn't change.

If the runtime is directly related to the amount of input (a.k.a. linear), then this is O(n). If there is twice as much input, then the algorithm will take twice as long to run. Some functions are O(n^2) (quadratic) or O(n^3) (cubic) or even worse.

Because we are only looking at orders of magnitude (is the operation going to take one minute or one hour) we can ignore constant factors and smaller terms. O(2n) and O(n) are equivalent. As are O(n+1) and O(n) or O(n^2+n) and O(n^2). The smaller term is insignificant compared to the larger.

There are rules for analyzing code and determining its big-O, but the simplest way is to look at how many times you iterate over the data. if you don't iterate, then it's O(1). If you iterate over it once, then it's O(n). If you have two nested loops, then it's probably O(n^2).

Orders

Example #1: Graphing

I developed a new graph module for a project at my day job, because the existing one was overkill, yet did not support a feature I needed. It was easy to implement and worked very well, until the graph got very big. I needed flattened subgraphs to do dependency checks, but it was taking more than two minutes to compute (on a 1Ghz P3) for a 1,000 node graph. I needed the computation to be much faster, or my users would get mad.

My initial Graph module looked like this. (Simplified to only show relevant parts.)


 package Graph; # for Directed Graphs
 use strict;
 sub new {
  my $this = { edges => [],
               vertices => {} };
  bless $this, "Graph";
 }
 sub add_edge { # from x to y
  my ($this,$x,$y) = @_;
  $this->{vertices}{$x}++;
  $this->{vertices}{$y}++;
  push @{$this->{edges}},[$x=>$y];
 }
 sub in_edges {
  my ($this,$v) = @_;
  grep { $_->[1] eq $v } @{$this->{edges}};
 }

The add_edge method is O(1) because it will always take the same amount of time no matter how many nodes/edges are in the graph. The in_edges method is O(n) because it has to iterate over each edge in the graph. (The iteration is "inside" the grep.)

The problematic section of the code looked like this:


 sub flat_subgraph {
    my ($graph,$start) = @_;
    my %seen;
    my @worklist = ($start);
    while (@worklist) {
      my $x = shift @worklist;
      push @worklist, $graph->in_edges($x) 
         unless $seen{$x}++;
    }
    return keys %seen;
 }

And I actually needed to do that for multiple vertices;


  my %dependencies;
  for (keys %{$graph->{vertices}}) {
   $dependencies{$_} = [ flat_subgraph( $graph, $_ ) ];
  }

This would make the entire flattening operation O(n^3). ( The dependency loop is O(n), flat_subgraph is effectively O(n), and in_edges is O(n)). That means that it takes much longer to compute as the Graph becomes bigger. (Imagine a curve with the following values: 1,8,27,64..., that's the curve of relative times.)

Caching

With my users expecting instant feedback, something had to be done. The first thing I tried was to cache the in_edges computation.


 sub in_edges {
  my ($this,$v) = @_;
  if (exists $this->{cache}{in_edges}{$v}) {
      return @{$this->{cache}{in_edges}{$v}};
  } else {
    my @t = grep { $_->[1] eq $v } @{$this->{edges}};
    $this->{cache}{in_edges}{$v} = [@t];
    return @t;
   }
 }

This helped, but not as much as I had hoped. The in_edges calculation was still expensive when it wasn't cached. It's still O(n) in the worst case, but becomes O(1) when cached. On average, that means the entire flattening operation is somewhere between O(n^2) and O(n^3), which is still potentially slow.

In situations where the same functions are called with the same arguments, caching can be useful. Mark-Jason Dominus has written a module called Memoize, which makes it easy to write caching functions. See his Web site http://perl.plover.com/MiniMemoize/ for more information.

Change the Structure

The next thing I tried was to change the Graph data structure. When I originally designed it, my goal was simplicity, not necessarily speed. Now speed was becoming important and I had to make changes.

I modified graph so that in_edges were calculated when the new edge was being added.


 package Graph; # for Directed Graphs
 use strict;
 sub new {
  my $this = { edges => [],
               vertices => {},
               in_edges => {} };
  bless $this, "Graph";
 }
 sub add_edge { # from x to y
  my ($this,$x,$y) = @_;
  $this->{vertices}{$x}++;
  $this->{vertices}{$y}++;
  push @{$this->{in_edges}{$y}}, $x;
  push @{$this->{edges}},[$x=>$y];
 }
 sub in_edges {
  my ($this,$v) = @_;
  return @{$this->{in_edges}{$v}};
 }

add_edge remains O(1) - it still executes in a constant time. in_edges is O(1) as well now; its runtime will not vary based on the number of edges.

This was the solution I needed, getting a full flattening down from more than four minutes to only six seconds.

The Importance of Good Interfaces

One of the things that made this fiddling possible was a good design for the Graph module's interface. The interface hid all the details of the internals from me, so that different implementations could be plugged in. (This is actually how I did the testing, I had a regular Graph, a Graph::Cached, and a Graph::Fast. Once I decided that I liked Graph::Fast the best, I renamed it to just Graph.)

I'm glad I designed Graph the way I did. I put simplicity and design first. There is a phrase you might have heard: "Premature optimization is the root of all evil." If I had optimized the Graph module first, then I might have ended up with more complicated code that didn't work in all cases. My initial cut of Graph was slow, but worked consistently. Optimization happened later, after I had a baseline I could return to, to guarantee proper operation.

Example #2: Duplicate Message Filter

You don't always need to try and optimize code to the fullest extent. Optimizing code takes your time and energy, and it isn't always worth it. If you spend two days to shave two seconds off a program, then it's probably not worth it. (Unless of course, that program is running hundreds of times each day.) If a program is "fast enough", then making it faster isn't worth it.

Our second example is a duplicate message filter for e-mail. This filter is useful when you receive cc's of messages also sent to mailing lists you are on, but you don't want to see both.

The general idea behind this filter is simple.


  skip $message if seen $message->id;

It's only the seen function that interests us. It searches some form of cache of id's. How we implement it can have a large effect on the speed of our mailfilter. (For more information on filtering mail with perl, take a look at Mail::Audit.)

The two most obvious methods that come to mind for implementing seen are a simple linear search (O(n)) or the use of a persistent database/hash table to do the lookup. O(1). I'll ignore the cost of maintaining the database, eliminating old Message-Id's, etc.

A linear approach writes out the message id's with some sort of separator. Some programs use null bytes, others use newlines.

The database/hash method stores the message id's in a binary database like a DBM or Berkeley DB file. At the expense of using more disk space, lookups are generally faster.

But - there's one more factor to take into account - overhead. The linear search has little overhead, while the DB file has a large overhead. (Overhead here means time to open the file and initialize the appropriate data structures.)


 TABLE: Comparison
 
  Records | Comment
  -----------------
    100   | linear is 415% faster
    250   | linear is 430% faster
    500   | hash is 350% faster
   1000   | hash is 690% faster

I performed a set of timings with a rough example implementation on my P2/400, and received the above results. The hash lookup speeds remained constant, while the linear lookup dropped off rapidly after 250 items in the cache.

Returning to the big picture, we need to take our message-id cache size into account when determining the method to use. For this kind of cache, there's usually not a reason to keep a huge amount of entries around, only two or three days worth should catch most duplicate messages. (And probably even one day's worth in most cases.)

For this problem, an O(1) solution is possible, but the constant time it takes might be higher than an O(n) solution. Both cases are trivial to implement, but in some situations it will be difficult to implement O(1) - and not worth it either.

Conclusion

Perl is a powerful language, but it doesn't prevent the programmer from making poor decisions and shooting him/herself in the foot. One mistake is to choose the wrong data structure or algorithm for the job. By using a little bit of complexity theory, it's possible to optimize your code and get huge speed gains.

I've only touched the surface of big-O notation and complexity theory. It's a highly researched topic that delves into the deep dark corners of math and computers. But, hopefully there's enough here to give you a basic tool set.

As the second example showed, you don't necessarily want to spend too much time thinking about optimization. Over-optimization isn't always worth it.

For More Information

Google
 http://www.google.com/search?q=big-O+complexity
Introduction to the Theory of Computation
Part Three of Sipser, Michael. "Introduction to the Theory of Computation". PWS Publishing Company. Boston, MA. 1997.

Algorithms and Complexity, Internet Edition
http://www.cis.upenn.edu/~wilf/AlgComp3.html

Special Thanks

Walt Mankowski

Visual Perl

Jumbo shrimp, military intelligence - Visual Perl? ActiveState built its reputation by bringing Perl to the Windows platform, thereby extending the reach of Perl and the size and nature of the Perl community. However, regardless of the Windows port, many Perl programmers don't hesitate to voice their preference for text-based editors (like vi and emacs) over Windows-based integrated development environments (IDEs). Some Perl gurus find text-based editors the most efficient way to code. But for programmers who are new to Perl, and programmers who primarily work in graphical environments such as Windows, the intuitive, visual nature of IDEs help them to learn languages and produce code more quickly.

ActiveState's combined expertise in Perl and the Windows platform, and our stated goal of bringing open-source programming languages to a wider developer audience, has led to our interest in developing IDEs. We have three particular goals:

  1. to create an intuitive editing environment
  2. to deliver much of the functionality that powerful text-based editors offer, but without the initial learning curve
  3. to add language-specific features, such as real-time syntax checking, debugging, colorizing and online help

Therefore, in addition to developing our own Komodo IDE, ActiveState has participated in Microsoft's Visual Studio Integration Program (VSIP) and developed three plug-ins (Visual Perl, Visual Python and Visual XSLT) for Microsoft's Visual Studio .NET.

Integrating Perl support into Visual Studio .NET required extending Visual Studio to provide Perl-specific functionality. For example, Visual Perl includes module-based code completion, pop-up hints for Perl keywords and module functions, a context-sensitive Perl language reference and emacs-like auto-indenting, brace matching, and block selection. Additionally, Visual Perl makes use of standard Visual Studio .NET functionality through the graphical debugger, the class view (for browsing functions), sophisticated project management, etc.

Visual Perl and .NET

In a separate initiative, ActiveState has developed PerlNET, a tool for building .NET-compliant components with Perl. PerlNET is part of ActiveState's Perl Dev Kit 4.0, and Visual Perl integrates with the Perl Dev Kit, allowing developers to create .NET-compliant components (and also Windows applications, services and controls) from within Visual Studio .NET.

PerlNET and Visual Perl make it easier to build multi-language applications. Previously, the easiest way to build Windows applications quickly was with Visual Basic. However, Visual Basic has drawbacks with regards to the the speed at which the applications run. Also, VB is not the most commodious language for developers accustomed to modern constructs such as associative arrays, dynamic functions, closures, run-time eval and regular expression matching with full backtracking.

Microsoft has addressed this by creating a modernized ".NET" version of Visual Basic, and with the creation of C#, which provides a near-functional equivalent to Visual Basic .NET (some would say "superior, but functionally close"). Additionally, any language that can target the .NET runtime engine can be used to build components, and multilanguage interoperability is transparent. In a nutshell, you can use a high-level language to script the UI, write back-end components in Perl, and tie the two together almost effortlessly. All you have to do is:

  • tell Visual Perl that a particular Perl project should be converted to a .NET DLL, and add the interface definition for the component in one or more "=for interface" POD comments
  • assign the code in the Perl-sourced DLL a namespace
  • point the client application to the Perl-sourced DLL, and use that namespace

Visual Perl and .NET: Making it Happen

As an example, the process of building a .NET-compliant application goes something like this:

  1. Quickly build a front-end in C# or Visual Basic .NET using Win Forms to populate a form with UI widgets. Double-click on the widgets to script common events ("..._Click" for buttons, "..._TextChanged" for changing text fields).
  2. When it's time to add more processing to the back end, add a Perl project to the current Visual Studio .NET solution, and treat it like a Perl package. Define the constructor, add methods and fields to the interface definition block, and then define the new methods and fields.
  3. Like other Perl component builders, you should test and debug your Perl code before converting it into a DLL. The simplest way is to put a
    
    	unless (caller) { ... }
    block at the end of your module, before the final "1;", and drive your package from that.
  4. Configure the Perl project as the "startup" project, then run the Perl component straight through, outside Visual Perl's debugger. (You can run programs in a DOS command shell, or in the "Run Window" within Visual Studio .NET. Both support standard IO, including stdin.)
  5. Change the project type from "Standard" to "Managed DLL," change the "startup" project to the C# project, then build and run.

Getting the Job Done with Visual Perl

.NET aside, we worked hard to make Visual Perl an attractive alternative to emacs, vim and other editors favored by Perl programmers. Visual Studio .NET provides an editing environment familiar to developers who have always worked in Windows. But many Perl programmers cut their hash variables on emacs and vi, and are wary of giving up the powerful functionality of those editors.

However, implementing "vi" and "emacs" keystroke bindings wasn't the way to go. "Ctrl-X" has a long-established meaning in the GUI world, and we weren't about to let developers accidentally delete a chunk of selected text when they meant to start a multi-keystroke command.

So, while we couldn't preserve the keystrokes, we could go some way toward preserving the functionality. Visual Perl features:

  • configurable auto-indenting that provides the ability to set the level of indent, specify whether to insert spaces or tabs, and whether to auto-indent based on nesting constructs
  • auto-indenting based on the Camel book style, so it's sensitive to the location of enclosing braces and parentheses
  • the ability to show matching bracketing characters
  • keyboard shortcuts for moving to matching brackets, or selecting a block within the matching brackets
  • keyboard shortcuts for commenting and un-commenting blocks of selected text
  • the ability to collapse and expand blocks of code with a single mouse click
  • incremental searching, both forward and backward (although we hope to extend this functionality with incremental regular expression searches)

Visual Perl's colorizer is aware of some of Perl's more arcane constructs, including here-documents (stacked or plain), regex substitutions with different delimiters on the pattern and replacement, and ?-delimited regular expressions. And Visual Studio .NET's Options dialog lets you assign whatever color combination you want to each style.

Other features include the class browser, which is used to quickly navigate to the functions in the files loaded in a project. Integrated context-sensitive help provides quick information on Perl keywords. You can right click on a "use" or "require" statement, and view help for the imported module within the Visual Studio .NET environment. The debugger supports debugging of remote Perl programs, not just the program loaded in the IDE.

Visual Perl uses the Visual Studio .NET code completion framework to help walk the user through "::"-separated names when importing modules. It recognizes when an instance of a package is bound to a variable, and presents available methods when "->" is typed after the variable. Visual Perl isn't doing anything fancy here; it's simply assuming that good developers try to maintain a many-to-one relationship between their variable names and the types the variables are labels for. So when Visual Perl sees that a variable called "$ua" is an instance of LWP::UserAgent, it assumes that other instances of that variable are as well. This definitely isn't thorough, and assumes that a value keeps its name on both sides of a function call, but it works the way that many people work.

Visual Perl and Web Services

Visual Perl's code completion is also available for Web services. Once a Web service has been associated with a variable, a list of available methods in the Web service, and their call signatures, will pop up when you type "->".

ActiveState has participated in the development of a scripting-language-agnostic framework for consuming Web services, called the Simple Web Services API. It works with Python and PHP, as well as Perl. Among other things, it allows you to bind to and call methods on a Web service with a minimal amount of code. For Perl, this line suffices to bind a variable to a Web service:

$var = WebService::ServiceProxy->new (wsdl string)

Calling methods off the service is as simple as invoking an object method:

$var->method(args...)

Once a variable has been bound to a Web service, typing the '->' will drop a list of methods the Web service exports, and typing the '(' after the method name will raise a call tip that walks through the arguments.

This is similar to SOAP::Lite, but has the advantage of working in several languages, as well as handling features such as overloaded methods, and supporting method names that contain Unicode characters that can't be used in Perl identifiers.

Conclusion

ActiveState's work with Visual Studio .NET and the .NET Framework furthers our commitment to expanding the use of Perl. Visual Perl expands the language set available to developers using Microsoft's ubiquitous Visual Studio IDE, and PerlNET extends Perl by providing compliancy with the .NET Framework.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en