Recently in CGI Category

CGI::Kwiki

This article is about a new Perl module called CGI::Kwiki. With this module you can create a Wiki Web site in less than a minute. Now that's quick. Or more appropriately, ``That's Kwik!''

If you've not heard of a Wiki, it's a Web site that allows you to add and edit pages directly from your browser. Generally, every page on the site has a link or button that will let you edit the page that you are reading. When you edit a page, the raw contents of that page come up in a text edit area in your browser. You can make any changes you want. When you hit the SAVE button, the changes become live.

To create a new page, you just create a link on the current page to a page that didn't exist before. Then when you follow the new link, you are allowed to edit the new page.

Knowledge of HTML is not a prerequisite for using a Wiki. It's not even a requisite, because the raw Wiki contents that you edit are not presented as HTML. Wikis use a much more natural markup, that resembles the messages posted in Usenet news groups. An example can speak for itself:


    == A Page Header for a Sample Wiki Page ==

    Here's list of some WikiFormattingCodes:
    * Lines that begin '* ' form a bulleted list
    * Asterisks might be used to mean *bold* text
    * Links like http://www.perl.com work automatically

The only markup that should require further explanation is the text WikiFormattingCodes. Capitalized words that are mushed together form a link to another page on the Wiki site.

A Wiki is simply a Web site that is easy for ordinary people to edit. So where did the Wiki idea come from and why is it important?

Ward's Wiki Wisdom

I've only been dabbling in the world of Wiki for less than a year. Rather than answer that question myself, I decided to ask the inventor of the Wiki. Now by pure coincidence, Ward Cunningham lives but a few miles from my house and well with in my telephone area code. I decided to drop him a line, and find out his innermost feelings on his creation:

Brian: Yes, hello. May I speak to Mr. Ward Cunningham?

Ward: Who is this?

Brian: This is Brian Ingerson from Perl.com. I have a few questions.

Ward: Perl?! That's not me! Wall is to blame. Call him.

Brian: No. Wait. It's about the Wiki.

Ward: Ah, yes. The Wiki. Well let's get to business.

Brian: Why did you invent the Wiki?

Ward: Wiki had a predecessor that was a hyper-card stack. I wrote it to explore hypertext. I wanted to try recording something that was ragged, something that wouldn't fit into columns. I had this pet theory that programming ideas were spread by people working together. I set out to chart the flow of ideas through my company (then Tektronix). This turned out to be more fun than I ever would have imagined.

When we were really trying to capture a programmer's experience in software patterns, I remembered that stack and set out to do it over with the technology of the moment, the World Wide Web. This was 1994. I wrote Wiki to support and enlarge the community writing software patterns.

Brian: What do you see as Wiki's most-positive contribution to the world?

Ward: Back in 1994, the Web was a pretty wonderful place, with lots of people putting up stuff just because they thought someone else would find it interesting or useful. Wiki preserves that feeling in a place that has become too much of a shopping mall. It reminds people that sometimes to work together you have to trust each other more than you have any reason to.

Brian: Are you concerned that there are so many different Wiki implementations?

Ward: I was concerned once. I wish everyone used my markup instead of inventing their own. But that didn't happen. Now I realize that the implementations have done more to spread the idea than I ever could with my one version. That is the way it is with really simple things.

Brian: What programming language is your Wiki written in?

Ward: Um, ... Perl.

Brian: Tell me about that.

click

Wikis Wikis Everywhere

Just in case you didn't visualize the tongue entering and exiting my cheek, Ward does not have anything against Perl. To the contrary, he does almost all his open-source development with it, including his Wiki software. Try visiting his Wiki site, http://c2.com/cgi/wiki, for an excellent introduction to Wiki.

As was pointed out, there are many many implementations that have sprung forth since the Wiki was invented, and many of those were written in Perl. That's because a Wiki is fairly easy to implement and everyone seems to want to do it slightly differently.

Most of these implementations are just simple CGI scripts at heart. Even though they may have gathered dozens of special features over the years, they are really just ad hoc programs that are not particularly modularized or designed for extensibility.

One notable exception is the CGI::Wiki module by Kate ``Kake'' Pugh. This relatively new CPAN distribution is designed to be a Wiki framework. The various bits of functionality are encapsulated into class modules that can be extended by end users. As far as I know, this project is the first attempt in Perl to modularize the Wiki concept. It's about time!

The second attempt is a completely different module called CGI::Kwiki; the subject of this article. When I evaluated CGI::Wiki, I found it a little too heavy for my needs. It had about a dozen prerequisite modules and required an SQL database. CGI::Kwiki by comparison requires no extra modules besides those that come with Perl, and stores its Web pages as plain text files.

I find this preferable, because I can install a new Kwiki in seconds (literally) and I have the full arsenal of Unix commands at my disposal for manipulating the content. In fact, the default search facility for CGI::Kwiki is just a method call that invokes the Unix command grep.

Another compelling aspect of CGI::Kwiki is that every last bit of it is extensible, and extending it is trivial. About the only thing you can't easily change is the fact that it is written in Perl.

Because of this, I have probably set up more than a dozen Kwiki sites in the past month, and customized each one according to my needs. In this article, I'll show you how to do the same thing.

The Kwikest Way to Start

So just how easy is it to install a Kwiki? Well, that depends on how many of the basics you already have in place. You need a Web server and Perl, of course. You also need to have the CGI::Kwiki module installed from CPAN. That's about it.

For the sake of a specific example, let's say that you are running the Apache Web server (version 1.3.x) and that /home/johnny/public_html/cgi-bin/ is a CGI-enabled directory. With that setup in place, you can issue the following commands to create a new Kwiki:


    cd /home/johnny/public_html/cgi-bin/
    mkdir my-kwiki
    cd my-kwiki
    kwiki-install

Done! Your Kwiki is installed and ready for action. You should be able to point your Web browser at:


    http://your-domain/~johnny/cgi-bin/my-kwiki/index.cgi

and begin your wiki adventure.

At this point, if you do an ls command inside the my-kwiki directory, then you should see two files (index.cgi and config.yaml). index.cgi is just a point of execution for the CGI::Kwiki class modules, and config.yaml is little more than a list of which class modules are being used. You should also see a directory called database, where all your Kwiki pages are stored as individual plain text files.

These files will become important later as we explore how to customize Kwiki to your personal needs or whims.

If you are having trouble configuring Apache for CGI, then here is the basic httpd.conf section that I use for my personal Kwikis:


    Alias /kwiki/ /home/ingy/kwiki/
    <Directory /home/ingy/kwiki/>
        Order allow,deny
        Allow from all
        Options ExecCGI FollowSymLinks Indexes
        AddHandler cgi-script .cgi
        DirectoryIndex index.cgi
    </Directory>

This allows me to connect with this URL:


    http://localhost/kwiki/

Using Your Kwiki

When you first visit your newly installed Kwiki, you'll notice that there are a number of default pages already installed. Most notably is the one called HomePage, because that's the one you'll see first. This page requests that you change it as soon as possible. Go ahead and give it a try. Click the EDIT button.

You should see the text of HomePage laid out in Kwiki format inside an editable text area. Make some changes and click the SAVE button. The first thing you'll probably want to know is exactly how all the little Kwiki markup characters work.

KwikiFormattingRules

CGI::Kwiki has a set of default formatting rules that reflect my favorites from other Wikis. Some are from WardsWiki, some from MoinMoin, some from UseMod. All of them are customizable. More on that shortly. For now, let's go over the basics.

The first thing to learn is how to create a link. A link to another page on the site is made by squishing two or more words together in CamelCase. If the page doesn't exist yet, then that's OK. Clicking on it will allow you to create the new page from scratch. This is how Wikis grow.

You can also create an external link by simply starting some text with http:. Like http://c2.com/cgi/wiki, the original Wiki Web site. Sometimes you want an internal link that isn't CamelCase. Just put the link text inside square brackets. If you want the link to be external, then add the http: component inside the brackets:


    [check_this_out]
    [check this out http://checked.out]

The second most-common formatting rule I use is preformatted text. This is used for things like Perl code examples. Text that is preformatted is automatically immune to futher Wiki processing. To mark text as preformatted you just indent it. This is similar to the approach that POD takes:


        sub backwards_string {
            return join '', reverse split '', shift;
        }

One of the FormattingRules that I personally like is the ability to create HTML tables. You do it like this (if you're a bowler):


    | Player | 1   | 2   | 3   |
    | Marv   | 8-1 | X   | 9-/ |
    | Sally  | X   | X   | 8-1 |
    | Ingy   | 5-2 | 6-0 | 7-0 |
    | Big Al | 0-1 | 5-\ | X   |

(The people I bowl with usually get tired after three frames)

Tables are made by separating cells with vertical bar (or pipe) characters. Many times I need to put multiline text inside the cells. Kwiki accomplishes this by allowing a Here-Document style syntax:


    | yaml | perl | python |
    | <<end_yaml | <<end_perl | {'foo':'bar','bar':[42]} |
    ---
    foo: bar
    bar:
      - 42
    end_yaml
    {
      foo => 'bar',
      bar =>
        [ 42 ]
    }
    end_perl

Kwiki has a fairly rich set of default formatting rules. You'll find an exhaustive list of all the rules right inside your new Kwiki. The page is called KwikiFormattingRules. To find this page (and every other page on your Kwiki) click the RecentChanges link at the top of the current page.

KustomizingKwiki

To those of you familiar with the Wiki world, this has all been fairly pedestrian stuff so far. Here's where I think that things get interesting. As I stated before, every last part of the Kwiki software is changable, customizable and extensible. Best of all, it's easy to do.

CGI::Kwiki is made up of more than a dozen class modules. Each class is responsible for a specific piece of the overall Kwiki behavior. To change something about a particular class, you just subclass it with a module of your own.

Some of the more important CGI::Kwiki classes are:

Kwiki knows what classes to use by looking in it's config file. So if you want to subclass something, then the first thing you would do is change the config.yaml entry to point to your new class. Let's start with a easy one.

A Kwik and Dirty Tweak

Kwiki will turn a word or phrase inside *asterisks* to bold text. This is similar to the way you might do it in text e-mail. But WardsWiki uses '''triple quotes''' for bolding. Let's change your Kwiki to do it like Ward does.

First, create a file called MyFormatter.pm. You can put it right inside your Kwiki installation directory, and Kwiki will find it. The contents of the file should look like this:


    package MyFormatter;
    use base 'CGI::Kwiki::Formatter';

    sub bold {
        my ($self, $text) = @_;
        $text =~ s#'''(.*?)'''#<b>$1</b>#g;
        return $text;
    }

    1;

Now, change the config.yaml file to use this line:


    formatter_class: MyFormatter

The Kwiki formatting engine will now call your subroutine with pieces of text that are eligible to contain bold formatting. Sections of text that are already preformatted code will not be passed to your bold() method. And as you can see, MyFormatter is a subclass of CGI::Kwiki::Formatter so all the other formatting behaviors remain intact.

Kwiki's Formatting Engine

Let's look under the hood at CGI::Kwiki's hotrod formatting engine. You'll need to be familiar with it to do any serious formatting changes. Conceptually, it's rather simple. It works like this.

  • The text starts out as one big string.
  • There is a list of formatting routines that are applied in a certain order.
  • The string is passed to the first formatting routine. This routine may change the original text. It may also break the text into a number of substrings. It then return the strings it has created and manipulated.
  • Each of the substrings is run through the next formatting routine in line.
  • Sometimes, a formatting routine will want to make sure that no further routines touch a particular substring. It can do this by returning a hard reference to that string.
  • After all the substrings have been passed through every routine, they are joined back together to form one long string.

The specific routines and their order of execution is determined by another method called process_order(). The process_order method just returns a list of method names in the order they should be called. The default process_order method is:


    sub process_order {
        return qw(
            function
            table code header_1 header_2 header_3 
            escape_html
            lists comment horizontal_line
            paragraph 
            named_http_link no_http_link http_link
            no_wiki_link wiki_link force_wiki_link
            bold italic underscore
        );
    }

The best way to get a good feel for how to do things is to look over the CGI::Kwiki::Formatter module itself.

KontentKontrol

The biggest fear that many people have of setting up a Wiki site is that someone will come along and destroy all their pages. This happens from time to time, but in general people just don't do it. It's really not that cool of a trick to pull off. Someone could even write a program to destroy a Wiki, but if they were that smart, hopefully they'd be mature enough not to do it.

As of this writing, CGI::Kwiki doesn't do anything to protect your data. But remember, it's just code. Let's now extend your code to do a simple backup, everytime a page is written.

Possibly the simplest way to back up files on Unix is to use RCS. Let's make the Kwiki perform an RCS checkin every time it saves a page.

This time we need to extend the database class. Change the config file like so:


    database_class: MyDatabase

Then write a file called MyDatabase.pm that looks like:


    package MyDatabase;
    use base 'CGI::Kwiki::Database';

    sub store {
        my $self = shift;
        my ($file) = @_;
        $self->SUPER::store(@_);
        system(qq{ci -q -l -m"saved" database/$file backup/$file,v});
    }



    1;

Note: Be sure to add a backup directory that the CGI program can write to:


    mkdir backup
    chmod 777 backup

In this case the store method calls its parent method to handle this actual database store. But then it invokes an extra rcs command to backup the changes to the file.

Hopefully these examples will give you an idea of how to go about making other types of modifications to CGI::Kwiki. If you make a whole set of cohesive and generally useful extensions, then please consider putting them on CPAN as module distribution.

A Kwiki in Every Pot

The classic use for a Wiki site is to provide a multi-user forum for some topic of interest. In this context, Wiki is a great collaboration tool. People can add new ideas, and revise old ones. The Wiki serves as both an archive and a news site. Most Wikis provide a search mechanism and a RecentChanges facility.

But I think this only scratches the surface of Wiki usage possibilities. Since a Kwiki is so easy to create, I now find myself doing it all the time. It's almost like I'm creating a new wiki for every little thing I set out to do. Here's a few examples:

  • Personal Planning

    I have a personal wiki for keeping track of my projects. I keep it on my laptop.

  • Module Development

    Every Perl module I write these days has its own Kwiki in the directory. I use them mainly for creating Test::FIT testing tables. (See Test::FIT on CPAN). But I can also use it for project notes and documentation. Since I can extend the Kwiki, I can make it export the pages to POD if I want.

  • Autobiowiki

    I am seriously considering writing the stories of my life in a Wiki. If I can get others to to the same, then the Wikis could be linked using Ward's SisterSite mechanism. This would create one big story. (See http://c2.com/cgi/wiki)

  • Project Collaboration

    For my bigger projects I like to create a user community based around a Wiki. Using Test::FIT I can actually get my users to write failing tests for my projects. And they can help write documentation, report bugs, share recipes, etc. (See http://fit.freepan.org and http://yaml.freepan.org)

Conclusion

One final point of interest; this entire article was written in a Wiki format. I needed to submit it to my editor in POD format, which he in turn formatted into the HTML you are reading now. I accomplished this by simply using an extension of CGI::Kwiki::Formatter that produces POD instead of HTML!

NOTE: The raw content of this article along with the formatter program can be found at http://www.freepan.org/ingy/articles/kwiki/

Editor's note: http://www.kwiki.org has been created as the official kwiki home page.

About the Author

Brian Ingerson has been programming for more than 20 years, and hacking Perl for five of those. He is dedicated to improving the overall quality of scripting languages including Perl, Python and Ruby. He currently hails from Portland, Ore.; the very location of this year's O'Reilly Open Source Convention. How convenient!

Finding CGI Scripts

Introduction

No matter how much we try to convince people that Perl is a multi-purpose programming language, we'd be deluding ourselves if we didn't admit that the majority of programmers first come into contact with Perl through their experience with CGI programs. People have a small Web site and one day they decide that they need a guest book, a form mail script or a hit counter. Because these people aren't programmers, they go out onto the Web to see what pre-written scripts they can find.

And there are plenty to choose from. Try searching on ``CGI scripts'' at Google. I received about 2 million hits. The first two were those well-known sites - Matt's Script Archive and the CGI Resource Index. Our Web site owner will visit one of these sites, find the required scripts and install them on his site. What could be simpler? See, the Web is as easy as people make it out to be.

In this article, I'll take a closer look at this scenario and show that all is not as rosy as I've portrayed it above.

CGI Script Quality

An important factor that Google takes into account when displaying search results is the number of links to a given site. Google assumes that if there are a large number of links to a given Web page, then it must be a well-known page and that Google's visitors will want to visit that site first.

Notice that I said ``well-known'' in that previous paragraph. Not ``useful'' or ``valuable.'' Think about this for a second. The types of people that I described in the introduction are not programmers. They certainly aren't Perl programmers. Therefore, they are in no position to make value judgments on the Perl code that they download from the Internet.

This means that the ``most popular'' site becomes a self-fulfilling prophecy. The best known site is listed first on the search engines. More people download scripts from that site, assuming that the most popular site must have the highest quality scripts and that the popular sites end up becoming more popular.

At no point does any kind of quality control enter into the process.

OK, so that's not strictly true. If the scripts from a particular site just didn't work at all, then word would soon get out and that site's scripts would become unpopular. But what if the problems were more subtle and didn't manifest themselves on all sites. Here is a list of some potential problems:

  • Not checking the results of an open call. This will work fine if the expected file exists and has the right permissions. But what happens when the file doesn't exist? Or it exists but the CGI process doesn't have permissions to read from it or write to it?
  • Bad CGI parameter parsing code. CGI parameter parsing is one of those things that is easy to do badly and hard to do well. It's simple enough to write a parser function that handles most cases, but does it handle both GET and POST requests? What about keys with multiple associated values? And does it process file uploads correctly?
  • Lack of security. Installing a CGI program allows anyone with an Internet connection to run a program on your server. That's quite a scary thing to allow. You'd better be well aware of the security implications. Of course, if people only ever run the script from your HTML form, then everything will probably be fine, but a cracker won't do that. He'll fire ``interesting'' sets of parameters at your script in an attempt to find its weaknesses. Suddenly a form mail script is being used to send copies of vital system files to the cracker.

    It's also worth bearing in mind that because these scripts are available on the Web, crackers can easily get the source code. They can then work out any insecurities in the scripts and exploit them. Recently, a friend's Web site came under attack from crackers and amongst the traces left in the access log were a large number of calls to well-known CGI scripts.

    For this reason, it is even more important that you are careful about security when writing CGI scripts that are intended to be used by novice Webmasters.

The fact is, unfortunately, that these kinds of problems are commonplace in the scripts that you can download from many popular CGI script archives. That's not to say that the authors of these scripts are deliberately trying to give crackers access to your servers. It's simply evidence that Perl has moved on a great deal since the introduction of Perl 5 in 1994 and many of the CGI script authors haven't kept their scripts up to date with current practices. In other cases, the authors know only too well how out of date their scripts are and have produced newer, improved versions, but other people are still distributing the older versions.

Setting a Good Example

Although the people who are downloading these scripts aren't usually programmers, there often comes a time when they want to start changing the way a program works and perhaps even writing their own CGI programs. When this time comes, they will go to the scripts they already have for examples of how to write them. If the original script contained bad programming practices, then these will be copied in the new scripts. This is the way that many bad programming practices have become so common among Perl scripts. I, therefore, think that it's a good idea for any publicly distributed programs to follow best programming practices as much as possible.

Script Quality - A Checklist

So now we have an obvious problem. I said before that the people who are downloading and installing these scripts aren't qualified to make judgments on the quality of the code. Given that there are some problematic scripts out there, how are they supposed to know whether they should be using a particular script that they find on the Web?

It's a difficult question to answer, but there are some clues that you can look for that give a idea of how well-written a script is. Here's a brief checklist:

  • Does the script use -w and use strict? The vast majority of Perl experts recommend using these tools when writing Perl programs of any level of complexity. They make any Perl program more robust. Anyone distributing Perl programs without them probably doesn't know as much Perl as they think they do.
  • Does the script use Perl's taint mode? Accepting external data from a Web browser is a dangerous business. You can never be sure what you'll get. If you add -T to a program's shebang line, then Perl goes into taint mode. In this mode Perl distrusts any data that it gets from external sources. You need to explicitly check this data before using it. Using -T is a sign that the author is at least thinking about CGI security issues.
  • Does the script use CGI.pm? Since Perl 5.004, CGI.pm has been a part of the standard Perl distribution. This module contains a number of functions for handling various parts of the CGI protocol. The most important one is probably param, which deals with the parsing of the query string to extract the CGI parameters. Many CGI scripts write their own CGI parameter parsing routine that is missing features or has bugs. The one in CGI.pm has been well-tested over many years in thousands of scripts - why attempt to reinvent it?
  • How often is the script updated? One reason for a script not to use CGI.pm might be that it hasn't been updated since the module was added to the Perl distribution. This is generally a bad sign. You should look for scripts that are kept up to date. If there hasn't been been a new version of the script for several years, then you should probably avoid it.
  • How good is the support? Any program is of limited use if it's unsupported. How do you get support for the program? Is there an e-mail address for the author? Or is there a support mailing list? Try dropping an e-mail to either the author or the mailing list and see how quickly you get a response.

Of course, these rules will have exceptions, but if a script scores badly on most of them, then you might have second thoughts on whether you should be using the script.

nms - A New CGI Program Archive

Having spent most of this article being quite negative about existing CGI program archives, let's now get a bit more positive. In the summer of 2001, a group of London Perl Mongers started to wonder what would be involved in writing a set of new CGI programs that could act as replacements for the ones in common use. After some discussion, the nms project was born. The name nms originally stood for a disparaging remark about one of the existing archives, but we decided that we didn't want the kind of negativity in the name. By that time, however, the abbreviated name was in common usage so we decided to keep it - but it no longer stands for anything.

The objectives for nms were quite simple. We wanted to provide a set of CGI programs which fulfilled the following:

  • As easy (or easier) to use as existing CGI scripts.
  • Use best programming practices
  • Secure
  • Bug-free (or, at least, well supported)

We decided that we would base our programs on the ones found in Matt's Script Archive. This wasn't because Matt Wright's scripts were the worst out there, but simply that they were the most commonly used. We made a rule that our scripts would be drop-in replacements for Matt's scripts. That meant that anyone who had existing data from using one of Matt's scripts would be able to take our replacement and simply put it in place of the old script. This, of course, meant that we had to become familiar with the inner workings of Matt's scripts. This actually turned out not to be a hard as I expected. The majority of Matt's scripts are simple. It's only really formmail, guestbook and wwwboard that are complex.

Sometimes our objectives contradicted one anther. We decided early on, that part of making the scripts as easy to use as possible meant not relying on any CPAN modules. We forced ourselves to only use only modules that came as part of the standard Perl distribution. The reason for this is that our target audience probably doesn't know anything about CPAN modules and wouldn't find it easy to install them. A large part of our audience isprobably operating a Web site on a hosted server where they may not be able to install new modules and in many cases won't have telnet access to their server. We felt that asking them to install extra modules would make them far less likely to use our programs. This, of course, goes against our objective of using best programming practices as in many cases there is a CPAN module that implements functionality that we use. The best example of this is in formmail where we resort to sending e-mails by talking directly to sendmail rather than using one of the e-mail modules. In these cases, we decided that getting people to use the scripts (by not relying on CPAN) was more important to us than following best practices.

nms is a SourceForge project. You can get the latest released versions of the scripts from http://nms-cgi.sourceforge.net or, if you're feeling braver, then you can get the leading edge versions from CVS at the project page at http://sourceforge.net/projects/nms-cgi/. Both of those pages also have links to the nms mailing lists. We have two lists, one for developers and one for support questions. There is also a FAQ that will hopefully answer any further questions that you have about the project.

Here is a list of the scripts available from nms

  • Countdown Count down the time to a certain date
  • Free For All Links A simple Web link database
  • Formmail Send e-mails from Web forms
  • Guestbook A simple guest book script
  • Random Image Display a random image
  • Random Links Display a link chosen randomly from a list
  • Random Text Display a randomly chosen piece of text
  • Simple Search Simple Web site search engine
  • SSI Random Image Display a random image using SSI
  • Text Clock Display the time
  • Text Counter Text counter

I should point out that this is very much a ``work in progress.'' While we're happy with the way that they work, we can always use more people looking at the code. The one advantage that Matt's scripts have over ours is that they've had many years of testing on a large number of Web sites.

A Plea for Help

So now we have a source of well-written CGI programs that we can point users to. What more needs to be done? Well, the whole point of writing this article was to ask more people to help. There's always more work to do :-)

  • Peer review. We think we've done a pretty good job on the scripts, but we're not interested in resting on our laurels. The more people that look at the scripts the more likely we'll catch bugs and insecurities. Please download the scripts and take a look at them. Pass any bugs on to the developers mailing.
  • Testing. We test the scripts on as many platforms with as many different configurations as we can, but we'll always miss one or two. Please try to install the scripts on your systems and let us know about any problems you have.
  • Documentation. Our documentation isn't any worse than the documentation for the existing archives, but we think it could be much better. If you'd like to help out with this, then please get in touch with us.
  • Advocacy. This is the most important one. Please tell everyone that you know about nms. Everywhere that you see people using other CGI scripts, please explain to them the potential problems and show them where to get the nms scripts. Having written these scripts, we feel it's important that they get as wide exposure as possible. If you have any ideas for promoting nms, then please let us know.

While I don't pretend for a minute that these are the only well-written and secure CGI programs available, I do think that the Perl community needs a well-known and trusted set of CGI programs that we can point people to. With your help, that's what I want nms to become.

CGI Scripting

This article is about scripting in Perl. Of course, scripting can take place anywhere, not just in the context of the Web. I will concentrate on CGI scripts written in Perl. In fact, virtually any language could be used to write CGI scripts.

Perl has been ported to about 70 operating systems. The most recent (February 2001) is Windows CE. In this article I will make a lot of simplifications and generalizations. Here's the first ...

There are two types of CGI scripts:

  • Those that output HTML pages
  • Those that process the input from CGI forms

Invariably, the second type -- having processed the data -- will output an HTML page or form to allow the user to continue, or at least know what happened, or they will do a CGI redirect to another script that outputs something.

I am splitting scripts in to two type to emphasize that there are differences between processing forms and processing in the absence of forms.

Terminology

There are Web Servers and Web Clients. Some Web clients are browsers.

There are programs and scripts. Once upon a time, programs were compiled and scripts were interpreted. Hence the two names. But today, this is ``a distinction without a difference.'' My attitude is that the two words, program and script, are interchangable.

Program and process, however, are different. Program means a program on disk. Process means a program that has been loaded by the operating system into memory and it being executed. This means a single program on disk can be loaded and run several times simultaneously, in which case it is one program and several processes.

Web servers have names such as Apache, Zeus, MS IIS and TinyHTTPd. Apache and TinyHTTPd (Tiny HyperText Transfer Protocol Daemon) are open source. Zeus and MS IIS (Internet Information Server) are commercial products. The feeble security of IIS makes it unusable in a commercial environment.

My examples will use Apache as the Web server. Web clients that are browsers have names such as Opera, Netscape and Explorer. Of course, you can roll your own non-browser Web client. We'll do this below.

URI = URL + URN

You'll notice the three letters I, L and N are in alphabetical order. That's the way to remember this formula.

U => Uniform
R => Resource
I => Indicator
L => Location
N => Name

Web Server Start Up

When a Web server starts running, these are the basic steps taken:

  • Read configuration file. With Apache, you can use Perl to analyze certains parts of the config file. With Apache, you can even use Perl inside the config file
  • Start subprocesses (depending on platform)
  • Become quiescent, i.e. wait for requests from Web clients

It doesn't matter which Web server you are using, and it doesn't matter if the Web server is running under Unix or Windows or any other OS. These principles will apply.

Web Server Request Loop

The Web server request loop, simplified (as always), has several steps:

  • Accept request from Web client
  • Process request. This means one of the following:
    • Read a disk file containing an HTML page (the file = the page = the response)
    • Run a script (its ouput = the response)
    • Service the request using code within the Web server. If you submit this to Apache, http://127.0.0.1/server-info, Apache fabricates the response
  • The script fabricates an HTML page and writes it to STDOUT. The Web server captures STDOUT. This output is the body of the response. The script exits
  • Send response body, wrapped in the appropriate headers, to the Web client

Pictorially, we have an infinity symbol, i.e. a figure eight on its side:

+------+  1 -Request--->  +------+  2 -Action-->  +------+
| Web  |  (URI or Submit) | Web  |  (script.pl)   | Perl |
|Client|                  |Server|                |Script|
+------+  <--Response- 4  +------+  <---HTML-- 3  +------+
         (Header and HTML)    (Plain page or CGI form)

Things to note:

  1. The interaction starts from the Web client
  2. The interaction is a round trip
  3. The Web client uses the HyperText Transfer Protocol to format Request 1:
    • The URI will be sent as text in the message
    • The data from a submitted form will be sent using the CGI protocol
    • In both cases, the HTTP will be used to generate an envelope wrapped around the message content
    In reality, of course, messages 1 and 4 will be wrapped in TCP/IP envelopes, and meesages 2 and 3 will be mediated (handled) by the OS.
  4. Action 2 is a request from the Web server to the operating system to load and run a script. Many issued arise here. A brief summary:
    • Does the Web server have permission to run this script? After all, the Web server is a program, so it means it was loaded and run by some user, often a special user called ``nobody.'' So does this ``nobody'' have permission to run script.pl?
    • Does this particular Web client have permission? The Web server will check directory-access permissions and may have to ask the Web client for a username and a password before proceeding
    • Does the script have permission to read/write whatever directories it needs to to do its work? For instance, to put a Web front-end on CVS requires that ``nobody'' have read access to the source code repository, or that the script opens a socket to another script that can access the repository
  5. Action 3 is a stream of HTML output by the script and is captured by the Web server.
  6. Response 4 is the output from the script wrapped in an envelope of headers according to the HTTP.
  7. The Web client cannot see the source code of the script, only the output of the script. If the Web client, e.g. a browser, offers to download the script and pops up a dialog box asking for the name of a file to save the script in, then the Web server clearly did not execute the script. This means the Web server is misconfigured.
  8. If the first execute of the script outputs a CGI form, then when the Web client submits that form, the script is rerun to process the form's data. That's right, the script would normally be run twice. In other words, the first time the script runs it sees it has no input data, so it outputs an empty form. The second time it runs it sees it has input data, so it reads and processes that data. Yes, they could be two separate scripts. When the form is output, the ``action'' clause specifies the name of the script that the Web server will run to process the form's data.

Web Server Directory Structure

But how does the Web server know which page to return or which script to run? To answer this we next look at the directory structure on the Web server's machine.

Below, Monash and Rusden are the names of university campuses.

monash.edu and rusden.edu will be listed under the ``Virtual Hosts'' part of httpd.conf, or, if you are running MS Windows NT/2k, they can be named in the file C:\WinNT\System32\Drivers\Etc\Hosts. Under other versions of Windows, the hosts file will be C:\Windows\Hosts.

And a warning about the NT version of this file: Windows Explorer will lie to you about the attributes of this file. You will have to log off as any user and log on as the administrator to be able to save edits into this file.

See http://savage.net.au/Perl/Html/configure-apache.html for details.

Assume this directory structure:

        - D:\
        -    www\
        -        cgi-bin\
        -            x.pl
        -        conf\
        -            httpd.conf
        -        public\
        -            index.html
        -            monash\
        -                index.html
        -            monash\staff
        -                mug-shots.html
        -            rusden\
        -                index.html
        -            rusden\staff
        -                courses.html

Note:
• D:\www\cgi-bin   Contents can be executed by the Web server but not viewed by Web clients

• D:\www\conf   Contents invisible to Web clients

• D:\www\public   Contents can be viewed by Web clients

Web Server Configuration

Now, the Web server can be told, via its configuration file httpd.conf, that:

  • Web client requests using http://monash.edu/ are directed to D:\www\public\monash\.
    Hence, a request for http://monash.edu/staff/mug-shots.html returns the disk file D:\www\public\monash\staff\mug-shots.html
  • Web client requests using http://rusden.edu/ are directed to D:\www\public\rusden\.
    Hence a request for http://rusden.edu/staff/courses.html returns the disk file D:\www\public\rusden\staff\courses.html
  • Web client requests using http://monash.edu/cgi-bin/ are directed to D:\www\cgi-bin
  • Web client requests using http://rusden.edu/cgi-bin/ are directed to D:\www\cgi-bin

Did you notice that both virtual hosts use D:\www\cgi-bin?

 ================================================================
 These two hosts have their own document trees, but share scripts
 ================================================================

We can service any number of virtual hosts with only one copy of each script. This is a huge maintenance savings.

This is the information available to the Web server when a request comes in from a Web client. So, now let's look at the client side of things.

A Perl Web Client

Here is a real, live, complete, Perl Web Client that is obviously not a browser:

#!/usr/bin/perl
use LWP::Simple;
print get('http://savage.net.au/index.html');

Yes, folks, that's it. The work is managed by the Perl module ``LWP::Simple,'' and is available thru the command ``get,'' which that module exports, i.e. makes public so it can be used in scripts like this one. LWP stands for Library for Web programming in Perl.

This code runs identically, from the command line, under Windows and Linux. The output is ``print''ed to the screen, but not formatted according to the HTML.

It's now time to step thru the Web server-Web client interaction.

Web Client Requests

When you type something such as ``rusden.edu'' into the browser's address field, or pass that string to a Web client, and hit Go, here's an example of what could happen:

  • The Web client says ``You're lazy,'' and prepends the default protocol to the string, resulting in ``http://rusden.edu''
  • The Web client says ``You're lazy,'' and appends the default directory to the string, resulting in ``http://rusden.edu/''
  • The Web client sends this to the Web server with some headers. This is the all-important ``Request'' (see Web Server Request Loop)
  • The Web server parses it and, using its configuration data, determines which disk directory, if any, this maps to. I say ``if any'' because it may refer to a virtual, or non-existant, directory
  • If the client asks for a directory, this would normally be converted (by the Web server) into a request for a directory listing, or a default file, such as /index.html
  • If the client asks for a script to be run, the request is processed as described above. Of course, the client may not even know that a script is being run to service the request
  • The Web server determines whether you have enough permission to access files in this directory
  • If so, the Web server reads this disk file into memory or runs the script, and sends the result to the Web client with the appropriate headers. This is the all-important ``Response''

In reality, processing the request and manufacturing the response can be complex procedures.

Web Pages

There are two types of Web pages sent to Web clients:

  • Those that contain passive text, which the Web client (or human operating a browser) can do no more than look at
  • Those that contain active text, i.e. CGI forms, in that the Web client (or human) can fill in data entry fields and then submit the form's data back to the Web server for processing by a script. In such cases, the form must contain a submit button of some type. You can use a clickable image as a submit button, or you may use a standard submit button, whose appearance has perhaps been transformed by a cascading style sheet, as the thing to click.

Action = Script

If you view the source of such a form, you will always find text like:

<form method='POST' action='./script.pl'
 enctype='application/x-www-form-urlencoded'>

The ``action'' part tells the Web server when the form is submitted and which script to run to process the form's data.

The Web server asks the operating system to load and run the script, and then it (the Web server) passes the data (from the form) to the script. The script process the data and outputs a response (which would normally be another form).

Warning

I've used './script.pl' to indicate the script is in the ``current'' directory, but be warned, the CGI protocol does not specify what the current directory is at any time.

In fact, it does not even specify that any current directory exists. Your scripts must, at all times, know exactly where they are and what they are doing.

Remember, this ``action'' is taking place inside (i.e. from the point of view of) the Web server.

Web Page Content

Web pages usually contain data in a combination of languages:

  • Text: Display this text
  • Image references: Display this image
  • HTML: Format the text and images. HTML is a ``rendering'' language
  • XML: Echo and describe the text, e.g. to ``data mining'' page crawlers
  • JavaScript, for these reasons:
    • Create special effects (trivial)
    • Validate form input (important)

Yes, scripts can output scripts! Specifically, scripts can output Web pages containing JavaScript, etc. There's even a Perl interface to Macromedia's Flash. Where I work, some salesmen are obsessed with Flash, because it's all they understand of the software we write :-(. In Flash's defense, you'd have to say it's too trivial to have pretensions.

JavaScript

As a Perl aficionado, you may be tempted to look down on JavaScript, but you shouldn't. It really does have its uses.

When a page contains JavaScript to validate form input, this means quite a savings for the Web client.

Without the JavaScript here's what would happen (call this ``overhead''):

  • The form's data would have to be sent to the Web server. This means one trip across the Internet
  • The Web server would have to run the script that will validate the data
  • The Web server would have to pass the data to the script
  • The script would have to read and parse the data
  • The script would have to validate the data
  • The script would have to send a response to the Web server
  • The Web server would have to send a response to the Web client. This means a second trip across the Internet

All of this takes time. When the JavaScript runs, it runs inside the Web client, e.g. browser, so the Web client receives a response much faster.

Of course, complex validation often requires access to databases and so on, so sometimes there is no escape from ``overhead.''

For example, where I work we noticed some pages were appearing quite slowly, and I tracked it down to 3.6Mb (yes!!!) of JavaScript in some pages that was being used to stop inputting of duplicate data. Naturally, this JavaScript was being created by a Perl program :-).

Digression: HTML 'v' XML

As an aside, here's how HTML compares to XML. HTML is a rendering language. It indicates how the data is to be displayed. XML is a meta-language. It indicates the meaning of the data.

Examples:

HTML: '<h1>25</h1>' tells you how 25 should look, but not what it is. In other words, '<h1>' is a command, telling a Web client how to display what follows.

HTML: '<th>Temperature</th><td>25</td>' tells you how to align the 25, but not what it is.

XML: '<temperature>25</temperature>' tells you what 25 is. '<temperature>' is not a command.

XML: '<street number>25</street number>' tells you what 25 is.

Hmmm. This would make a marvellous exam question.

Reaction: A Tale of 2 Scripts

So, what happens when a Web client requests that a Web server run a script?

To answer this, let's look at a Web client request for a script-generated form, and how that request is processed.

In fact, the Web client is saying to the Web server: ''Pretty please, run _your_ script on _my_ data.'' Let's go through the procedure:

  • The Web client sends the URI 'http://rusden.edu/cgi-bin/enrol.pl.' This is script # 1
  • The Web server executes the script (# 1), captures its output and sends the output -- the form -- back to the Web client. Script # 1 knows what to output, because it sees that it has no input data from a CGI form
  • The script (# 1) terminates. It is finished, completed, done, gone forever. Trust me: I'm a programmer ...
  • The Web client renders the Web page
  • The web client fills in the form and submits it. Being a form, it must contain an ``action'' clause naming a script (# 2). Perhaps script # 1 is the same as script # 2
  • The Web server executes the script (# 2), which processes the data. This invocation of script # 2 is independent of the prior invocation of script # 1, even if they are the same script. The Web server executes two separate processes, scripts # 1 and # 2. Script # 2 knows what to do because it sees that it has input data from a CGI form
  • And so on ... Script # 2 may issue another form, in order to continue the interaction

You can see the problem. How does script # 2 know what ``state'' script # 1 got up to?

Maintaining State

The problem of maintaining state is a big problem. Chapter 5 in ``Writing Apache Modules in Perl and C'' is called ``Maintaining State,'' and is dedicated to this problem. See ``Resources,'' below.

A few alternatives, and a simple discussion of possible drawbacks:

  • Send data to the Web client as ``hidden fields'' to be returned with the form data
    Drawback: A person can simply use the browser's ``View Source'' command to see the values. Hidden simply means that these fields are not rendered on the screen. There is absolutely no security in hidden fields.
  • Save state in cookies

    Drawback: The Web client may have disabled cookies. Some banks do this under the false assumption that cookies can contain viruses.

    Drawback: If the cookie is written to disk by the Web client, the text in the cookie must be encrypted if you want to stop people looking at it or changing it.

  • Save state in Web server memory

    Drawback: The data is in the memory of one process, and when the Web client logs back in (i.e. submits the form data) it may be connected to a different process, i.e. a copy of the process that send the first response, and this copy will not have access to the memory of the first process.

  • Save state in the URI itself, e.g. as a session ID Here's how: Generate a random number. Write the data into a database using the random number as the key. Send the random number to the Web client to be returned with the form data.

    Drawback: You can't just use the operating system's random-number generator, since anyone with the same OS and compiler could predict the numbers, since they aren't truely random.

    Drawback: Relative URIs no longer work correctly. However, help is at hand with Perl module Apache::StripSession.

    Drawback: Under some circumstances it is possible for the session ID to ``leak'' to other sites.

  • Write the data to a temporary file

    Drawback: How does script # 2 know the name of this file created by script # 1? It's simple if they are the same script, but they don't have to be.

    Drawback: What happens if two copies of script # 1 run at the same time?

In each case, you either abandon that alternative or add complexity to overcome the drawbacks.

There is no perfect solution that satisfies all cases. You must study the alternatives, study your situation and choose a course of action.

Combining Perl and HTML

There are three basic ways to do this:

  1. Put the code inside the HTML. Many Perl packages take this approach. E.G: Apache::ASP, Apache::EmbPerl (EmbeddedPerl), Apache::EP (another embedded Perl), Apache::ePerl (yet another embedded Perl), Template::Toolkit (embed a mini non-Perl language in the HTML).

    In each case, you need an interpreter to read the combined HTML/Perl/Other and to output pure HTML. In such cases, the interpreter will act as a filter.

  2. Put the HTML inside the code. This is just the reverse of (1). Thus (tested code):
    #!/usr/bin/perl
    use integer;
    use strict;
    use warnings;
    use CGI;
    my($q) = CGI -> new();
    print   $q -> header,
      $q -> start_html(),
      'Weather Report',
      $q -> br(),
      $q -> table
      (
       $q -> Tr
       ([
        $q -> th('Temperature') . $q -> td(25)
       ])
      ),
      $q -> end_html();
    
  3. Put the HTML, or XML, or whatever, in a file external to the script. In this case your script will act as a filter. Your script reads this file and looks for special strings in the file that it replaces with HTML generated by, say, reading a database and formatting the output. In other words, the external file contains a combination of:
    • HTML, which your script simply copies to its output stream
    • HTML comments, like <!-- Some command -->, which your script cuts out and replaces with the results of processing that command

A Detour - SDF

If you head over to SDF - Simple Document Format, you'll see an example of the third way. SDF is, of course, a Perl-based open-source answer to PDF.

SDF is also available from CPAN: http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html.

SDF converts text files into various specific formats. SDF can output, directly or via other software, into these formats: HTML, PostScript, PDF, man pages, POD, LaTeX, SGML, MIMS HTX and F6 help, MIF, RTF, Windows help and plain text.

Inside a Script: Who's Calling?

A script can ask the Web server the URI used to fire off the script.

The Web server puts this information into the environment of the script under the name HTTP_REFERER (yes, misspelling included for free).

So, as a script, I can say I was called by one of:

  • http://monash.edu/cgi-bin/enrol.pl
  • http://rusden.edu/cgi-bin/enrol.pl

Now, either ``monash.edu'' or ``rusden.edu'' is just the value of a string in the script, and so the script can use this string as a key into a database. In fact, this part of the URI is also in the environment, separately, under the name HTTP_HOST.

From a database table, or any number of tables, the script can retrieve data specific to the host. This, in turn, means the script can change its behavior depending on the URI used to run it.

Data Per URI - Page Design

The open-source database MySQL has a reserved table called ``hosts,'' so I'll start using the word ``domain.'' Given a domain, I can turn that into a number that can be used as an index into a database table.

Here is a sample ``domain'' table:

 +=============+=======+
 |             |  URI  |
 | domain_name | index |
 +=============+=======+
 | monash.edu  |   4   |
 +=============+=======+
 | rusden.edu  |   6   |
 +=============+=======+

And here is a sample Web page ``design'' table:

 +=======+
 +  URI  +===============+===========+===================+
 | index | template_name | bkg_color | location_of_links |...
 +=======+===============+===========+===================+
 |   4   |   dark blue   |   cream   |   down the left   |...
 +=======+===============+===========+===================+
 |   6   |   pale green  |  an image | across the bottom |...
 +=======+===============+===========+===================+

Data per URI - Page Content

Here is a sample Web page ``content'' table:

 +=======+
 +  URI  +================+================+
 | index | News headlines |    Weather     |...
 +=======+================+================+
 |   4   |        -       | www.bom.gov.au |...
 +=======+================+================+
 |   6   | www.f2.com.au  | www.bom.gov.au |...
 +=======+================+================+

f2 => Fairfax, the publisher of ``The Age'' newspaper.

bom => Bureau of Meteorology.

Data Per URI - Page Content Revisited

Let me give a more commercial example. Here we chain tables:

 ProductMap table:
 +=======+
 +  URI  +==============+============+
 | index |   Products   | product_id |
 +=======+==============+============+
 |   4   | Motherboards |     1      |
 +=======+==============+============+
 |   4   | Printers     |     2      |
 +=======+==============+============+
 |   4   | CD Writers   |     3      |
 +=======+==============+============+
 |   6   | CD Writers   |     4      |
 +=======+==============+============+
 |   6   | Zip Drives   |     5      |
 +=======+==============+============+

 Product table:
 +============+=============+
 | product_id |    Brands   |
 +============+=============+
 |     1      | Gigabyte X1 |
 +============+=============+
 |     1      | Gigabyte X2 |
 +============+=============+
 |     1      |   Intel A   |
 +============+=============+
 |     1      |   Intel B   |
 +============+=============+
 |     :      |      :      |
 +============+=============+
 |     5      |    Sony     |
 +============+=============+

Hence a list of products for a given URI, i.e. a given shop, can be turned into an HTML table and inserted into the outgoing Web page.

Resources

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02