June 2001 Archives

Why Not Translate Perl to C?

People often have the idea that automatically translating Perl to C and then compiling the C will make their Perl programs run faster, because "C is much faster than Perl." This article explains why this strategy is unlikely to work.

Short Summary

Your Perl program is being run by the Perl interpreter. You want a C program that does the same thing that your Perl program does. A C program to do what your Perl program does would have to do most of the same things that the Perl interpreter does when it runs your Perl program. There is no reason to think that the C program could do those things faster than the Perl interpreter does them, because the Perl interpreter itself is written in very fast C.

Some detailed case studies follow.

Built-In Functions

Suppose your program needs to split a line into fields, and uses the Perl split function to do so. You want to compile this to C so it will be faster.

This is obviously not going to work, because the split function is already implemented in C. If you have the Perl source code, you can see the implementation of split in the file pp.c; it is in the function named pp_split. When your Perl program uses split, Perl calls this pp_split function to do the splitting. pp_split is written in C, and it has already been compiled to native machine code.

Now, suppose you want to translate your Perl program to C. How will you translate your split call? The only thing you can do is translate it to a call to the C pp_split function, or some other equivalent function that splits. There is no reason to believe that any C implementation of split will be faster than the pp_split that Perl already has. Years of work have gone into making pp_split as fast as possible.

You can make the same argument for all of Perl's other built-in functions, such as join, printf, rand and readdir.

So much for built-in functions.

Data Structures

Why is Perl slow to begin with? One major reason is that its data structures are extremely flexible, and this flexibility imposes a speed penalty.

Let's look in detail at an important example: strings. Consider this Perl code:

        $x = 'foo';     
        $y = 'bar';
        $x .= $y;

That is, we want to append $y to the end of $x. In C, this is extremely tricky. In C, you would start by doing something like this:

        char *x = "foo";
        char *y = "bar";

Now you have a problem. You would like to insert bar at the end of the buffer pointed to by x. But you can't, because there is not enough room; x only points to enough space for four characters, and you need space for seven. (C strings always have an extra nul character on the end.) To append y to x, you must allocate a new buffer, and then arrange for x to point to the new buffer:

        char *tmp = malloc(strlen(x) + strlen(y) + 1);
        strcpy(tmp, x);
        strcat(tmp, y);
        x = tmp;

This works fine if x is the only pointer to that particular buffer. But if some other part of the program also had a pointer to the buffer, this code does not work. Why not? Here's the picture of what we did:

BEFORE:

Here x and z are two variables that both contain pointers to the same buffer. We want to append bar to the end of the string. But the C code we used above doesn't quite work, because we allocated a new region of memory to hold the result, and then pointed x to it:

AFTER x = tmp:

It's tempting to think that we should just point z to the new buffer also, but in practice this is impossible. The function that is doing the appending cannot know whether there is such a z, or where it may be. There might be 100 variables like z all pointing to the old buffer, and there is no good way to keep track of them so that they can all be changed when the array moves.

Perl does support a transparent string append operation. Let's see how this works. In Perl, a variable like $x does not point directly at the buffer. Instead, it points at a structure called an SV. ('Scalar Value') The SV has the pointer to the buffer, and also some other things that I do not show:

BEFORE $x .= $y

When you ask Perl to append bar to $x, it follows the pointers and finds that there is not enough space in the buffer. So, just as in C, it allocates a new buffer and stores the result in the new buffer. Then it fixes the pointer in the SV to point to the new buffer, and it throws away the old buffer:

Now $x and $z have both changed. If there were any other variables sharing the SV, their values would have changed also. This technique is called "double indirection,'" and it is how Perl can support operations like .=. A similar principle applies for arrays; this is how Perl can support the push function.

The flexibility comes at a price: Whenever you want to use the value of $x, Perl must follow two pointers to get the value: The first to find the SV structure, and the second to get to the buffer with the character data. This means that using a string in Perl takes at least twice as long as in C. In C, you follow just one pointer.

If you want to compile Perl to C, you have a big problem. You would like to support operations like .= and push, but C does not support these very well. There are only three solutions:

  1. Don't support .=

    This is a bad solution, because after you disallow all the Perl operations like .= and push what you have left is not very much like Perl; it is much more like C, and then you might as well just write the program in C in the first place.

  2. Do something extremely clever

    Cleverness is in short supply this month. :)

  3. Use a double-indirection technique in the compiled C code

    This works, but the resulting C code will be slow, because you will have to traverse twice as many pointers each time you want to look up the value of a variable. But that is why Perl is slow! Perl is already doing the double-indirection lookup in C, and the code to do this has already been compiled to native machine code.

So again, it's not clear that you are going to get any benefit from translating Perl to C. The slowness of Perl comes from the flexibility of the data structures. The code to manipulate these structures is already written in C. If you translate a Perl program to C, you have the choice of throwing away the flexibility of the data structure, in which case you are now writing C programs with C structures, or keeping the flexibility with the same speed penalty. You probably cannot speed up the data structures, because if anyone knew how to make the structures faster and still keep them flexible, they would already have made those changes in the C code for Perl itself.

Possible Future Work

Related Articles

Larry Wall Apocalypse 2

Damian Conway Exegesis 2

perl6-internals mailing list archive

It should now be clear that although it might not be hard to translate Perl to C, programs probably will not be faster as a result.

However, it's possible that a sufficiently clever person could make a Perl-to-C translator that produced faster C code. The programmer would need to give hints to the translator to say how the variables were being used. For example, suppose you have an array @a. With such an array, Perl is ready for anything. You might do $a[1000000] = 'hello'; or $a[500] .= 'foo'; or $a[500] /= 17;. This flexibility is expensive. But suppose you know that this array will only hold integers and there will never be more than 1,000 integers. You might tell the translator that, and then instead of producing C code to manage a slow Perl array, the translator can produce

        int a[1000];

and use a fast C array of machine integers.

To do this, you have to be very clever and you have to think of a way of explaining to the translator that @a will never be bigger than 1,000 elements and will only contain integers, or a way for the translator to guess that just from looking at the Perl program.

People are planning these features for Perl 6 right now. For example, Larry Wall, the author of Perl, plans that you will be able to declare a Perl array as

        my int @a is dim(1000);

Then a Perl-to-C translator (or Perl itself) might be able to use a fast C array of machine integers rather than a slow Perl array of SVs. If you are interested, you may want to join the perl6-internals mailing list.

CGI Scripting

This article is about scripting in Perl. Of course, scripting can take place anywhere, not just in the context of the Web. I will concentrate on CGI scripts written in Perl. In fact, virtually any language could be used to write CGI scripts.

Perl has been ported to about 70 operating systems. The most recent (February 2001) is Windows CE. In this article I will make a lot of simplifications and generalizations. Here's the first ...

There are two types of CGI scripts:

  • Those that output HTML pages
  • Those that process the input from CGI forms

Invariably, the second type -- having processed the data -- will output an HTML page or form to allow the user to continue, or at least know what happened, or they will do a CGI redirect to another script that outputs something.

I am splitting scripts in to two type to emphasize that there are differences between processing forms and processing in the absence of forms.

Terminology

There are Web Servers and Web Clients. Some Web clients are browsers.

There are programs and scripts. Once upon a time, programs were compiled and scripts were interpreted. Hence the two names. But today, this is ``a distinction without a difference.'' My attitude is that the two words, program and script, are interchangable.

Program and process, however, are different. Program means a program on disk. Process means a program that has been loaded by the operating system into memory and it being executed. This means a single program on disk can be loaded and run several times simultaneously, in which case it is one program and several processes.

Web servers have names such as Apache, Zeus, MS IIS and TinyHTTPd. Apache and TinyHTTPd (Tiny HyperText Transfer Protocol Daemon) are open source. Zeus and MS IIS (Internet Information Server) are commercial products. The feeble security of IIS makes it unusable in a commercial environment.

My examples will use Apache as the Web server. Web clients that are browsers have names such as Opera, Netscape and Explorer. Of course, you can roll your own non-browser Web client. We'll do this below.

URI = URL + URN

You'll notice the three letters I, L and N are in alphabetical order. That's the way to remember this formula.

U => Uniform
R => Resource
I => Indicator
L => Location
N => Name

Web Server Start Up

When a Web server starts running, these are the basic steps taken:

  • Read configuration file. With Apache, you can use Perl to analyze certains parts of the config file. With Apache, you can even use Perl inside the config file
  • Start subprocesses (depending on platform)
  • Become quiescent, i.e. wait for requests from Web clients

It doesn't matter which Web server you are using, and it doesn't matter if the Web server is running under Unix or Windows or any other OS. These principles will apply.

Web Server Request Loop

The Web server request loop, simplified (as always), has several steps:

  • Accept request from Web client
  • Process request. This means one of the following:
    • Read a disk file containing an HTML page (the file = the page = the response)
    • Run a script (its ouput = the response)
    • Service the request using code within the Web server. If you submit this to Apache, http://127.0.0.1/server-info, Apache fabricates the response
  • The script fabricates an HTML page and writes it to STDOUT. The Web server captures STDOUT. This output is the body of the response. The script exits
  • Send response body, wrapped in the appropriate headers, to the Web client

Pictorially, we have an infinity symbol, i.e. a figure eight on its side:

+------+  1 -Request--->  +------+  2 -Action-->  +------+
| Web  |  (URI or Submit) | Web  |  (script.pl)   | Perl |
|Client|                  |Server|                |Script|
+------+  <--Response- 4  +------+  <---HTML-- 3  +------+
         (Header and HTML)    (Plain page or CGI form)

Things to note:

  1. The interaction starts from the Web client
  2. The interaction is a round trip
  3. The Web client uses the HyperText Transfer Protocol to format Request 1:
    • The URI will be sent as text in the message
    • The data from a submitted form will be sent using the CGI protocol
    • In both cases, the HTTP will be used to generate an envelope wrapped around the message content
    In reality, of course, messages 1 and 4 will be wrapped in TCP/IP envelopes, and meesages 2 and 3 will be mediated (handled) by the OS.
  4. Action 2 is a request from the Web server to the operating system to load and run a script. Many issued arise here. A brief summary:
    • Does the Web server have permission to run this script? After all, the Web server is a program, so it means it was loaded and run by some user, often a special user called ``nobody.'' So does this ``nobody'' have permission to run script.pl?
    • Does this particular Web client have permission? The Web server will check directory-access permissions and may have to ask the Web client for a username and a password before proceeding
    • Does the script have permission to read/write whatever directories it needs to to do its work? For instance, to put a Web front-end on CVS requires that ``nobody'' have read access to the source code repository, or that the script opens a socket to another script that can access the repository
  5. Action 3 is a stream of HTML output by the script and is captured by the Web server.
  6. Response 4 is the output from the script wrapped in an envelope of headers according to the HTTP.
  7. The Web client cannot see the source code of the script, only the output of the script. If the Web client, e.g. a browser, offers to download the script and pops up a dialog box asking for the name of a file to save the script in, then the Web server clearly did not execute the script. This means the Web server is misconfigured.
  8. If the first execute of the script outputs a CGI form, then when the Web client submits that form, the script is rerun to process the form's data. That's right, the script would normally be run twice. In other words, the first time the script runs it sees it has no input data, so it outputs an empty form. The second time it runs it sees it has input data, so it reads and processes that data. Yes, they could be two separate scripts. When the form is output, the ``action'' clause specifies the name of the script that the Web server will run to process the form's data.

Web Server Directory Structure

But how does the Web server know which page to return or which script to run? To answer this we next look at the directory structure on the Web server's machine.

Below, Monash and Rusden are the names of university campuses.

monash.edu and rusden.edu will be listed under the ``Virtual Hosts'' part of httpd.conf, or, if you are running MS Windows NT/2k, they can be named in the file C:\WinNT\System32\Drivers\Etc\Hosts. Under other versions of Windows, the hosts file will be C:\Windows\Hosts.

And a warning about the NT version of this file: Windows Explorer will lie to you about the attributes of this file. You will have to log off as any user and log on as the administrator to be able to save edits into this file.

See http://savage.net.au/Perl/Html/configure-apache.html for details.

Assume this directory structure:

        - D:\
        -    www\
        -        cgi-bin\
        -            x.pl
        -        conf\
        -            httpd.conf
        -        public\
        -            index.html
        -            monash\
        -                index.html
        -            monash\staff
        -                mug-shots.html
        -            rusden\
        -                index.html
        -            rusden\staff
        -                courses.html

Note:
• D:\www\cgi-bin   Contents can be executed by the Web server but not viewed by Web clients

• D:\www\conf   Contents invisible to Web clients

• D:\www\public   Contents can be viewed by Web clients

Web Server Configuration

Now, the Web server can be told, via its configuration file httpd.conf, that:

  • Web client requests using http://monash.edu/ are directed to D:\www\public\monash\.
    Hence, a request for http://monash.edu/staff/mug-shots.html returns the disk file D:\www\public\monash\staff\mug-shots.html
  • Web client requests using http://rusden.edu/ are directed to D:\www\public\rusden\.
    Hence a request for http://rusden.edu/staff/courses.html returns the disk file D:\www\public\rusden\staff\courses.html
  • Web client requests using http://monash.edu/cgi-bin/ are directed to D:\www\cgi-bin
  • Web client requests using http://rusden.edu/cgi-bin/ are directed to D:\www\cgi-bin

Did you notice that both virtual hosts use D:\www\cgi-bin?

 ================================================================
 These two hosts have their own document trees, but share scripts
 ================================================================

We can service any number of virtual hosts with only one copy of each script. This is a huge maintenance savings.

This is the information available to the Web server when a request comes in from a Web client. So, now let's look at the client side of things.

A Perl Web Client

Here is a real, live, complete, Perl Web Client that is obviously not a browser:

#!/usr/bin/perl
use LWP::Simple;
print get('http://savage.net.au/index.html');

Yes, folks, that's it. The work is managed by the Perl module ``LWP::Simple,'' and is available thru the command ``get,'' which that module exports, i.e. makes public so it can be used in scripts like this one. LWP stands for Library for Web programming in Perl.

This code runs identically, from the command line, under Windows and Linux. The output is ``print''ed to the screen, but not formatted according to the HTML.

It's now time to step thru the Web server-Web client interaction.

Web Client Requests

When you type something such as ``rusden.edu'' into the browser's address field, or pass that string to a Web client, and hit Go, here's an example of what could happen:

  • The Web client says ``You're lazy,'' and prepends the default protocol to the string, resulting in ``http://rusden.edu''
  • The Web client says ``You're lazy,'' and appends the default directory to the string, resulting in ``http://rusden.edu/''
  • The Web client sends this to the Web server with some headers. This is the all-important ``Request'' (see Web Server Request Loop)
  • The Web server parses it and, using its configuration data, determines which disk directory, if any, this maps to. I say ``if any'' because it may refer to a virtual, or non-existant, directory
  • If the client asks for a directory, this would normally be converted (by the Web server) into a request for a directory listing, or a default file, such as /index.html
  • If the client asks for a script to be run, the request is processed as described above. Of course, the client may not even know that a script is being run to service the request
  • The Web server determines whether you have enough permission to access files in this directory
  • If so, the Web server reads this disk file into memory or runs the script, and sends the result to the Web client with the appropriate headers. This is the all-important ``Response''

In reality, processing the request and manufacturing the response can be complex procedures.

Web Pages

There are two types of Web pages sent to Web clients:

  • Those that contain passive text, which the Web client (or human operating a browser) can do no more than look at
  • Those that contain active text, i.e. CGI forms, in that the Web client (or human) can fill in data entry fields and then submit the form's data back to the Web server for processing by a script. In such cases, the form must contain a submit button of some type. You can use a clickable image as a submit button, or you may use a standard submit button, whose appearance has perhaps been transformed by a cascading style sheet, as the thing to click.

Action = Script

If you view the source of such a form, you will always find text like:

<form method='POST' action='./script.pl'
 enctype='application/x-www-form-urlencoded'>

The ``action'' part tells the Web server when the form is submitted and which script to run to process the form's data.

The Web server asks the operating system to load and run the script, and then it (the Web server) passes the data (from the form) to the script. The script process the data and outputs a response (which would normally be another form).

Warning

I've used './script.pl' to indicate the script is in the ``current'' directory, but be warned, the CGI protocol does not specify what the current directory is at any time.

In fact, it does not even specify that any current directory exists. Your scripts must, at all times, know exactly where they are and what they are doing.

Remember, this ``action'' is taking place inside (i.e. from the point of view of) the Web server.

Web Page Content

Web pages usually contain data in a combination of languages:

  • Text: Display this text
  • Image references: Display this image
  • HTML: Format the text and images. HTML is a ``rendering'' language
  • XML: Echo and describe the text, e.g. to ``data mining'' page crawlers
  • JavaScript, for these reasons:
    • Create special effects (trivial)
    • Validate form input (important)

Yes, scripts can output scripts! Specifically, scripts can output Web pages containing JavaScript, etc. There's even a Perl interface to Macromedia's Flash. Where I work, some salesmen are obsessed with Flash, because it's all they understand of the software we write :-(. In Flash's defense, you'd have to say it's too trivial to have pretensions.

JavaScript

As a Perl aficionado, you may be tempted to look down on JavaScript, but you shouldn't. It really does have its uses.

When a page contains JavaScript to validate form input, this means quite a savings for the Web client.

Without the JavaScript here's what would happen (call this ``overhead''):

  • The form's data would have to be sent to the Web server. This means one trip across the Internet
  • The Web server would have to run the script that will validate the data
  • The Web server would have to pass the data to the script
  • The script would have to read and parse the data
  • The script would have to validate the data
  • The script would have to send a response to the Web server
  • The Web server would have to send a response to the Web client. This means a second trip across the Internet

All of this takes time. When the JavaScript runs, it runs inside the Web client, e.g. browser, so the Web client receives a response much faster.

Of course, complex validation often requires access to databases and so on, so sometimes there is no escape from ``overhead.''

For example, where I work we noticed some pages were appearing quite slowly, and I tracked it down to 3.6Mb (yes!!!) of JavaScript in some pages that was being used to stop inputting of duplicate data. Naturally, this JavaScript was being created by a Perl program :-).

Digression: HTML 'v' XML

As an aside, here's how HTML compares to XML. HTML is a rendering language. It indicates how the data is to be displayed. XML is a meta-language. It indicates the meaning of the data.

Examples:

HTML: '<h1>25</h1>' tells you how 25 should look, but not what it is. In other words, '<h1>' is a command, telling a Web client how to display what follows.

HTML: '<th>Temperature</th><td>25</td>' tells you how to align the 25, but not what it is.

XML: '<temperature>25</temperature>' tells you what 25 is. '<temperature>' is not a command.

XML: '<street number>25</street number>' tells you what 25 is.

Hmmm. This would make a marvellous exam question.

Reaction: A Tale of 2 Scripts

So, what happens when a Web client requests that a Web server run a script?

To answer this, let's look at a Web client request for a script-generated form, and how that request is processed.

In fact, the Web client is saying to the Web server: ''Pretty please, run _your_ script on _my_ data.'' Let's go through the procedure:

  • The Web client sends the URI 'http://rusden.edu/cgi-bin/enrol.pl.' This is script # 1
  • The Web server executes the script (# 1), captures its output and sends the output -- the form -- back to the Web client. Script # 1 knows what to output, because it sees that it has no input data from a CGI form
  • The script (# 1) terminates. It is finished, completed, done, gone forever. Trust me: I'm a programmer ...
  • The Web client renders the Web page
  • The web client fills in the form and submits it. Being a form, it must contain an ``action'' clause naming a script (# 2). Perhaps script # 1 is the same as script # 2
  • The Web server executes the script (# 2), which processes the data. This invocation of script # 2 is independent of the prior invocation of script # 1, even if they are the same script. The Web server executes two separate processes, scripts # 1 and # 2. Script # 2 knows what to do because it sees that it has input data from a CGI form
  • And so on ... Script # 2 may issue another form, in order to continue the interaction

You can see the problem. How does script # 2 know what ``state'' script # 1 got up to?

Maintaining State

The problem of maintaining state is a big problem. Chapter 5 in ``Writing Apache Modules in Perl and C'' is called ``Maintaining State,'' and is dedicated to this problem. See ``Resources,'' below.

A few alternatives, and a simple discussion of possible drawbacks:

  • Send data to the Web client as ``hidden fields'' to be returned with the form data
    Drawback: A person can simply use the browser's ``View Source'' command to see the values. Hidden simply means that these fields are not rendered on the screen. There is absolutely no security in hidden fields.
  • Save state in cookies

    Drawback: The Web client may have disabled cookies. Some banks do this under the false assumption that cookies can contain viruses.

    Drawback: If the cookie is written to disk by the Web client, the text in the cookie must be encrypted if you want to stop people looking at it or changing it.

  • Save state in Web server memory

    Drawback: The data is in the memory of one process, and when the Web client logs back in (i.e. submits the form data) it may be connected to a different process, i.e. a copy of the process that send the first response, and this copy will not have access to the memory of the first process.

  • Save state in the URI itself, e.g. as a session ID Here's how: Generate a random number. Write the data into a database using the random number as the key. Send the random number to the Web client to be returned with the form data.

    Drawback: You can't just use the operating system's random-number generator, since anyone with the same OS and compiler could predict the numbers, since they aren't truely random.

    Drawback: Relative URIs no longer work correctly. However, help is at hand with Perl module Apache::StripSession.

    Drawback: Under some circumstances it is possible for the session ID to ``leak'' to other sites.

  • Write the data to a temporary file

    Drawback: How does script # 2 know the name of this file created by script # 1? It's simple if they are the same script, but they don't have to be.

    Drawback: What happens if two copies of script # 1 run at the same time?

In each case, you either abandon that alternative or add complexity to overcome the drawbacks.

There is no perfect solution that satisfies all cases. You must study the alternatives, study your situation and choose a course of action.

Combining Perl and HTML

There are three basic ways to do this:

  1. Put the code inside the HTML. Many Perl packages take this approach. E.G: Apache::ASP, Apache::EmbPerl (EmbeddedPerl), Apache::EP (another embedded Perl), Apache::ePerl (yet another embedded Perl), Template::Toolkit (embed a mini non-Perl language in the HTML).

    In each case, you need an interpreter to read the combined HTML/Perl/Other and to output pure HTML. In such cases, the interpreter will act as a filter.

  2. Put the HTML inside the code. This is just the reverse of (1). Thus (tested code):
    #!/usr/bin/perl
    use integer;
    use strict;
    use warnings;
    use CGI;
    my($q) = CGI -> new();
    print   $q -> header,
      $q -> start_html(),
      'Weather Report',
      $q -> br(),
      $q -> table
      (
       $q -> Tr
       ([
        $q -> th('Temperature') . $q -> td(25)
       ])
      ),
      $q -> end_html();
    
  3. Put the HTML, or XML, or whatever, in a file external to the script. In this case your script will act as a filter. Your script reads this file and looks for special strings in the file that it replaces with HTML generated by, say, reading a database and formatting the output. In other words, the external file contains a combination of:
    • HTML, which your script simply copies to its output stream
    • HTML comments, like <!-- Some command -->, which your script cuts out and replaces with the results of processing that command

A Detour - SDF

If you head over to SDF - Simple Document Format, you'll see an example of the third way. SDF is, of course, a Perl-based open-source answer to PDF.

SDF is also available from CPAN: http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html.

SDF converts text files into various specific formats. SDF can output, directly or via other software, into these formats: HTML, PostScript, PDF, man pages, POD, LaTeX, SGML, MIMS HTX and F6 help, MIF, RTF, Windows help and plain text.

Inside a Script: Who's Calling?

A script can ask the Web server the URI used to fire off the script.

The Web server puts this information into the environment of the script under the name HTTP_REFERER (yes, misspelling included for free).

So, as a script, I can say I was called by one of:

  • http://monash.edu/cgi-bin/enrol.pl
  • http://rusden.edu/cgi-bin/enrol.pl

Now, either ``monash.edu'' or ``rusden.edu'' is just the value of a string in the script, and so the script can use this string as a key into a database. In fact, this part of the URI is also in the environment, separately, under the name HTTP_HOST.

From a database table, or any number of tables, the script can retrieve data specific to the host. This, in turn, means the script can change its behavior depending on the URI used to run it.

Data Per URI - Page Design

The open-source database MySQL has a reserved table called ``hosts,'' so I'll start using the word ``domain.'' Given a domain, I can turn that into a number that can be used as an index into a database table.

Here is a sample ``domain'' table:

 +=============+=======+
 |             |  URI  |
 | domain_name | index |
 +=============+=======+
 | monash.edu  |   4   |
 +=============+=======+
 | rusden.edu  |   6   |
 +=============+=======+

And here is a sample Web page ``design'' table:

 +=======+
 +  URI  +===============+===========+===================+
 | index | template_name | bkg_color | location_of_links |...
 +=======+===============+===========+===================+
 |   4   |   dark blue   |   cream   |   down the left   |...
 +=======+===============+===========+===================+
 |   6   |   pale green  |  an image | across the bottom |...
 +=======+===============+===========+===================+

Data per URI - Page Content

Here is a sample Web page ``content'' table:

 +=======+
 +  URI  +================+================+
 | index | News headlines |    Weather     |...
 +=======+================+================+
 |   4   |        -       | www.bom.gov.au |...
 +=======+================+================+
 |   6   | www.f2.com.au  | www.bom.gov.au |...
 +=======+================+================+

f2 => Fairfax, the publisher of ``The Age'' newspaper.

bom => Bureau of Meteorology.

Data Per URI - Page Content Revisited

Let me give a more commercial example. Here we chain tables:

 ProductMap table:
 +=======+
 +  URI  +==============+============+
 | index |   Products   | product_id |
 +=======+==============+============+
 |   4   | Motherboards |     1      |
 +=======+==============+============+
 |   4   | Printers     |     2      |
 +=======+==============+============+
 |   4   | CD Writers   |     3      |
 +=======+==============+============+
 |   6   | CD Writers   |     4      |
 +=======+==============+============+
 |   6   | Zip Drives   |     5      |
 +=======+==============+============+

 Product table:
 +============+=============+
 | product_id |    Brands   |
 +============+=============+
 |     1      | Gigabyte X1 |
 +============+=============+
 |     1      | Gigabyte X2 |
 +============+=============+
 |     1      |   Intel A   |
 +============+=============+
 |     1      |   Intel B   |
 +============+=============+
 |     :      |      :      |
 +============+=============+
 |     5      |    Sony     |
 +============+=============+

Hence a list of products for a given URI, i.e. a given shop, can be turned into an HTML table and inserted into the outgoing Web page.

Resources

This Week on p5p 2001/06/25

This week on perl5-porters (18 June--25 June 2001)

Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

This was a fairly busy week, seeing just under 500 messages.

Regular expressions

There were a couple of regular expression threads this week.

Heikki Lehvaslaiho found a bug with some end-anchored regular expressions when using study, which involved the regular rexpression engine looping. The problem was quickly found and patched by Hugo.

Jeff Pinyan almost presented a patch for regex negation, before being kindly asked not to forget Unicode.

Artur Bergman and Richard Soderberg started to move regular expressions from the optree to the pad so that they can be redefined under USEITHREADS. They also patched perl to allow /o to work under ithreads.

Jeffrey Friedl asked about applying regular expressions backwards (from right to left):

	$text =~ s/^\s+//;  # strip leading whitespace
	$text =~ s/^\s+//r; # strip trailing whitespace
... and Jeff Pinyan plugged his sexeger research.

5.7.2 in sight

Jarkko informed us that the latest snapshot, perl@10825, would be the last snapshot before the next Perl development release of 5.7.2. Snapshots after that would be release candidates, leading on to the next Perl stable release of 5.8.0:

So now it's definitely time to remind me of any patches that I might have missed. If you have a bug fixing going on please inform your loved ones that you'll camp at the office / by the computer for coming few weeks.

rsync-able snapshots

Jarkko also announced availability of a way to rsync his development snapshots, staying away from the bleading-edge part of bleadperl, so that you don't report a bug which is in the midst of getting patched (or created), available via:

	mkdir -p perl;cd perl && rsync -av ftp.funet.fi::perlsnap .

overloading s?printf

Jarkko Hietaniemi asked for opinions on a way to use formatted printing for objects.

For example, it would be nice to able to use, say, %Z, to format both the real and imaginary part of Math::Complex objects at the same time, say %7.3Z --or even tamper with %f %g %e so that they would recognize their arguments being Math::Complexes.

Some options are: using overload to pass formats via coderefs, or inventing a whole new mini language.

Various

Nicholas Clark reported a problem with 64 bit int support under x86 FreeBSD and with gcc's optimiser enabled. This was tracked down to a patch (10417), but no solution was offered other than patch retraction.

Mike Guy commented that not_a_number in sv.c didn't grok UTF8. This got added to the UTF8 TODO list, as a request for a function sv_printify for displaying PVs with control characters, embedded nulls, and Unicode.

Spider Boardman was awarded the nightship of the Wielder of the Holy Cast, for fixing casting warnings under HP-UX cc.

There was some talk about compiling Perl with the new, pickier -Wall that comes with the new gcc 3.0.

Artur Bergman asked an interesting problem about what to do with the seed for rand when threads are involved. Do we want srand() and rand() to be performed on a single global seen variable or a "per-thread" seed variable? Jarkko suggested that maybe it wouldn't be such a bad idea to have our own PRNG implementation.

Nikola Knezevic asked about configure.com not working under DOS, which is in fact the configure script for VMS. It now sports a comment reflecting this.

Abhijit Menon-Sen's macro cleanups last week were backed out for performance reasons, but with the macros left in as comments to aid grepping.

Doug MacEachern, amongst other sterling patches, offered a patch to make make Foo.s add -S to cc flags, which outputs the assembly code to Foo.s.

Laszlo Molnar provided some patched to get bleadperl working under djgpp.

Abhijit Menon-Sen patch a problem spoted by Harmon S. Nine about => turning ANY bareword before it into a quoted string. The documentation is somewhat vague about what exactly is quoted by => but this is a change in behaviour since 5.6.0. The patch reverted the behaviour so that function calls that are used as hash key-values in the construction of a hash are still called.

Spider Boardman reported that PERLIO=unix broke many tests and offered some patches. Nick Ing-Simmons countered that it always has, as it makes the "stream" completely unbuffered and that isn't really very perlish. However, he fixed a reported bug in PerlIOBase_unread, which was messing with b->posn after the unread(), rather than setting it before the unread.

Peter Prymmer worried about floating point representations of numbers under OS/390 with exponents over +63.

Marcel Grunauer produced some patches for Darwin.

Ilya Zakharevich provided a slew of OS/2 patches.

There were also some minor documentation patches (some very pedantic).

Oh, and Jarkko had an off-by-one error.

Until next week I remain, your temporarily-replacing humble and obedient servant,

Leon Brocard, leon@iterative-software.com

Yet Another YAPC Report: Montreal

A year ago, at Yet Another Perl Conference North America 19100, both Perl-the-language and Perl-the-community seemed to be headed for trouble. Longtime Perl hackers spoke openly with concern at the apparent stagnation of Perl 5 development, and how the community seemed to be increasingly bogged down by acrimony and bickering. Now, a year later, the tide has already turned and the evidence is nowhere more apparent than at this year's YAPC::NA in Montreal.

The conference, produced by Yet Another Society, was a smashing success. More than 350 Perl Mongers converged from all across North America and Europe on McGill University for the three-day event. Rich Lafferty and Luc St. Louis, key organizers from the Montreal.pm group, did a brilliant job of lining everything up; and Kevin Lenzo, YAS president, once again did the crucial fund raising and promotional work to make the conference a reality.

Certain familiar faces were missing from this year's conference, including Larry himself, who was scheduled to deliver the keynote but was absent due to illness. (Get well soon, Larry!) The unenviable task of filling Larry Wall's shoes for the highly anticipated opening talk fell to the infamous Dr. Damian Conway, indentured servant to the Perl community and lecturer extraordinaire. Those of you who have seen Damian in action will have probably guessed that his presentation did not disappoint. The topic was, of course, a tour of the past and future of the Perl language -- where we have come (a long way from Perl 1 in 1987) and where are going (a long way yet to Perl 6).

To hear Damian tell it, Perl 6 looks like it's going to be awesome. While many details are still sketchy, the intention from all quarters is to preserve all the things we like about Perl today, especially its tendency to be eclectic in its incorporation of ideas and features from other languages. Larry, Damian and others have carefully studied the lessons of such languages as Java, Python, Ruby, and even the infant C#, in the hopes of applying those lessons to Perl 6.

Damian's keynote also focused on how Perl 6 will attempt to correct some of the flaws and deficiencies of Perl 5, the details of which can be found elsewhere, so I won't reiterate them here. Additionally, he emphasized that, due to the unexpected quantity and scope of the Perl 6 RFCs, the final language design will take Larry far longer than anyone originally imagined. Damian went on to predict a usable alpha version of Perl 6 being ready by May 2002, with a full release perhaps available by October 2002. However, as pieces of the Perl 6 design stabilize, Damian and others (including our own Simon Cozens) will be implementing them in Perl 5, so that we can start playing with Perl 6 today, rather than next year.

Meanwhile, the continued enthusiasm and energy being devoted to Perl 6 has had a profound impact on the community at large that is hard to overstate. YAPC::NA 2001 was marked not merely by much discussion and speculation on Perl 6, but also by fascinating new developments in Perl. One of the downright niftiest of these new directions is Brian Ingerson's Inline.pm, which he presented in a 90-minute talk Wednesday. Inline.pm uses a form of plug-in architecture to allow seamless embedding of other languages like C, C++, Java, and, yes, Python, right into ordinary Perl scripts. Brian, who works at ActiveState, has already written a www.perl.com feature on Inline.pm, so I'll merely mention here that the module hides away the frighteningly ugly details of gluing disparate languages together, in the most intuitive way possible. This kind of development is really exciting for the ways in which it opens new doors and breeds new ideas on the many, many different kinds of intriguing things that can still be done with Perl 5. Incidentally, Brian's midlecture sing-a-longs about Perl internals and so forth were also quite well regarded.

The hubbub around Inline.pm was just one thread in the theme of "Perl as glue language for the 21st Century," a theme visited and revisited at many times and places throughout the conference. The notion was raised again by Perl 6 project manager Nathan Torkington in his presentation at Adam "Ziggy" Turoff's Perl Apprenticeship Workshop on Wednesday. Amidst the announcement of many interesting and valuable projects in need of Perl hackers, Gnat issued a call to "make a Python friend" and collaborate with them on a development project. "Show them we're not *all* evil!" he insisted, in marked contrast to his howlingly funny diatribe on Python at the previous year's Lightning Talks.

"I was surprised that Gnat took that approach, because I thought I would be left this year to argue the other side," ActiveState's Neil Kandalgaonkar observed, after giving his Friday morning talk on "Programming Parrot," so named for its case study in getting Perl and Python applications to work in concert. Part of Neil's tale of success lay in using Web services to get different processes running in different languages on different machines to exchange data reliably. "All it took was an extra four lines of scripting in each language, and I was done," he noted, driving home the importance of using and extending Perl's ability to talk to other languages and applications.

Meanwhile, YAPC North America 2001 also showed growth in the depth and scope of the conference's offerings. In contrast to previous years, where talks were largely aimed at beginner and intermediate Perl hackers, this year's presentations covered some more advanced topics, such as Nat Torkington's three-hour lesson on the Perl internals that he delivered to a packed house Thursday. Originally written by Simon Cozens (who was unable to attend), the Perl internals class presented a concise introduction to some of Perl's inner workings, furthering the Perl community's expressed goal of lowering the barrier of entry to internals hacking and encouraging wider participation in Perl core development. Later in the day, Michael Schwern addressed the ever-present tendency of Perl hackers to rely on Perl's forgiving nature in his rather well-attended talk on "Disciplined Programming, or, How to Be Lazy without Trying." "Always code Perl as if you were writing for CPAN," Schwern urged his audience. "Document and test as you go, and release working versions often."

Speaking of which, the Comprehensive Perl Archive Network was also a major topic of discussion at YAPC. "The CPAN is Perl's killer app," Gnat said at one point. "No other language has anything like it." Neil and Brian gave a short presentation on their experiences building and maintaining ActiveState's PPM repository, a collection of binary distributions of CPAN modules. The dynamic duo from Vancouver yielded some of their time to Schwern to allow him to discuss his proposed CPANTS project, intended to automate testing and quality verification of modules in the repository. Metadata, rating systems, trust metrics and peer-to-peer distribution models were all touched on. Based on the buzz this year, it seems reasonable to predict that many new and exciting things are likely to grow up around the CPAN, and around the possibilities inherent in the distribution of Perl modules, in the not-too-distant future.

The final talk Thursday was once more delivered by Damian Conway, and curiosity had spread far and wide on how he might top last year's now-legendary presentation on Quantum::Superpositions. This year's Thursday afternoon plenary lecture was merely titled, "Life, the Universe, and Everything," in homage to the late Mr. Adams; and, true to his word, Damian delivered just that. Swooping from Conway's Game of Life (no relation), to a source filter for programming Perl in Klingon (a la Perligata), to the paradox of Maxwell's Demon (conveniently dispelled with a little help from Quantum::Superpositions), Damian's talk was a masterful reflection of all the things we love about Perl: It was clever, complex, elegant, and, most of all, it was fun.

(Parenthetically, among the modules that Damian introduced at this talk was a little number called Sub::Junctive. As a linguist, I must confess it scares the living heck outta me. Look for it on the CPAN.)

Friday featured more of this year's theme of Perl as glue-language-for-the-21st-Century in two talks on Web services by Adam "Ziggy" Turoff and Nat Torkington, in which Nat issued an impassioned plea for a Perl implementation of Freenet. However, the morning's highlight was without a doubt the much-anticipated Lightning Talks. Hosted once again by the irrepressible Mark-Jason Dominus, the 90-minute series of five-minute short talks went over quite well, featuring topics ranging from how hacking Perl is like Japanese food and the graphing of IRC conversations to a call for more political action from within hackerdom and an overview of the Everything Engine. The showstopper, however, was once again Damian, who is generally reputed to be unable to hold forth on any topic for anything *less* than an hour and a half. To everyone's surprise, the Lightning Talk consisted of a hilarious argument in the grand Shakespearean style between Damian and Brian Ingerson, over the disputed authorship of Inline::Files, a nifty new module for extending the capabilities of the old DATA filehandle. Their invective-laden dialogue was the most brilliantly humorous five minutes of the entire conference, and, yes, Damian even managed to finished on time. :) If you had the misfortune not to be present, you might be lucky enough to see them have at each other again at this year's Perl Conference 5 in San Diego.

Later in the day, Nat and Damian chaired a Perl 6 status meeting, reviewing the major events in Perl 6 starting with the announcement at TPC4, and working forward to the present language design phase. "This is a fresh rebirth for Perl AND for the community," Gnat said at one point. "Everything changes." The sometimes fractious attitudes encountered on the various Perl mailing lists were discussed. "In some ways this is a meritocracy," Gnat confessed. "Write good patches and we will love you." Kirrily "Skud" Robert then spoke at length on the future of the core Perl modules, and on the need to develop guidelines to direct the process of porting them to Perl 6.

Finally, the plenary session Friday afternoon closed the conference with another presentation from, you guessed it, Damian Conway. Our Mr. Conway took the opportunity to thank Yet Another Society and its sponsors for all of the contributions that permitted him to take a year off from academia to work exclusively on Perl. He then reviewed some of the fruits of that labor to date, including NEXT.pm, Inline::Files and the brilliant Filter::Simple, all of which, it should be pointed out, are now freely available to the community.

It has been nearly a year since Jon Orwant's now-legendary coffee-mug-tossing tantrum at TPC4 touched off the decision to begin work on Perl 6. After the grueling RFC process, the endless mailing list discussions and the breathless wait to see what Larry would come through with, Perl 6 the language and Perl 6 the community finally appear to be taking shape right before our eyes. New innovations are coming along more and more often, including Larry's Apocalypses, Brian's Inline modules, all of the potential emerging from the Web services meme, the future of the CPAN, new projects like Reefknot -- including continuing projects such as POE and Mason -- and, last but not least, whatever the heck it is that Damian is working on this week.

The Yet Another Perl Conferences are evolving, as well. Although neither Larry nor Randal nor Orwant could make it this year, the turnout was nevertheless such that, no matter where you looked at the conference, there you might find someone you knew from IRC, from the mailing lists, from previous conferences or for the great work that person had done for Perl. Although I've only touched on some highlights, there were dozens of presenters at this year's conference, practically all of them had something fascinating to say, and I really wish I had more time and space to cover them all.

Finally, it's safe to say that YAPC::NA clearly defined its own existence as a growing concern of the community this year, having at last separated from its birthplace at Carnegie Mellon in Pittsburgh. Montreal, as it turns out, is a fantastic, vibrant place to hold a summer conference, with countless magnificent restaurants and bars suitable for hosting the heady after-hours carousings of the Perl community. From every report, a good time was had by nearly all, and I think we all eagerly await the next YAPC::America::North, wherever it may be held.

This Week in Perl 6 (10 - 16 June 2001)

Notes

You can subscribe to an email version of this summary by sending an empty message to perl6-digest-subscribe@netthink.co.uk.

Please send corrections and additions to bwarnock@capita.com.

The Perl 6 mailing lists saw 165 messages across 20 threads, with 31 authors contributing. Two threads (and two authors) dominated the lists this week. Other than that, traffic was noticeably lighter with YAPC going on.

Even More on Unicode and Regexes

It seems that both general relativity/quantum mechanics and linguistics/text processing are hoping superstrings can solve their respective enigmas:

We probably also ought to answer the question "How accommodating to non-latin writing systems are we going to be?" It's an uncomfortable question, but one that needs asking. Answering by Larry, probably, but definitely asking. Perl's not really language-neutral now (If you think so, go wave locales at Jarkko and see what happens... :) but all our biases are sort of implicit and un (or under) stated. I'd rather they be explicit, though I know that's got problems in and of itself.

                                        Dan

This was followed by a lengthy discussion on different languages, although Perl wasn't really one of them. It was the general conclusion that Perl needn't try to support the world's languages, but should at least leave sufficient hooks for others to do so. More will surely follow.

Multi-Dimensional Arrays and Relational Databases

Me started the other lengthy thread for this week, in a call for a multi-dimensional array syntax that would allow easy modeling of a relational database. Several folks pointed out that

  1. Me could always use Perl 6's planned replaceable syntax to define such capabilities.
  2. If Perl 5 arrays (even with only a pseudo-multi-dimensional nature to them) aren't fundamentally in sync with database access, then to make them so would require rendering them out sync with everything else they've been used for.
  3. Array slicing, even with tying and overloading, isn't nearly as good as working with SQL (as through DBI) for this type of data manipulation.

David L. Nicol did provide an interesting digression on "tasty" variables, which are, more or less, the ultimate in lazy evaluation. Mostly, though, the discussion went completely sideways.

Core Architecture

Dan Sugalski provided a glossy on what the guts of the new interpreter will look like.

Perl 6 Emulator

Simon Cozens released a rough Perl 6 emulator.

Handler Function or VTable?

Dave Mitchell asked if a single handler function would be better than the vtable scheme currently planned. (" No," says Dan.)

Argument Decoding

Dan also asked for opnions on who should decode opcode function arguments - the functions or the dispatch loop? (Argument decoding, in this case, is the translation from a virtual register number to its memory address.) A number of trade-offs were discussed, with function decoding having a slight edge. That, in turn. led to a brief discussion on shadow functions - C functions that mirror Perl functions - and how useful they are or aren't.

Polymorphic Builtins

David Nicol lamented about not having multiple dispatch based on signatures.

YAPC

Although he didn't mention it on the mailing lists, Nathan Torkington did tell use.perl.org where to find an MP3 of Damian Conway's opening spiel on Perl 6. (Damian was speaking in place of an ill Larry Wall.) For the less imaginative, the slides can be found here in PDF format.


Bryan C. Warnock

This Week on p5p 2001/06/17

Notes

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

This was a reasonably busy week, seeing just over 400 messages.

Many thanks to Leon for taking over these summaries in my absence.

More -Wall fixes

Doug MacEachern has been furiously fixing up compiler warnings this week, tidying up ext/ so that it is less noisy under -Wall.

One major problem was that if an XSUB has no arguments and all the work is done in PP code, the helper variable ax is likely to be unused; Doug added a macro, PERL_UNUSED_VAR, to wrap around ax in such cases to prevent xsubpp creating code that produced warnings.

Nick Clark also produced a mega (141K) warning-satisfying patch.

Dough also fixed up some missing dTHXes in some of the extensions. This led to an associated change in ExtUtil/Constant.pm, Nick Clark's helper module for creating the constant subroutine in XS modules. It took a while for people to get their heads around the idea that this was a Perl module that spat out XS code - unfortunately, MakeMaker doesn't like the idea of having two .xs files in a module, so you can't have a separate file for constants; hence, you have to do evil things like modifying the XS code produced by h2xs in place. Urgh.

Miscellaneous Darwin Updates

Wilfredo Sanchez and Larry Schatzer have been our Mac OS X heroes this week. They uncovered a bunch of problems, some of which have even been fixed. Firstly, PerlIO was intercepting a warning from the locale system before it had been properly initialized, and was causing segfaults. There were also some numeric problems with INT32_MIN - Apple's stdin.h defines it as being -2147483648 instead of the more gcc-friendly -2147483647-1. As Jarkko said, the real long-term fix would be to fix Apple's header files; unfortunately, we can't do that very easily, so we have to work around to redefining INT32_MIN on Darwin ourselves.

Hash accessor macros

Abhijit Menon-Sen asked why hv.c does things like

    register XPVHV* xhv;
    ...
    xhv = (XPVHV*)SvANY(hv);

    ... use of xhv->xhv_* ...

instead of simply

    SvFOO(hv);

which is considerably easier to grep for. He then sent seven large patches which revert to the macro behaviour. Jarkko applied the patches, but then found that this generated a bunch of new warnings, and some nasty errors on HPUX.

Doug MacEachern piped up, saying that the original code was there for reasons of performance; direct access saves a couple of indirections every time one of the xhv_ fields is accessed. He also said that the first

    xhv = (XPVHV*)SvANY(hv);

might be stored in a register, optimising further accesses. Both Mike Guy and Ilmari Karonen suggested reverting to the previous optimised format, but putting in the macro version as comments so that they can be grepped for. Spider was called in to fix up the HPUX indigestion.

Abhijit also threw in some updates to dump.c, and removed the tests in the core for "anonymous stashes" - a concept which didn't even appear to exist...

iThreads development

Artur's been busy this week with iThreads support; he's working on a number of different areas. Firstly, the continuing conversion of the Threads.pm module for iThreads. His new module will be called threads.pm and will soon appear on CPAN once he's finished writing test cases. Another tack was to make regular expressions safe in threaded environments, but that seems to have temporarily ground to a halt.

His big achievement this week was the implementation and documenting of the CLONE method, a callback like DESTROY called on all objects when a thread is cloned. This caused some brief problems on Win32 because of the way it was detecting stashes. Stashes are, of course, normal hashes, so you have to be careful to call the CLONE method from a stash, rather than do something horrible like attempt to call the CLONE entry from a hash. Nick added

     if (gv_stashpv(HvNAME(hv),0)

as a guard, and it all worked again.

Cross-compilation

Jarkko has reported that the basics of cross-compilation support are working; he built miniperl for iPAQ by constructing an SSH tunnel, sending the binaries produced by Configure across and reading back the results. Read about it.

SV Documentation

God of the week award goes to Dave Mitchell. Over on the Perl 6 lists, I'd asked for people to look through the Perl 5 sources and see what sort of things happen to SVs, so that I could see what needed to happen for Perl 6. In doing so, he produced 800 lines of documentation and comments for sv.c. Wow.

On his travels, he asked some questions about SV macros; the answers may be interesting. Read about it.

Various

Ilya put in a couple of scary patches for OS/2, including one which made Configure self-modifying. This was considered to be a Bad Thing.

Jarkko integrated the NetWare modifications to Perl, and the Memoize module.

Sean Teague announced that the Perl Power Tools can be downloaded from CPAN as Bundle::PPT.

Chris Nandor updated File::Find and its tests to cope with MacOS; however, this consisted mainly of several iterations of

    if (macos) {
        A
    } else {
        B
    }

which Jarkko balked at. Thomas Wegner rewrote the test suite to be much more maintainable.

Mike Guy noticed that many of the internal functions don't support UTF8 strings properly; an important culprit is Perl_warner, which means every warning or error message containing UTF8 data gets mangled. He wants something which sanitizes a string for display. Any takers?

Peter Prymmer hacked h2ph to deal with C trigraphs - you remember them, the evil convention of using, say, "??=" instead of "#". Urgh. He also fixed up some of the extensions to build more happily under VMS. Craig Berry also chipped in some extension patches for VMS.

Until next week I remain, your humble and obedient servant,


Mark-Jason Dominus

Parse::RecDescent Tutorial

The Basics

Parse::RecDescent is a combination compiler and interpreter. The language it uses can be thought of roughly as a macro language like CPP's, but the macros take no parameters. This may seem limiting, but the technique is very powerful nonetheless. Our macro language looks like this:

  macro_name : macro_body

A colon separates the macro's name and body, and the body can have any combination of explicit strings ("string, with optional spaces"), a regular expression (/typical (?=perl) expression/), or another macro that's defined somewhere in the source file. It can also have alternations. So, a sample source file could look like:

  startrule : day  month /\d+/ # Match strings of the form "Sat Jun 15"

  day : "Sat" | "Sun" | "Mon" | "Tue" | "Wed" | "Thu" | "Fri"

  month : "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun" |
          "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec"

Three macros make up this source file: startrule, dayrule and monthrule. The compiler will turn these rules into its internal representation and pass it along to the interpreter. The interpreter then takes a data file and attempts to expand the macros in startrule to match the contents of the data file.

The interpreter takes a string like "Sat Jun 15" and attempts to expand the startrule macro to match it. If it matches, the interpreter returns a true value. Otherwise, it returns undef;. Some sample source may be welcome at this point:

  #!/usr/bin/perl

  use Parse::RecDescent;

  # Create and compile the source file
  $parser = Parse::RecDescent->new(q(
    startrule : day  month /\d+/

    day : "Sat" | "Sun" | "Mon" | "Tue" | "Wed" | "Thu" | "Fri"

    month : "Jan" | "Feb" | "Mar" | "Apr" | "May" | "Jun" |
            "Jul" | "Aug" | "Sep" | "Oct" | "Nov" | "Dec"
  ));

  # Test it on sample data
  print "Valid date\n" if $parser->startrule("Thu Mar 31");
  print "Invalid date\n" unless $parser->startrule("Jun 31 2000");

Creating a new Parse::RecDescent instance is done just like any other OO module. The only parameter is a string containing the source file, or grammar. Once the compiler has done its work, the interpreter can run as many times as necessary. The sample source tests the interpreter on valid and invalid data.

By the way, just because the parser knows that the string "Sat Jun 15" is valid, it has no way of knowing if the 15th of June was indeed a Saturday. In fact, the sample grammar would also match "Sat Feb 135". The grammar describes form, not content.

Getting Data

Now, this is quite a bit of work to go to simply to match a string. However, much, much more can be done. One element missing from this picture is capturing data. So far the sample grammar can tell if a string matches a regular expression, but it can't tell us what the data it's parsed is. Well, these macros can be told to run perl code when encountered.

Perl code goes after the end of a rule, enclosed in braces. When the interpreter recognizes a macro such as startrule, the text matched is saved and passed to the perl code embedded in the grammar.

Each word or term of the macro ('day', 'month'...) is saved by the interpreter. dayrule gets saved into the $item{day} hash entry, as does monthrule. The /\d+/ term doesn't have a corresponding name, so its data comes from the @item array. $item[0] is always the rule name, so /\d+/ gets saved into $item[3]. So, code to print the parsed output from our sample startrule rule looks like this:

  startrule : day month /\d+/
            { print "Day: $item{day} Month: $item{month} Date: $item[3]\n"; }

Everything in the parser is run as if it was in the Parse::RecDescent package, so when calling subroutines outside Parse::RecDescent, either qualify them as Package::Name->my_sub() or subclass Parse::RecDescent.

A Mini-Language

All of the pieces are now in place to create a miniature language, compile, and run code in it. To make matters simple, the language will only have two types of instruction: Assign and Print. A sample 'Assign' instruction could look like foo = 3 + a. The 'Print' statement will look like print foo / 2. Add the fact that 3 + a can be arbitrarily long (temp = 3+a/2*4), and now you've got a non-trivial parsing problem.

The easiest instruction to implement is the 'Print' instruction. Assuming for the moment that the right-hand side of the statement (the foo / 2 part of print foo / 2) already has a rule associated with it (called 'expression'), the 'Print' instruction is very simple:

  print_instruction : /print/i expression
                    { print $item{expression}."\n" }

The 'Assign' instruction is a little harder to do, because we need to implement variables. We'll do this in a straightforward fashion, storing variable names in a hash. This will live in the main package, and for the sake of exposition we'll call it %VARIABLE. One caveat to remember is that the perl code runs inside the Parse::RecDescent package, so we'll explicitly specify the main package when writing the code.

More complex than the 'Print' instruction, the 'Assign' instruction has three parts: the variable to assign to, an "=" sign, and the expression that gets assigned to the variable. So, the instruction looks roughly like this:

  assign_instruction : VARIABLE "=" expression
                     { $main::VARIABLE{$item{VARIABLE}} = $item{expression} }

Much like we did with the dayrule rule in the last section, we'll combine the print_instruction and assign_instruction into one instruction rule. The syntax for this should be fairly simple to remember, as it's the same as a Perl regular expression.

  instruction : print_instruction
              | assign_instruction

In order to make the startrule expand to the instruction rule, we'd ordinarily use a rule like startrule : instruction. However, most languages let you enter more than one instruction in a source file. One way to do this would be to create a recursive rule that would look like this:

  instructions : instruction ";" instructions
               | instruction
  startrule : instructions

[[JMG: I'm sorely tempted to rewrite this chunk, if only 'cause there's a lot of info here in just one paragraph]]

Input text like "print 32" expands as follows: startrule expands to instructions. instructions expands to instruction, which expands to print_instruction. Longer input text like "a = 5; b = a + 5; print a" expands like so: startrule expands to instructions. The interpreter looks ahead and chooses the alternative with the semicolon, and parses "a = 5" into its first instruction. "b = a + 5; print a" is left in instructions. This process gets repeated twice until each bit has been parsed into a separate instruction.

If the above seemed complex, Parse::RecDescent has a shortcut available. The above instructions rule can be collapsed into startrule : instruction(s). The (s) part can simply be interpreted as "One or more instructions". By itself this assumes only whitespace exists between the different instructionrule;s, but here again, Parse::RecDescent comes to the rescue, by allowing the user to specify a separator regular expression, like (s /;/). So, the startrule actually will use the (s /;/) syntax.

  startrule : instruction(s /;/)

The Expression Rule

Expressions can be anything from '0' all the way through 'a+bar*foo/300-75'. Ths range may seem intimidating, but we'll try to break it down into easy-to-digest pieces. Starting simply, an expression can be as simple as a single variable or integer. This would look like:

  expression : INTEGER
             | VARIABLE
             { return $main::VARIABLE{$item{VARIABLE}} }

The VARIABLE rule has one minor quirk. In order to compute the value of the expression, variables have to be given a value. In order to modify the text parsed, simply have the code return the modified text. In this case, the perl code looks up the variable in %main::VARIABLE and returns the value of the variable rather than the text.

Those two lines take care of the case of an expression with a single term. Multiple-term expressions (such as 7+5 and foo+bar/2) are a little harder to deal with. The rules for a single expression like a+7 would look roughly like:

  expression : INTEGER OP INTEGER
             | VARIABLE OP INTEGER
             | INTEGER OP VARIABLE
             | VARIABLE OP VARIABLE
  OP : /[-+*/%]/

This introduces one new term, OP. This rule simply contains the binary operators /[-+*/%]/. The above approach works for two terms, and can be extended to three terms or more, but is terribly unwieldy. If you'll remember, the expression rule already is defined as INTEGER | VARIABLE, so we can replace the right-hand term with expression. Replacing the right-hand term with expression and getting rid of redundant lines results in this:

  expression : INTEGER OP expression
             | VARIABLE OP expression

We'll hand off the final evaluation to a function outside the Parse::RecDescent package. This function will simply take the @item list from the interpreter and evaluate the expression. Since the array will look like (3,'+',5). we can't simply say $item[1] $item[2] $item[3], since $item[2] is a scalar variable, not an operator. Instead we'll take the string "$item[1] $item[2] $item[3]" and evaluate that. This will evaluate the string and return the result. This then gets passed back, and becomes the value of the expression.

  expression : INTEGER OP expression
             { return main::expression(@item) }
             | VARIABLE OP expression
             { return main::expression(@item) }

  sub expression {
    shift;
    my ($lhs,$op,$rhs) = @_;
    return eval "$lhs $op $rhs";
  }

That completes our grammar. Testing is fairly simple. Write some code in the new language, like "a = 3 + 5; b = a + 2; print a; print b", and pass it to the $parser->startrule() method to interpret the string.

The file included with this article comes with several test samples. The grammar in the tutorial is very simple, so plenty of room to experiment remains. One simple modification is to change the INTEGER rule to account for floating point numbers. Unary operators (single-term such as sin()) can be added to the expression rule, and statements other than 'print' and 'assign' can be added easily.

Other modifications might include adding strings (some experimental extensions such as '<perl_quotelike>' may help). Changing the grammar to include parentheses and proper precedence are other possible projects.

Closing

Parse::RecDescent is a powerful but difficult-to-undertstand module. Most of this is because parsing a language can be difficult to understand. However, as long as the language has a fairly consistent grammar (or one can be written), it's generally possible to translate it into a grammar that Parse::RecDescent can handle.

Many languages have their grammars available on the Internet. Grammars can usually be found in search engines under the keyword 'BNF', standing for 'Backus-Naur Form'. These grammars aren't quite in the form Parse::RecDescent prefers, but can usually be modified to suit.

When writing your own grammars for Parse::RecDescent, one important rule to keep in mind is that a rule can never have itself as the first term. This makes rules such as statement : statement ";" statements illegal. This sort of grammar is called "left-recursive" because a rule in the grammar expands to its left side.

Left-recursive grammars can usually be rewritten to right-recursive, which will parse cleanly under Parse::RecDescent, but there are classes of grammars thatcant be rewritten to be right-recursive. If a grammar can't be done in Parse::RecDescent, then something like Parse::Yapp may be more appropriate. It's also possible to coerce yacc into generating a perl skeleton, supposedly.

Hopefully some of the shroud of mystery over Parse::RecDescent has been lifted, and more people will use this incredibly powerful module.

 #!/usr/bin/perl -w

 use strict;
 use Parse::RecDescent;
 use Data::Dumper;

 use vars qw(%VARIABLE);

 # Enable warnings within the Parse::RecDescent module.

 $::RD_ERRORS = 1; # Make sure the parser dies when it encounters an error
 $::RD_WARN   = 1; # Enable warnings. This will warn on unused rules &c.
 $::RD_HINT   = 1; # Give out hints to help fix problems.

 my $grammar = <<'_EOGRAMMAR_';

   # Terminals (macros that can't expand further)
   #

   OP       : m([-+*/%])      # Mathematical operators
   INTEGER  : /[-+]?\d+/      # Signed integers
   VARIABLE : /\w[a-z0-9_]*/i # Variable

   expression : INTEGER OP expression
              { return main::expression(@item) }
              | VARIABLE OP expression
              { return main::expression(@item) }
              | INTEGER
              | VARIABLE
              { return $main::VARIABLE{$item{VARIABLE}} }

   print_instruction  : /print/i expression
                      { print $item{expression}."\n" }
   assign_instruction : VARIABLE "=" expression
                      { $main::VARIABLE{$item{VARIABLE}} = $item{expression} }

   instruction : print_instruction
               | assign_instruction

   startrule: instruction(s /;/)

 _EOGRAMMAR_

 sub expression {
   shift;
   my ($lhs,$op,$rhs) = @_;
   $lhs = $VARIABLE{$lhs} if $lhs=~/[^-+0-9]/;
   return eval "$lhs $op $rhs";
 }

 my $parser = Parse::RecDescent->new($grammar);

 print "a=2\n";             $parser->startrule("a=2");
 print "a=1+3\n";           $parser->startrule("a=1+3");
 print "print 5*7\n";       $parser->startrule("print 5*7");
 print "print 2/4\n";       $parser->startrule("print 2/4");
 print "print 2+2/4\n";     $parser->startrule("print 2+2/4");
 print "print 2+-2/4\n";    $parser->startrule("print 2+-2/4");
 print "a = 5 ; print a\n"; $parser->startrule("a = 5 ; print a");

This Week in Perl 6 (3 - 9 June 2001)

Notes

You can subscribe to an email version of this summary by sending an empty message to perl6-digest-subscribe@netthink.co.uk.

Please send corrections and additions to bwarnock@capita.com.

The Perl 6 mailing lists saw 226 messages across 19 threads, with 40 authors contributing. Although the traffic was moderate, and very little heat was generated, most of the light was of the mysterious, all-encompassing kind.

Unicode

Dan Sugalski dropped a link provided by Slashdot's recent article on why Unicode won't work. Russ Allbery and Simon Cozens both started with an anti-FUD discussion, pointing out some of the questionable veracity and the datedness of the paper's conclusions, and providing basic Unicode information.

There was a brief discussion starting here on the lossiness of Unicode, and whether the tables were going to be embedded/included in Perl 6. (Dan's plan is to build external libraries (for the latter) so that improved or alternate encoding sets can be replaced independently (to handle the former).)

Hong Zhang then kicked off a thread that brought locales into the discussion, particularly with case determination and sorting.

The discussions were mostly academic - not much was actually decided on. Larry did drop a couple hints as to how much (or how little ) Perl 6 may differ from Perl 5.6 with Unicode handling.

A Strict View Of Properties

A lengthy discussion on the interaction of properties with use strict was started by Me. (In truth, the actual discussion on properties and stricture was very short - because properties can be both static variable properties or dynamic value properties, Perl can't determine at compile-time whether a property being accessed actually exists.

A parallel discussion involving a 'super-strict' mode to, in essence, allow programmers to completely turn off any runtime constructs and beef up compile-time checking devolved into a Perl-Java debate. Albeit more civilized than usual.)

Regular Expressions

In a continuation of last week's discussion on registers and opcodes, Larry listed his reasons for wanting the regex engine integrated into the regular opcode space:

But there is precedent for turning second-class code into first-class code. After all, that's just what we did for ordinary quotes in the transition from Perl 4 to Perl 5. Perl 4 had a string interpolation engine, and it was a royal pain to deal with.

The fact that Perl 5's regex engine is a royal pain to deal with should be a warning to us.

Much of the pain of dealing with the regex engine in Perl 5 has to do with allocation of opcodes and temporary values in a non-standard fashion, and dealing with the resultant non-reentrancy on an ad hoc basis. We've already tried that experiment, and it sucks. I don't want to see the regex engine get swept back under the complexity carpet for Perl 6. It will come back to haunt us if we do:

Although everyone was in agreement that halfway was, on a scale from Good to Bad, Bad, there was some dissention on whether integration or separation was needed to solve the maintenance issues.

Lists, References, and Interpolation

Simon Cozens asked a couple of questions about the new syntax:

Should properties interpolate in regular expressions? (and/or strings) I don't suppose they should, because we don't expect subroutines to. (if $foo =~ /bar($baz,$quux)/;? Urgh, maybe we need m//e)

What should $foo = (1,2,3) do now? Should it be the same as what $foo = [1,2,3]; did in Perl 6? (This is assuming that $foo=@INC does what $foo = \@INC; does now.) Putting it another way: does a list in scalar context turn into a reference, or is it just arrays that do that? If so, how can we disambiguate hashes from lists?

(The provided answers were " yes" (Damian Conway) and " undecided" (Larry Wall), respectively.

Miscellany

Follow-Ups

It's Alive

Because of (or to spite) last week's coverage of it, David Nicol produced a patch for 5.7.1 that does such a thing.

Perl Assembly Language: Clarification

Two weeks ago, when discussing A.C. Yardley's Assembly Language proposal, I mentioned the "very-low-level operations" of the Perl Virtual Machine itself.

The Perl Virtual Machine is, of course, far removed from any layers that are normally considered to be "very-low-level," and the Virtual Machine opcodes tend to be somewhat complex in what and how much they accomplish. "Very-low-level" wasn't intended as a description of the complexity on an absolute scale, but as a description of the atomicy of operations relative to the Virtual Machine itself. (Thanks go to John Porter for bringing it to my attention.)


Bryan C. Warnock

This Week on p5p 2001/06/09

Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

This was a fairly quiet week with 240 messages.

Removing dependence on strtol

Nicholas Clark provided a patch to replace a call to strtol, a C library function convert a string to a long integer which (as luck would have it, turns out to have bugs in certain implementations). As Nicholas puts it: "No falling over because of other people's libraries' bugs".

This has cropped up again recently, so it's worth explaining. Perl is a very portable language: it is expected to compile under many different operating systems and under even more libraries. Occasionally some of these platforms have buggy implementations of functions: it is often easier to re-implement the buggy function inside Perl (correctly, using Perl internals and optimisations) than to code around that particular bug.

In this case, the problem was with UTS Amdahl's strtol not always setting errno when appropriate in certain "out of bounds" cases.

More committers

Jarkko Hietaniemi proposed that there be more Perl committers (people able to add patches directly to the main Perl repository, which is held under Perforce):

I think it's time we nudge the development model of Perl to be a bit more open by extending the group of people having commit rights to the Perl repository.

There are many active perl5-porters that submit a lot of patches, both code (both C and Perl) and documentation patches, and I feel somewhat silly being a bottleneck. Some people (including me) could argue that having a single point of quality control is a good thing, but I think opening up access to the code would outweigh the potential downsides.

The rest of the proposal is worth reading, as it nicely sums up the situation. While it is good to have one central master control point for quality control, hopefully this change will free Jarkko up and increase the speed of development on Perl.

So far Jarkko has taken the Configure subpumpkin, and Simon Cozens is the Unicode subpumpkin. In addition, a changes mailing list will be set up so that interested parties can read the patches without any discussions.

Note that at the moment a few people already have Perforce access, such as Gurusamy Sarathy (5.6.x pumpkin), Nick-Ing Simmons (Perl IO pumpkin), and Charles Bailey (VMS pumpkin).

Regex Negation

Jeff Pinyan noted that a new Java regex package contained support for the following regular expression negation: [\w'\-[^\d]], which matches any word character, apostrophe, or hyphen, EXCEPT digits, and asked whether support was planned for Perl. He was pointed to the Unicode Regular Expression Guidelines, which proposed a syntax: [AEIOU[^A]], but was rather unclear on many points. The backwards compatibility police made an appearance, but otherwise nothing was resolved.

Various

Gurusamy Sarathy fixed an as-old-as-the-hills bug to do with lexical lookups within eval EXPR.

Some minor documentation patches.

Simon Cozens re-announced the Perl Repository Browser. He also reworked and added many comments to perly.y (the Perl parser, which is now much easier to understand), and posted a hypertext representation of the Perl grammar.

David Nicol proposed a (broken, buggy, overworked) patch to Perl containing an new operator it, which would allow the following code to print "5": $a{foo} = 5; defined $a{foo}; print it. It was not liked by the backwards compatibility police.

There was some IThreads discussion on the naming of modules, from IThreads to Thread and threads.

Chris Nandor submitted some Mac OS compatibility patches.

Until next week I remain, your temporarily-replacing humble and obedient servant,

Leon Brocard


The Beginner's Attitude of Perl: What Attitude?

A recent article here, Turning the Tides of Perl's Attitude Toward Beginners," described places on the Internet where beginners could get help without fear of being flamed mercilessly by insensitive, elitist Perl experts. Experienced programmers, in short, must be more patient with newbies.

The article's undertone seemed to scold programmers for not being good role models for the young, not having their photographs on Corn Flakes boxes and not helping build strong bodies 12 different ways.

I think that Perl is a great programming language and that everybody should learn it. But I don't agree that experienced Perl programmers have to feel guilty about being more creative and efficient than other programmers and submit to sensitivity training.

Being critical toward beginners is part of the Internet culture. It's not Perl's fault. Try reading a Newsgroup for the X Window System, for example. Perl Newsgroups are positively civil, in comparison.

If Perl has an image problem, it's probably due to nature of the language. Its label as a "scripting" language gives the impression that Perl is best suited for writing shell scripts and batch files, and that it's about as complicated as LOGO.

Perl is anything but a simple language. For example, here are two (completely fictitious) statements, approximately similar in function, the first in C, the second in Perl, which return a dynamic function call:

func *p = *functable[i * (sizeof func *)];

$func = ${*{"$pkg"}}{"$key"};

Perl's use of data references is at least as sophisticated as so-called system languages - the difference being that languages such as C allow Type-T programmers to fragment memory at will, while Perl interprets code in its own memory space that makes it safer for use in networked environments.

But complex data references can cause side-effects, which can cause programming gaffes in Perl at least as quickly as in C, Perl being an "interpreted language" and all.

In addition, Perl is famous - or infamous - for its flexibility. So there are a half-dozen ways to do any one task, and that sort of freedom can be confusing to beginners, if not downright frightening. A Perl is a Perl is a Perl ....

Right. Anyway, with the advent of dynamic module loading in Perl, just about anyone can write a library module to perform his or her task in his or her own manner. That kind of freedom can be very liberating and empowering, but it can also lead to confusion and panic. Perl development efforts often have the character of shirts-and-skins basketball games rather than a cloistered garden of object orientedness.

You can use objects in Perl, but they're really just references to things called associative arrays, or hashes, which are composed of sets of other things.

All of these objects belong to a class hierarchy. Even if the programmer doesn't care, they still belong to a class in Perl, because the considerate language designers worked in a generic, syntactically consistent UNIVERSAL class for any piece of data that doesn't wear its heart on its sleeve.

I think E.E. Cummings might have done well as a Perl programmer.

But back to object orientedness - instead of using objects, you can just tell the Perl interpreter what module some piece of data is being loaded from. You don't have to cope with high-sounding and bothersome object oriented terminology if you don't want to.

However, to say that Perl has polymorphism - the ability of data to assume different characteristics depending on its context - is like saying that getting hit by a Greyhound bus might be hazardous to your health.

Not that I've experienced that personally. In this instance, I'll take somebody else's word for it.

With all of Perl's flexibility, sophistication and hordes of contributed library modules, it's easy to understand how a beginner might feel lost, frustrated and downright intimidated by the volume of material that's available online. If it weren't for a handful of dedicated CPAN archivists, the entire body of the community's library source code would have succumbed to anti-matter and chaos long ago.

I'm not certain how a contribution of mine might work in the context of my Web site, which is mainly about Linux, except that knowing how to use Perl is an essential system administration skill, and contributes mightily to the understanding of other system administration topics.

If a beginning system administrator learns how to use Perl, then he or she will have a better understanding of how the operating system works and will be less likely to pull some bonehead newbie trick like, say, setting the umask to 0.

So the purpose of this proposal is to argue that any effort to provide beginners with answers to their Perl questions must have equal prominence as the efforts of fully fledged programmers. A mailing list reference ought to appear prominently on Web pages, right up there with module listings and search engine forms, where beginners can find it right away.

Besides, you wouldn't want to feel superior to them.

If you have any suggestions as to how best to help beginners learn Perl, visit http://www.mainmatter.com/, and if the idea still sounds good, then email me, at rkiesling@mainmatter.com/.

Using CGI::Application


Why CGI::Application?

Table of Contents

Why CGI::Application?

Understanding CGI::Application

Putting It All Together

Conclusions & Advanced Concepts: Where to Go From Here

Resources

The Common Gateway Interface (CGI) may be viewed by some as "less than glamorous", but it is the workhorse of Web-based application development. For what CGI lacks in buzzword compliance, it more than makes up for in reliability, flexibility, portability, and (perhaps most important of all) familiarity!

CGI::Application builds upon the bedrock of CGI, adding a structure for writing truly reusable Web-applications. CGI::Application takes what works about CGI and simply provides a structure to negate some of the more onerous programming techniques that have cast an unfavorable light upon it.

CGI::Application code is so universal and non-proprietary that it works exceedingly well on any operating system and Web server that supports Perl and CGI. As you shall see, the CGI::Application structure even makes it possible for authors to distribute, for the first time, fully functional and sophisticated Web-applications via CPAN.

Understanding CGI::Application

Run-Modes

The most significant contribution of CGI::Application is the formal structure of "run-modes." A run-mode generally refers to a single screen of an application. All sophisticated Web applications feature multiple screens (or "pages"). For instance, an application to search through a database might feature a search form, a list of results and a detail of a single record. Each one of these three screens is part of a whole application.

Different programmers have devised different systems for managing these run-modes. Too many Web applications still look like huge IF-THEN-ELSE blocks, containing each run-mode in the enclosure of one conditional state. Often, these conditionals try to divide the application state by looking for the presence of various form variables. For instance, if a search field is present, show the list of results - otherwise, show the search form:

     my $query = CGI->new();
     print $query->header();
     if (my $search_term = $query->param("search_term")) {
          # ...30 lines of code to run a search
          # and print the results as HTML
     } else {
          # ...15 lines of code to display
          # the search form
     }

It is code such as this that has given CGI a bad name! It is barely structured and easily broken by even small changes in functionality.

The most savvy programmers quickly realized that run-modes are a specific thing that must be directly managed, and the most succinct way to determine the run-mode is to explicitly set it. Some systems, such as ASP, HTML::Mason, Cold Fusion and JSP attempt to manage these run-modes by having one physical document for each run-mode. This has the effect of spreading the code for a single application over at least as many files as there are run-modes! Taking the run-modes out of context by breaking them into separate files solves the state management problem at the cost of creating all sorts of new problems, not the least of which is the management of code assets. These run-modes are, after all, all part of the same application.

Application Modules

CGI::Application provides another solution to the run-mode management problem by providing two core facilities. First, CGI::Application designates a single specific HTML form input as a "Mode Parameter". This Mode Parameter is used to store (and retrieve) the current run-mode of your application. The value of a run-mode is a simple text scalar. CGI::Application reads the value of this Mode Parameter and acts as a traffic cop, directing the application operation accordingly.

Second, CGI::Application maps each run-mode to a specific Perl subroutine. Each subroutine, referred to as a "Run-Mode Method", implements the behavior of a single run-mode. All of your code, including all your run-mode methods and the mapping table between run-modes and subroutines, is stored in a single file. This file is a Perl module, referred to as your "Application Module".

Your Application Module is a sub-class of CGI::Application. In fact, CGI::Application is never intended to be used directly. CGI::Application is referred to by object-oriented enthusiasts as an "abstract class", and is only used via inheritance. To implement inheritance from CGI::Application, put the following code at the top of your Application Module:

     package Your::Web::Application;
     use base 'CGI::Application';

This code gives a name to your application (in this case, "Your::Web::Application"), and causes CGI::Application to be designated as the parent class. This parent class implements a number of methods that will provide the necessary infrastructure for your application. Some of the methods are expected to be called by your code to perform functions or set properties. Other inherited methods are expected to be implemented in your code, to provide the functionality specific to your application.

Defining Your Run-Mode Map

The map between run-modes and run-mode methods is defined in the setup() method. The setup() method is a method that you are expected to override in your Application Module by implementing a setup() subroutine. It is in your setup() subroutine that you define the map between run-modes and run-mode methods. Think of this map as the definitive list of things your application can do. If you ever add a function to your Web application, then you will amend this map to include your new run mode.

This run mode map is defined in your setup() method by using the run_modes() method provided by CGI::Application. The run_modes() method is an instance method that takes, as arguments, an associative array of run-modes as keys and run-mode method names as values (Note: CGI::Application version 1.3 is used in all our examples). To set up our prototypical database search application with three run-modes, this is how our code might look:

     package WidgetView;
     use base 'CGI::Application';
     sub setup {
          my $self = shift;
          $self->run_modes(
               'mode_1' => 'show_search_form',
               'mode_2' => 'show_results_list',
               'mode_3' => 'show_widget_detail'
           );
           $self->start_mode('mode_1');
           $self->mode_param('rm');
     }

That's it! The setup method receives an instance of your application class ($self) as an argument. When you call run_modes() you are setting the run-modes for this instance, so you use the object-oriented indirect ("->") operator. The inherited start_mode() method tells CGI::Application which mode to default to, if no mode is specified (as is the case when the application is first called). The inherited mode_param() method specifies the name of the HTML form parameter that will hold the run-mode state of the application from request to request.

What we have done here is set up an application called "WidgetView" with three run-modes, creatively named "mode_1", "mode_2" and "mode_3". These run-modes map to three as-yet-unwritten subroutines, respectively show_search_form(), show_results_list() and show_widget_detail(). The mode parameter is set to "rm" (the default), and the first mode of operation will be "mode_1".

Creating Run-Mode Methods

The run-mode method subroutines will contain the bulk of your code. These run-mode methods each implement the functionality for a particular run-mode. As we mentioned earlier, run-modes loosely translate into screens. As such, your run-mode methods will be responsible for setting up the HTTP and HTML output to be sent back to the requesting Web browser.

The most critical thing to remember about run-mode methods is that they should never print() anything to STDOUT. The inherited CGI::Application run() method is singularly responsible for actually sending all HTTP headers and HTML content to the Web browser. Your run-mode method is called by the run() method, and your code is expected to return a scalar containing all your HTML content. If you send anything to STDOUT, it will cause your application to malfunction. Symptoms of this type of mistake are typically content preceding HTTP headers, or HTTP headers appearing more than once in the output. If you see this, then you have probably tried sending output to STDOUT.

Your run-mode method will invariably need to interact with the CGI query to retrieve (and set) form parameters. CGI::Application does not attempt to provide this basic functionality. Instead, CGI::Application utilizes Lincoln D. Stein's superb CGI.pm module for all interactions with the CGI query. Becoming expert in CGI.pm will greatly enhance your mastery of CGI::Application. CGI::Application gives you access to the CGI.pm query object by way of the inherited query() method. Once you retrieve the CGI.pm query object via the query() method, you may interact with it as required.

For our first run-mode ("mode_1") we have specified what the run-mode method show_search_form() should be called. The purpose of this run-mode is to display the search form when the user first enters the application. Our run-mode method might look something like this:

     sub show_search_form {
          my $self = shift;
          # Get the CGI.pm query object
          my $q = $self->query();
          my $output = "";
          $output .= $q->start_html(-title => "Search Form");
          $output .= $q->start_form();

          # Build up our HTML form
          $output .= "Search for Widgets: ";
          $output .= $q->textfield(-name => 'search_term');
          $output .= $q->submit();

          # Set the new run-mode, when the user hits "submit"
          $output .= $q->hidden(-name => 'rm', -value => 'mode_2');
          $output .= $q->end_form();
          $output .= $q->end_html();

          return $output;
       }

As you can see, this subroutine is straight-forward. The specified run-mode method is called in an object-oriented context ($self). We retrieve the CGI.pm query object via our Application Module's query() method (inherited from CGI::Application). The HTML form we create should be familiar to anybody who has used CGI.pm. When we have completely built up our $output, we return it (as opposed to printing it to STDOUT).

There is only one bit of "magic" going on here, and that is our hidden form variable "rm". This is the method by which a CGI::Application gets from one run-mode to another. In the case of this run-mode (based on the desired functionality of our application), there is only one place we can go, and that is to "mode_2" - the list of matching results. If the "mode_1" run-mode allowed us to do more than one thing (for instance, to add a new Widget), then we would have to have two buttons on this screen with each set having a different value for the form variable, "rm".

How you go about setting that variable is up to you. For instance, you could have multiple HTML forms, or you could use JavaScript. CGI::Application imposes no restrictions on how the run-mode parameter gets set. It only cares that it is set, and leaves the logistics up to the application developer. Once the run-mode parameter is set, CGI::Application provides all the run-mode state management necessary to direct your application to the proper subroutine.

HTTP Headers

CGI::Application, by default, will return all content as MIME type "text/html". This is set by the HTTP headers. If you wish to set a different MIME type, manipulate a cookie or perform a HTTP redirect, then you will need to change the default HTTP headers. This is done by using two inherited CGI::Application methods: header_type() and header_props(). Refer to CGI::Application's perldoc for details on their usage.

Instance Scripts

There is one final piece in the CGI::Application architecture, and that is the "Instance Script". So far, we have talked extensively about the Application Module, but we have not yet explained exactly how the Application Module gets used! This is where the Instance Script comes in.

In traditional CGI programming, we might have a file, myapp.cgi, which is requested by a Web browser. The Web server (based on its configuration) will treat this file as a program, and return the output of its execution (as opposed to its content). In traditional CGI, this file would contain all your application code, and it would be quite lengthy. Using CGI::Application, we have put all our code in our Application Module, instead. This means that the actual file executed by the Web server can be completely empty of application-specific code! As a matter of fact, for our prototypical "WidgetView" application, what follows is the entirety of widgetview.cgi:

     #!/usr/bin/perl -w
     use WidgetView;
     my $app = WidgetView->new();
     $app->run();

It is that simple! The file, widgetview.cgi, is referred to as an "Instance Script" because it manages a single "instance" of your Application Module. As long as WidgetView.pm is in Perl's search path (@INC), this Instance Script will run your entire Web application.

Putting It All Together

In our prototypical WidgetView application, we have all the essential components of a complete CGI::Application.

Our Application Module, WidgetView.pm, may reside anywhere in the server's file system, provided it is within Perl's search path. It is recommended that your Application Module be placed outside the Web server's public document space, so that its contents are not accessible directly via the Web server.

The Application Module in our example contains four subroutines:

setup()
Configures our run-mode map, and other application settings.
show_search_form()
Run-mode "mode_1". Returns the HTML search form.
show_results_list()
Run-mode "mode_2". Based on the contents of the search form, it finds matching items in the database. The results are formatted in HTML and returned. A button is provided for each matching item, allowing the Web user to select one item by clicking on it. Clicking on an item sets the value of form parameter "rm" to "mode_3" and sets the value of another form parameter (e.g.: "item_id") to the unique identifier for the selected item.
show_widget_detail()
Run-mode "mode_3". Based on the value of the form parameter "item_id", this method retrieves all the details about the specified item from the database. These details are formatted as HTML and returned by this run-mode method.

A more complete source listing of the WidgetView application can be found here.

Our Instance Script, widgetview.cgi, resides within the Web server's public document space. It is configured to be treated as a CGI application. As long as your Web server supports CGI and Perl, your Web application based on CGI::Application will operate as you expect. WidgetView.pm does not require an Apache Web server - in fact, it will run equally well on any CGI-compatible server, including Microsoft's "IIS" or Netscape's "iPlanet" server, regardless of operating system.

Naturally, WidgetView will run exceedingly well on Apache/mod_perl servers, as CGI::Application adheres to very clean Perl programming standards. CGI::Application was designed, from the ground up, to run in full strict mode without throwing any warnings.

Conclusions & Advanced Concepts: Where to Go From Here

The concepts presented in this article should provide you with a starting point for using CGI::Application as the foundation of your Web application development. There are many advanced concepts that complete the CGI::Application picture, a few of which I will endeavor to summarize here.

Code Reuse

A tremendous potential for reusability is created through the structure of Instance Scripts. A single Application Module can be used by multiple Instance Scripts. Consider the potential for writing a Perl module and using it multiple times within a single project, across projects or even across organizations! For the first time, high-level functionality for the Web can be encapsulated in a single Perl module and distributed. If you are a CPAN author (or interested in becoming one), you could create a Web application and distribute it via CPAN in the same way CGI::Application, itself, is distributed!

Instance Scripts also have the capability to set instance-specific properties. As a result, the Instance Script becomes a sort of "configuration file" for your Application Module. The new() method (inherited from CGI::Application) has the capability to allow you to set variables that you may utilize in your Application Modules, via the inherited param() method. As a simple example, you could write a mail-form application that takes instance parameters such as the address to which the form contents should be e-mailed. Multiple Instance Scripts, all referring to the same Application Module, could each specify a different e-mail recipient or a different form. Refer to the CGI::Application perldoc for the usage details on the new() and param() methods.

CGI::Application is designed to support code reuse via inheritance. Application Modules could be devised to provide project-wide functionality. Specific applications could then inherit from your custom parent class instead of directly from CGI::Application. For example, consider the possibility of having each of your applications load configuration data from a database, or set specific run-time properties. Your parent class Application Module might implement a cgiapp_init() method, which would allow for these types of inherited behaviors. Refer to CGI::Application's perldoc for specific usage of the cgiapp_init() method.

Separating the HTML GUI From Application Code Using HTML::Template

At my company, Vanguard Media, CGI::Application is one part of a larger development strategy. One of the principle guiding forces of our application strategy is the maximum separation of the HTML GUI (Graphic User Interface) from the underlying application code. We have found that the best Perl programmers are rarely the best HTML designers, and the best HTML designers are rarely the best Perl programmers. It is for this reason that the separation of these two elements is arguably the most beneficial design decision you can make when devising an application architecture.

To this end, Sam Tregar's excellent HTML::Template module is utilized. HTML::Template allows external "template files" to be created for each screen in our applications. These template files contain 99 percent pure HTML, with a very small additional syntax for including scalar variables, loops and conditional blocks of data, set by the calling Run-Mode Method.

HTML::Template is so fundamental to our development strategy that special hooks have been built into CGI::Application to support its use! Refer to the CGI::Application perldoc for usage details of the inherited tmpl_path() and load_tmpl() methods.

Thoughts on Sessions and Security

A question that frequently comes up on the CGI::Application mailing list is how best to implement login security and session management. Experience has taught me that these are elements that are best excluded from your application code, and pushed into a lower layer of your Web server.

If you are using the Apache Web server, and are interested in implementing login security and session management, I encourage you to check out the various Apache::Auth* modules on CPAN. These modules tie into the "Authentication" and "Authorization" phases of the request. This code runs long before your CGI applications are called.

There are two primary advantages in placing your sessions and security code in this layer. First, your security will work for all documents, not just Perl applications. Even static HTML documents will be protected by this system. Second, putting sessions and security in this layer will avoid an architecture where programmers have to include special code at the start of their applications to participate in the sessions and security system.

Resources

I hope you enjoyed reading this article. The following references should help you further explore the use of CGI::Application:

Download CGI::Application
http://www.cpan.org/authors/id/J/JE/JERLBAUM/
CGI Application Mailing List
Send email to: cgiapp-subscribe@lists.vm.com
CGI.pm
http://stein.cshl.org/WWW/software/CGI/cgi_docs.html
HTML::Template
http://html-template.sourceforge.net/
Apache/mod_perl
http://perl.apache.org/

This Week in Perl 6 (27 May - 2 June 2001)

3 June 2001

Notes

You can subscribe to an email version of this summary by sending an empty message to perl6-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl6-thisweek-YYYYMM@simon-cozens.org, where YYYYMM is the current year and month.

It was a quiet week, with a mere 92 messages across 3 of the mailing lists. There were 9 threads, with 27 authors contributing. 3 threads generated 71 of the messages.

Perl Virtual Registers (continued)

Dan Suglaski summed up his thoughts on the previous week's register discussion.

1) The paired register thing's silly. Forget I mentioned it.

2) The interpreter will have some int, float, and string registers. Some stuff will be faster because of it, and it'll make the generated TIL or C code (when we do a TILperl or perl2c version) faster since we won't need to call opcode functions to add 3 and 4 together...

3) Whether the registers are really stack-based or not's an implementation detail. They'll be based off of some per-interpreter thing, of course, so'll be thread-local

4) We will have some sort of register push/pop system independent of the register implementation. (Probably, like the 68K family, with the ability to move multiple registers in one go)

5) The bytecode should be really, really close to the final executable form. I'd really like to be able to read in the bytecode in one big chunk and start executing it without change. (We'll end up with some sections that'll need to be changed--that's inevitable. If we can mmap in the non-fixup section pieces, though, that'd be great)

6) We may formally split the registers used to pass parameters from the working registers. I'm not sure if that'll ultimately be a win or not. (I can forsee lots of pointless register->register moving, and I'm not keen on pointless anything)

This spawned various discussions on several issues:

  • 8-bit versus 16-bit opcodes, with 8-bit opcodes having an escape opcode to access extended opcode features. The 8-bit with escapes scheme appears to be the winner.
  • CISC-style (high-level) versus RISC-style (low-level) opcodes. Various tradeoffs were discussed, including byte-bloat, processing speed, and ease of translation to other backends. No consensus has been reached yet.
  • Pure register versus register/stack hybrid. (In reality, even the pure register scheme is a register/stack hybrid - the question is how much stack play should be involved.) No real consensus on this one, either.
  • Variable argument opcodes and how to handle them. It wasn't expected that any opcodes should not know how many args it was getting passed, but if the situation ever arose, Dan suggested the varargness be buried a layer deeper, and the opcode itself can simply take a single argument - that of the register containing a list of arguments.

Coding Conventions Revisited

Dave Mitchell posted his revised draft of the "Conventions and Guidelines for Perl Source Code" PDD. The revision was generally accepted (save a brief foray into some standard (but relatively tame) tabs and spaces and brace alignment discussions), and the official PDD Proposal should be forthcoming shortly.

.NET

A.C. Yardley pointed out some technical documents on .NET as an FYI. (The links were to here and here.)

It Is Another Language Feature, It Is, Or Is It?

David L. Nicol mused about a new magical variable it that automatically refers to the last lexically used variable (or perhaps the last variable used as the target of defined or exists). Most folks found it (in both senses of the word) too troublesome and ambiguous.

Status of the Perl 6 Mailing Lists

There have been, to date, 28 different mailing lists associated with the Perl 6 development effort - a list that seems most daunting at first. That list has now been reduced to eight "open" lists that are currently in use. (The previous lists may be reopened at a later date, and new ones may be created. Annoucements will be made in the usual fashion on perl6-announce.) Subscription instructions and links to the archives can be found here.

The currently active lists dedicated to Perl 6 are -all, -announce, -build, -internals, -language, -meta, and -stdlib.

The last list, perl-qa, is involved in quality assurance for Perl in general, so it is also included as a Perl 6 development list.


Bryan C. Warnock

This Week on p5p 2001/06/03



Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

This was a faily active week with 700 messages.

Testing, testing

Michael Schwern was on a rampage this week attempting to improve the Perl test suite. The current test suite is quite extensive, but maintenance (or even finding which test failed) is currently tricky due to them being numbered. Hugo sums it up very nicely:

As someone who regularly tries to put in the effort to add test cases, I find there is little difference in the effort involved in adding a test case whether or not I have to encode the test number in the test case.

As someone who regularly tries to investigate test failures, the lack of test numbers makes life _much_ more difficult. It isn't just the time it takes to discover which test failed, but also the fact that it diverts my concentration from the code I want to be thinking about, so that the debugging process becomes that much more difficult.

The rest of his post is also interesting.

Schwern (in his role as Perl Quality Assurance pumpkin) has been slowly improving the available testing tools, such as the Test::Simple module on CPAN, "an extremely simple, extremely basic module for writing tests suitable for CPAN modules and other pursuits". Instead of simply numbering the tests, it allows tests to be named. From its documentation:

 # This produces "ok 1 - Hell not yet frozen over" (or not ok)
 ok( get_temperature($hell) > 0, 'Hell not yet frozen over' );

Schwern is currently holding off integrating the module into the core until he gets the interface just right. Tony Bowden dreamt about a world of testing and psychology, with convincing arguments about the module.

Schwern also submitted quite a few patches to the test suite to sync the latest version of the Test and Test::Harness modules from CPAN into the core and to improve the test suite.

libnet in the core

Jarkko introduced us to his evil plan to integrate all of CPAN into the Perl core, assimilating libnet this week. libnet contains various client side networking modules, such as Net::FTP, Net::NNTP, Net::POP3, Net::Time and Net::SMTP, but unfortunately requires some initial configuration. The idea was that libnet could be told only once which POP3 server to use, which it would then use by default in future.

Jarkko asked whether configuration could be delayed. There followed some discussion about providing a seperate configuration utility which could be run after configuration-time, some talk (and flames) about a .perlrc per-user configuration file, and testing the modules by shipping small fake servers. No concensus was reached.

Warnings crusade

It was very much a week of patches from Schwern, who continued on his crusade to make Perl compile cleanly under -Wall, jumping over hoops sometimes to get rid of warnings.

After a slew of patches, Schwern suggested making -Wall the default to stop new patches containing warnings. Jarkko made it so, with the slightly suprising problem that Perl no longer compiled on Solaris with gcc. The culprit turned out to be -ansi, which has been temporarily removed.

Various

Hugo posted a wonderful comparison of various benchmarks containing the experimental ?> regular expression feature, along with a small discussion of the regular expression optimiser.

Tye McQueen posted a small patch attempting to make pathological hash keys much more unlikely.

H.Merijn Brand posted some patches to get Perl running on AIX and gcc.

There was some more talk on documenting sort as stable, with perhaps having a pragma such as use sort qw( stable unique );.

Jarkko submitted some UTF bug reports and proceeded to fix some.

Ilya provided some more OS/2 patches.

Ilmari Karonen provided an interesting bug report which was produced by his Markov chain random input tester.

Hugo provided a patch to stop Atof numifying "0xa" to 10. At the moment Perl was relying on the system's atof which turns out to be different on different platforms, so we now have an implementation in Perl.

Jarkko attempted to make use utf8 the default, allowing us to write our scripts in UTF-8. It was shot down very rapidly by the backwards-compatibility police due to no longer allowing naked bytes with the eight bit, such as the pound character.

Doug MacEachern posted some patches to clean up and optimise Cwd.pm.

Until next week I remain, your temporarily-replaced humble and obedient servant,

Leon Brocard, leon@iterative-software.com


Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en