This article is about scripting in Perl. Of course, scripting can take place anywhere, not just in the context of the Web. I will concentrate on CGI scripts written in Perl. In fact, virtually any language could be used to write CGI scripts.
Perl has been ported to about 70 operating systems. The most recent (February 2001) is Windows CE. In this article I will make a lot of simplifications and generalizations. Here's the first ...
There are two types of CGI scripts:
- Those that output HTML pages
- Those that process the input from CGI forms
Invariably, the second type -- having processed the data -- will output an HTML page or form to allow the user to continue, or at least know what happened, or they will do a CGI redirect to another script that outputs something.
I am splitting scripts in to two type to emphasize that there are differences between processing forms and processing in the absence of forms.
There are Web Servers and Web Clients. Some Web clients are browsers.
There are programs and scripts. Once upon a time, programs were compiled and scripts were interpreted. Hence the two names. But today, this is ``a distinction without a difference.'' My attitude is that the two words, program and script, are interchangable.
Program and process, however, are different. Program means a program on disk. Process means a program that has been loaded by the operating system into memory and it being executed. This means a single program on disk can be loaded and run several times simultaneously, in which case it is one program and several processes.
Web servers have names such as Apache, Zeus, MS IIS and TinyHTTPd. Apache and TinyHTTPd (Tiny HyperText Transfer Protocol Daemon) are open source. Zeus and MS IIS (Internet Information Server) are commercial products. The feeble security of IIS makes it unusable in a commercial environment.
My examples will use Apache as the Web server. Web clients that are browsers have names such as Opera, Netscape and Explorer. Of course, you can roll your own non-browser Web client. We'll do this below.
URI = URL + URN
You'll notice the three letters I, L and N are in alphabetical order. That's the way to remember this formula.
U => Uniform R => Resource I => Indicator L => Location N => Name
Web Server Start Up
When a Web server starts running, these are the basic steps taken:
- Read configuration file. With Apache, you can use Perl to analyze certains parts of the config file. With Apache, you can even use Perl inside the config file
- Start subprocesses (depending on platform)
- Become quiescent, i.e. wait for requests from Web clients
It doesn't matter which Web server you are using, and it doesn't matter if the Web server is running under Unix or Windows or any other OS. These principles will apply.
Web Server Request Loop
The Web server request loop, simplified (as always), has several steps:
- Accept request from Web client
- Process request. This means one of the following:
- Read a disk file containing an HTML page (the file = the page = the response)
- Run a script (its ouput = the response)
- Service the request using code within the Web server. If you submit this to Apache, http://127.0.0.1/server-info, Apache fabricates the response
- The script fabricates an HTML page and writes it to STDOUT. The Web server captures STDOUT. This output is the body of the response. The script exits
- Send response body, wrapped in the appropriate headers, to the Web client
Pictorially, we have an infinity symbol, i.e. a figure eight on its side:
+------+ 1 -Request---> +------+ 2 -Action--> +------+ | Web | (URI or Submit) | Web | (script.pl) | Perl | |Client| |Server| |Script| +------+ <--Response- 4 +------+ <---HTML-- 3 +------+ (Header and HTML) (Plain page or CGI form)
Things to note:
- The interaction starts from the Web client
- The interaction is a round trip
- The Web client uses the HyperText Transfer Protocol to format Request 1:
- The URI will be sent as text in the message
- The data from a submitted form will be sent using the CGI protocol
- In both cases, the HTTP will be used to generate an envelope wrapped around the message content
- Action 2 is a request from the Web server to the operating system to load and
run a script. Many issued arise here. A brief summary:
- Does the Web server have permission to run this script? After all, the Web server is a program, so it means it was loaded and run by some user, often a special user called ``nobody.'' So does this ``nobody'' have permission to run script.pl?
- Does this particular Web client have permission? The Web server will check directory-access permissions and may have to ask the Web client for a username and a password before proceeding
- Does the script have permission to read/write whatever directories it needs to to do its work? For instance, to put a Web front-end on CVS requires that ``nobody'' have read access to the source code repository, or that the script opens a socket to another script that can access the repository
- Action 3 is a stream of HTML output by the script and is captured by the Web server.
- Response 4 is the output from the script wrapped in an envelope of headers according to the HTTP.
- The Web client cannot see the source code of the script, only the output of the script. If the Web client, e.g. a browser, offers to download the script and pops up a dialog box asking for the name of a file to save the script in, then the Web server clearly did not execute the script. This means the Web server is misconfigured.
- If the first execute of the script outputs a CGI form, then when the Web client submits that form, the script is rerun to process the form's data. That's right, the script would normally be run twice. In other words, the first time the script runs it sees it has no input data, so it outputs an empty form. The second time it runs it sees it has input data, so it reads and processes that data. Yes, they could be two separate scripts. When the form is output, the ``action'' clause specifies the name of the script that the Web server will run to process the form's data.
Web Server Directory Structure
But how does the Web server know which page to return or which script to run? To answer this we next look at the directory structure on the Web server's machine.
Below, Monash and Rusden are the names of university campuses.
monash.edu and rusden.edu will be listed under the ``Virtual Hosts'' part of httpd.conf, or, if you are running MS Windows NT/2k, they can be named in the file C:\WinNT\System32\Drivers\Etc\Hosts. Under other versions of Windows, the hosts file will be C:\Windows\Hosts.
And a warning about the NT version of this file: Windows Explorer will lie to you about the attributes of this file. You will have to log off as any user and log on as the administrator to be able to save edits into this file.
See http://savage.net.au/Perl/Html/configure-apache.html for details.
Assume this directory structure:
- D:\ - www\ - cgi-bin\ - x.pl - conf\ - httpd.conf - public\ - index.html - monash\ - index.html - monash\staff - mug-shots.html - rusden\ - index.html - rusden\staff - courses.html
D:\www\cgi-bin Contents can be executed by the Web server but not viewed by Web clients
D:\www\conf Contents invisible to Web clients
D:\www\public Contents can be viewed by Web clients
Web Server Configuration
Now, the Web server can be told, via its configuration file httpd.conf, that:
- Web client requests using http://monash.edu/ are directed to D:\www\public\monash\.
Hence, a request for http://monash.edu/staff/mug-shots.html returns the disk file D:\www\public\monash\staff\mug-shots.html
- Web client requests using http://rusden.edu/ are directed to D:\www\public\rusden\.
Hence a request for http://rusden.edu/staff/courses.html returns the disk file D:\www\public\rusden\staff\courses.html
- Web client requests using http://monash.edu/cgi-bin/ are directed to D:\www\cgi-bin
- Web client requests using http://rusden.edu/cgi-bin/ are directed to D:\www\cgi-bin
Did you notice that both virtual hosts use D:\www\cgi-bin?
================================================================ These two hosts have their own document trees, but share scripts ================================================================
We can service any number of virtual hosts with only one copy of each script. This is a huge maintenance savings.
This is the information available to the Web server when a request comes in from a Web client. So, now let's look at the client side of things.
A Perl Web Client
Here is a real, live, complete, Perl Web Client that is obviously not a browser:
#!/usr/bin/perl use LWP::Simple; print get('http://savage.net.au/index.html');
Yes, folks, that's it. The work is managed by the Perl module ``LWP::Simple,'' and is available thru the command ``get,'' which that module exports, i.e. makes public so it can be used in scripts like this one. LWP stands for Library for Web programming in Perl.
This code runs identically, from the command line, under Windows and Linux. The output is ``print''ed to the screen, but not formatted according to the HTML.
It's now time to step thru the Web server-Web client interaction.
Web Client Requests
When you type something such as ``rusden.edu'' into the browser's address field, or pass that string to a Web client, and hit Go, here's an example of what could happen:
- The Web client says ``You're lazy,'' and prepends the default protocol to the string, resulting in ``http://rusden.edu''
- The Web client says ``You're lazy,'' and appends the default directory to the string, resulting in ``http://rusden.edu/''
- The Web client sends this to the Web server with some headers. This is the all-important ``Request'' (see Web Server Request Loop)
- The Web server parses it and, using its configuration data, determines which disk directory, if any, this maps to. I say ``if any'' because it may refer to a virtual, or non-existant, directory
- If the client asks for a directory, this would normally be converted (by the Web server) into a request for a directory listing, or a default file, such as /index.html
- If the client asks for a script to be run, the request is processed as described above. Of course, the client may not even know that a script is being run to service the request
- The Web server determines whether you have enough permission to access files in this directory
- If so, the Web server reads this disk file into memory or runs the script, and sends the result to the Web client with the appropriate headers. This is the all-important ``Response''
In reality, processing the request and manufacturing the response can be complex procedures.
There are two types of Web pages sent to Web clients:
- Those that contain passive text, which the Web client (or human operating a browser) can do no more than look at
- Those that contain active text, i.e. CGI forms, in that the Web client (or human) can fill in data entry fields and then submit the form's data back to the Web server for processing by a script. In such cases, the form must contain a submit button of some type. You can use a clickable image as a submit button, or you may use a standard submit button, whose appearance has perhaps been transformed by a cascading style sheet, as the thing to click.
Action = Script
If you view the source of such a form, you will always find text like:
<form method='POST' action='./script.pl' enctype='application/x-www-form-urlencoded'>
The ``action'' part tells the Web server when the form is submitted and which script to run to process the form's data.
The Web server asks the operating system to load and run the script, and then it (the Web server) passes the data (from the form) to the script. The script process the data and outputs a response (which would normally be another form).
I've used './script.pl' to indicate the script is in the ``current'' directory, but be warned, the CGI protocol does not specify what the current directory is at any time.
In fact, it does not even specify that any current directory exists. Your scripts must, at all times, know exactly where they are and what they are doing.
Remember, this ``action'' is taking place inside (i.e. from the point of view of) the Web server.
Web Page Content
Web pages usually contain data in a combination of languages:
- Text: Display this text
- Image references: Display this image
- HTML: Format the text and images. HTML is a ``rendering'' language
- XML: Echo and describe the text, e.g. to ``data mining'' page crawlers
- Create special effects (trivial)
- Validate form input (important)
- The form's data would have to be sent to the Web server. This means one trip across the Internet
- The Web server would have to run the script that will validate the data
- The Web server would have to pass the data to the script
- The script would have to read and parse the data
- The script would have to validate the data
- The script would have to send a response to the Web server
- The Web server would have to send a response to the Web client. This means a second trip across the Internet
Of course, complex validation often requires access to databases and so on, so sometimes there is no escape from ``overhead.''
Digression: HTML 'v' XML
As an aside, here's how HTML compares to XML. HTML is a rendering language. It indicates how the data is to be displayed. XML is a meta-language. It indicates the meaning of the data.
'<h1>25</h1>' tells you how 25 should look, but not what it is. In other words,
'<h1>' is a command, telling a Web client how to display what follows.
'<th>Temperature</th><td>25</td>' tells you how to align the 25, but not what it is.
<temperature>25</temperature>' tells you what 25 is. '<temperature>' is not a command.
<street number>25</street number>' tells you what 25 is.
Hmmm. This would make a marvellous exam question.
Reaction: A Tale of 2 Scripts
So, what happens when a Web client requests that a Web server run a script?
To answer this, let's look at a Web client request for a script-generated form, and how that request is processed.
In fact, the Web client is saying to the Web server: ''Pretty please, run _your_ script on _my_ data.'' Let's go through the procedure:
- The Web client sends the URI 'http://rusden.edu/cgi-bin/enrol.pl.' This is script # 1
- The Web server executes the script (# 1), captures its output and sends the output -- the form -- back to the Web client. Script # 1 knows what to output, because it sees that it has no input data from a CGI form
- The script (# 1) terminates. It is finished, completed, done, gone forever. Trust me: I'm a programmer ...
- The Web client renders the Web page
- The web client fills in the form and submits it. Being a form, it must contain an ``action'' clause naming a script (# 2). Perhaps script # 1 is the same as script # 2
- The Web server executes the script (# 2), which processes the data. This invocation of script # 2 is independent of the prior invocation of script # 1, even if they are the same script. The Web server executes two separate processes, scripts # 1 and # 2. Script # 2 knows what to do because it sees that it has input data from a CGI form
- And so on ... Script # 2 may issue another form, in order to continue the interaction
You can see the problem. How does script # 2 know what ``state'' script # 1 got up to?
The problem of maintaining state is a big problem. Chapter 5 in ``Writing Apache Modules in Perl and C'' is called ``Maintaining State,'' and is dedicated to this problem. See ``Resources,'' below.
A few alternatives, and a simple discussion of possible drawbacks:
- Send data to the Web client as ``hidden fields'' to be returned with the form data
Drawback: A person can simply use the browser's ``View Source'' command to see the values. Hidden simply means that these fields are not rendered on the screen. There is absolutely no security in hidden fields.
- Save state in cookies
Drawback: The Web client may have disabled cookies. Some banks do this under the false assumption that cookies can contain viruses.
Drawback: If the cookie is written to disk by the Web client, the text in the cookie must be encrypted if you want to stop people looking at it or changing it.
Save state in Web server memory
Drawback: The data is in the memory of one process, and when the Web client logs back in (i.e. submits the form data) it may be connected to a different process, i.e. a copy of the process that send the first response, and this copy will not have access to the memory of the first process.
- Save state in the URI itself, e.g. as a session ID
Here's how: Generate a random number. Write the data into a database using the random
number as the key. Send the random number to the Web client to be returned with the form
Drawback: You can't just use the operating system's random-number generator, since anyone with the same OS and compiler could predict the numbers, since they aren't truely random.
Drawback: Relative URIs no longer work correctly. However, help is at hand with Perl module Apache::StripSession.
Drawback: Under some circumstances it is possible for the session ID to ``leak'' to other sites.
- Write the data to a temporary file
Drawback: How does script # 2 know the name of this file created by script # 1? It's simple if they are the same script, but they don't have to be.
Drawback: What happens if two copies of script # 1 run at the same time?
In each case, you either abandon that alternative or add complexity to overcome the drawbacks.
There is no perfect solution that satisfies all cases. You must study the alternatives, study your situation and choose a course of action.
Combining Perl and HTML
There are three basic ways to do this:
- Put the code inside the HTML. Many Perl packages take this approach. E.G: Apache::ASP,
Apache::EmbPerl (EmbeddedPerl), Apache::EP (another embedded Perl), Apache::ePerl
(yet another embedded Perl), Template::Toolkit (embed a mini non-Perl language in the HTML).
In each case, you need an interpreter to read the combined HTML/Perl/Other and to output pure HTML. In such cases, the interpreter will act as a filter.
- Put the HTML inside the code. This is just the reverse of (1). Thus (tested code):
#!/usr/bin/perl use integer; use strict; use warnings; use CGI; my($q) = CGI -> new(); print $q -> header, $q -> start_html(), 'Weather Report', $q -> br(), $q -> table ( $q -> Tr ([ $q -> th('Temperature') . $q -> td(25) ]) ), $q -> end_html();
- Put the HTML, or XML, or whatever, in a file external to the script. In this case
your script will act as a filter. Your script reads this file and looks for special
strings in the file that it replaces with HTML generated by, say, reading a database
and formatting the output. In other words, the external file contains a combination of:
- HTML, which your script simply copies to its output stream
- HTML comments, like
<!-- Some command -->, which your script cuts out and replaces with the results of processing that command
A Detour - SDF
If you head over to SDF - Simple Document Format, you'll see an example of the third way. SDF is, of course, a Perl-based open-source answer to PDF.
SDF is also available from CPAN: http://theoryx5.uwinnipeg.ca/CPAN/cpan-search.html.
SDF converts text files into various specific formats. SDF can output, directly or via other software, into these formats: HTML, PostScript, PDF, man pages, POD, LaTeX, SGML, MIMS HTX and F6 help, MIF, RTF, Windows help and plain text.
Inside a Script: Who's Calling?
A script can ask the Web server the URI used to fire off the script.
The Web server puts this information into the environment of the script under the name HTTP_REFERER (yes, misspelling included for free).
So, as a script, I can say I was called by one of:
Now, either ``monash.edu'' or ``rusden.edu'' is just the value of a string in the script, and so the script can use this string as a key into a database. In fact, this part of the URI is also in the environment, separately, under the name HTTP_HOST.
From a database table, or any number of tables, the script can retrieve data specific to the host. This, in turn, means the script can change its behavior depending on the URI used to run it.
Data Per URI - Page Design
The open-source database MySQL has a reserved table called ``hosts,'' so I'll start using the word ``domain.'' Given a domain, I can turn that into a number that can be used as an index into a database table.
Here is a sample ``domain'' table:
+=============+=======+ | | URI | | domain_name | index | +=============+=======+ | monash.edu | 4 | +=============+=======+ | rusden.edu | 6 | +=============+=======+
And here is a sample Web page ``design'' table:
+=======+ + URI +===============+===========+===================+ | index | template_name | bkg_color | location_of_links |... +=======+===============+===========+===================+ | 4 | dark blue | cream | down the left |... +=======+===============+===========+===================+ | 6 | pale green | an image | across the bottom |... +=======+===============+===========+===================+
Data per URI - Page Content
Here is a sample Web page ``content'' table:
+=======+ + URI +================+================+ | index | News headlines | Weather |... +=======+================+================+ | 4 | - | www.bom.gov.au |... +=======+================+================+ | 6 | www.f2.com.au | www.bom.gov.au |... +=======+================+================+ f2 => Fairfax, the publisher of ``The Age'' newspaper. bom => Bureau of Meteorology.
Data Per URI - Page Content Revisited
Let me give a more commercial example. Here we chain tables:
ProductMap table: +=======+ + URI +==============+============+ | index | Products | product_id | +=======+==============+============+ | 4 | Motherboards | 1 | +=======+==============+============+ | 4 | Printers | 2 | +=======+==============+============+ | 4 | CD Writers | 3 | +=======+==============+============+ | 6 | CD Writers | 4 | +=======+==============+============+ | 6 | Zip Drives | 5 | +=======+==============+============+ Product table: +============+=============+ | product_id | Brands | +============+=============+ | 1 | Gigabyte X1 | +============+=============+ | 1 | Gigabyte X2 | +============+=============+ | 1 | Intel A | +============+=============+ | 1 | Intel B | +============+=============+ | : | : | +============+=============+ | 5 | Sony | +============+=============+
Hence a list of products for a given URI, i.e. a given shop, can be turned into an HTML table and inserted into the outgoing Web page.
- http://savage.net.au/Ron/Scripting/cgi-scripting.txt (this article)
- http://savage.net.au/Ron/Scripting/cgi-scripting.ppt (an older version, in PowerPoint format)
- http://savage.net.au/Ron/Scripting/resources.txt (the above, as text)