May 2002 Archives

Improving mod_perl Sites' Performance: Part 1

In the next series of articles, we are going to talk about mod_perl performance issues. We will try to look at as many aspects of the mod_perl driven service as possible: hardware, software, Perl coding and finally the mod_perl specific aspects.

The Big Picture

To make the user's Web browsing experience as painless as possible, every effort must be made to wring the last drop of performance from the server. There are many factors that affect Web site usability, but speed is one of the most important. This applies to any Web server, not just Apache, so it is important that you understand it.

How do we measure the speed of a server? Since the user (and not the computer) is the one that interacts with the Web site, one good speed measurement is the time elapsed between the moment when one clicks on a link or presses a Submit button to the moment when the resulting page is fully rendered.

The requests and replies are broken into packets. A request may be made up of several packets; a reply may be many thousands. Each packet has to make its way from one machine to another, perhaps passing through many interconnection nodes. We must measure the time starting from when the first packet of the request leaves our user's machine to when the last packet of the reply arrives back there.

A Web server is only one of the entities the packets see along their way. If we follow them from browser to server and back again, then they may travel by different routes through many different entities. Before they are processed by your server, the packets might have to go through proxy (accelerator) servers and, if the request contains more than one packet, packets might arrive to the server by different routes with different arrival times. Therefore, it's possible that some packets that arrive earlier will have to wait for other packets before they could be reassembled into a chunk of the request message that will be then read by the server. Then the whole process is repeated in reverse.

You could work hard to fine-tune your Web server's performance, but a slow Network Interface Card (NIC) or a slow network connection from your server might defeat it all. That's why it's important to think about the big picture and to be aware of possible bottlenecks between the server and the Web.

Of course, there is little that you can do if the user has a slow connection. You might tune your scripts and Web server to process incoming requests quickly, so you will need only a small number of working servers, but you might find that the server processes are all busy waiting for slow clients to accept their responses.

But there are techniques to cope with this. For example, you can deliver the response compressed. If you are delivering a pure text respond, then gzip compression will sometimes reduce the size of the respond by 10 times.

You should analyze all the involved components when you try to create the best service for your users, and not the Web server or the code that the Web server executes. A Web service is like a car: If one of the parts or mechanisms is broken, then the car may not operate smoothly and it can even stop dead if pushed too far without fixing it.

Let me stress it again: If you want to be successful in the Web service business, then you should start worrying about the client's browsing experience and not only how good your code benchmarks are.

Operating System and Hardware Analysis

Previously in the Series

Finding a mod_perl ISP... or Becoming One

The Perl You Need To Know - Part 3

The Perl You Need To Know - Part 2

The Perl You Need To Know

Installing mod_perl without superuser privileges

mod_perl in 30 minutes

Why mod_perl?

Before you start to optimize server configuration and learn to write more-efficient code, you need to consider the demands that will be placed on the hardware and the operating System. There is no point in investing a lot of time and money in configuration tuning and code optimizing, only to find that your server's performance is poor because you did not choose a suitable platform in the first place.

Because hardware platforms and operating systems are developing rapidly (even while you are reading this article), the following advisory discussion must be in general terms, without mentioning specific vendors names.

Choosing the Right Operating System

I will try to talk about what characteristics and features you should be looking for to support a mod_perl enabled Apache server, then when you know what you want from your OS, you can go out and find it. Visit the Web sites of the operating systems you are interested in. You can gauge user's opinions by searching the relevant discussions in newsgroup and mailing list archives. Deja - http://deja.com and eGroups - http://egroups.com are good examples. I will leave this fan research to you. But probably the best shot will be to ask mod_perl users, as they know the best.

Stability and Robustness Requirements

Probably the most important features in an OS are stability and robustness. You are in the Internet business. You do not keep normal 9 a.m. to 5 p.m. working hours like conventional businesses. You are open 24 hours a day. You cannot afford to be off-line, because your customers will shop at another service (unless you have a monopoly ...). If the OS of your choice crashes every day, then first conduct a little investigation. There might be a simple reason that you can fix. There are OSs that won't work unless you reboot them twice a day. You don't want to use this type of OS, no matter how good the OS' vendor sales department is. Do not follow flushy advertisements; follow developers' advice instead.

Generally, people who have used the OS for some time can tell you a lot about its stability. Ask them. Try to find people who are doing similar things to what you are planning to do, they may even be using the same software. There are often compatibility issues to resolve. You may need to become familiar with patching and compiling your OS.

Good Memory-Management Importance

You want an OS with a good memory-management implementation. Some OSs are well-known as memory hogs. The same code can use twice as much memory on one OS compared to another. If the size of the mod_perl process is 10Mb and you have tens of these running, then it definitely adds up!

Say No to Memory Leaks

Some OSs and/or their libraries (e.g. C runtime libraries) suffer from memory leaks. A leak is when some process requests a chunk of memory for temporary storage, but then does not subsequently release it. The chunk of memory is not then available for any purpose until the process that requested it dies. You cannot afford such leaks. A single mod_perl process sometimes serves thousands of requests before it terminates. So if a leak occurs on each request, then the memory demands could become huge. Of course, your code can be the cause of the memory leaks as well, but it's easy to detect and solve. Certainly, we can reduce the number of requests to be served during the process' life, but that can degrade performance.

Memory-Sharing Capabilities Is a Must

You want an OS with good memory-sharing capabilities. If you preload the Perl modules and scripts at server startup, then they are shared between the spawned children (at least for a part of a process' life - memory pages can become ``dirty'' and cease to be shared). This feature can reduce memory consumption a lot!

And, of course, you don't want an OS that doesn't have memory-sharing capabilities.

The Real Cost of Support

If you are in a big business, then you probably do not mind paying another $1,000 for some fancy OS with bundled support. But if your resources are low, then you will look for cheaper and free OSs. Free does not mean bad, it can be quite the opposite. Free OSs can have the best support you can find. Some do.

It is easy to understand - most of the people are not rich and will try to use a cheaper or free OS first if it does the work for them. Since it really fits their needs, many people keep using it and eventually know it well enough to be able to provide support for others in trouble. Why would they do this for free? One reason is for the spirit of the first days of the Internet, when there was no commercial Internet and people helped each other, because someone helped them in first place. I was there, I was touched by that spirit and I'm keen to keep that spirit alive.

But, let's get back to the real world. We are living in material world, and our bosses pay us to keep the systems running. So if you feel that you cannot provide the support yourself and you do not trust the available free resources, then you must pay for an OS backed by a company, and blame them for any problem. Your boss wants to be able to sue someone if the project has a problem caused by the external product that is being used in the project. If you buy a product and the company selling it claims support, then you have someone to sue or at least to put the blame on.

If we go with open source and it fails we do not have someone to sue ... wrong -- in the past several years, many companies have realized how good the open-source products are and started to provide an official support for these products. So your boss cannot just dismiss your suggestion of using an open-source operating system. You can get a paid support just like with any other commercial OS vendor.

Also remember that the less money you spend on OS and software, the more you will be able to spend on faster and stronger hardware. Of course, for some companies money is a nonissue, but there are many companies for which it is a big issue.

Ouch ... Discontinued Products

The OSs in this hazard group tend to be developed by a single company or organization.

You might find yourself in a position where you have invested a lot of time and money into developing some proprietary software that is bundled with the OS you chose (say writing a mod_perl handler that takes advantage of some proprietary features of the OS and that will not run on any other OS). Things are under control, the performance is great and you sing with happiness on your way to work. Then, one day, the company that supplies your beloved OS goes bankrupt (not unlikely nowadays), or they produce a newer incompatible version and they will not support the old one (happens all the time). You are stuck with their early masterpiece, no support and no source code! What are you going to do? Invest more money into porting the software to another OS ...

Free and open-source OSs are probably less susceptible to this kind of problem. Development is usually distributed between many companies and developers. So if a person who developed an important part of the kernel lost interest in continuing, then someone else will pick the falling flag and carry on. Of course, if tomorrow some better project shows up, then developers might migrate there and finally drop the development. But in practice, people are often given support on older versions and helped to migrate to current versions. Development tends to be more incremental than revolutionary, so upgrades are less traumatic, and there is usually plenty of notice of the forthcoming changes so that you have time to plan for them.

Of course, with open-source OSs you can have the source code! So you can always have a go yourself, but do not under-estimate the amounts of work involved. There are many, many man-years of work in an OS.

Keeping Up with OS Releases

Actively developed OSs generally try to keep pace with the latest technology developments, and continually optimize the kernel and other parts of the OS to become better and faster. Nowadays, Internet and networking in general are the hottest topics for system developers. Sometimes a simple OS upgrade to the latest stable version can save you an expensive hardware upgrade. Also, remember that when you buy new hardware, chances are that the latest software will make the most of it.

If a new product supports an old one by virtue of backward compatibility with previous products of the same family, then you might not reap all the benefits of the new product's features. Perhaps you get almost the same functionality for much less money if you were to buy an older model of the same product.

Choosing the Right Hardware

Sometimes the most expensive machine is not the one that provides the best performance. Your demands on the platform hardware are based on many aspects and affect many components. Let's discuss some of them.

In the discussion I use terms that may be unfamiliar to you:

  • Cluster: a group of machines connected together to perform one big or many small computational tasks in a reasonable time. Clustering can also be used to provide 'fail-over,' where if one machine fails, then its processes are transferred to another without interruption of service. And you may be able to take one of the machines down for maintenance (or an upgrade) and keep your service running -- the main server will simply not dispatch the requests to the machine that was taken down.
  • Load balancing: users are given the name of one of your machines but perhaps it cannot stand the heavy load. You can use a clustering approach to distribute the load over a number of machines. The central server, which users access initially when they type the name of your service, works as a dispatcher. It just redirects requests to other machines. Sometimes the central server also collects the results and returns them to the users. You can get the advantages of clustering, too.
  • Network Interface Card (NIC): a hardware component that allows you to connect your machine to the network. It performs packets sending and receiving, newer cards can encrypt and decrypt packets and perform digital signing and verifying of the such. These are coming in different speeds categories varying from 10Mbps to 10Gbps and faster. The most used type of the NIC card is the one that implements the Ethernet networking protocol.
  • Random Access Memory (RAM): It's the memory that you have in your computer. (Comes in units of 8Mb, 16Mb, 64Mb, 256Mb, etc.)
  • Redundant Array of Inexpensive Disks (RAID): an array of physical disks, usually treated by the operating system as one single disk, and often forced to appear that way by the hardware. The reason for using RAID is often simply to achieve a high data transfer rate, but it may also be to get adequate disk capacity or high reliability. Redundancy means that the system is capable of continued operation even if a disk fails. There are various types of RAID array and several different approaches to implementing them. Some systems provide protection against failure of more than one drive and some (`hot-swappable') systems allow a drive to be replaced without even stopping the OS.

Machine Strength Demands According to Expected Site Traffic

If you are building a fan site and you want to amaze your friends with a mod_perl guest book, then any old 486 machine could do it. If you are in a serious business, then it is important to build a scalable server. If your service is successful and becomes popular, then the traffic could double every few days, and you should be ready to add more resources to meet demand. While we can define the Web server scalability more precisely, the important thing is to make sure that you can add more power to your webserver(s) without investing much additional money in software development (you will need a little software effort to connect your servers, if you add more of them). This means that you should choose hardware and OSs that can talk to other machines and become a part of a cluster.

On the other hand, if you prepare for a lot of traffic and buy a monster to do the work for you, then what happens if your service doesn't prove to be as successful as you thought? Then you've spent too much money, and meanwhile faster processors and other hardware components have been released; so you lose.

Wisdom and prophecy, that's all it takes :)

Single Strong Machine vs. Many Weaker Machines

Let's start with a claim that a 4-year-old processor is still powerful and can be put to a good use. Now let's say that for a given amount of money you can probably buy either one new very strong machine or about 10 older but very cheap machines. I claim that with 10 old machines connected into a cluster and by deploying load balancing you will be able to serve about five times more requests than with one single new machine.

Why is that? Because generally the performance improvement on a new machine is marginal while the price is much higher. Ten machines will do faster disk I/O than one single machine, even if the new disk is quite a bit faster. Yes, you have more administration overhead, but there is a chance you will have it anyway, for in a short time the new machine you have just bought might not stand the load. Then you will have to purchase more equipment and think about how to implement load balancing and Web server file system distribution anyway.

Why I am so convinced? Look at the busiest services on the Internet: search engines, Web/e-mail servers and the like -- most of them use a clustering approach. You may not always notice it, because they hide the real implementation details behind proxy servers.

Getting Fast Internet Connection

You have the best hardware you can get, but the service is still crawling. Make sure you have a fast Internet connection. Not as fast as your ISP claims it to be, but fast as it should be. The ISP might have a good connection to the Internet, but put many clients on the same line. If these are heavy clients, then your traffic will have to share the same line and your throughput will suffer. Think about a dedicated connection and make sure it is truly dedicated. Don't trust the ISP, check it!

The idea of having a connection to the Internet is a little misleading. Many Web hosting and co-location companies have large amounts of bandwidth, but still have poor connectivity. The public exchanges, such as MAE-East and MAE-West, frequently become overloaded, yet many ISPs depend on these exchanges.

Private peering means that providers can exchange traffic much quicker.

Also, if your Web site is of global interest, check that the ISP has good global connectivity. If the Web site is going to be visited mostly by people in a certain country or region, then your server should probably be located there.

Bad connectivity can directly influence your machine's performance. Here is a story one of the developers told on the mod_perl mailing list:

What relationship has 10 percent packet loss on one upstream provider got to do with machine memory ?

Yes ... a lot. For a nightmare week, the box was located downstream of a provider who was struggling with some serious bandwidth problems of his own ... people were connecting to the site via this link, and packet loss was such that retransmits and TCP stalls were keeping httpd heavies around for much longer than normal ... instead of blasting out the data at high or even modem speeds, they would be stuck at 1k/sec or stalled out ... people would press stop and refresh, httpds would take 300 seconds to timeout on writes to no-one ... it was a nightmare. Those problems didn't go away till I moved the box to a place closer to some decent backbones.

Note that with a proxy, this only keeps a lightweight httpd tied up, assuming the page is small enough to fit in the buffers. If you are a busy Internet site, then you always have some slow clients. This is a difficult thing to simulate in benchmark testing, though.

Tuning I/O Performance

If your service is I/O bound (does a lot of read/write operations to disk), then you need a very fast disk, especially if the you need a relational database, which are the main I/O stream creators. So you should not spend the money on video card and monitor! A cheap card and a 14-inch monochrome monitor are perfectly adequate for a Web server; you will probably access it by telnet or ssh most of the time. Look for disks with the best price/performance ratio. Of course, ask around and avoid disks that have a reputation for head-crashes and other disasters.

You must think about RAID or similar systems if you have an enormous data set to serve (what is an enormous data set nowadays? Gigabytes, terabytes?) or you expect a really big Web traffic.

OK, you have a fast disk, what's next? You need a fast disk controller. There may be one embedded on your computer's motherboard. If the controller is not fast enough, then you should buy a faster one. Don't forget that it may be necessary to disable the original controller.

How Much Memory Is Enough?

How much RAM do you need? Nowadays, chances are that you will hear: ``Memory is cheap, the more you buy the better.'' But how much is enough? The answer is pretty straightforward: you do not want your machine to swap. When the CPU needs to write something into memory, but memory is already full, it takes the least frequently used memory pages and swaps them out to disk. This means you have to bear the time penalty of writing the data to disk. If another process then references some of the data that happens to be on one of the pages that has just been swapped out, then the CPU swaps it back in again, probably swapping out some other data that will be needed very shortly by some other process. Carried to the extreme, the CPU and disk start to thrash hopelessly in circles, without getting any real work done. The less RAM there is, the more often this scenario arises. Worse, you can exhaust swap space as well, and then your troubles really start.

How do you make a decision? You know the highest rate at which your server expects to serve pages and how long it takes on average to serve one. Now you can calculate how many server processes you need. If you know the maximum size your servers can grow to, then you know how much memory you need. If your OS supports memory sharing, then you can make best use of this feature by preloading the modules and scripts at server startup, and so you will need less memory than you have calculated.

Do not forget that other essential system processes need memory as well, so you should plan not only for the Web server, but also take into account the other players. Remember that requests can be queued, so you can afford to let your client wait for a few moments until a server is available to serve it. Most of the time your server will not have the maximum load, but you should be ready to bear the peaks. You need to reserve at least 20 percent of free memory for peak situations. Many sites have crashed a few moments after a big scoop about them was posted and an unexpected number of requests suddenly came in. (Like Slashdot effect.) If you are about to announce something cool, then be aware of the possible consequences.

Getting a Fault-Tolerant CPU

Make sure that the CPU is operating within its specifications. Many boxes are shipped with incorrect settings for CPU clock speed, power supply voltage, etc. Sometimes a cooling fan is not fitted. It may be ineffective because a cable assembly fouls the fan blades. Like faulty RAM, an overheating processor can cause all kinds of strange and unpredictable things to happen. Some CPUs are known to have bugs that can be serious in certain circumstances. Try not to get one of them.

Previously in the Series

Finding a mod_perl ISP... or Becoming One

The Perl You Need To Know - Part 3

The Perl You Need To Know - Part 2

The Perl You Need To Know

Installing mod_perl without superuser privileges

mod_perl in 30 minutes

Why mod_perl?

Detecting and Avoiding Bottlenecks

You might use the most expensive components, but still get bad performance. Why? Let me introduce an annoying word: bottleneck.

A machine is an aggregate of many components. Almost any one of them may become a bottleneck.

If you have a fast processor but a small amount of RAM, then the RAM will probably be the bottleneck. The processor will be under-utilized, usually it will be waiting for the kernel to swap the memory pages in and out, because memory is too small to hold the busiest pages.

If you have a lot of memory, a fast processor, a fast disk, but a slow disk controller, then the disk controller will be the bottleneck. The performance will still be bad, and you will have wasted money.

A slow NIC can cause a bottleneck as well and make the whole service run slow. This is a most important component, since Web servers are much more often network-bound than they are disk-bound (i.e. having more network traffic than disk utilization)

Solving Hardware Requirement Conflicts

It may happen that the combination of software components that you find yourself using gives rise to conflicting requirements for the optimization of tuning parameters. If you can separate the components onto different machines, then you may find that this approach (a kind of clustering) solves the problem, at much less cost than buying faster hardware, because you can tune the machines individually to suit the tasks they should perform.

For example, if you need to run a relational database engine and mod_perl server, then it can be wise to put the two on different machines, since while RDBMS need a very fast disk, mod_perl processes need lots of memory. So by placing the two on different machines it's easy to optimize each machine at separate and satisfy the each software components requirements in the best way.


References

Achieving Closure

Maybe you've heard about closures; they're one of those aspects of Perl -- like object-oriented programming -- that everyone raves about and you can't really see the big deal until you play around with them and then they just click. In this article, we're going to play around with some closures, in the hope that they'll just click for you.

The nice thing about playing around with closures is that you often don't realize you're doing it. Don't believe me? OK, here's an ordinary piece of Perl:


    my $print_hello = sub { print "Hello, world!"; }

    $print_hello->();

We create a subroutine reference in $print_hello, and then we dereference it, calling the subroutine. I suppose we could put that into a subroutine:


    sub make_hello_printer {
        return sub { print "Hello, world!"; }
    }

    my $print_hello = make_hello_printer();
    $print_hello->()

Still nothing magical going on here. And it shouldn't be any surprise to you that we can move the "message" to a separate variable, like this:


    sub make_hello_printer {
        my $message = "Hello, world!";
        return sub { print $message; }
    }

    my $print_hello = make_hello_printer();
    $print_hello->()

As you'd expect, that prints out the Hello, world! message. Nothing special going on here, is there? Well, actually, there is. This is a closure. Did you notice?

What's special is that the subroutine reference we created refers to a lexical variable called $message. The lexical is defined in make_hello_printer, so by rights, it shouldn't be visible outside of make_hello_printer, right? We call make_hello_printer, $message gets created, we return the subroutine reference, and then $message goes away, out of scope.

Except it doesn't. When we call our subroutine reference, outside of make_hello_printer, it can still see and receive the correct value of $message. The subroutine reference forms a closure, ``enclosing'' the lexical variables it refers to.

Here's the canonical example of closures, that you'll find in practically every Perl book:


    sub make_counter {
        my $start = shift;
        return sub { $start++ }
    }

    my $from_ten = make_counter(10);
    my $from_three = make_counter(3);

    print $from_ten->();       # 10
    print $from_ten->();       # 11
    print $from_three->();     # 3
    print $from_ten->();       # 12
    print $from_three->();     # 4

We've created two "counter" subroutines, which have completely independent values. This happens because each time we call make_counter, Perl creates a new lexical for $start, which gets wrapped up in the closure we return. So $from_ten encloses one $start which is initialized to 10, and $from_three encloses a totally different $start, which starts at 3.

It's because of this property that Barrie Slaymaker calls closures "inside-out objects:" objects are data that have some subroutines attached to them, and closures are subroutines that have some data attached to them.

Now, I said that's used in practically every Perl book, because authors try and put off discussing closures until there's little time left and they run out of imagination. (Well, at least that's my excuse ...) However, it's not an entirely practical example, to say the least. So let's try and find a better one.

This example is a bit more complex, but it demonstrates more clearly one extremely useful feature of closures: They can be used to bridge the gap between event-driven programs, which use callbacks extensively, and ordinary procedural code. I recently had to convert a bunch of XML files into an SQL database. Each file constituted a training course, so I wanted to build a data structure that contained the filename plus some of the details I'd parsed from the XML. Here's what I ended up with:


    use XML::Twig;
    my %courses;
    for (<??.xml>) {
        my $name = $_; $name =~ s/.xml//;
        my $t= XML::Twig->new( 
            TwigHandlers => {
                need => sub { 
                    push @{$courses{$name}{prereqs}}, $_->{'att'}->{course};
                },
                # ...
            }
        );
        $t->parsefile($_);
    }

What's going on here? XML::Twig is a handy module that can be used to create an XML parser -- these parsers will call "TwigHandlers" when they meet various tags. We go through all the two-letter XML files in the current directory, and create a parser to parse the file. When we see something like this:


    <need course="AA"/>

our need handler is called to store the fact that the current course has a prerequisite of the course coded "AA." ($_->{'att'}->{...} is XML::Twig-speak for "retrieve the value of the attribute called ...")

And that need handler is a closure -- it wraps up the name of the current file we're parsing, $name, so that it can be referred to whenever XML::Twig decides to use it.

There are many other things you can do with closures -- Tom Christiansen once recommended using them for "data hiding" in object-oriented code, since they rely on lexical variables that nothing outside of the closure can see. In fact, some of the most esoteric and advanced applications of Perl make heavy use of closures.

But as we've seen, some of the most useful uses of closures can happen without you noticing them at all ...

Finding a mod_perl ISP... or Becoming One

Introduction

In this article we will talk about the nuances of providing mod_perl services and present a few ISPs that successfully provide them.

  • You installed mod_perl on your box at home, and you fell in love with it. So now you want to convert your CGI scripts (which are currently running on your favorite ISP's machine) to run under mod_perl. Then you discover that your ISP has never heard of mod_perl, or he refuses to install it for you.
  • You are an old sailor in the ISP business, you have seen it all, you know how many ISPs are out there, and you know that the sales margins are too low to keep you happy. You are looking for some new service almost no one else provides, to attract more clients to become your users and, hopefully, to have a bigger slice of the action than your competitors.

If you are planning to become an ISP that provides mod_perl services or are just looking for such a provider, this article is for you.

Gory Details

An ISP has three choices:

  1. ISPs probably cannot let users run scripts under mod_perl on the main server. There are many reasons for this:

    Scripts might leak memory, due to sloppy programming. There will not be enough memory to run as many servers as required, and clients will be not satisfied with the service because it will be slow.

    The question of file permissions is a very important issue: any user who is allowed to write and run a CGI script can at least read (if not write) any other files that belong to the same user and/or group under which the Web server is running. Note that it's impossible to run suEXEC and cgiwrap extensions under mod_perl.

    Another issue is the security of the database connections. If you use Apache::DBI, by hacking the Apache::DBI code you can pick a connection from the pool of cached connections, even if it was opened by someone else and your scripts are running on the same Web server.

    There are many more things to be aware of, so at this time you have to say no.

    Of course, as an ISP, you can run mod_perl internally, without allowing your users to map their scripts, so that they will run under mod_perl. If, as a part of your service, you provide scripts such as guest books, counters, etc. that are not available for user modification, you can still can have these scripts running very quickly.

  2. "But, hey, why can't I let my users run their own servers, so I can wash my hands of them and don't have to worry about how dirty and sloppy their code is? (Assuming that the users are running their servers under their own user names, to prevent them from stealing code and data from each other.)"

    This option is fine, as long as you are not concerned about your new systems resource requirements. If you have even very limited experience with mod_perl, you will know that mod_perl-enabled Apache servers -- while freeing up your CPU and allowing you to run scripts much faster -- have huge memory demands (5-20 times that of plain Apache).

    The size of these memory demands depends on the code length, the sloppiness of the programming, possible memory leaks the code might have, and all of that multiplied by the number of children each server spawns. A very simple example: a server, serving an average number of scripts, demanding 10MB of memory, spawns 10 children and already raises your memory requirements by 100MB (the real requirement is actually much smaller if your OS allows code sharing between processes, and if programmers exploit these features in their code). Now, multiply the average required size by the number of server users you intend to have and you will get the total memory requirement.

    Since ISPs never say no, you'd better take the inverse approach -- think of the largest memory size you can afford, and then divide it by one user's requirements (as I have shown in this example), and you will know how many mod_perl users you can afford :)

    But what if you cannot tell how much memory your users may use? Their requirements from a single server can be very modest, but do you know how many servers they will run? After all, they have full control of httpd.conf - and it has to be this way, since this is essential for the user running mod_perl.

    All of this rumbling about memory leads to a single question: is it possible to prevent users from using more than X memory? Or another variation of the question: assuming you have as much memory as you want, can you charge users for their average memory usage?

    If the answer to either of the above questions is yes, you are all set and your clients will prize your name for letting them run mod_perl! There are tools to restrict resource usage (see, for example, the man pages for ulimit(3), getrlimit(2), setrlimit(2), and sysconf(3); the last three have the corresponding Perl modules BSD::Resource and Apache::Resource).

    If you have chosen this option, you have to provide your client with:

    • Shutdown and startup scripts installed together with the rest of your daemon startup scripts (e.g., the /etc/rc.d directory), so that when you reboot your machine, the user's server will be correctly shut down and will be back online the moment your system starts up. Also make sure to start each server under the user name the server belongs to, or you are going to be in big trouble!
    • Proxy services (in forward or httpd accelerator mode) for the user's virtual host. Since the user will have to run their server on an unprivileged port (>1024), you will have to forward all requests from user.given.virtual.hostname:80 (which is user.given.virtual.hostname without the default port 80) to your.machine.ip:port_assigned_to_user. You will also have to tell the users to code their scripts so that any self-referencing URLs are of the form user.given.virtual.hostname.

      Letting the user run a mod_perl server immediately adds the requirement that the user be able to restart and configure their own server. Only root can bind to port 80; this is why your users have to use port numbers greater than 1024.

      Another solution would be to use a setuid startup script, but think twice before you go with it, since if users can modify the scripts, sometimes they will get a root access.


    • O'Reilly Open Source Convention -- July 22-26, San Diego, CA.

      From the Frontiers of Research to the Heart of the Enterprise

      mod_perl 2.0, the Next Generation Stas Bekman will provide an overview of what's new in mod_perl 2.0 and what else is planned for the future in his talk at the upcoming O'Reilly Open Source Convention, this July 22-26, in San Diego.

    • Another problem you will have to solve is how to assign ports to users. Since users can pick any port above 1024 to run their server, you will have to lay down some rules here so that multiple servers do not conflict.

      A simple example will demonstrate the importance of this problem. I am a malicious user or I am just a rival of some fellow who runs his server on your ISP. All I need to do is to find out what port my rival's server is listening to (e.g. using netstat(8)) and configure my own server to listen on the same port. Although I am unable to bind to this port, imagine what will happen when you reboot your system and my startup script happens to be run before my rival's! I get the port first, and now all requests will be redirected to my server. I'll leave to your imagination what nasty things might happen then.

      Of course, the ugly things will quickly be revealed, but not before the damage has been done.

    Basically, you can preassign each user a port, without them having to worry about finding a free one, as well as enforce MaxClients and similar values, by implementing the following scenario:

    For each user, have two configuration files: the main file, httpd.conf (non-writable by user) and the user's file, username.httpd.conf, where they can specify their own configuration parameters and override the ones defined in httpd.conf. Here is what the main configuration file looks like:

    
      httpd.conf
      ----------
      # Global/default settings, the user may override some of these
      ...
      ...
      # Included so that user can set his own configuration
      Include username.httpd.conf
      
      # User-specific settings which will override any potentially
      # dangerous configuration directives in username.httpd.conf
      ...
      ...
      
      username.httpd.conf
      -------------------
      # Settings that your user would like to add/override, like
      # <Location> and PerlModule directives, etc.

    Apache reads the global/default settings first. It then reads the Included username.httpd.conf file with whatever settings the user has chosen, and finally, it reads the user-specific settings that we don't want the user to override, such as the port number. Even if the user changes the port number in his username.httpd.conf file, Apache reads our settings last, so they take precedence. Note that you can use <Perl> sections to make the configuration much easier.

  3. A much better, but costly, solution is co-location. Let the user hook his (or your) stand-alone machine into your network, and forget about this user. Of course, either the user or you will have to undertake all of the system administration chores and it will cost your client more money.

    Who are the people who seek mod_perl support? They are people who run serious projects/businesses. Money is not usually an obstacle. They can afford a standalone box, thus achieving their goal of autonomy while keeping their ISP happy.

ISPs Providing mod_perl Services

Let's present some of the ISPs that provide mod_perl services.

  • A Canadian company called Baremetal (http://BareMetal.com/) provides mod_perl services via front-end proxy and a shared mod_perl backend, which, as their technical support claims, works reasonably well for folks that write good code. They're willing to run a dedicated backend mod_perl server for customers that need it. Some of their clients mix mod_cgi and mod_perl as a simple acceleration technique.

    Basic service price is $30/month.

    For more information see http://modperl-space.com/.

  • BSB-Software GmbH, located in Frankfurt, Germany, provides their own mod_perl applications for clients with standard requirements, thus preventing the security risks and allowing trusted users to use their own code, which is usually reviewed by the company's system administrator. For the latter case, httpd.conf is under the control of the ISP, so everything is monitored.

    Please contact the company for the updated price list.

    For more information see http://www.bsb-software.com/.

  • Digital Wire Consulting, an open-source-driven Ebusiness consulting company located in Zurich, Switzerland, provides shared and standalone mod_perl systems. The company operates internationally.

    Here are the specifics of this company:

  • Previously in the Series

    The Perl You Need To Know - Part 3

    The Perl You Need To Know - Part 2

    The Perl You Need To Know

    Installing mod_perl without superuser privileges

    mod_perl in 30 minutes

    Why mod_perl?

    1. No restrictions in terms of CPU, bandwidth, etc. (so heavy-duty operations are better off with dedicated machines!)
    2. The user has to understand the risk that is involved if he/she is choosing a shared machine. Every user has their own virtual server.
    3. They offer dedicated servers at approximately $400/month (depending on configuration) + $500 setup.
    4. They don't support any proxy setups. If someone is serious about running mod_perl for a mission-critical application, then that person should be willing to pay for dedicated servers!
    5. For a shared server and a mid-size mod_perl Web site, they charge roughly $100/month for hosting only! Installation and setup are extra and based on the time spent (one hour is $120). Please contact the company for the updated price list.

    For more information see http://www.dwc.ch/.

  • Even The Bunker (which claims to be UK's safest site for secure computing) supports mod_perl! Their standard server can include mod_perl if requested. All of their users are provided with a dedicated machine.

    For more information see http://www.thebunker.net/hosting.htm.

  • For more ISPs supporting mod_perl, see http://perl.apache.org/isp.html

    If you are an ISP that supports mod_perl and is not listed on the above page, please contact the person who maintains the list.

References

The Perl You Need To Know - Part 3

Introduction

This article is the third in our series talking about the essential Perl basics that you should know before starting to program for mod_perl.

Variables Globally, Lexically Scoped and Fully Qualified

You will hear a lot about namespaces, symbol tables and lexical scoping in Perl discussions, but little of it will make any sense without a few key facts:

Symbols, Symbol Tables and Packages; Typeglobs

There are two important types of symbols: package global and lexical. We will talk about lexical symbols later; for now, we will talk only about package global symbols, which we will refer to as global symbols.

The names of pieces of your code (subroutine names) and the names of your global variables are symbols. Global symbols reside in one symbol table or another. The code itself and the data do not; the symbols are the names of pointers that point (indirectly) to the memory areas that contain the code and data. (Note for C/C++ programmers: We use the term `pointer' in a general sense of one piece of data referring to another piece of data, and not in the specific sense as used in C or C++.)

There is one symbol table for each package, (which is why global symbols are really package global symbols).

You are always working in one package or another.

Just as in C, where the first function you write must be called main(), the first statement of your first Perl script is in package main::, which is the default package. Unless you say otherwise by using the package statement, your symbols are all in package main::. You should be aware that files and packages are not related. You can have any number of packages in a single file; and a single package can be in one file or spread among many files. However, it is common to have a single package in a single file. To declare a package you write:


    package mypackagename;

From the following line you are in package mypackagename and any symbols you declare reside in that package. When you create a symbol (variable, subroutine, etc.) Perl uses the name of the package in which you are currently working as a prefix to create the fully qualified name of the symbol.

When you create a symbol, Perl creates a symbol table entry for that symbol in the current package's symbol table (by default main::). Each symbol table entry is called a typeglob. Each typeglob can hold information on a scalar, an array, a hash, a subroutine (code), a filehandle, a directory handle and a format, each of which all have the same name. So you see now that there are two indirections for a global variable: the symbol (the thing's name) points to its typeglob, and the entry in the typeglob for the thing's type (scalar, array, etc.) points to the data. If we had a scalar and an array with the same name, then their name would point to the same typeglob, but for each type of data the typeglob points to somewhere different. Hence, the scalar's data and the array's data are completely separate and independent, they just happen to have the same name.

Most of the time, only one part of a typeglob is used (yes, it's a bit wasteful). By now, you know that you distinguish between them by using what the authors of the Camel book call a funny character. So if we have a scalar called `line,' then we would refer to it in code as $line, and if we had an array of the same name, that would be written, @line. Both would point to the same typeglob (which would be called *line), but because of the funny character, (also known as decoration) Perl won't confuse the two. Of course, we might confuse ourselves, so some programmers don't ever use the same name for more than one type of variable.

Every global symbol is in some package's symbol table. To refer to a global symbol we could write the fully qualified name, e.g. $main::line. If we are in the same package as the symbol, then we can omit the package name, e.g. $line (unless you use the strict pragma and then you will have to predeclare the variable using the vars pragma). We can also omit the package name if we have imported the symbol into our current package's namespace. If we want to refer to a symbol that is in another package and which we haven't imported, then we must use the fully qualified name, e.g. $otherpkg::box.

Most of the time, you do not need to use the fully qualified symbol name, because most of the time you will refer to package variables from within the package. This is like C++ class variables. You can work entirely within package main:: and never even know you are using a package, nor that the symbols have package names. In a way, this is a pity, because you may fail to learn about packages and they are extremely useful.

The exception is when you import the variable from another package. This creates an alias for the variable in the current package, so that you can access it without using the fully qualified name.

While global variables are useful for sharing data and are necessary in some contexts, it is usually wiser to minimize their use and use lexical variables, discussed next, instead.

Note that when you create a variable, the low-level business of allocating memory to store the information is handled automatically by Perl. The intepreter keeps track of the chunks of memory to which the pointers are pointing and takes care of undefining variables. When all references to a variable have ceased to exist, then the Perl garbage collector is free to take back the memory used ready for recycling. However, Perl almost never returns back memory it has already used to the operating system during the lifetime of the process.

Lexical Variables and Symbols

The symbols for lexical variables (i.e. those declared using the keyword my) are the only symbols that do not live in a symbol table. Because of this, they are not available from outside the block in which they are declared. There is no typeglob associated with a lexical variable and a lexical variable can refer only to a scalar, an array or a hash.

If you need access to the data from outside the package, then you can return it from a subroutine, or you can create a global variable (i.e. one that has a package prefix) that points or refers to it, and return that. The reference must be global so that you can refer to it by a fully qualified name. But just like in C, try to avoid having global variables. Using OO methods generally solves this problem by providing methods to get and set the desired value within the object that can be lexically scoped inside the package and passed by reference.

The phrase ``lexical variable'' is a bit of a misnomer, as we are really talking about ``lexical symbols.'' The data can be referenced by a global symbol, too, and in such cases when the lexical symbol goes out of scope the data will still be accessible through the global symbol. This is perfectly legitimate and cannot be compared to the terrible mistake of taking a pointer to an automatic C variable and returning it from a function -- when the pointer is dereferenced there will be a segmentation fault. (Note for C/C++ programmers: having a function return a pointer to an auto variable is a disaster in C or C++; the Perl equivalent, returning a reference to a lexical variable created in a function is normal and useful.)


O'Reilly Open Source Convention -- July 22-26, San Diego, CA.

From the Frontiers of Research to the Heart of the Enterprise

mod_perl 2.0, the Next Generation Stas Bekman will provide an overview of what's new in mod_perl 2.0 and what else is planned for the future in his talk at the upcoming O'Reilly Open Source Convention, this July 22-26, in San Diego.

  • my() vs. use vars:

    With use vars(), you are making an entry in the symbol table, and you are telling the compiler that you are going to be referencing that entry without an explicit package name.

    With my(), NO ENTRY IS PUT IN THE SYMBOL TABLE. The compiler figures out at compile time which my() variables (i.e. lexical variables) are the same as each other, and once you hit execute time you cannot look up those variables in the symbol table.

  • my() vs. local():

    local() creates a temporal-limited package-based scalar, array, hash, or glob -- that's to say, when the scope of definition is exited at runtime, the previous value (if any) is restored. References to such a variable are also global ... only the value changes. (Aside: that is what causes variable suicide. :)

    my() creates a lexically limited nonpackage-based scalar, array, or hash -- when the scope of definition is exited at compile-time, the variable ceases to be accessible. Any references to such a variable at runtime turn into unique anonymous variables on each scope exit.

use(), require(), do(), %INC and @INC Explained

The @INC Array

@INC is a special Perl variable that is the equivalent to the shell's PATH variable. Whereas PATH contains a list of directories to search for executables, @INC contains a list of directories from which Perl modules and libraries can be loaded.

When you use(), require() or do() a filename or a module, Perl gets a list of directories from the @INC variable and searches them for the file it was requested to load. If the file that you want to load is not located in one of the listed directories, then you have to tell Perl where to find the file. You can either provide a path relative to one of the directories in @INC, or you can provide the full path to the file.

The %INC Hash

%INC is another special Perl variable that is used to cache the names of the files and the modules that were successfully loaded and compiled by use(), require() or do() statements. Before attempting to load a file or a module with use() or require(), Perl checks whether it's already in the %INC hash. If it's there, then the loading and therefore the compilation are not performed at all. Otherwise, the file is loaded into memory and an attempt is made to compile it. do() does unconditional loading -- no lookup in the %INC hash is made.

If the file is successfully loaded and compiled, then a new key-value pair is added to %INC. The key is the name of the file or module as it was passed to the one of the three functions we have just mentioned. If it was found in any of the @INC directories except ".", then the value is the full path to it in the file system.

The following examples will make it easier to understand the logic.

First, let's see what are the contents of @INC on my system:


  % perl -e 'print join "\n", @INC'
  /usr/lib/perl5/5.00503/i386-linux
  /usr/lib/perl5/5.00503
  /usr/lib/perl5/site_perl/5.005/i386-linux
  /usr/lib/perl5/site_perl/5.005
  .

Notice that . (current directory) is the last directory in the list.

Now let's load the module strict.pm and see the contents of %INC:


  % perl -e 'use strict; print map {"$_ => $INC{$_}\n"} keys %INC'
  
  strict.pm => /usr/lib/perl5/5.00503/strict.pm

Since strict.pm was found in /usr/lib/perl5/5.00503/ directory and /usr/lib/perl5/5.00503/ is a part of @INC, %INC includes the full path as the value for the key strict.pm.

Now let's create the simplest module in /tmp/test.pm:


  test.pm
  -------
  1;

It does nothing, but returns a true value when loaded. Now let's load it in different ways:


  % cd /tmp
  % perl -e 'use test; print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => test.pm

Since the file was found relative to . (the current directory), the relative path is inserted as the value. If we alter @INC by adding /tmp to the end:


  % cd /tmp
  % perl -e 'BEGIN{push @INC, "/tmp"} use test; \
  print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => test.pm

Here we still get the relative path, since the module was found first relative to ".". The directory /tmp was placed after . in the list. If we execute the same code from a different directory, then the "." directory won't match,


  % cd /
  % perl -e 'BEGIN{push @INC, "/tmp"} use test; \
  print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => /tmp/test.pm

so we get the full path. We can also prepend the path with unshift(), so it will be used for matching before "." and therefore we will get the full path as well:


  % cd /tmp
  % perl -e 'BEGIN{unshift @INC, "/tmp"} use test; \
  print map {"$_ => $INC{$_}\n"} keys %INC'
  
  test.pm => /tmp/test.pm

The code:


  BEGIN{unshift @INC, "/tmp"}

can be replaced with the more elegant:


  use lib "/tmp";

Which is almost equivalent to our BEGIN block and is the recommended approach.

These approaches to modifying @INC can be labor intensive, since if you want to move the script around in the file-system, then you have to modify the path. This can be painful, for example, when you move your scripts from development to a production server.

There is a module called FindBin that solves this problem in the plain Perl world, but, unfortunately, it won't work under mod_perl, since it's a module, and as any module, it's loaded only once. So the first script using it will have all the settings correct, but the rest of the scripts will not if they're in a different directory from the first.

For the sake of completeness, I'll present this module anyway.

If you use this module, then you don't need to write a hard-coded path. The following snippet does all the work for you (the file is /tmp/load.pl):


  load.pl
  -------
  #!/usr/bin/perl
  
  use FindBin ();
  use lib "$FindBin::Bin";
  use test;
  print "test.pm => $INC{'test.pm'}\n";

In the above example, $FindBin::Bin is equal to /tmp. If we move the script somewhere else... e.g. /tmp/x in the code above $FindBin::Bin equals /home/x.


  % /tmp/load.pl
  
  test.pm => /tmp/test.pm

This is just like use lib except that no hard-coded path is required.

You can use this workaround to make it work under mod_perl.


  do 'FindBin.pm';
  unshift @INC, "$FindBin::Bin";
  require test;
  #maybe test::import( ... ) here if need to import stuff

This has a slight overhead, because it will load from disk and recompile the FindBin module on each request. So it may not be worth it.

Modules, Libraries and Program Files

Before we proceed, let's define what we mean by module, library and program file.

  • Libraries
  • These are files that contain Perl subroutines and other code.

    When these are used to break up a large program into manageable chunks, they don't generally include a package declaration; when they are used as subroutine libraries, they often do have a package declaration.

    Their last statement returns true, a simple 1; statement ensures that.

    They can be named in any way desired, but generally their extension is .pl.

    Examples:

    
      config.pl
      ----------
      # No package so defaults to main::
      $dir = "/home/httpd/cgi-bin";
      $cgi = "/cgi-bin";
      1;
    
      mysubs.pl
      ----------
      # No package so defaults to main::
      sub print_header{
        print "Content-type: text/plain\r\n\r\n";
      }
      1;
    
      web.pl
      ------------
      package web ;
      # Call like this: web::print_with_class('loud',"Don't shout!");
      sub print_with_class{
        my( $class, $text ) = @_ ;
        print qq{<span class="$class">$text</span>};
      }
      1;
  • Modules
  • A file that contains Perl subroutines and other code.

    It generally declares a package name at the beginning.

    Modules are generally used either as function libraries (which .pl files are still but less commonly used for), or as object libraries where a module is used to define a class and its methods.

    Its last statement returns true.

    The naming convention requires it to have a .pm extension.

    Example:

    
      MyModule.pm
      -----------
      package My::Module;
      $My::Module::VERSION = 0.01;
      
      sub new{ return bless {}, shift;}
      END { print "Quitting\n"}
      1;
  • Program Files
  • Many Perl programs exist as a single file. Under Linux and other Unix-like operating systems, the file often has no suffix since the operating system can determine that it is a Perl script from the first line (shebang line) or if it's Apache that executes the code, there is a variety of ways to tell how and when the file should be executed. Under Windows, a suffix is normally used, for example .pl or .plx.

    The program file will normally require() any libraries and use() any modules it requires for execution.

    It will contain Perl code but won't usually have any package names.

    Its last statement may return anything or nothing.

require()

Previously in the Series

The Perl You Need To Know - Part 2

The Perl You Need To Know

Installing mod_perl without superuser privileges

mod_perl in 30 minutes

Why mod_perl?

require() reads a file containing Perl code and compiles it. Before attempting to load the file, it looks up the argument in %INC to see whether it has already been loaded. If it has, then require() just returns without doing a thing. Otherwise, an attempt will be made to load and compile the file.

require() has to find the file it has to load. If the argument is a full path to the file, then it just tries to read it. For example:


  require "/home/httpd/perl/mylibs.pl";

If the path is relative, then require() will attempt to search for the file in all the directories listed in @INC. For example:


  require "mylibs.pl";

If there is more than one occurrence of the file with the same name in the directories listed in @INC, then the first occurrence will be used.

The file must return TRUE as the last statement to indicate successful execution of any initialization code. Since you never know what changes the file will go through in the future, you cannot be sure that the last statement will always return TRUE. That's why the suggestion is to put ``1;'' at the end of file.

Although you should use the real filename for most files, if the file is a module, then you may use the following convention instead:


  require My::Module;

This is equal to:


  require "My/Module.pm";

If require() fails to load the file, either because it couldn't find the file in question or the code failed to compile, or it didn't return TRUE, then the program would die(). To prevent this, the require() statement can be enclosed into an eval() exception-handling block, as in this example:


  require.pl
  ----------
  #!/usr/bin/perl -w
  
  eval { require "/file/that/does/not/exists"};
  if ($@) {
    print "Failed to load, because : $@"
  }
  print "\nHello\n";

When we execute the program:


  % ./require.pl
  
  Failed to load, because : Can't locate /file/that/does/not/exists in
  @INC (@INC contains: /usr/lib/perl5/5.00503/i386-linux
  /usr/lib/perl5/5.00503 /usr/lib/perl5/site_perl/5.005/i386-linux
  /usr/lib/perl5/site_perl/5.005 .) at require.pl line 3.
  
  Hello

We see that the program didn't die(), because Hello was printed. This trick is useful when you want to check whether a user has some module installed. If she hasn't, then it's not critical, because the program can run with reduced functionality without this module.

If we remove the eval() part and try again:


  require.pl
  ----------
  #!/usr/bin/perl -w
  
  require "/file/that/does/not/exists";
  print "\nHello\n";

  % ./require1.pl
  
  Can't locate /file/that/does/not/exists in @INC (@INC contains:
  /usr/lib/perl5/5.00503/i386-linux /usr/lib/perl5/5.00503
  /usr/lib/perl5/site_perl/5.005/i386-linux
  /usr/lib/perl5/site_perl/5.005 .) at require1.pl line 3.

The program just die()s in the last example, which is what you want in most cases.

For more information, refer to the perlfunc manpage.

use()

use(), just like require(), loads and compiles files containing Perl code, but it works with modules only. The only way to pass a module to load is by its module name and not its filename. If the module is located in MyCode.pm, then the correct way to use() it is:


  use MyCode

and not:


  use "MyCode.pm"

use() translates the passed argument into a file name replacing :: with the operating system's path separator (normally /) and appending .pm at the end. So My::Module becomes My/Module.pm.

use() is equivalent to:


 BEGIN { require Module; Module->import(LIST); }

Internally it calls require() to do the loading and compilation chores. When require() finishes its job, import() is called unless () is the second argument. The following pairs are equivalent:


  use MyModule;
  BEGIN {require MyModule; MyModule->import; }
  
  use MyModule qw(foo bar);
  BEGIN {require MyModule; MyModule->import("foo","bar"); }
  
  use MyModule ();
  BEGIN {require MyModule; }

The first pair exports the default tags. This happens if the module sets @EXPORT to a list of tags to be exported by default. The module's mainpage normally describes which tags are exported by default.

The second pair exports only the tags passed as arguments.

The third pair describes the case where the caller does not want any symbols to be imported.

import() is not a built-in function, it's just an ordinary static method call into the ``MyModule'' package to tell the module to import the list of features back into the current package. See the Exporter manpage for more information.

When you write your own modules, always remember that it's better to use @EXPORT_OK instead of @EXPORT, since the former doesn't export symbols unless it was asked to. Exports pollute the namespace of the module user. Also avoid short or common symbol names to reduce the risk of name clashes.

When functions and variables aren't exported you can still access them using their full names, like $My::Module::bar or $My::Module::foo(). By convention you can use a leading underscore on names to informally indicate that they are internal and not for public use.

There's a corresponding ``no'' command that un-imports symbols imported by use, i.e., it calls Module->unimport(LIST) instead of import().

do()

While do() behaves almost identically to require(), it reloads the file unconditionally. It doesn't check %INC to see whether the file was already loaded.

If do() cannot read the file, then it returns undef and sets $! to report the error. If do() can read the file but cannot compile it, then it returns undef and puts an error message in $@. If the file is successfully compiled, then do() returns the value of the last expression evaluated.

References

  • An article by Mark-Jason Dominus about how Perl handles variables and namespaces, and the difference between use vars() and my() - http://www.plover.com/~mjd/perl/FAQs/Namespaces.html .
  • For an in-depth explanation of Perl data types, see chapters 3 and 6 in the book ``Advanced Perl Programming'' by Sriram Srinivasan.

    And, of course, the ``Programming Perl'' by L.Wall, T. Christiansen and J.Orwant (also known as the ``Camel'' book, named after the camel picture on the cover of the book). Look at chapters 10, 11 and 21.

  • The Exporter, perlvar, perlmod and perlmodlib man pages.

Where Wizards Fear To Tread

So you're a Perl master. You've got XS sorted. You know how the internals work. Hey, there's nothing we can teach you on perl.com that you don't already know. You think? Where Wizards Fear To Tread brings you the information you won't find anywhere else concerning the very top level of Perl hackery.

Putting Down Your Roots

This month, we look at the Perl op tree. Every Perl program is compiled into an internal representation before it is executed. Functions, subroutine calls, variable accesses, control structures, and all that makes up a Perl program, are converted into a series of different fundamental operations (ops) and these ops are strung together into a tree data structure.

For more on the different types of ops available, how they fit together, and how to manipulate them with the B compiler module, look at the Perl 5 internals tutorial. Right now, though, we're going to take things a step further.

B and Beyond With B::Utils

The B module allows us to get at a wealth of information about an op, but it can become incredibly frustrating to know which op you want to deal with, and to perform simple manipulation on a range of ops. It also offers limited functionality for navigating around the op tree, meaning that you need to hold onto a load of additional state about which op is where. This gets complicated quickly. Finally, it's not easy to get at the op trees for particular subroutines, or indeed, all subroutines both named and anonymous.

B::Utils was created at the request of Michael Schwern to address these issues. It offers much more high-level functionality for navigating through the tree, such as the ability to move ``upward'' or ``backward,'' to return the old name of an op that has currently been optimized away, to get a list of the op's children, and so on. It can return arrays of anonymous subroutines, and hashes of subroutine op roots and starts. It also contains functions for walking through the op tree from various starting points in various orders, optionally filtering out ops that don't match certain conditions; while performing actions on ops, B::Utils provides carp and croak routines which perform error reporting from the point of view of the original source code.

But one of the most useful functions provided by B::Utils is the opgrep routine. This allows you to filter a series of ops based on a pattern that represents their attributes and their position in a tree. The major advantage over doing it yourself is that opgrep takes care of making sure that the attributes are present before testing them - the seasoned B user is likely to be accustomed to the carnage that results from accidentally trying to call name on a B::NULL object.

For instance, we can find all the subroutine calls in a program with


    walkallops_filtered (
        sub { opgrep( { name => "entersub" }, @_) },
        sub { print "Found one: $_[0]\n"; }
    );

C<opgrep> supports alternation and negation of attribute queries. For instance, here are all the scalar variable accesses, whether to globals or lexicals:


    @svs = opgrep ( { name => ["padsv", "gvsv"] }, @ops)

And as for checking an op's position in the tree, here are all the exec ops followed by a nextstate and then followed by something other than exit, warn or die:


  walkallops_filtered(
      sub { opgrep( {
                        name => "exec",
                        next => {
                           name    => "nextstate",
                           sibling => { 
                                         name => [qw(! exit warn die)] 
                                      }
                                }
                    }, @_)},
      sub {
            carp("Statement unlikely to be reached");
            carp("\t(Maybe you meant system() when you said exec()?)\n");
      }
  )

Don't Do That, Do This

So, what can we do with all this? The answer is, of course, ``anything we want.'' If you can mess about with the op tree, then you have complete control over Perl's operation. Let's take an example.

Damian Conway recently released the Acme::Don't module, which doesn't do anything:


    don't { print "Something\n" }

doesn't print anything. Very clever. But not clever enough. You see, I like double negatives:


    my $x = 1;
    don't { print "Something\n" } unless $x;

doesn't print anything either, and if you like double negatives, then you might agree that it should print something. But how on earth are we going to get Perl to do something when a test proves false? By messing about with the op tree, of course.

The way to solve any problem like this is to think about the op tree that we've currently got, work out what we'd rather do instead, and work out the differences between the op trees. Then, we write something that looks for a given pattern in a program's op tree and modifies it to be what we want.

There are several ways of achieving what we actually want to get but the simplest one is this: add a second parameter to don't which, if set, actually does do the code. This allows us to replace any occurrence of

    don't { print "Something\n" } if (condition);

with


    don't(sub { print "Something\n" }, 1) unless (condition);

Let's now look at this in terms of op trees. Here's the relevant part of the op tree for don't { ... } if $x, produced by running perl -MO=Terse and then using sed to trim out to unsightly hex addresses:


    UNOP  null
        LOGOP  and
            UNOP  null [15]
                SVOP  *x
            UNOP  entersub [2]
                UNOP  null [141]
                    OP  pushmark
                    UNOP  refgen
                        UNOP  null [141]
                            OP  pushmark
                            SVOP  anoncode  SPECIAL #0 Nullsv
                    UNOP  null [17]
                        SVOP  *don::t

As we can see, the if is represented as an and op internally, which makes sense if you think about it. The two ``legs'' of the and, called ``first'' and ``other,'' are a call to fetch the value of $c, and a subroutine call. Look at the subroutine call closely: the ops ``inside'' this set up a mark to say where the parameters start, push a reference to anonymous code (that's our { ... }) onto the stack, and then push the glob for *don::t on there.

So, we need to do two things: We need to insert another parameter between refgen and the null attached to *don::t, and we need to invert the sense of the test.

Now we know what we've got to do, let's start doing it - remember our solution: stage one, write code to find the pattern.

This is actually pretty simple: We're looking for either an and or an or op, and the ``other'' leg of the op is going to be a call to *don::t. However, we have to be a bit clever here, since Perl internally performs a few optimizations on the op tree that even the B::* reporting modules don't tell you about. When Perl threads the next pointers around an op tree, it does something special for a short-circuiting binary op like and or or - it sets the other pointer to be not the first sibling in the tree, but the first op in execution order. In this case, that's pushmark, as we can see from running B::Terse,exec:


    LOGOP (0x80fa008) and
    AND => {
        OP (0x80f9f88) pushmark
        OP (0x80f9f20) pushmark
        SVOP (0x80f9ec0) anoncode  SPECIAL #0 Nullsv
        ...

With this knowledge, we can create a pattern to pass to opgrep:


    {
        name => ["and", "or"],
        other => {
            name => "pushmark",
            sibling => { next => { name => "gv" }}
        }
    }

Unfortunately, this doesn't tell us the whole story, since we actually need to check that the subroutine call is to don't, rather than to any other given subroutine that might be called conditionally. Hence, our filter looks like this:


    sub {
        my $op = shift;
        opgrep(
            {
                name => ["and", "or"],
                other => {
                    name => "pushmark",
                    sibling => { next => { name => "gv" }}
                }
            }, $op) or return;
        my $gv = $op->other->sibling->next->gv;
        return unless $gv->STASH->NAME eq "don" and $gv->NAME eq "t";
        return 1;
    }

We grab the GV (we know exactly where it's going to be because of our pattern!) and test that it's in the don stash and is called t.

Part one done - we have located the ops that we want to change. Now how on earth do we change ops in an op tree?

Fixing It Up With B::Generate

B::Generate was written to allow users to create their own ops and insert them into the op tree. The original intent was to be able to create bytecode for other languages to be run on the Perl virtual machine, but it's found plenty of use manipulating existing Perl op trees.

It provides ``constructor'' methods in all of the B::*OP classes, and makes many of the accessor methods read-write instead of read-only. Let's see how we can apply it to this problem. Remember that we want to negate the sense of the test, and then to add another argument to the call to don't.

For the first of these tasks, B::Generate provides the handy mutate and convert methods on each B::OP-derived object to change one op's type into another. The decision as to which of them use is slightly complex: mutate can only be used for ops of the same type - for instance, you cannot use it to mutate a binary op into a unary op. However, convert produces a completely new op, which needs to be threaded back into the op tree. So convert is much more powerful, but mutate is much more convenient. In this case, since we're just flipping between and and or, we can get away with using mutate:


    require B::Generate;
    my $op = shift;
    if ($op->name eq "and") {
        $op->mutate("or");
    } else {
        $op->mutate("and");
    }

Now to insert the additional parameter. For this, remember that entersub works by popping off the top entry in the stack and calling that as a subroutine, and the remaining stack entries become parameters to the subroutine. So we want to add a const op to put a constant on the stack. We use the B::SVOP->new constructor to create a new one, and then thread the next pointers so that Perl's main loop will call it between $op->other->sibling (the refgen op) and the op after it. (the GV which represents *don::t)


    my $to_insert = $op->other->sibling;
    my $newop = B::SVOP->new("const", 0, 1);
    $newop->next($to_insert->next);
    $to_insert->next($newop);

All that's left is to replace the definition of don't so that, depending on the parameters, it sometimes does:


    sub don't (&;$) { $_[0]->() if $_[1] }

And there we have it:


    package Acme::Don't;
    CHECK {
        use B::Utils qw(opgrep walkallops_filtered);
        walkallops_filtered(
            sub {
                my $op = shift;
                opgrep(
                {
                    name => ["and", "or"],
                    other => {
                        name => "pushmark",
                        sibling => { next => { name => "gv" }}
                    }
                }, $op) or return;
                my $gv = $op->other->sibling->next->gv;
                return unless $gv->STASH->NAME eq "don" and $gv->NAME eq "t";
                return 1;
            },
            sub {
                require B::Generate;
                my $op = shift;
                if ($op->name eq "and") {
                    $op->mutate("or");
                } else {
                    $op->mutate("and");
                }
                
                my $to_insert = $op->other->sibling;
                my $newop = B::SVOP->new("const", 0, 1);
                $newop->next($to_insert->next);
                $to_insert->next($newop);
            }
       );
    }
    
    sub don't (&;$) { $_[0]->() if $_[1] }

This will turn


    $false = 0; $true = 1;
    
    don't { print "Testing" } if $false;
    don't { print "Testing again" } unless $true;

into


    $false = 0; $true = 1;
    
    don't(sub { print "Testing" }, 1) unless $false;
    don't(sub { print "Testing again" }, 1) if $true;

setting off the conditions and making don't do the code. A neat trick? We think so.

Where To From Here?

But that's not all! And, of course, this doesn't cater for some of the more complex constructions people can create, such as


    if ($x) {
        do_something();
        don't { do_the_other_thing() };
        do_something_else();
    }

or even


    if ($x) {
        do_that();
        don't { do_this() }
    } else {
        do_the_other();
        don't { do_something_else() }
    }

But this can be solved in just the same way. For instance, you want to turn the first one into


    if ($x) {
        do_something();
        do_something_else();
    } else {
        don't(sub { do_the_other_thing() }, 1);
    }

and the second into


    if ($x) {
        do_that();
        don't(sub { do_something_else() }, 1);
    } else {
        do_the_other();
        don't(sub { do_this() }, 1);
    }

Both of these transformations can be done by applying the method above: compare the op trees, work out the difference, find the pattern you want to look for, then write some code to manipulate the op tree into the desired output. An easy task for the interested reader ...

And we really haven't scratched the surface of what can be done with B::Generate and B::Utils; the B::Generate test suite shows what sort of mayhem can be caused to existing Perl programs, and there have been experiments using B::Generate to generate op trees for other languages - a B::Generate port of Leon Brocard's shiny Ruby interpreter could produce Perl bytecode for simple Ruby programs; chromatic is working on an idea to turn Perl programs into XML, manipulate them and use B::Generate to turn them back into Perl op trees.

Later in our ``Where Wizards Fear To Tread'' series, we'll have articles about Perl and Java interaction, iThreads, and more.

The Perl You Need To Know - Part 2

Introduction

In this article, we continue to talk about the essential Perl basics that you should know before starting to program for mod_perl.

Tracing Warnings Reports

Sometimes it's hard to understand what a warning is complaining about. You see the source code, but you cannot understand why some specific snippet produces that warning. The mystery often results from the fact that the code can be called from different places if it's located inside a subroutine.

Here is an example:


  warnings.pl
  -----------
  #!/usr/bin/perl -w

  use strict;

  correct();
  incorrect();

  sub correct{
    print_value("Perl");
  }

  sub incorrect{
    print_value();
  }

  sub print_value{
    my $var = shift;
    print "My value is $var\n";
  }

In the code above, print_value() prints the passed value. Subroutine correct() passes the value to print, but in subroutine incorrect() we forgot to pass it. When we run the script:


  % ./warnings.pl

we get the warning:


  Use of uninitialized value at ./warnings.pl line 16.

Perl complains about an undefined variable $var at the line that attempts to print its value:


  print "My value is $var\n";

But how do we know why it is undefined? The reason here obviously is that the calling function didn't pass the argument. But how do we know who was the caller? In our example, there are two possible callers, in the general case there can be many of them, perhaps located in other files.

We can use the caller() function, which tells who has called us, but even that might not be enough: It's possible to have a longer sequence of called subroutines, and not just two. For example, here it is sub third() which is at fault, and putting sub caller() in sub second() would not help:


  sub third{
    second();
  }
  sub second{
    my $var = shift;
    first($var);
  }
  sub first{
    my $var = shift;
   print "Var = $var\n"
  }

The solution is quite simple. What we need is a full calls stack trace to the call that triggered the warning.

The Carp module comes to our aid with its cluck() function. Let's modify the script by adding a couple of lines. The rest of the script is unchanged.


  warnings2.pl
  -----------
  #!/usr/bin/perl -w

  use strict;
  use Carp ();
  local $SIG{__WARN__} = \&Carp::cluck;

  correct();
  incorrect();

  sub correct{
    print_value("Perl");
  }

  sub incorrect{
    print_value();
  }

  sub print_value{
    my $var = shift;
    print "My value is $var\n";
  }

Now when we execute it, we see:


  Use of uninitialized value at ./warnings2.pl line 19.
    main::print_value() called at ./warnings2.pl line 14
    main::incorrect() called at ./warnings2.pl line 7

Take a moment to understand the calls stack trace. The deepest calls are printed first. So the second line tells us that the warning was triggered in print_value(); the third, that print_value() was called by subroutine, incorrect().


  script => incorrect() => print_value()

O'Reilly Open Source Convention -- July 22-26, San Diego, CA.

From the Frontiers of Research to the Heart of the Enterprise

mod_perl 2.0, the Next Generation Stas Bekman will provide an overview of what's new in mod_perl 2.0 and what else is planned for the future in his talk at the upcoming O'Reilly Open Source Convention, this July 22-26, in San Diego.

We go into incorrect() and indeed see that we forgot to pass the variable. Of course, when you write a subroutine such as print_value, it would be a good idea to check the passed arguments before starting execution. We omitted that step to contrive an easily debugged example.

Sure, you say, I could find that problem by simple inspection of the code!

Well, you're right. But I promise you that your task would be quite complicated and time consuming if your code has some thousands of lines. In addition, under mod_perl, certain uses of the eval operator and ``here documents'' are known to throw off Perl's line numbering, so the messages reporting warnings and errors can have incorrect line numbers. This can be easily fixed by helping compiler with #line directive. If you put the following at the beginning of the line in your script:


 #line 125

then it will tell the compiler that the next line is number 125 for reporting needs. Of course, the rest of the lines would be adapted as well.

Getting the trace helps a lot.

my() Scoped Variable in Nested Subroutines

Before we proceed let's make the assumption that we want to develop the code under the strict pragma. We will use lexically scoped variables (with help of the my() operator) whenever it's possible.

The Poison

Let's look at this code:


  nested.pl
  -----------
  #!/usr/bin/perl

  use strict;

  sub print_power_of_2 {
    my $x = shift;

    sub power_of_2 {
      return $x ** 2; 
    }

    my $result = power_of_2();
    print "$x^2 = $result\n";
  }

  print_power_of_2(5);
  print_power_of_2(6);

Don't let the weird subroutine names fool you, the print_power_of_2() subroutine should print the square of the number passed to it. Let's run the code and see whether it works:


  % ./nested.pl

  5^2 = 25
  6^2 = 25

Ouch, something is wrong. May be there is a bug in Perl and it doesn't work correctly with the number 6? Let's try again using 5 and 7:


  print_power_of_2(5);
  print_power_of_2(7);

And run it:


  % ./nested.pl

  5^2 = 25
  7^2 = 25

Wow, does it works only for 5? How about using 3 and 5:


  print_power_of_2(3);
  print_power_of_2(5);

and the result is:


  % ./nested.pl

  3^2 = 9
  5^2 = 9

Now we start to understand -- only the first call to the print_power_of_2() function works correctly. This makes us think that our code has some kind of memory for the results of the first execution, or it ignores the arguments in subsequent executions.

The Diagnosis

Let's follow the guidelines and use the -w flag:


  #!/usr/bin/perl -w

Under Perl version 5.6.0+ we use the warnings pragma:


  #!/usr/bin/perl
  use warnings;

Now execute the code:


  % ./nested.pl

  Variable "$x" will not stay shared at ./nested.pl line 9.
  5^2 = 25
  6^2 = 25

We have never seen such a warning message before and we don't quite understand what it means. The diagnostics pragma will certainly help us. Let's prepend this pragma before the strict pragma in our code:


  #!/usr/bin/perl -w

  use diagnostics;
  use strict;

And execute it:


  % ./nested.pl

Variable "$x" will not stay shared at ./nested.pl line 10 (#1)

(W) An inner (nested) named subroutine is referencing a lexical variable defined in an outer subroutine.

When the inner subroutine is called, it will probably see the value of the outer subroutine's variable as it was before and during the *first* call to the outer subroutine; in this case, after the first call to the outer subroutine is complete, the inner and outer subroutines will no longer share a common value for the variable. In other words, the variable will no longer be shared.

Furthermore, if the outer subroutine is anonymous and references a lexical variable outside itself, then the outer and inner subroutines will never share the given variable.

This problem can usually be solved by making the inner subroutine anonymous, using the sub {} syntax. When inner anonymous subs that reference variables in outer subroutines are called or referenced, they are automatically rebound to the current values of such variables.


  5^2 = 25
  6^2 = 25

Well, now everything is clear. We have the inner subroutine power_of_2() and the outer subroutine print_power_of_2() in our code.

When the inner power_of_2() subroutine is called for the first time, it sees the value of the outer print_power_of_2() subroutine's $x variable. On subsequent calls the inner subroutine's $x variable won't be updated, no matter what new values are given to $x in the outer subroutine. There are two copies of the $x variable, no longer a single one shared by the two routines.

The Remedy

The diagnostics pragma suggests that the problem can be solved by making the inner subroutine anonymous.

An anonymous subroutine can act as a closure with respect to lexically scoped variables. Basically, this means that if you define a subroutine in a particular lexical context at a particular moment, then it will run in that same context later, even if called from outside that context. The upshot of this is that when the subroutine runs, you get the same copies of the lexically scoped variables that were visible when the subroutine was defined. So you can pass arguments to a function when you define it, as well as when you invoke it.

Let's rewrite the code to use this technique:


  anonymous.pl
  --------------
  #!/usr/bin/perl

  use strict;

  sub print_power_of_2 {
    my $x = shift;

    my $func_ref = sub {
      return $x ** 2;
    };

    my $result = &$func_ref();
    print "$x^2 = $result\n";
  }

  print_power_of_2(5);
  print_power_of_2(6);

Now $func_ref contains a reference to an anonymous function, which we later use when we need to get the power of two. (In Perl, a function is the same thing as a subroutine.) Since it is anonymous, the function will automatically be rebound to the new value of the outer scoped variable $x, and the results will now be as expected.

Let's verify:


  % ./anonymous.pl

  5^2 = 25
  6^2 = 36

So we can see that the problem is solved.

When You Cannot Get Rid of the Inner Subroutine

First, you might wonder, why in the world will someone need to define an inner subroutine? Well, for example, to reduce some of Perl's script startup overhead you might decide to write a daemon that will compile the scripts and modules only once, and cache the pre-compiled code in memory. When some script is to be executed, you just tell the daemon the name of the script to run and it will do the rest and do it much faster since compilation has already taken place.

Seems like an easy task; and it is. The only problem is once the script is compiled, how do you execute it? Or let's put it another way: After it was executed for the first time and it stays compiled in the daemon's memory, how do you call it again? If you could get all developers to code their scripts so each has a subroutine called run() that will actually execute the code in the script, then we've solved half the problem.

But how does the daemon know to refer to some specific script if they all run in the main:: name space? One solution might be to ask the developers to declare a package in each and every script, and for the package name to be derived from the script name. However, since there is a chance that there will be more than one script with the same name but residing in different directories, then in order to prevent namespace collisions the directory has to be a part of the package name, too. And don't forget that the script may be moved from one directory to another, so you will have to make sure that the package name is corrected each time the script gets moved.

But why enforce these strange rules on developers, when we can arrange for our daemon to do this work? For every script that the daemon is about to execute for the first time, the script should be wrapped inside the package whose name is constructed from the mangled path to the script and a subroutine called run(). For example, if the daemon is about to execute the script /tmp/hello.pl:


  hello.pl
  --------
  #!/usr/bin/perl
  print "Hello\n";

then prior to running it, the daemon will change the code to be:


  wrapped_hello.pl
  ----------------
  package cache::tmp::hello_2epl;

  sub run{
    #!/usr/bin/perl 
    print "Hello\n";
  }

The package name is constructed from the prefix cache::, each directory separation slash is replaced with ::, and nonalphanumeric characters are encoded so that for example . (a dot) becomes _2e (an underscore followed by the ASCII code for a dot in hex representation).


 % perl -e 'printf "%x",ord(".")'

prints: 2e. The underscore is the same you see in URL encoding except the % character is used instead (%2E), but since % has a special meaning in Perl (prefix of hash variable) it couldn't be used.

Now when the daemon is requested to execute the script /tmp/hello.pl, all it has to do is to build the package name as before based on the location of the script and call its run() subroutine:


  use cache::tmp::hello_2epl;
  cache::tmp::hello_2epl::run();

We have just written a partial prototype of the daemon we wanted. The only outstanding problem is how to pass the path to the script to the daemon. This detail is left as an exercise for the reader.

If you are familiar with the Apache::Registry module, then you know that it works in almost the same way. It uses a different package prefix and the generic function is called handler() and not run(). The scripts to run are passed through the HTTP protocol's headers.

Now you understand that there are cases where your normal subroutines can become inner, since if your script was a simple:


  simple.pl
  ---------
  #!/usr/bin/perl 
  sub hello { print "Hello" }
  hello();

Wrapped into a run() subroutine it becomes:


  simple.pl
  ---------
  package cache::simple_2epl;

  sub run{
    #!/usr/bin/perl 
    sub hello { print "Hello" }
    hello();
  }

Therefore, hello() is an inner subroutine and if you have used my() scoped variables defined and altered outside and used inside hello(), it won't work as you expect starting from the second call, as was explained in the previous section.

Remedies for Inner Subroutines

First of all, there is nothing to worry about, as long as you don't forget to turn the warnings On. If you do happen to have the ``my() Scoped Variable in Nested Subroutines'' problem, then Perl will always alert you.

Given that you have a script that has this problem, what are the ways to solve it? There are many of them and we will discuss some of them here.

We will use the following code to show the different solutions.


  multirun.pl
  -----------
  #!/usr/bin/perl -w

  use strict;

  for (1..3){
    print "run: [time $_]\n";
    run();
  }

  sub run{
    my $counter = 0;

    increment_counter();
    increment_counter();

    sub increment_counter{
      $counter++;
      print "Counter is equal to $counter !\n";
    }

  } # end of sub run

This code executes the run() subroutine three times, which in turn initializes the $counter variable to 0 each time it is executed and then calls the inner subroutine increment_counter() twice. Sub increment_counter() prints $counter's value after incrementing it. One might expect to see the following output:


  run: [time 1]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 2]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 3]
  Counter is equal to 1 !
  Counter is equal to 2 !

But as we have already learned from the previous sections, this is not what we are going to see. Indeed, when we run the script we see:


  % ./multirun.pl
  Variable "$counter" will not stay shared at ./nested.pl line 18.
  run: [time 1]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 2]
  Counter is equal to 3 !
  Counter is equal to 4 !
  run: [time 3]
  Counter is equal to 5 !
  Counter is equal to 6 !

Obviously, the $counter variable is not reinitialized on each execution of run(). It retains its value from the previous execution, and sub increment_counter() increments that.

One of the workarounds is to use globally declared variables, with the vars pragma.


  multirun1.pl
  -----------
  #!/usr/bin/perl -w

  use strict;
  use vars qw($counter);

  for (1..3){
    print "run: [time $_]\n";
    run();
  }

  sub run {

    $counter = 0;

    increment_counter();
    increment_counter();

    sub increment_counter{
      $counter++;
      print "Counter is equal to $counter !\n";
    }

  } # end of sub run

If you run this and the other solutions offered below, then the expected output will be generated:


  % ./multirun1.pl

  run: [time 1]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 2]
  Counter is equal to 1 !
  Counter is equal to 2 !
  run: [time 3]
  Counter is equal to 1 !
  Counter is equal to 2 !

By the way, the warning we saw before has gone, and so has the problem, since there is no my() (lexically defined) variable used in the nested subroutine.

Another approach is to use fully qualified variables. This is better, since less memory will be used, but it adds a typing overhead:


  multirun2.pl
  -----------
  #!/usr/bin/perl -w

  use strict;

  for (1..3){
    print "run: [time $_]\n";
    run();
  }

  sub run {

    $main::counter = 0;

    increment_counter();
    increment_counter();

    sub increment_counter{
      $main::counter++;
      print "Counter is equal to $main::counter !\n";
    }

  } # end of sub run

You can also pass the variable to the subroutine by value and make the subroutine return it after it was updated. This adds time and memory overheads, so it may not be good idea if the variable can be very large, or if speed of execution is an issue.

Don't rely on the fact that the variable is small during the development of the application, it can grow quite big in situations you don't expect. For example, a simple HTML form text entry field can return a few megabytes of data if one of your users is bored and wants to test how good your code is. It's not uncommon to see users copy-and-paste 10Mb core dump files into a form's text fields and then submit it for your script to process.


  multirun3.pl
  -----------
  #!/usr/bin/perl -w

  use strict;

  for (1..3){
    print "run: [time $_]\n";
    run();
  }

  sub run {

    my $counter = 0;

    $counter = increment_counter($counter);
    $counter = increment_counter($counter);

    sub increment_counter{
      my $counter = shift;

      $counter++;
      print "Counter is equal to $counter !\n";

      return $counter;
    }

  } # end of sub run

Finally, you can use references to do the job. The version of increment_counter() below accepts a reference to the $counter variable and increments its value after first dereferencing it. When you use a reference, the variable you use inside the function is physically the same bit of memory as the one outside the function. This technique is often used to enable a called function to modify variables in a calling function.


  multirun4.pl
  -----------
  #!/usr/bin/perl -w

  use strict;

  for (1..3){
    print "run: [time $_]\n";
    run();
  }

  sub run {

    my $counter = 0;

    increment_counter(\$counter);
    increment_counter(\$counter);

    sub increment_counter{
      my $r_counter = shift;

      $$r_counter++;
      print "Counter is equal to $$r_counter !\n";
    }

  } # end of sub run

Here is yet another and more obscure reference usage. We modify the value of $counter inside the subroutine by using the fact that variables in @_ are aliases for the actual scalar parameters. Thus if you called a function with two arguments, then those would be stored in $_[0] and $_[1]. In particular, if an element $_[0] is updated, then the corresponding argument is updated (or an error occurs if it is not updatable as would be the case of calling the function with a literal, e.g. increment_counter(5)).


  multirun5.pl
  -----------
  #!/usr/bin/perl -w

  use strict;

  for (1..3){
    print "run: [time $_]\n";
    run();
  }

  sub run {

    my $counter = 0;

    increment_counter($counter);
    increment_counter($counter);

    sub increment_counter{
      $_[0]++;
      print "Counter is equal to $_[0] !\n";
    }

  } # end of sub run

The approach given above is generally not recommended because most Perl programmers will not expect $counter to be changed by the function; the example where we used \$counter, i.e. pass-by-reference would be preferred.

Here is a solution that avoids the problem entirely by splitting the code into two files: The first is really just a wrapper and loader, the second file contains the heart of the code.


  multirun6.pl
  -----------
  #!/usr/bin/perl -w

  use strict;
  require 'multirun6-lib.pl' ;

  for (1..3){
    print "run: [time $_]\n";
    run();
  }

Separate file:


  multirun6-lib.pl
  ----------------
  use strict ;

  my $counter;
  sub run {
    $counter = 0;
    increment_counter();
    increment_counter();
  }

  sub increment_counter{
    $counter++;
    print "Counter is equal to $counter !\n";
  }

  1 ;

Now you have at least six workarounds to choose from.

For more information, please refer to perlref and perlsub manpages.

Previously in the Series

The Perl You Need To Know

Installing mod_perl without superuser privileges

mod_perl in 30 minutes

Why mod_perl?

perldoc's Rarely Known But Very Useful Options

It's a known fact that one cannot become a Perl hacker and especially mod_perl hacker without knowing how to read Perl documentation and search through it. Books are good, but an easily accessible and searchable Perl reference at your fingertips is a great time saver. It always has the up-to-date information for the version of perl you're using.

Of course, you can use online Perl documentation at the Web. I prefer http://theoryx5.uwinnipeg.ca/CPAN/perl/ to the official URL: http://www.perl.com/pub/v/documentation is very slow :( . The perldoc utility provides you with access to the documentation installed on your system. To find out what Perl manpages are available execute:


  % perldoc perl

To find what functions perl has, execute:


  % perldoc perlfunc

To learn the syntax and to find examples of a specific function, you would execute (e.g. for open()):


  % perldoc -f open

Note: In perl5.005_03 and earlier, there is a bug in this and the -q options of perldoc. It won't call pod2man, but will display the section in POD format instead. Despite this bug it's still readable and very useful.

The Perl FAQ (perlfaq manpage) is in several sections. To search through the sections for open you would execute:


  % perldoc -q open

This will show you all the matching question-and-answer sections, still in POD format.

To read the perldoc manpage you would execute:


  % perldoc perldoc

References

  • Online documentation: http://theoryx5.uwinnipeg.ca/CPAN/perl/ http://www.perl.com/pub/v/documentation/

  • The book ``Programming Perl'' 3rd edition by L.Wall, T. Christiansen and J.Orwant (also known as the ``Camel'' book, named after the camel picture on the cover of the book). You want to refer to Chapter 8 that talks about nested subroutines among other things.

  • The perlref and perlsub man pages.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en