November 2000 Archives

Red Flags Return

Astute readers had a number of comments about last week's Program Repair Shop and Red Flags article.

Control Flow Puzzle

In the article, I had a section of code that looked like this:

       $_ = <INFO> until !defined($_) || /^(\* Menu:|\037)/;
       return @header if !defined($_) || /^\037/;

I disliked the structure and especially the repeated tests. I played with it, changing it to

Table of Contents

Control Flow Puzzle
Pattern Matching
Synthetic Variables
Send More Code

        while (<INFO>) {
          last if /^\* Menu:/;
          return @header if /^\037/;
        }
        return @header unless defined $_;

and then used Simon Cozens' suggestion of

        do { 
          $_ = <INFO>; 
          return @header if /^\037/ || ! defined $_ 
        } until /^\* Menu:/ ;

This still bothered me, because do...until is unusual. But I was out of time, so that's what I used.

Readers came up with two interesting alternatives. Jeff Pinyan suggested:

        while (<INFO>) {
          last if /^\* Menu:/;
          return %header if /^\037/ or eof(INFO);
        }

This is perfectly straightforward, and the only reason I didn't think of it was because of my prejudice against eof(). In the article, I recommended avoiding eof(), and that's a good rule of thumb. But in this case, I think it was probably the wrong way to go.

After I saw Jeff's solution, I thought more about eof() and tried to remember what its real problems are. The conclusion I came to is that the big problem with eof() occurs when you use it on a filehandle that is involved in an interactive dialogue, such as a terminal.

Consider code like this:

        my ($name, $fav_color);
        print "Enter your name: ";
        chomp($name = <STDIN>);
        unless (eof(STDIN)) {
          print "Enter your favorite color: ";
          chomp($fav_color = <STDIN>);
        }

This seems straightforward, but it doesn't work. (Try it!) After user enters their name, we ask for eof(). This tries to read another character from STDIN, which means that the program is waiting for user input before printing the second prompt! The program hangs forever at the eof test, and the only way it can continue is if the user clairvoyantly guesses that they are supposed to enter their favorite color. If they do that, then the program will print the prompt and immediately continue. Not very useful behavior! And under some circumstances, this can cause deadlock.

However, in the example program I was discussing, no deadlock is possible because the information flows in only one direction - from a file into the program. So the use of eof() would have been safe.

Ilya Zakharevich suggested a solution that I like even better:

      while (<INFO>) {
          return do_menu() if /^\* Menu:/;
          last if /^\037/;
      }
      return %header;

Here, instead of requiring the loop to fall through to process the menu, we simply put the menu-processing code into a subroutine and process it inside the loop.

Ilya also pointed out that the order of the tests in the original code is backward:

	return @header if /^\037/ || ! defined $_

It should have looked like this:

	return @header if ! defined $_  || /^\037/;

Otherwise, we're trying to do a pattern-match operation on a possibly undefined value.

Ilya also suggested another alternative:

    READ_A_LINE: {
      return %header if not defined ($_ = <INFO>) or /^\037/;
      redo READ_A_LINE unless /^\* Menu:/;
    }

Randal Schwartz suggested something similar. This points out a possible rule of thumb: When Perl's control-flow constructions don't seem to be what you want, try decorating a bare block.

Oops!

I said:

now invoke the function like this:
	$object = Info_File->new('camel.info');

Unfortunately, the function in question was named open_info_file, not new. The call should have been

	$object = Info_File->open_info_file('camel.info');

I got the call right in my test program (of course I had a test program!) but then mixed up the name when I wrote the article. Thanks to Adam Turoff for spotting this.

Pattern Matching

In the article, I replaced this:

	($info_file) = /File:\s*([^,]*)/;
        ($info_node) = /Node:\s*([^,]*)/;
        ($info_prev) = /Prev:\s*([^,]*)/;
        ($info_next) = /Next:\s*([^,]*)/;
        ($info_up)   = /Up:\s*([^,]*)/;

With this:

	for my $label (qw(File Node Prev Next Up)) {
          ($header{$label}) = /$label:\s*([^,]*)/;
        }

Then I complained that Perl must recompile the regex each time through the loop, five times per node. Ilya pointed out the obvious solution:

	 $header{$1} = $2 
	     while /(File|Node|Prev|Next|Up):\s*([^,]*)/g;

I wish I had thought of this, because you can produce it almost mechanically. In fact, I think my original code betrays a red flag itself. Whenever you have something like this:

	for $item (LIST) {
          something involving m/$item/;
        }

this is a red flag, and you should consider trying to replace it with this:

	my $pat = join '|', LIST;
        Something involving m/$pat/o;

As a simple example, consider this common construction:

	@states = ('Alabama', 'Alaska', ..., 
	           'West Virginia', 'Wyoming');
        $matched = 0;
        for $state (@states) {
          if ($input =~ /$state/) { 
            $matched = 1; last;
          }
        }

It's more efficient to use this instead:

	my $pat = join '|', @states;
        $matched = ($input =~ /$pat/o);

Applying this same transformation to the code in my original program yields Ilya's suggestion.

Synthetic Variables

My code looked like this:

	while (<INFO>) {
          return 1 if /^\037/;    # end of node, success.
          next unless /^\* \S/;   # skip non-menu-items
          if (/^\* ([^:]*)::/) {  # menu item ends with ::
              $key = $ref = $1;
          } elsif (/^\* ([^:]*):\s*([^.]*)[.]/) {
              ($key, $ref) = ($1, $2);
          } else {
              print STDERR "Couldn't parse menu item\n\t$_";
              next;
          }
          $info_menu{$key} = $ref;
        }

Ilya pointed out that in this code, $key and $ref may be synthetic variables. A synthetic variable isn't intrinsic to the problem you're trying to solve; rather, they're an artifact of the way the problem is expressed in a programming language. I think $key and $ref are at least somewhat natural, because the problem statement does include menu items with names that refer to nodes, and $key is the name of a menu item and $ref is the node it refers to. But some people might prefer Ilya's version:

       while (<INFO>) {
           return 1 if /^\037/;        # end of node, success.
           next unless s/^\* (?=\S)//; # skip non-menu-items
           $info_menu{$1} = $1, next if /^([^:]*)::/; 
           $info_menu{$1} = $2, next if /^([^:]*):\s*(.*?)\./;
           print STDERR "Couldn't parse menu item\n\t* $_";
       }

Whatever else you say about it, this reduces the code from eleven lines to six, which is good.

Old News

Finally, a belated correction. In the second Repair Shop and Red Flags Article way back in June, I got the notion that you shouldn't use string operations on numbers. While I still think this is good advice, I then tried to apply it outside of the domain in which it made sense.

I was trying to transform a number like 12345678 into an array like ('12', ',', '345', ',', '678'). After discussing several strategies, all of which worked, I ended with the following nonworking code:

	sub convert {
          my ($number) = shift;
          my @result;
          while ($number) {
            push @result, ($number % 1000) , ',';
            $number = int($number/1000);
          }
          pop @result;      # Remove trailing comma
          return reverse @result;
        }

If you ask this subroutine to convert the number 1009, you get ('1', ',', '9'), which is wrong; it should have been (1, ',', '009'). Many people wrote to point this out; I think Mark Lybrand was the first. Oops! Of course, you can fix this with sprintf, but really the solutions I showed earlier in the article are better.

The problem here is that I became too excited about my new idea. I still think it's usually a red flag to treat a number like a string. But there's an exception: When you are formatting a number for output, you have to treat it like a string, because output is always a string. I think Charles Knell hit the nail on the head here:

By inserting commas into the returned value, you ultimately treat the number as a string. Why not just give in and admit you're working with a string.

Thanks, Charles.

People also complained that the subroutine returns a rather peculiar list instead of a single scalar, but that was the original author's decision and I didn't want to tamper with it without being sure why he had done it that way. People also took advantage of the opportunity to send in every bizarre, convoluted way they would think of to accomplish the same thing (or even a similar thing), often saying something like this:

You are doing way too much work! Why don't you simply use this, like everyone else does?
	sub commify {
          $_ = shift . '*';
          "nosehair" while s/(.{1,3})\*/*,$1/;
          substr($_,2);
        }

I think this just shows that all code is really simple if you already happen to understand it.

Send More Code

Finally, thanks to everyone who wrote in, especially the people I didn't mention. These articles have been quite popular, and I'd like to continue them. But that can't happen unless I have code to discuss. So if you'd like to see another ``Red Flags'' article, please consider sending me a 20- to 50-line section of your own code. If you do, I won't publish the article without showing it to you beforehand.

Programming GNOME Applications with Perl - Part 2


Table of Contents

The Cookbook Application
The Main Screen
Columned Lists
Displaying Recipes
Where We Are, And Where We're Going
Notes on the Last Article

Last month's article examined how to create a simple ``Hello World'' application using Gtk+ and GNOME. This month, we'll build a more sophisticated application - one to store and retrieve recipes.

The Cookbook Application

Before we write a single line of code, let's see how we're going to design this. First, we'll look at the user interface, and then see what that means for our program design.

When designing user interfaces, we need to consider what provides users with the most useful and intuitive view of their data, without overcrowding them. What do we need to be able to get at easily when we're using the application? There are two parts to this question: actions that we can perform, and data we can see.

In terms of the data, I decided that the best way to organize the available recipes was as a list, just like the table of contents in a recipe book; scroll up and down the list to see the recipe titles, and then click on one title to display the whole recipe. We could also display some useful information next to each title. I decided that the most useful things to know would be the cooking time and the date that the recipe was added.

Now we can look at the actions that will be performed - these will be turned into the toolbar buttons. One of the most useful features I wanted was the ability to give the program a list of ingredients that I have and have it tell me things I could cook with them. I also wanted to be able to maintain several different cookbooks, so ``Save'' and ``Open'' were natural choices. Of course, you need to be able to add new recipes, so an ``Add'' button would be useful, too. Note that I didn't want a ``Delete'' button - deleting a recipe is something that'll probably happen rarely, and even then, you don't want to make it too easy to do. Finally, you need to be able to exit.

That's the interface for the main screen, and this is what it would look like:

Now we can think about the data we need to store. We'll need to store recipes with their titles, dates and cooking times. If we want to search by ingredient, we should also store what ingredients each recipe needs. It would also be handy to have a complete list of all the ingredients we know about, and we'll also have some user configuration settings.

Initially, I considered putting the recipes in an SQL database, but decided against it for two reasons: first, connecting recipes to ingredients was unnecessarily complicated, and the whole thing seemed a little overkill, and second, GNOME applications traditionally store all their data in XML files so that data can be easily passed between apps. In the end, I decided to store the configuration settings plus the list of ingredients we know about in a single XML file, and have the recipe book in a separate file.

The Main Screen

Now that we know what the interface is going to look like for the main screen, we can start coding it. We'll start with the menu items and the toolbar, just like before.

        #!/usr/bin/perl -w
        use strict;
        use Gnome;

        my $NAME    = 'gCookBook';
        my $VERSION = '0.1';

        init Gnome $NAME;

        my $app = new Gnome::App $NAME, $NAME;
		
        signal_connect $app 'delete_event', 
          sub { Gtk->main_quit; return 0 };

        $app->create_menus(
           {
          type => 'subtree',
          label => '_File',
          subtree => [
                { 
                 type => 'item',
                 label => '_New',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_New'
                },
                {
                 type => 'item',
                 label => '_Open...',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Open'
                },
                {
                 type => 'item',
                 label => '_Save',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Save'
                },
                {
                 type => 'item',
                 label => 'Save _As...',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Save As'
                },
                {
                 type => 'separator'
                },
                {
                 type => 'item',
                 label => 'E_xit',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Quit',
                 callback => sub { Gtk->main_quit; return 0 }
                }
                 ]
           },
           { 
          type => 'subtree',
          label => '_Edit',
          subtree => [
                {
                 type => 'item',
                 label => 'C_ut',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Cut',
                },
                {
                 type => 'item',
                 label => '_Copy',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Copy'
                },
                {
                 type => 'item',
                 label => '_Paste',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Paste'
                }
                 ]
           },
           {
          type => 'subtree',
          label => '_Settings',
          subtree => [
                {
                 type => 'item',
                 label => '_Preferences...',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_Preferences',
                 callback => \&show_prefs
                }
                 ]
           },
           {
          type   => 'subtree',
          label  => '_Help',
          subtree => [
                {type => 'item', 
                 label => '_About...',
                 pixmap_type => 'stock',
                 pixmap_info => 'Menu_About',
                 callback => \&about_box
                }
             ]
           }
          );

    $app->create_toolbar(
           {
            type     => 'item',
            label    => 'Cook',
            pixmap_type => 'stock',
            pixmap_info => 'Search',
            hint     => 'Find a recipe by ingedients'
           },
           {
            type     => 'item',
            label    => 'Add',
            pixmap_type => 'stock',
            pixmap_info => 'Add',
            hint     => 'Add a new recipe'
           },
           {
            type     => 'item',
            label    => 'Open...', 
            pixmap_type => 'stock',
            pixmap_info => 'Open',
            hint     => "Open a recipe book"
           },
           {
            type     => 'item',
            label    => 'Save', 
            pixmap_type => 'stock',
            pixmap_info => 'Save',
            hint     => "Save this recipe book"
           },
           { 
            type     => 'item',
            label    => 'Exit',
            pixmap_type => 'stock',
            pixmap_info => 'Quit',
            hint     => "Leave $NAME",
            callback  => sub { Gtk->main_quit;}
           }
          );

    $app->set_default_size(600,400);

    my $bar = new Gnome::AppBar 0,1,"user" ;
    $bar->set_status("");
    $app->set_statusbar( $bar );

    show_all $app;

    main Gtk;

    sub about_box {
      my $about = new Gnome::About $NAME, $VERSION,
      "(C) Simon Cozens, 2000", ["Simon Cozens"], 
      "This program is released under the 
          same terms as Perl itself";
      show $about;
      }

Columned Lists

Next, we have to show the list of recipes. This is usually done with a CList, or ``columned list,'' widget. However, the standard Gtk CList widget is a little unfriendly to deal with: You can only put data into it, and you can't find out what is in the list, so you have to maintain a separate array containing the data; columned lists usually re-sort themselves when a column title is clicked on, but the programmer has to handle this case himself; data has to be referenced by column number, not by column name; and so on.

Since I realized this was going to be unpleasant every time I wanted a columned list, I wrote a module called Gtk::HandyCList that encapsulates all these features. (You'll need to download that module from CPAN if you want to try this. Make sure you get version 0.02, since we use the hide method down below, which is new in that version.)

To add it to our program, we first need data to display! Let's create a dummy array of data, like this:

        my @cookbook = (
                [ "Frog soup", "29/08/99", "12"],
                [ "Chicken scratchings", "12/12/99", "40"],
                [ "Pork with beansprouts in a garlic
                    butter sauce and a really really long name
                    that we have to scroll to see",
                  "1/1/99", 30],
                [ "Eggy bread", "10/10/10", 3]
               );

Now we need to load the module itself, so:

    use Gtk::HandyCList;

Because we want this list to be scrollable, we put it inside a different widget that handles scroll bars - a Gtk::ScrolledWindow.

  my $scrolled_window = new Gtk::ScrolledWindow( undef, undef );
  $scrolled_window->set_policy( 'automatic', 'always' );

Now we create the HandyCList. First, we specify the column names that will be used, then we set up the sizes for each column.

  my $list = new Gtk::HandyCList qw(Name Date Time);
  $list->sizes(350,150,100);

As I mentioned, we want to be able to re-sort the data when we click on the column headings. To make this work we have to tell the module how to sort each column. It knows about alphabetical and numeric sorting, but we'll have to tell it about sorting by date by providing it with a subroutine reference. We also set the shadow so that it looks pretty.

  $list->sortfuncs("alpha", \&sort_date, "number");
  $list->set_shadow_type('out');

Now we give the data to the list:

  $list->data(@cookbook);

Next, we add the list to our scrolled window, and tell the application that its main contents are the scrolled window:

  $scrolled_window->add($list);
  $app->set_contents($scrolled_window);

Finally, we'll receive the signal sent when a recipe is clicked on, and use that to display the recipe.

  $list->signal_connect( "select_row", \&display_recipe);

Of course, we need to write those two subroutines, sort_date and display_recipe. Let's leave the latter one for now, and polish off the date sorting. Here's how I'd write it, because I'm British:

        sub sort_date {
          my ($ad, $am, $ay) = ($_[0] =~ m|(\d+)/(\d+)/(\d+)|);
          my ($bd, $bm, $by) = ($_[1] =~ m|(\d+)/(\d+)/(\d+)|);
          return $ay <=> $by || $am <=> $bm || $ad <=> $bd;
        }

Exercise for the reader: make this subroutine locale-aware.

By now, you should have an application that displays a list of recipes along with their dates and cooking times. Play with it, click on the column headings and watch it re-sort, resize the windows and the columns, and see what happens.

Displaying Recipes

Now let's tackle displaying the recipes. This is where things get more complex. First, we have to store the text for the recipes. We want to store them, along with the titles, dates and cooking times, in the @cookbook array. So let's add another column to that array, like so:

    my @cookbook = (
        [ "Frog soup", "29/08/99", "12", 
          "Put frog in water. Slowly raise water temperature 
           until frog is cooked."],
        [ "Chicken scratchings", "12/12/99", "40", 
          "Remove fat from chicken, and fry 
	   under a medium grill"],
        [ "Pork with beansprouts in a garlic butter sauce 
           and a really really long name that we have to
           scroll to see",
          "1/1/99", 30, 
	  "Pour boiling water into packet and stir"],
        [ "Eggy bread", "10/10/10", 3, 
	  "Fry bread. Fry eggs. Combine."]
           );

We don't want to display this information on the main list, so we need to change the data that we're passing to the Gtk::HandyCList:

 - my $list = new Gtk::HandyCList qw(Name Date Time);
 + my $list = new Gtk::HandyCList qw(Name Date Time Recipe);
 + $list->hide("Recipe");

(If you don't remember what that syntax means, it's ``take out the line starting with the minus, and add in the lines starting with a plus.'')

Now that we have the recipes stored inside our data structure, we want to be able to see them. We'll use a widget called Gnome::Less, which is named after the Unix utility less. It's a file browser, but we can also give it strings to display.

Let's stop and think about what we're going to do. We need to catch the signal that tells us that the user has double-clicked on a recipe. Then, we want to pop up a window, create a Gnome::Less widget inside that window containing the recipe text and allow the user to dismiss the window. We've already connected the ``mouse click'' signal to a subroutine called display_recipe, so it's time to write that subroutine.

    sub display_recipe {
      my ($clist, $row, $column, $mouse_event) = @_;
      return unless $mouse_event->{type} eq "2button_press";

First, we receive the parameters passed by the signal. The first thing we get is the object that caused the signal - our HandyCList widget. That determines what other parameters get sent. In the case of a HandyCList, it's the row and column in the list that received the mouse click, and a Gtk::Gdk::MouseEvent object that tells us what sort of click it was. In our case, we only want to act on a double click, which is where the type is "2button_press". If this isn't the case, we return.

      my %recipe = %{($clist->data)[$row]};

Given that we know the row that received the signal, we can extract that row from the HandyCList via the data method. Data is a get-set method, which means we can either store data into the list with it, or we can use it to retrieve the data from the list. Each row is stored as a hash reference, which we dereference to a real hash.

      my $recipe_str = $recipe{Name}."\n";
      $recipe_str .= "-" x length($recipe{Name})."\n\n";
      $recipe_str .= "Cooking time : $recipe{Time}\n";
      $recipe_str .= "Date created : $recipe{Date}\n\n";
      $recipe_str .= $recipe{Recipe};

Next, we build the string that we're going to display, using the hash values we've recovered.

      my $db = new Gnome::Dialog($recipe{Name});
      my $gl = new Gnome::Less;
      my $button = new Gtk::Button( "Close" );
      $button->signal_connect( "clicked", sub { $db->destroy } );

We now create three widgets: the pop-up dialog box window (we pass the recipe's name as a window title), the pager that will display the recipe and a close button. We also connect a signal so that when the button is clicked, the dialog box is destroyed.


      $db->action_area->pack_start( $button, 1, 1, 0 );
      $db->vbox->pack_start($gl, 1, 1, 0);

A dialog box consists of two areas: an ``action area'' at the bottom that should contain the available ``actions,'' or buttons, and a vbox at the top where we put our messages. Accordingly, we pack our button into the action area and our Less widget into the vbox

      $gl->show_string($recipe_str);
      show_all $db;
    }

Finally, we tell the pager what string it should display, and then show the dialog box. We can now display recipes.

Where We Are, And Where We're Going

The full source of the application so far can be found here.

So far, we've only dealt with static data, hard-coded into the application, which isn't a very real-life scenario. Next time, we'll look at adding and deleting recipes, as well as saving and restoring cookbooks to disk using XML. Once that's done, we'll have the core of a basic cookbook application. In the final part of this tutorial, we'll add more features, such as searching by ingredients.

Notes on the Last Article

Several people wrote me after last month's article saying that they couldn't get the GNOME versions of the application working; if that's a problem, you need to be using the latest version of the Gnome.pm module. The one on CPAN is not the latest - instead, use the one from the Gnome.pm Web site, at http://projects.prosa.it/gtkperl.

I also got my knuckles rapped for saying that ``GNOME is the Unix desktop.'' Fair play - the other project that's providing the same sort of environment for Unix is KDE, but for a long time it was hampered by developers' suspicion of TrollTech and their QPL license. At the same time, big players like Sun and IBM were putting money into the GNOME Foundation to make GNOME the Unix desktop, so it seemed a fair thing to say.

Now most people are happy that the same big players have also set up the KDE League. (From http://www.kde.org/announcements/gfresponse.html: `Now we have been asked ``Will KDE ever create a KDE Foundation in the same sense as the GNOME Foundation?'' The answer to this is no, absolutely not.' You tell 'em, guys.) KDE looks to be a worthy alternative to GNOME. Obviously, I prefer GNOME, but as http://segfault.org puts it: ``KDE - GNOME War - Casualties so far: 0''.

This Week on p5p 2000/11/27



Notes

You can subscribe to an e-mail version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to simon@brecon.co.uk

This week was very busy, and saw nearly 400 posts. Unfortunately, I was also very busy, so this report is slightly late.

Regexp Engine

First, Jarkko has this to say about his progress with the polymorphic regular expression node problem:

To this I can add that if so far I had been happily bouncing around the strange lands of Reg-Ex and shouting back "Dragons? What dragons?" to people frantically waving their hands (safely beyond the borders, funny that)... now I can attest to nasty monsters being fully alive, and full of flame ... the match variables are now under control (I *think*) -- but the character classes are mean, mostly because the data structures implementing them are so different between the byte and character cases, merging the code using them is, errrm, fun? I'm currently dodging core dumps falling from the sky, but I think I'm running in generally right direction ...

Ilya also questioned the methodology of merging character and byte nodes, and Jarkko explained further what he was doing. Read about it.

SOCKS and Sockets

Jens Hamisch noticed a problem with the SOCKS support: Perl had aliased close to fclose without making a distinction between file and socket cases. SOCKS provides wrapper functions around a lot of the I/O library, but it expects people to call close rather than fclose on sockets.

Jens provided a patch, but it only seemed to scratch the surface, so Nick suggested that, since others had pointed out that playing stdio on sockets was not exactly recommended, we should work the SOCKS support into our stdio emulation as part of PerlIO.

The thread continued to discuss the finer intricacies of PerlIO, stdio and SOCKS support; if that's your thing, Read about it.

for, map and grep

There was a long discussion, prompted by Jarkko, about how it would be nice if for could be used more like map or grep, and vice versa, allowing you to say things like:

    map $a { $_ += $a } @array
    grep $a { ... grep $b { $a + $b } } @array

and also

    for (@a) { ... } if $thing
    $total += $_ for @a if $thing
    
This led to a general discussion of dream syntax for post-expression
modifiers, including things such as:

    do_this if $that unless $the_other

There was no concensus or any patches, but it was fun anyway.

It also spawned an interesting sub-thread, which related to the fact that the implementation of qw// has changed and now the values it produces are read-only in a for loop, hence things like

    map { s/foo/bar/; $_ } qw(good food) 

now produce an error. Some people thought this was bad, some thought it was good, some thought it was a bug fix, others thought it was an unnecessary semantic change. A suggestion was to have some kind of copy-on-write method so that changing a value in an iterator creates a copy of the value that is no longer read-only.

The whole thread eventually came down to the fact that everyone wants ``Perl to Do What They Mean,'' but ``What They Mean May Not Be What Other People Mean.'' Read about it.

Encode Licensing

Remember I told you that we used the Encode files from Tcl? Well, as Nick was preparing documentation on the file format for the conversion tables, Jarkko spotted the license. Oops!

While Tcl is open-source, the terms it's distributed under aren't the same terms as Perl, so there was some ooh-ing and aah-ing about whether it could be let in there. Sarathy piped up and said we should do the same as we did for File::Glob - include the data, and keep the licensing terms as part of that extension.

PERL5OPT

Dominus noted that the environment variable PERL5OPT, which claims to behave exactly like switches on the Perl command command line, doesn't actually behave like that:

    PERL5OPT=-a -b perl program.pl

actually turns out to be interpreted as

    perl '-a -b' program.pl

which meant that you couldn't have more than one -M clause. It also turned out (as reported to me by Rich Lafferty) that

    PERL5OPT='-Mstrict; print "Hello\n"'

has rather unpleasant results, and there was some discussion as to whether this was a security problem.

Your humble author produced a patch to have the variable interpreted properly, and Dominus came up with a neat set of tests; however, both patch and test appeared to be slightly buggy, so that's not quite resolved just yet.

Unicode on Big Iron

Peter Prymmer has been making OS/390 Perl better; it now passes a whopping 94.12 percent of its tests. However, I complained that the reason that it was passing some of those was that we were hiding the fact that Unicode didn't work. There was some, uhm, heated debate before we all ascertained that we really did want Unicode to work, and we looked into the problems that are stopping it.

The nice thing about Unicode for ASCII machines is that the bottom 128 characters are the same, so you don't even need to think about them. The nasty thing about Unicode for EBCDIC machines is that they're not ASCII machines, and so there has to be some kind of translation going on. The plan is to introduce an array that converts EBCDIC to ASCII, and we'll see where that gets us.

Carp

Ben Tilly has been thinking for a while about the Carp module; it has convoluted and messy internal semantics.

Here's how the problem comes about: Carp has to report errors on behalf of your module - let's call it module A - but from the point of view of code that uses module A. OK, so far?

However, what happens if the error messages are not generated by module A directly, but are lexical warnings produced by the warnings pragma? Obviously, you don't want Carp to be churning out warnings that claim to come from the guts of warnings.pm. So, Carp has to know to skip over certain modules that are internal to Perl, and go further up the stack. There's an undocumented variable that allows you to skip over stack frames, but Ben considers this messy, and with good reason.

Worse, it's possible to get infinite loops when package inheritance comes into play. Ben is working on ideas on how to get around it, and Hugo and others have been helping him think about this. Read about it.

SvTEMP

If you say

        sub foo { "a" } @foo=(foo())[0,0];

you might be surprised to find that your array only has one element. The problem is that when a subroutine returns a list, the SV members of the list are marked as temporary, on the assumption that something is going to scoop them up and use them. This saves us making copies of the SVs and then throwing them away later. Unfortunately, what happened here is that foo returned a single value, which something did indeed scoop up and use. When the second part of the slice tries to take another value, there's nothing on the list.

Benjamin Holzman had a look at this and produced a patch that turned off the SvTEMP marking of anything about to be used in an array assignment. Sarathy pointed out that this wasn't exactly right, because SvTEMP means several different things. Benjamin tried again, using another bit to indicate whether the value could be stolen without a copy. Sarathy was concerned by the use of a "whole bit" for this task, and suggested a simpler answer: checking for both SvTEMP and also participation in an array assign:

     SvTEMP(sv) && !(PL_op && PL_op->op_type == OP_AASSIGN)

Bejamin then revised his patch, which Jarkko applied.

Locales and Floats

There's a horrible problem with locales: (Jarkko would argue that locales are horrible problems) [printf "%e"] should probably be locale-aware in the scope of use locale. This means that, theoretically, it should be tainted, because locale data can be corrupted.

So what about print 0.0+$x - that also does a floating-point conversion. Should that be locale-aware under use locale? Should it be automatically tainted? This was a tricky discussion, and it seems it's a problem that's been hanging around for a long time, and probably won't be solved soon. You can, however, take a look at the thread for yourself.

Low-Hanging Fruit

Here are a couple of jobs that people can look into if they have a spare moment:

Hugo found that make distclean was creating some dangerously long shell lines. Andreas found a scoping bug with %H, and Ilya replied explaining how to fix it. This report of a segfault could be worth waving a debugger over.

Miscellaneous

Jarkko said "Thanks, applied" 15 times this week.

Until next week, I remain your humble and obedient servant,


Simon Cozens

This Week on p5p 2000/11/20

>



Notes

You can subscribe to an e-mail version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to simon@brecon.co.uk

There were more than 250 messages this week.

Fixing the Regexp Engine

There's been a lot of work on the regular-expression engine this week, from Jarkko and Ilya. If you recall, there have been two major problems with the regular-expression engine. First, there's re-entrancy. What this means is that if you try to say

    m/something(?{s|foo|bar|})bad/

then by the time you get to bad, the engine will be confused, because it can't restore context properly after the s|foo|bar. It can now, thanks to Ilya, who noted:

Why such a trivial patch should wait for me to do it?!

Doing a similar edit of regexec() is left as an exercise to the reader.

Jarkko retaliated, "Most us mortals find the reg*.[hc] rather daunting." I note that the "trivial patch" was 89k.

The second problem is recursion; (See coverage two weeks ago) what this means is that things such as the pathological example that Dan Brumlevel produced this week causes stack overflows:

    '' =~ ($re = qr((??{$i++ < 10e4 ? $re : ''})));

Ilya also started work on "flattening" the regular expression engine to remove this recursion. While his patch doesn't solve the problem, it mitigates it considerably and also allows stack unwinding on things like alternation.

The famous "polymorphic regular expression" problem also saw some work, thanks to Jarkko. This problem occurs when matching inside UTF8 strings; it goes a little like this: Given a character string which contains character 300, which is represented in UTF8 as character 196 followed by character 172, should character 172 match? Of course, it shouldn't, but it currently does. What's needed is to turn each "node" of a regular expression into something that can make sense as a character and also as a series of bytes, and then it can behave appropriately when matched in byte mode and in character mode. (See previous discussion of this.)

Jarkko reports that most of it is now working, apart from the special match variables ( $&, $` and $') and the POSIX character classes.

UTF8 and Charnames

Andrew McNaughton found that Charnames didn't produce UTF8-encoded strings on code points less than 255. He produced some patches to make it do so, but Nick and I didn't believe that it should be UTF8-encoding if it doesn't need to be. Perl's UTF8 encoding is done lazily - strings are upgraded if Perl can't avoid upgrading them.

PerlIO (again)

Nick's sterling work on PerlIO continues, and this week he raised the question of where to store the defaults that use open sets; currently, there are four bits set in the CV, two for input and two for output. However, since this only gives you four states, it's not exactly extensible to user-defined disciplines. The suggestion from Sarathy was to use the same area used by the lexical warnings pragma. The whole thread contains a good discussion about how the semantics of PerlIO disciplines will pad out. Read about it.

=head3 (again)

Casey Tweten's unstinting drive to make POD support =head3 and further levels continues. He first patched the documentation, and then patched Pod::Checker to make sure it wouldn't complain about the new levels. Russ Allbery released a new version of the podlaters to support them.

New subs.pm

Jeff Pinyan put forward a new version of subs.pm which is supposed to deal with pre-declaring prototypes and attributes. It needed a little smoothing out, but it's looking quite good now.

Congratulations

Two sets of congratulations are in order this week. First, to the dedicated team of bug squashers; there are now less than 1,000 open bugs in Perl, and none of those is considered fatal.

Congratulations are also due to Perl hacker Mark Fisher, who just become a daddy. Read about it.

Various

Quite a lot of useful minor fixes, and a few noninteresting bug reports. Lots of the usual test results. Only two flames this week, and one each from your illustrious perl5-porters digest authors. You'd think we'd know better.

Until next week I remain, your humble and obedient servant,


Simon Cozens

Beginner's Introduction to Perl - Part 3

Editor's note: this venerable series is undergoing updates. You might be interested in the newer versions, available at:



Table of Contents

Part 1 of this series
Part 2 of this series
Part 4 of this series
Part 5 of this series
Part 6 of this series

Simple matching
Metacharacters
Character classes
Flags
Subexpressions
Watch out!
Search and replace
Play around!

We've covered flow control, math and string operations, and files in the first two articles in this series. Now we'll look at Perl's most powerful and interesting way of playing with strings, regular expressions, or regexes for short. (The rule is this: after the 50th time you type ``regular expression,'' you find you type ``regexp'' the next 50 times.)

Regular expressions are complex enough that you could write a whole book on them (and, in fact, someone did - Mastering Regular Expressions by Jeffrey Friedl).

Simple matching

The simplest regular expressions are matching expressions. They perform tests using keywords like if, while and unless. Or, if you want to be really clever, tests that you can use with and and or. A matching regexp will return a true value if whatever you try to match occurs inside a string. When you want to use a regular expression to match against a string, you use the special =~ operator:

 $user_location = "I see thirteen black cats under a ladder.";
    if ($user_location =~ /thirteen/) {
        print "Eek, bad luck!\n";
    }
	

Notice the syntax of a regular expression: a string within a pair of slashes. The code $user_location =~ /thirteen/ asks whether the literal string thirteen occurs anywhere inside $user_location. If it does, then the test evaluates true; otherwise, it evaluates false.

Metacharacters

A metacharacter is a character or sequence of characters that has special meaning. We've discussed metacharacters in the context of double-quoted strings, where the sequence \n mean the newline character, not a backslash, and the character n and \t means the tab character.

Regular expressions have a rich vocabulary of metacharacters that let you ask interesting questions such as, ``Does this expression occur at the end of a string?'' or ``Does this string contain a series of numbers?''

The two simplest metacharacters are ^ and $. These indicate ``beginning of string'' and ``end of string,'' respectively. For example, the regexp /^Bob/ will match ``Bob was here,'' ``Bob'' and ``Bobby.'' It won't match ``It's Bob and David,'' because Bob doesn't show up at the beginning of the string. The $ character, on the other hand, means that you are matching the end of a string. The regexp /David$/ will match ``Bob and David,'' but not ``David and Bob.'' Here's a simple routine that will take lines from a file and only print URLs that seem to indicate HTML files:

for $line (<URLLIST>) {
        # "If the line starts with http: and ends with html...."
        if (($line =~ /^http:/) and
            ($line =~ /html$/)) {
            print $line;
        }
    }

Another useful set of metacharacters is called wildcards. If you've ever used a Unix shell or the Windows DOS prompt, you're familiar with wildcards characters like * and ?. For example when you type ls a*.txt, you see all filenames that begin with the letter a and end with .txt. Perl is a bit more complex, but works on the same general principle.

In Perl, the generic wildcard character is .. A period inside a regular expression will match any character, except a newline. For example, the regexp /a.b/ will match anything that contains a, another character that's not a newline, followed by b - ``aab,'' ``a3b,'' ``a b,'' and so forth.

If you want to literally match a metacharacter, you must escape it with a backslash. The regex /Mr./ matches anything that contains ``Mr'' followed by another character. If you only want to match a string that actually contains ``Mr.,'' you must use /Mr\./.

On its own, the . metacharacter isn't very useful, which is why Perl provides three wildcard quantifiers: +, ? and *. Each quantifier means something different.

The + quantifier is the easiest to understand: It means to match the immediately preceding character or metacharacter one or more times. The regular expression /ab+c/ will match ``abc,'' ``abbc,'' ``abbbc'' and so on.

The * quantifier matches the immediately preceding character or metacharacter zero or more times. This is different from the + quantifier! /ab*c/ will match ``abc,'' ``abbc,'' and so on, just like /ab+c/ did, but it'll also match ``ac,'' because there are zero occurences of b in that string.

Finally, the ? quantifier will match the preceding character zero or one times. The regex /ab?c/ will match ``ac'' (zero occurences of b) and ``abc'' (one occurence of b). It won't match ``abbc,'' ``abbbc'' and so on.

We can rewrite our URL-matching code to use these metacharacters. This'll make it more concise. Instead of using two separate regular expressions (/^http:/ and /html$/), we combine them into one regular expression: /^http:.+html$/. To understand what this does, read from left to right: This regex will match any string that starts with ``http:'' followed by one or more occurences of any character, and ends with ``html''. Now, our routine is:

 for $line (<URLLIST>) {
        if ($line =~ /^http:.+html$/) {
           print $line;
        }
    }

Remember the /^something$/ construction - it's very useful!

Character classes

We've already discussed one special metacharacter, ., that matches any character except a newline. But you'll often want to match only specific types of characters. Perl provides several metacharacters for this. <\d> will match a single digit, \w will match any single ``word'' character (which, to Perl, means a letter, digit or underscore), and \s matches a whitespace character (space and tab, as well as the \n and \r characters).

These metacharacters work like any other character: You can match against them, or you can use quantifiers like + and *. The regex /^\s+/ will match any string that begins with whitespace, and /\w+/ will match a string that contains at least one word. (But remember that Perl's definition of ``word'' characters includes digits and the underscore, so whether or not you think _ or 25 are words, Perl does!)

One good use for \d is testing strings to see whether they contain numbers. For example, you might need to verify that a string contains an American-style phone number, which has the form 555-1212. You could use code like this:

 unless ($phone =~ /\d\d\d-\d\d\d\d/) {
 print "That's not a phone number!\n";
    }

All those \d metacharacters make the regex hard to read. Fortunately, Perl allows us to improve on that. You can use numbers inside curly braces to indicate a quantity you want to match, like this:

 unless ($phone =~ /\d{3}-\d{4}/) {
 print "That's not a phone number!\n";
   }

The string \d{3} means to match exactly three numbers, and \d{4} matches exactly four digits. If you want to use a range of numbers, you can separate them with a comma; leaving out the second number makes the range open-ended. \d{2,5} will match two to five digits, and <\w{3,}> will match a word that's at least three characters long.

You can also invert the \d, \s and \w metacharacters to refer to anything but that type of character. \D matches nondigits; \W matches any character that isn't a letter, digit or underscore; and \S matches anything that isn't whitespace.

If these metacharacters won't do what you want, you can define your own. You define a character class by enclosing a list of the allowable characters in square brackets. For example, a class containing only the lowercase vowels is [aeiou]. /b[aeiou]g/ will match any string that contains ``bag,'' ``beg,'' ``big,'' ``bog'' or ``bug''. You use dashes to indicate a range of characters, like [a-f]. (If Perl didn't give us the \d metacharacter, we could do the same thing with [0-9].) You can combine character classes with quantifiers:

 if ($string =~ /[aeiou]{2}/) {
 print "This string contains at least
        two vowels in a row.\n";
    }

You can also invert character classes by beginning them with the ^ character. An inverted character class will match anything you don't list. [^aeiou] matches every character except the lowercase vowels. (Yes, ^ can also mean ``beginning of string,'' so be careful.)

Flags

By default, regular expression matches are case-sensitive (that is, /bob/ doesn't match ``Bob''). You can place flags after a regexp to modify their behaviour. The most commonly used flag is i, which makes a match case-insensitive:

 $greet = "Hey everybody, it's Bob and David!";
    if ($greet =~ /bob/i) {
        print "Hi, Bob!\n";
    }

We'll talk about more flags later.

Subexpressions

You might want to check for more than one thing at a time. For example, you're writing a ``mood meter'' that you use to scan outgoing e-mail for potentially damaging phrases. You can use the pipe character | to separate different things you are looking for:

 # In reality, @email_lines would come from your email text, 
   # but here we'll just provide some convenient filler.
   @email_lines = ("Dear idiot:",
                   "I hate you, you twit.  You're a dope.",
                   "I bet you mistreat your llama.",
                   "Signed, Doug");

   for $check_line (@email_lines) {
       if ($check_line =~ /idiot|dope|twit|llama/) {
           print "Be careful!  This line might
	          contain something offensive:\n",
                 $check_line, "\n";
       }
   }

The matching expression /idiot|dope|twit|llama/ will be true if ``idiot,'' ``dope,'' ``twit'' or ``llama'' show up anywhere in the string.

One of the more interesting things you can do with regular expressions is subexpression matching, or grouping. A subexpression is like another, smaller regex buried inside your larger regexp, and is placed inside parentheses. The string that caused the subexpression to match will be stored in the special variable $1. We can use this to make our mood meter more explicit about the problems with your e-mail:

 for $check_line (@email_lines) {
       if ($check_line =~ /(idiot|dope|twit|llama)/) {
           print "Be careful!  This line contains the
                  offensive word $1:\n",
                 $check_line, "\n";
       }
   }

Of course, you can put matching expressions in your subexpression. Your mood watch program can be extended to prevent you from sending e-mail that contains more than three exclamation points in a row. We'll use the special {3,} quantifier to make sure we get all the exclamation points.

 for $check_line (@email_lines) {
        if ($check_line =~ /(!{3,})/) {
            print "Using punctuation like '$1' 
                   is the sign of a sick mind:\n",
                  $check_line, "\n";
        }
    }

If your regex contains more than one subexpression, the results will be stored in variables named $1, $2, $3 and so on. Here's some code that will change names in ``lastname, firstname'' format back to normal:

 $name = "Wall, Larry";
   $name =~ /(\w+), (\w+)/;
   # $1 contains last name, $2 contains first name

   $name = "$2 $1";
   # $name now contains "Larry Wall"

You can even nest subexpressions inside one another - they're ordered as they open, from left to right. Here's an example of how to retrieve the full time, hours, minutes and seconds separately from a string that contains a timestamp in hh:mm:ss format. (Notice that we're using the {1,2} quantifier so that a timestamp like ``9:30:50'' will be matched.)

 $string = "The time is 12:25:30 and I'm hungry.";
    $string =~ /((\d{1,2}):(\d{2}):(\d{2}))/;
    @time = ($1, $2, $3, $4);

Here's a hint that you might find useful: You can assign to a list of scalar values whenever you're assigning from a list. If you prefer to have readable variable names instead of an array, try using this line instead:

 ($time, $hours, $minutes, $seconds) = ($1, $2, $3, $4);

Assigning to a list of variables when you're using subexpressions happens often enough that Perl gives you a handy shortcut:

 ($time, $hours, $minutes, $seconds) =
         ($string =~ /((\d{1,2}):(\d{2}):(\d{2}))/);

Watch out!

Regular expressions have two traps that generate bugs in your Perl programs: They always start at the beginning of the string, and quantifiers always match as much of the string as possible.

Here's some simple code for counting all the numbers in a string and showing them to the user. We'll use while to loop over the string, matching over and over until we've counted all the numbers.

 $number = "Look, 200 5-sided, 4-colored pentagon maps.";
    while ($number =~ /(\d+)/) {
        print "I found the number $1.\n";
        $number_count++;
    }
    print "There are $number_count numbers here.\n";

This code is actually so simple it doesn't work! When you run it, Perl will print I found the number 200 over and over again. Perl always begins matching at the beginning of the string, so it will always find the 200, and never get to the following numbers.

You can avoid this by using the g flag with your regex. This flag will tell Perl to remember where it was in the string when it returns to it. When you insert the g flag, our code looks like this:

 $number = "Look, 200 5-sided, 4-colored pentagon maps.";
    while ($number =~ /(\d+)/g) {
        print "I found the number $1.\n";
        $number_count++;
    }
    print "There are $number_count numbers here.\n";

Now we get the results we expected:

 I found the number 200.
    I found the number 5.
    I found the number 4.
    There are 3 numbers here.

The second trap is that a quantifier will always match as many characters as it can. Look at this example code, but don't run it yet:

 $book_pref = "The cat in the hat is where it's at.\n";
    $book_pref =~ /(cat.*at)/;
    print $1, "\n";

Take a guess: What's in $1 right now? Now run the code. Does this seem counterintuitive?

The matching expression (cat.*at) is greedy. It contains cat in the hat is where it's at because that's the largest string that matches. Remember, read left to right: ``cat,'' followed by any number of characters, followed by ``at.'' If you want to match the string cat in the hat, you have to rewrite your regexp so it isn't as greedy. There are two ways to do this:

1. Make the match more precise (try /(cat.*hat)/ instead). Of course, this still might not work - try using this regexp against The cat in the hat is who I hate.

2. Use a ? character after a quantifier to specify nongreedy matching. .*? instead of .* means that Perl will try to match the smallest string possible instead of the largest:

 # Now we get "cat in the hat" in $1.
  $book_pref =~ /(cat.*?at)/;

Search and replace

Now that we've talked about matching, there's one other thing regular expressions can do for you: replacing.

If you've ever used a text editor or word processor, you're familiar with the search-and-replace function. Perl's regexp facilities include something similar, the s/// operator, which has the following syntax: s/regex/replacement string/. If the string you're testing matches regex, then whatever matched is replaced with the contents of replacement string. For instance, this code will change a cat into a dog:

 $pet = "I love my cat.\n";
    $pet =~ s/cat/dog/;
    print $pet;

You can also use subexpressions in your matching expression, and use the variables $1, $2 and so on, that they create. The replacement string will substitute these, or any other variables, as if it were a double-quoted string. Remember our code for changing Wall, Larry into Larry Wall? We can rewrite it as a single s/// statement!

 $name = "Wall, Larry";
    $name =~ s/(\w+), (\w+)/$2 $1/;  # "Larry Wall"

s/// can take flags, just like matching expressions. The two most important flags are g (global) and i (case-insensitive). Normally, a substitution will only happen once, but specifying the g flag will make it happen as long as the regex matches the string. Try this code, and then remove the g flag and try it again:

 $pet = "I love my cat Sylvester, and my other cat Bill.\n";
   $pet =~ s/cat/dog/g;
   print $pet;

Notice that without the g flag, Bill doesn't turn into a dog.

The i flag works just as it did when we were only using matching expressions: It forces your matching search to be case-insensitive.

Putting it all together

Regular expressions have many practical uses. We'll look at a httpd log analyzer for an example. In our last article, one of the play-around items was to write a simple log analyzer. Now, let's make it a bit more interesting: a log analyzer that will break down your log results by file type and give you a list of total requests by hour.

(Complete source code.)

First, let's look at a sample line from a httpd log:

 127.12.20.59 - - [01/Nov/2000:00:00:37 -0500] 
	"GET /gfx2/page/home.gif HTTP/1.1" 200 2285

The first thing we want to do is split this into fields. Remember that the split() function takes a regular expression as its first argument. We'll use /\s/ to split the line at each whitespace character:

 @fields = split(/\s/, $line);

This gives us 10 fields. The ones we're concerned with are the fourth field (time and date of request), the seventh (the URL), and the ninth and 10th (HTTP status code and size in bytes of the server response).

First, we'd like to make sure that we turn any request for a URL that ends in a slash (like /about/) into a request for the index page from that directory (/about/index.html). We'll need to escape out the slashes so that Perl doesn't mistake them for terminators in our s/// statement.

 $fields[6] =~ s/\/$/\/index.html/;

This line is difficult to read, because anytime we come across a literal slash character we need to escape it out. This problem is so common, it has acquired a name: leaning-toothpick syndrome. Here's a useful trick for avoiding the leaning-toothpick syndrome: You can replace the slashes that mark regular expressions and s/// statements with any other matching pair of characters, like { and }. This allows us to write a more legible regex where we don't need to escape out the slashes:

 $fields[6] =~ s{/$}{/index.html};

(If you want to use this syntax with a matching expression, you'll need to put a m in front of it. /foo/ would be rewritten as m{foo}.)

Now, we'll assume that any URL request that returns a status code of 200 (request OK) is a request for the file type of the URL's extension (a request for /gfx/page/home.gif returns a GIF image). Any URL request without an extension returns a plain-text file. Remember that the period is a metacharacter, so we need to escape it out!

 if ($fields[8] eq '200') {
           if ($fields[6] =~ /\.([a-z]+)$/i) {
               $type_requests{$1}++;
           } else {
               $type_requests{'txt'}++;
           }
        }

Next, we want to retrieve the hour each request took place. The hour is the first string in $fields[3] that will be two digits surrounded by colons, so all we need to do is look for that. Remember that Perl will stop when it finds the first match in a string:

 # Log the hour of this request
        $fields[3] =~ /:(\d{2}):/;
        $hour_requests{$1}++;

Finally, let's rewrite our original report() sub. We're doing the same thing over and over (printing a section header and the contents of that section), so we'll break that out into a new sub. We'll call the new sub report_section():

 sub report {
    print ``Total bytes requested: '', $bytes, ``\n'';

print "\n"; report_section("URL requests:", %url_requests); report_section("Status code results:", %status_requests); report_section("Requests by hour:", %hour_requests); report_section("Requests by file type:", %type_requests); }

The new report_section() sub is very simple:

 sub report_section {
    my ($header, %type) = @_;

print $header, "\n"; for $i (sort keys %type) { print $i, ": ", $type{$i}, "\n"; } print "\n"; }

We use the keys function to return a list of the keys in the %type hash, and the sort function to put it in alphabetic order. We'll play with sort a bit more in the next article.

Play around!

As usual, here are some sample exercises:

1. A rule of good writing is ``avoid the passive voice.'' Instead of The report was read by Carl, say Carl read the report. Write a program that reads a file of sentences (one per line), detects and eliminates the passive voice, and prints the result. (Don't worry about irregular verbs or capitalization, though.)

Sample solution. Sample test sentences.

2. You have a list of phone numbers. The list is messy, and the only thing you know is that there are either seven or 10 digits in each number (the area code is optional), and if there's an extension, it will show up after an ``x'' somewhere on the line. ``416 555-1212,'' ``5551300X40'' and ``(306) 555.5000 ext 40'' are all possible. Write a fix_phone() sub that will turn all of these numbers into the standard format ``(123) 555-1234'' or ``(123) 555-1234 Ext 100,'' if there is an extension. Assume that the default area code is ``123.''

Sample solution.

This Week on p5p 2000/11/14



Notes

You can subscribe to an e-mail version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to simon@brecon.co.uk

There were 348 messages this week.

Dominus mentioned this during his tenure, but now it's my turn: these reports are only made possible due to the generosity of O'Reilly and Associates, who keep me in caffeine.

stat vs. lstat

David Dyck was using find2perl and turned up a couple of problems with lstat. The problem is that lstat _ produces a warning because it thinks that _ is a filehandle, and you can't lstat a filehandle. (Only a filename)

David produced a patch to take care of this, but then discovered that lstat and the -l filetest were acting differently; you shouldn't be allowed to follow stat('something') with lstat _. Again, David produced another patch to cause a fatal error if you try.

He also noticed that -l FH dies, whereas lstat FH doesn't, but nobody has looked into that yet.

Also in the "file tests" area, Rich Morin quoted the camel book:

"Because Perl has to read a file to do the -T test, you don't want to use -T on special files that might hang or give you other kinds of grief."

and commented that maybe Perl should test to see whether someone is planning to do a -T or a -B test on a special file, and "keep itself out of trouble".

Kurt Starsinic pointed out that there were times when you do want to test, for example, a named pipe, and Perl shouldn't restrict the programmer from doing so. Nick said it'll all be OK when we have PerlIO working.

Threads and POSIX

Kurt was indulging in bug archaeology and turned up something spooky relating to signals and threads. He asked for an explanation of how the signal model changes between nonthreaded and threaded Perl, which Dan Sugalski duly provided. The rest of that thread (ho, ho) is worth reading, if you're interested in how threads work with Perl.

PerlIO

As usual, there's a lot of good work going on with PerlIO, and quite a lot of the bulk of this week's traffic was taken up with PerlIO-related test results, bug reports and discussion.

To remind you, PerlIO is going to be a complete IO library for Perl, which, among other things, allows us to insert filters at various stages of the input and output process. This means that, for instance, data can be transparently converted to UTF8 from other character sets.

Nick mentioned that he'd like to test PerlIO a lot, so it would be appreciated if those following the Perl development sources could do something like the following:

    ./Configure -Duseperlio -d
    make
    ...
    PERLIO=stdio  make test
    PERLIO=perlio make test
    PERLIO=mmap   make test

and report results. ( mmap may not be available, depending on platform.) Dominic Dunlop fixed the MachTen hints to stop it for claiming to support mmap, since any attempt to use the function just causes the program to abort due to an error.

Robin Barker found that PerlIO-over- stdio breaks large file support; Nick found that this was a problem with 64-bit support and that Perl was using fseek where it should have been using fseeko.

Nicholas Clark did some work on IO::Handle and some other IO calls, and found that the return values weren't particularly intuitive; Perl was reporting the raw return values from stdio rather than true or false. This becomes problematic, of course, when the IO model isn't stdio. He produced some fixes for ungetc and getpos, and he also noted that if we're using sfio then we shouldn't treat sftell as if it were ftell, as there's yet another return value inconsistency.

Nicholas also came up with a "dumb shell" to allow a shell with a per-process current directory on systems that don't have one, which should make dealing with subdirectories during building easier, and might also help with cross-compiling Perl.

Nick Ing-Simmons also asked

Now that PerlIO is in the mainline I _really_ need to know what to do next in terms of making it useful.

This means knowing what it "should" look like to perl5

There followed a useful discussion of the proposed API; Read about it.

It was also determined that one should open a file containing Latin data as follows:

[ XL indementum tum biguttam tum latin-1 inquementum tum LatinFile inquementum evolute meo fho morive errorum. ]

and that Perl programs dealing with data in Japanese are implicity permitted to seppuku.

README.Solaris

Andy Dougherty produced a README.Solaris, which Jarkko, Russ, Alan and many others looked over and improved; later on, he produced a final version to be integrated, along with some other Solaris fixes. This was applied to the tree, and then some people picked over it a little more. Read about it.

Locales

Jarkko's old nemesis was out in force this week. Robin Barker and Larry Virden found that the UTF8 locales didn't work properly - this was not really a surprise, since Jarkko had disabled the tests as they require the fabled polymorphic regular expression support. (This means that /(.)/ should be equally happy capturing a UTF8 character as a non-UTF8 character.) Lupe Christoph found that most of the tests work OK, as there's only two that require polymorphic regexes. Jarkko grudgingly re-enabled them - locales are Not His Friend.

Vadim Konovalov produced a little fix to make the locale tests work, and added a test for the MS-DOS Russian code page; seems like Perl is now happy to speak Russian even on MS-DOS systems.

Integer handling revisited

If you remember a month ago, there was a discussion on integer and floating point handling, in which it was suggested that adding two integers together at run-time should result in an integer, rather than a floating point.

Jarkko and Nicholas Clark have been looking into this, and Jarkko produced a patch; it looked to me like the complexity of determining whether it's possible to sum two integers without overflowing is going to cause a slowdown, and Perl usually trades space for speed.

However, Nicholas mentioned that on the StrongARM architecture, floating points are all emulated, and thus keeping everything integer might actually speed it up.

=head3

Casey Tweten produced a patch that allowed Pod::Man to deal with =head3 and above. Russ Allbery disapproved, on the grounds that the man translator should not be singled out - if we're going to do this, we should change the documentation for POD, and all the other translators. Andy pointed out that it's OK to update the translators one at a time, since it's not really reasonable to blitz through them all in one go. However, perlpod.pod should be updated. (Nobody's done this yet.)

Tim Jenness pointed out that the LaTeX translator already deals with =head3 and =head4.

Little fixes

Since last week, I concentrated on things that nobody fixed, this time we'll have a whirlwind tour of things that people did get fixed.

The VMS and Cygwin flock fix we mentioned last week were implemented, thanks to Craig Berry and Andy Dougherty.

Eric Fifer brought the Cygwin port up to date with Cygwin 1.1.5. (Great work as ever, Eric!)

Harold Morris, one of the Amdahl UTS people produced some patches to help Perl run under UTS; Lupe Christoph had some of the regular expression-test suite explain why it was failing if it did. Lots of people fixed up documentation, so I won't mention them all. Casey Tweten did some good work, including adding an import method to Class::Struct. Nicholas Clark fixed some FreeBSD stdio declarations. Robin Barker picked up some dodgy casts between pointers and integers.

Various

As usual, plenty of small bug reports, patches, irrelevant questions, the complete absence of flames, and only one spam.

Until next week I remain, your humble and obedient servant,


Simon Cozens

Program Repair Shop and Red Flags




What's wrong with this picture?

Once again I'm going to have a look at a program written by a Perl beginner and see what I can do to improve it.

This month's program comes from a very old Usenet post. It was posted seven years ago - on Nov. 12, 1993, to be exact - on the comp.lang.perl newsgroup. (At that time comp.lang.perl.misc had not yet been created.)

The program is a library of code for reading GNU ``info'' files. Info files are a form of structured documentation used by the GNU project. If you use the emacs editor, you can browse info files by using the C-h i command, for example. An info file is made up of many nodes, each containing information about a certain topic. The nodes are arranged in a tree structure. Each node has a header with some meta-information; one item recorded in the header of each node is the name of that node's parent in the documentation tree. Most nodes also have a menu of their child nodes. Each node also has pointers to the following and preceding nodes so that you can read through all the nodes in order.


The Interface

The code we'll see has functions for opening info files and for reading in nodes and parsing the information in their headers and menus. But before I start discussing the code, I'll show the documentation. Here it is, copied directly from that 7-year-old Usenet posting, typos and all:

    To use the functions:  Call



            &open_info_file(INFO_FILENAME);


    to open the filehandle `INFO' to the named info file.
    Then call


            &get_next_node;


    repeatedly to read the next node in the info file; variables
            $info_file
            $info_node
            $info_prev
            $info_next
            $info_up


    are set if the corresponding fields appear in the node's
    header, and if the node has a menu, it is loaded into
    %info_menu.  When `get_next-node' returns false, you have
    reached end-of-file or there has been an error.

Right away, we can see a major problem. The code is supposed to be a library of utility functions. But the only communication between the library and the main program is through a series of global variables with names like $info_up. This, of course, is terrible style. The functions cannot be used in any program that happens to have a variable named $info_up, and if you do use it in such a program, you can introduce bizarre, hard-to-find bugs that result from the way the library smashes whatever value that variable had before. The library might even interfere with itself! If you had something like this:

        &get_next_node;
        foo();
        print $info_node;

then you might not get the results you expect. If foo() happens to also call get_next_node, it will discard the value of $info_node that the main code was planning to print.

These are the types of problems that functions and local variables were intended to solve. In this case, it's easy to solve the problems: Just have get_next_node return a list of the node information, instead of setting a bunch of hardwired global variables. If the caller of the function wants to set the variables itself, it is still free to do that:

        %next_node = &get_next_node;
        ($info_file, $info_node, $info_prev, $info_next, $info_up)
            = @next_node{qw(File Node Prev Next Up)};
        %info_menu = %{$next_node{Menu}}

Or not:

        my (%node) = &get_next_node;
        my ($next) = $node{Next};

If for some reason the caller of get_next_node likes the global variables, they can still have the original interface:

        sub get_next_node_orig {
          my %next_node = &get_next_node;
          ($info_file, $info_node, $info_prev, $info_next, $info_up)
              = @next_node{qw(File Node Prev Next Up)};
          %info_menu = %{$next_node{Menu}}
        }

This shows that no functionality has been lost; it is just as powerful to return a list of values as it is to set the global variables directly.


The Code

Now we'll see the code itself. The entire program is available here. We will be looking at one part at a time.

open_info_file

The first function that the user calls is the open_info_file function:

    83  sub open_info_file {
    84      ($info_filename) = @_;
    85      (open(INFO, "$info_filename")) 
	      || die "Couldn't open $info_filename: $!";
    86      return &start_info_file;
    87  }

Before I discuss the design problems here, there's a minor syntactic issue: The quotation marks around "$info_filename" are useless. Perl uses the "..." notation to say ``Construct a string.'' But $info_filename is already a string, so making it into a string is at best a waste of time. Moreover, the extra quotation marks can sometimes cause subtle bugs. Consider this innocuous-looking code:

        my ($x) = @_;
        do_something("$x");

If $x was a string, this still works. But if $x was a reference, it probably fails. Why? Because "$x" constructs a string that looks like a reference but isn't, and if do_something is expecting a reference, it will be disappointed. Such errors can be hard to debug, because the string that do_something gets looks like a reference when you print it out. The use strict 'refs' pragma was designed to catch exactly this error. With use strict 'refs' in scope, do_something will probably raise an error like

    Can't use string ("SCALAR(0x8149bbc)") as an ARRAY ref...

Without use strict 'refs', you get a subtle and silent bug.

But back to the code. open_info_file calls die if it can't open the specified file for any reason. It would probably be more convenient and consistent to have it simply return a failure code in this case; this is what it does if the open succeeds, but then start_next_part fails. It's usually easier for the calling code to deal with a simple error return than with an exception, all the more so in 1993, when Perl didn't have exception handling. I would rewrite the function like this:

        sub open_info_file {
            ($info_filename) = @_;
            open(INFO, $info_filename) || return;
            return start_info_file();
        }

I also got rid of some superfluous parentheses and changed the 1993 &function syntax to a more modern function() syntax. It's tempting to try to make $info_filename into a private variable, but it turns out that other functions need to see it later, so the best we can do is make it a file-scoped lexical, private to the library, but shared among all the functions in the library.

Finally, a design issue: The filehandle name INFO is hard-wired into the function. Since filehandle names are global variables, this is best avoided for the same reason that we wanted to get rid of the $info_node variable earlier: If some other part of the program happens to have a filehandle named INFO, it's going to be very surprised to find it suddenly attached to a new file.

There are a number of ways to solve this. The best one available in Perl 4 is to have the caller pass in the filehandle it wants to use, as an argument to open_info_file. Then the call is effectively using the filehandle as an object. In this case, however, this doesn't work as well as we'd like, because, as we'll see later, the library needs to be able to associate the name of the file with the filehandle. In the original library, this was easy, because the filename was always stored in the global variable $info_filename and the filehandle was always INFO. The downside of this simple solution is the library couldn't have two info files open at once. There are solutions to this in Perl 4, but they're only of interest to Perl 4 programmers, so I won't go into detail.

The solution in Perl 5 is to use an object to represent an open info file. Whenever the caller wants to operate on the file, it passes the object into the library as an argument. The object can carry around the open filehandle and the filename. Since the data inside the object is private, it doesn't interfere with any other data in the program. The caller can have several files open at once, and distinguish between them because each file is represented by its own object.

To make this library into an object-oriented class only requires a few small changes. We add

        package Info_File;

at the top, and rewrite open_info_file like this:

    sub open_info_file {
        my ($class, $info_filename) = @_;
        my $fh = new FileHandle;        
        open($fh, $info_filename) || return;
        my $object = { FH => $fh, NAME => $info_filename };
        bless $object => $class;
        return unless $object->start_info_file;            
        return $object;
    }

We now invoke the function like this:

     $object = Info_File->new('camel.info');

The new FileHandle line constructs a fresh new filehandle. The next line opens the filehandle, as usual. The line

     my $object = { FH => $fh, NAME => $info_filename };

constructs the object, which is simply a hash. The object contains all the information that the library will need to use in order to deal with the info file - in this case, the open filehandle and the original filename. The bless function converts the hash into a full-fledged object of the Info_File class. Finally, the

        $object->start_info_file;

invokes the start_info_file function with $object as its argument, just like calling start_info_file($object). The special ``arrow'' syntax for objects is enabled by the bless on the previous line. This notation indicates a method call on the object; start_info_file is the method. A method is just an ordinary subroutine. A method call on an object is like any other subroutine call, except that the object itself is passed as an argument to the subroutine.

That was a lot of space to spend on one three-line function, but many of the same issues are going to pop up over and over, and it's good to see them in a simple context.

start_info_file


    47  # Discard commentary before first node of info file
    48  sub start_info_file {
    49      $_ = <INFO> until (/^\037/ || eof(INFO));
    50      return &start_next_part if (eof(INFO)) ;
    51      return 1;
    52  }

An info file typically has a preamble before the first node, usually containing a copyright notice and a license. When the user opens an info file, the library needs to skip this preamble to get to the nodes, which are the parts of interest. That is what start_info_file does. The preamble is separated from the first node by a line that begins with the obscure \037 character, which is control-underscore. The function will read through the file line by line, looking for the first line that begins with the obscure character. If it finds such a line, it immediately returns success. Otherwise, it moves on to the next ``part,'' which I'll explain later.

As I explained in earlier articles, a ``red flag'' is an immediate warning sign that you have done something wrong. Use of the eof() function is one of the clearest and brightest red flags in Perl. It is almost always a mistake to use eof().

The problem with eof() is that it tries to see into the future, whether the next read from the filehandle will return an end-of-file condition. It's impossible to actually see the future, so what it really does is try to read some data. If there isn't any, it reports that the next read will also report end-of-file. If not, it has to put back the data that it just read. This can cause weird problems, because eof() is reading extra data that you might not have meant to read.

eof() is one of those functions like goto that looks useful at first, but then it turns out that there is almost always a better way to accomplish the same thing. In this case, the code is more straightforward and idiomatic like this:

    sub start_info_file {
        while (<INFO>) {
          return 1  if /^\037/;
        }
        &start_next_part;
    }

Perl will automatically exit the while loop when it reaches the end of the file, and in that case we can unconditionally call start_next_part. Inside the loop, we examine the current line to see whether it is the separator, and return success if it is. The assignment to $_ and the check for end-of-file are now all implicit.

In the object-oriented style, start_info_file expects to get an object, originally constructed by open_info_file, as its argument. This object will contain the filehandle that the function will read from in place of INFO. The rewriting into OO style is straightforward:

    sub start_info_file {
        my ($object) = @_;
        my $fh = $object->{FH};
        while (<$fh>) {
          return 1 if /^\037/;
        }
        $object->start_next_part;
    }

Here we extract the filehandle from the object by asking for $object->{$fh}, and then use the filehandle $fh in place of INFO. The call to start_next_part changes into a method call on the object, which means that the object is implicitly passed to the start_next_part function so that start_next_part also has access to the object, including the filehandle buried inside it.

start_next_part

I promised to explain what start_next_part does, and now we're there. An info file is not a single file; it might be split into several separate files, each containing some of the nodes. If the main info file is named camel.info, there might be additional nodes in the files camel.info-1, camel.info-2 and so on. This means that when we get to the end of an info file we are not finished; we have to check to see whether it continues in a different file. start_next_part does this.

    54  # Look for next part of multi-part info file.  Return 0
    55  # (normal failure) if it isn't there---that just means
    56  # we ran out of parts.  die on some other kind of failure.
    57  sub start_next_part {
    58      local($path, $basename, $ext);
    59      if ($info_filename =~ /\//) {
    60          ($path, $basename) 
		    = ( $info_filename =~ /^(.*)\/(.*)$/ );
    61      } else {
    62          $basename = $info_filename;
    63          $path = "";
    64      }
    65      if ($basename =~ /-\d*$/) {
    66          ($basename, $ext) 
		    = ($basename =~ /^([^-]*)-(\d*)$/);
    67      } else {
    68          $ext = 0;
    69      }
    70      $ext++;
    71      $info_filename = "$path/$basename-$ext";
    72      close(INFO);
    73      if (! (open(INFO, "$info_filename")) ) {
    74          if ($! eq "No such file or directory") {
    75              return 0;
    76          } else {
    77              die "Couldn't open $info_filename: $!";
    78          }
    79      }
    80      return &start_info_file;
    81  }

The main point of this code is to take a filename like /usr/info/camel.info-3 and change it into /usr/info/camel.info-4. It has to handle a special case: /usr/info/camel.info must become /usr/info/camel.info-1. After computing the new filename, it tries to open the next part of the info file. If successful, it calls start_info_file to skip the preamble in the new part.

The first thing to notice here is that the function is performing more work than it needs to. It carefully separates the filename into a directory name and a base name, typically /usr/info and camel.info-3. But this step is unnecessary, so let's eliminate it.

    sub start_next_part {
        local($name, $ext);
        if ($info_filename =~ /-\d*$/) {
            ($name, $ext) 
                = ($info_filename =~ /^([^-]*)-(\d*)$/);
        } else {
            $ext = 0;
        }
        $ext++;
        $info_filename = "$name-$ext";
        # ... no more changes ...
    }

This immediately reduces the size of the function by 25 percent. Now we notice that the two pattern matches that remain are almost the same. This is the red flag of all red flags: Any time a program does something twice, look to see whether you can get away with doing it only once. Sometimes you can't. This time, we can:

    sub start_next_part {
        local($name, $ext);
        if ($info_filename =~ /^([^-]*)-(\d*)$/) {
            ($name, $ext) = ($1, $2);
        } else {
            $name = $info_filename; $ext = 0;
        }
        $ext++;
        $info_filename = "$name-$ext";
        # ... no more changes ...
    }

This is somewhat simpler, and it paves the way for a big improvement: The $name variable is superfluous, because its only purpose is to hold an intermediate result. The real variable of interest is $info_filename. $name is what I call a synthetic variable: It's an artifact of the way we solve the problem, and is inessential to the problem itself. In this case, it's easy to eliminate:

    sub start_next_part {
        if ($info_filename =~ /^([^-]*)-(\d*)$/) {
            $info_filename = $1 . '-' . ($2 + 1);
        } else {
            $info_filename .= '-1';
        }
        # ... no more changes ...
    }

If the pattern matches, then $1 contains the base name, typically /usr/info/camel.info, and $2 contains the numeric suffix, typically 3. There is no need to copy these into named variables before using them; we can construct the new filename, /usr/info/camel.info-4 directly from $1 and $2. If the pattern doesn't match, we construct the new file name by appending -1 to the old file name; this turns /usr/info/camel.info into /usr/info/camel.info-1.

That takes care of the top half of the function; now let's look at the bottom half:

    sub start_next_part {
        if ($info_filename =~ /^([^-]*)-(\d*)$/) {
            $info_filename = $1 . '-' . ($2 + 1);
        } else {
            $info_filename .= '-1';
        }
        close(INFO);
        if (! (open(INFO, "$info_filename")) ) {
            if ($! eq "No such file or directory") {
                return 0;
            } else {
                die "Couldn't open $info_filename: $!";
            }
        }
        return &start_info_file;
    }

The close(INFO) is unnecessary, because the open on the following line will perform an implicit close. If the file can't be opened the function looks to find out why. If the reason is that the next part doesn't exist, then we're really at the end, and it quietly returns failure, but if there was some other sort of error, it aborts. In keeping with our change to open_info_file, we will eliminate the die and let the caller die itself, if that is desirable:

    sub start_next_part {
        if ($info_filename =~ /^([^-]*)-(\d*)$/) {
            $info_filename = $1 . '-' . ($2 + 1);
        } else {
            $info_filename .= '-1';
        }
        return unless open(INFO, $info_filename);
        return &start_info_file;
    }

I made a few other minor changes here: Superfluous quotation marks around $info_filename are gone, and if ! has turned into unless. Also, I replaced return 0 with return. return 0 and return undef are red flags: They are attempts to make a function that returns a false value. But if the function is invoked in a list context, return values of 0 and undef are interpreted as true, not false, because they are one-element lists, and the only false lists are empty ones:

    sub false {
      return 0;
    }

    @a = false();
    if (@a) {          
      print "ooops!\n";
    }

The correct way for a function to return a boolean false value in Perl is almost always a simple return as we have here. In scalar context, this returns an undefined value; in list context, it returns an empty list.

The function has gone from 20 lines to 7. Refitting it for object-oriented style does not make it much bigger:

    sub start_next_part {
        my ($object) = @_;
        my $info_filename = $object->{NAME};
        if ($info_filename =~ /^([^-]*)-(\d*)$/) {
            $info_filename = $1 . '-' . ($2 + 1);
        } else {
            $info_filename .= '-1';
        }
        my $fh = $object->{FH};
        return unless open($fh, $info_filename);
        $object->{NAME} = $info_filename;         # ***
        return $object->start_info_file;
    }

Here we extract the info file's filename from the object using $object->{NAME}, which we originally set up back in open_info_file. We also extract the filehandle from the object using $object->{FH} as we did in start_info_file. If we successfully open the new file, we store the changed filename back into the object, for next time; this occurs on the line marked ***.

read_next_node

Finally, we get to the heart of the library. read_next_node actually reads a nodeful of information and returns it to the caller. (The first thing to notice is that the documentation calls this function get_next_node, which is wrong. But that's an easy fix.)

As far as this function is concerned, the node has three parts. The first line is the header of the node, which contains the name of the node; pointers to the previous and next nodes; and other metainformation. Then there's a long stretch of text, which is the documentation that the node was intended to contain. Somewhere near the bottom of the text is a menu of pointers to other nodes. read_next_node is interested in the header line and the menu. It has three sections: One section to handle the header line, one section to skip the following text until it sees the menu and one section to parse the menu. We'll deal with these one at a time.

     1  # Read next node into global variables.  Assumes that file 
     2  # pointer is positioned at the header line that starts a 
     3  # node.  Leaves file pointer positioned at header line of 
     4  # next node. Programmer: note that nodes are separated by 
     5  # a "\n\037\n" sequence.  Reutrn true on 
	  success, false on failure
     6  sub read_next_node {
     7      undef %info_menu;
     8      $_ = <INFO>;                # Header line
     9      if (eof(INFO)) {
    10          return &start_next_part && &read_next_node;
    11      }
    12  
    13      ($info_file) = /File:\s*([^,]*)/;
    14      ($info_node) = /Node:\s*([^,]*)/;
    15      ($info_prev) = /Prev:\s*([^,]*)/;
    16      ($info_next) = /Next:\s*([^,]*)/;
    17      ($info_up)   = /Up:\s*([^,]*)/;

Not much needs to change here. The undef %info_menu was an appropriate initialization when %info_menu was a global variable, but our function isn't going to use global variables; it's going to return the menu information as part of its return list, so we replace this line with my %info_menu. The eof() test is a red flag again; it's probably more straightforward to simply check whether $_ is defined. If it's undefined, then the function has reached the end of the file, and needs to try to open the next part. If that succeeds, then it calls itself recursively to read the first node from the new part. The && used here to sequence those two operations is concise, if a little peculiar. Unfortunately, it won't work any more now that read_next_node returns a list of data, because && always evaluates its arguments in scalar context. This section of the code needs to change to:

        $_ = <INFO>;                # Header line
        if (! defined $_) {
            return unless  &start_next_part;      
            return &read_next_node;
        }

The recursive call might be considered a little strange, because it's essentially performing a goto back up to the top of the function, and some people might express that with a simple while loop. But it's not really obvious that that would be clearer, so I decided to leave the recursive call in.

The subsequent lines extract parts of the header into the global variables $info_file, $info_node and so on. Since we need to make these items into a data structure to be returned from the function, rather than a set of global variables, it's natural to try this:

        ($header{File}) = /File:\s*([^,]*)/;
        ($header{Node}) = /Node:\s*([^,]*)/;
        ($header{Prev}) = /Prev:\s*([^,]*)/;
        ($header{Next}) = /Next:\s*([^,]*)/;
        ($header{Up})   =   /Up:\s*([^,]*)/;
This works, but as I mentioned before, repeated code is the biggest red flag of all. The similarity of these five lines suggests that we should try a loop instead:

        for my $label (qw(File Node Prev Next Up)) {
          ($header{$label}) = /$label:\s*([^,]*)/;
        }

Here five lines have become two. The downside, however, is that Perl has to recompile the pattern five times for each node, because the value of $label keeps changing. There are three things we can do to deal with this. We can ignore it, we can apply the qr// operator to precompile the patterns, or we can try to make the five variable patterns into a single constant pattern. My vote here, as for most questions of micro-optimization, is to ignore it unless it proves to be a real problem. The qr// solution will be an adequate fallback in that case.

I did also consider combining them into one pattern, but that turns into a disaster:

    ($file, $node, $prev, $next, $up) = 
      /File:\s*([^,]*),\s*Node:\s*([^,]*),\s*
       Next:\s*([^,]*),\s*Prev:\s*([^,]*),\s*
       Up:\s*([^,]*)/x;

Actually, it's worse than that, because some of the five items might be missing from the header line, so we must make each part optional:

    ($file, $node, $prev, $next, $up) = 
      /(?:File:\s*([^,]*),)?\s*(?:Node:\s*([^,]*),)?\s*
       (?:Next:\s*([^,]*),)?\s*(?:Prev:\s*([^,]*),)?\s*
       (?:Up:\s*([^,]*))?/x;

Actually, it's even worse, because the original author was programming in Perl 4 and didn't have (?:...) or /x. So that tactic really didn't work out.

This brings up an important point that I don't always emphasize as much as I should: It's not always obvious what tactics are best until you have tried them. When I write these articles, I make false starts. I rewrite the code one way, and discover that there are unexpected problems and the gains aren't as big as I thought they were. Then, I try another way and see if it looks better. Sometimes it turns out I was wrong, and the original code wins, as it did in this case.

When you're writing your own code, it won't always be clear how best to proceed. Try it both ways and see which looks better, then throw away the one you don't like as much.

In this article, I had originally planned to rework the library into something that would still have functioned under Perl 4. I wrote a lot of text explaining how to do this. But it turned out that the only good solution was objects, so I did it over, and that's what you see.

The moral: Never be afraid to do it over.

Looking for the menu

OK, end of digression. The function has processed the header line; now it needs to skip the intervening text until it finds the menu part of the node:

    19      $_ = <INFO> until /^(\* Menu:|\037)/ || eof(INFO);
    20      if (eof(INFO)) {
    21          return &start_next_part;
    22      } elsif (/^\037/) { 
    23          return 1; # end of node, so return success.
    24      }

The menu follows a line labeled * Menu:. If the function sees the end of the node or the end of the file before it sees * Menu, then the node has no menu. There's a bug here: The function should return immediately at the end of the node, regardless of whether it is also the end of the file. As originally written, it calls start_next_part at the end of the file, which might fail (if the current node was the last one) and reports the failure back to the caller when it should have reported success. Fixing the bug and eliminating eof() yields this:

    $_ = <INFO> until !defined($_) || /^(\* Menu:|\037)/;
    return @header if !defined($_) || /^\037/;

The repeated tests bothered me there, but the best alternative formulation I could come up with was:

    while (<INFO>) {
      last if /^\* Menu:/;
      return %header if /^\037/;
    }
    return %header unless defined $_;

I asked around, and Simon Cozens suggested

    do { 
      $_ = <INFO>; 
      return %header if /^\037/ || ! defined $_ 
    } until /^\* Menu:/ ;

I think I like this best, because it makes the /^\* Menu:/ into the main termination condition, which is as it should be. On the other hand, do...until is unusual, and you don't get the implicit read into $_. But four versions of the same code is plenty, so let's move on.

Finally our function is ready to read the menu. A typical menu looks like this:

        * Menu:

        * Numerical types::
        * Exactness::
        * Implementation restrictions::
        * Syntax of numerical constants::
        * Numerical operations::
        * Numerical input and output::

Each item has a title (which is displayed to the user) and a node name (which is the node that the user visits next if they select that menu item). If the title and node name are different, the menu item looks like this:

        * The title:       The node name.

If they're the same (as is often the case) the menu item ends in :: as in the examples above. The menu-reading code has to handle both cases:

    27      local($key, $ref);
    28      while (<INFO>) {    
    29          return 1 if /^\037/;    # end of node, success.
    30          next unless /^\* \S/;   # skip non-menu-items
    31          if (/^\* ([^:]*)::/) {  # menu item ends with ::
    32              $key = $ref = $1;
    33          } elsif (/^\* ([^:]*):\s*([^.]*)[.]/) {
    34              ($key, $ref) = ($1, $2);
    35          } else {
    36              print STDERR "Couldn't parse menu item\n\t$_";
    37              next;
    38          }
    39          $info_menu{$key} = $ref;
    40      }

I think this code is lovely. I would do only two things differently. First, I would change the error message to include the filename and line number of the malformed menu entry. Perl's built-in $. variable makes this easy, and the current behavior makes it too difficult for the programmer to locate the source of the problem. And second, instead of returning directly out of the loop, I would use last, because the return value (%header, Menu => \%menu) is rather complicated and the code below the loop will have to return the same thing anyway.

In the original prgram, that return line calls start_info_file again if the function reads to the end of the current part while still reading the menu. This isn't correct; it should simply return success and let the next call to read_next_node worry about opening the new part.

The rewritten version of read_next_node looks like this:

    sub read_next_node {
        $_ = <INFO>;                # Header line
        if (! defined $_) {
            return unless  &start_next_part;      
            return &read_next_node;
        }

        my (%header, %menu);
        for my $label (qw(File Node Prev Next Up)) {
          ($header{$label}) = /$label:\s*([^,]*)/;
        }

        do { 
          $_ = <INFO>; 
          return %header if /^\037/ || ! defined $_ 
        } until /^\* Menu:/ ;



        while (<INFO>) {    
            my ($key, $ref);
            last if /^\037/;        # end of node
            next unless /^\* \S/;   # skip non-menu-items
            if (/^\* ([^:]*)::/) {  # menu item ends with ::
                $key = $ref = $1;
            } elsif (/^\* ([^:]*):\s*([^.]*)[.]/) {
                ($key, $ref) = ($1, $2);
            } else {
                warn "Couldn't parse menu item at line $. 
                      of file $info_file_name";
                next;
            }
            $menu{$key} = $ref;
        }

        return (%header, Menu => \%menu);
    }

The code didn't get shorter this time, but that's because it was pretty good to begin with. After making a few straightforward changes to convert it to object-oriented style, we get:

    sub read_next_node {
        my ($object) = @_;
        my ($fh) = $object->{FH};
        local $_ = <$fh>;           # Header line
        if (! defined $_) {
            return unless  $object->start_next_part;      
            return $object->read_next_node;
        }

        my (%header, %menu);
        for my $label (qw(File Node Prev Next Up)) {
          ($header{$label}) = /$label:\s*([^,]*)/;
        }

        do { 
          $_ = <$fh>; 
          return %header if /^\037/ || ! defined $_ 
        } until /^\* Menu:/ ;

        while (<$fh>) {    
            my ($key, $ref);
            last if /^\037/;        # end of node
            next unless /^\* \S/;   # skip non-menu-items
            if (/^\* ([^:]*)::/) {  # menu item ends with ::
                $key = $ref = $1;
            } elsif (/^\* ([^:]*):\s*([^.]*)[.]/) {
                ($key, $ref) = ($1, $2);
            } else {
                warn "Couldn't parse menu item at line $. 
                      of file $object->{NAME}";
                next;
            }
            $menu{$key} = $ref;
        }

        return (%header, Menu => \%menu);
    }

The entire object-oriented module is available here.

A simple example program that demonstrates the use of the library:

    use Info_File;
    my $file = shift;
    my $info = Info_File->open_info_file($file)
      or die "Couldn't open $file: $!; aborting";
    while (my %node = $info->read_next_node) {
      print $node{Node},"\n";  # print the node name
    }


Putting It All Together

This time the code hasn't gotten any smaller; it's the same size as it was before. Some parts got smaller, but there was some overhead associated with the conversion to object-oriented style that made the code bigger again.

But the OO style got us several big wins. The interface got better; the library no longer communicates through global variables and no longer smashes INFO. It also gained the capability to process two or more info files simultaneously, or the same info file more than once, which is essential if it's to be useful in any large project. Flexibility has increased also: It would require only a few extra lines to provide the ability to search for any node or to seek back to a node by name.


Red Flags

A summary of the red flags we saw this time:

The Cardinal Rule of Computer Programming is that if you wrote the same code twice, you probably did something wrong. At the very least, you may be setting yourself up for a maintenance problem later on when someone changes the code in one place and not in another.

Programming languages are chock-full of features designed to prevent code duplication from the very lowest levels (features such as $a[3] += $b instead of $a[3] = $a[3] + $b to the very highest (features such as DLLs and pipes.) In between these levels are essential features such as subroutines and modules.

Each time you see you have written the same code more than once, give serious thought to how you might eliminate all but one instance.

eof()

The Perl eof() function is almost always a bad choice. It's typically overused by beginners and by people who have been programming in Pascal for too long.

Perl returns an unambiguous end-of-file condition by yielding an undefined value. Perl's I/O operators are designed to make it convenient to check for this. The while(<FH>) construction even does so automatically. Explicit checking of eof() is almost never required or desirable.

return 0 and return undef

This is often an attempt to return a value that will be perceived by the caller as a Boolean false. But in list context, it will test as true, not false. Unless the function always returns a single scalar, even in list context, it is usually a better choice to use plain return; to yield a false value.

Some programmers write wantarray() ? () : undef, which does the same thing but is more verbose and confusing.


Brief Confession

The program discussed in this article was indeed written by a Perl beginner. I wrote it in 1993 when I had only been programming in Perl for a few months. I must have been pleased with it, because it was the first Perl program that I posted in a public forum.

This Week on p5p 2000/11/07



Notes

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to simon@brecon.co.uk

Apologies for the slightly late appearance of this report; I aim to get them out every Monday, but I'm hoping you were all so distracted by the US election that you didn't notice yet. This week was pretty heavy, reaching either 300 or 550 messages depending on whether you count unique mails or duplicates. However, there wasn't much in the way of hugely interesting content.

Error number parsing

An innocent-looking patch to t/lib/io_multihomed.t managed to turn up some problems with Errno.pm. (Good work Casey Tweten for finally tracking that down, after what looked like a rather agonizing debugging session...)

If you didn't know, the Errno module is automatically generated by reading the system's errno.h file when Perl is being built. However, it looks like some people's systems and libc implementations use macros that are too difficult for Perl to parse automatically, and this meant that some error constants (you know, like EINPROGRESS and ETOOSTUPID and so on.) were not getting picked up. (Tom Hughes noticed this on RedHat 7.0 as well.) The solution is to use the -dM flag to gcc; there's not much we can do for non-gcc right now.

VMS hackery

There was a lot of work on VMS this week: firstly, in the area of file locking. Peter Prymmer reported that t/lib/st-lock.t failed since VMS doesn't really have locking. Peter Farley piped up and said that this was also the case on DOS and DJGPP, and the work-around there was to null-op the locking tests. Andy Dougherty mentioned that a permanent fix would be to define a d_fcntl_can_lock Configure variable to determine whether or not fcntl is any use. Craig Berry had suggested this idea before; Andy provided the patch.

Peter also noticed that one of Jarkko's patches had broken the POSIX module; there were also some problems with erronous output confusing the test harnesses, so that the tests were passing but were still reporting failures.

After all that, Perl on VMS now passes all tests, and there was much rejoicing!

The (f)crypt of mystery

Richard Proctor is trying to compile Perl on RiscOS (and take over the RiscOS pumpkin) however is having problems in that fcrypt isn't prototyped on that system. He tried to force the issue with a couple of casts, but Jarkko wasn't impressed; the Right Way to do it is to have a separate riscos/ directory and put the platform-specific things in there. Nicholas Clark remarked that the [fcrypt] support was "weird":

There are 4 references to FCRYPT but nothing anywhere (even in a port subdirectory) defines the macro. The macro isn't of the usual HAS_ or I_ forms. Does anyone know where it originates?

Nobody did, which is... well, weird. However, we found out where Riscos's UnixLib got its fcrypt implementation from.

Andy Dougherty suggested that the other right way to do it would be to define (and test for) a configure symbol d_fcrypt_proto which then turns into HAS_FCRYPT_PROTO.

No patches were forthcoming.

Yet more self-ties

Randy Ray has a module ( Perl-RPM) which uses a self-tie; he found that the reference count of a self-tied object is two, where you'd expect it to be one. Alan replied that self-ties are badly broken (hey, don't we know it...) and Jarkko announced that we could really do with someone who's prepared to tackle this issue and get it sorted out. Randy promised to try and have a look.

Steffen Bayer suggested that you could just manually set the reference count back to one, which struck Alan as somewhat sick. Steffen responded:

IMO, this is a matter of opinion. :-)

There's nothing fundamentally evil about setting the refcount to a lower value, because this just triggers the DESTROY() event sooner than it would normally.

But Nick noted that you can't ensure that you're not going to create any more references to the object during DESTROYing, so this really is fundamentally evil after all.

Rsync vs. FTP'ing the patches

Merijn Brand complained of a discrepancy between the patch archive (That's rsync://ftp.linux.activestate.com/perl-current-diffs/ or ftp://ftp.linux.activestate.com/pub/staff/gsar/APC/perl-current-diffs) and the latest source tree. ( rsync://ftp.linux.activestate.com/perl-current/ or ftp://ftp.linux.activestate.com/pub/staff/gsar/APC/perl-current.)

Sarathy explained that

The rsync mirror is automatic and syncs with the repository every five minutes.

Updating the patch area still requires manual intervention (with all the goofiness that implies, which you've noted) and is typically on a daily cycle. Making this process automatic is on my tuit list, but don't ask me when.

The thread went a little off topic into the mechanics of setting up a rsync server, but lead to a contribution to perlhack from Merijn explaining how to keep in sync with the Perl source.

Changes to README.aix

We've recently been telling AIX users that they need to upgrade their compilers in order to compile Perl, so Merijn send in a patch to README.aix to tell them how to do so. Read about it.

Jarkko also decided it was time we had a README.Solaris; volunteers are being sought.

The Regex Stack Problem

Sam Smith found a regular expression which will crash the engine with a segfault - further investigation showed it was the engine recursing too deeply and exceeding the maximum stack depth. Dominic Dunlop explained that it was difficult to fix this one, as writing stack checking code portably is quite a challenge, and fixing the root of the problem - changing the regex engine from recursion to iteration - is somewhat Herculean.

Dominic further explained that he had a stack checking patch but it depended on the system having the BSD resource limit functions getrlimit and setrlimit. He also couldn't find a nice way of making this all play sensibly with threads. At the time, Ilya had countered that the problem should really be fixed by rewriting the engine to be iterative; Dominic had tried this, but it was very tricky and slowed the whole thing down by about 30% - this was due to the extra memory allocation of the structures that needed to be copied between levels. Jarkko came up with a few ideas, most promising being the use of a custom stack.

The work-around is to increase the process's stack size using ulimit; it was suggested that perl should be able to use setrlimit to extend its own stack size when necessary. This was frowned upon by Jarkko - limits are imposed for a reason.

Anyone who's got infinite patience and sanity may want to consider hacking at removing the recursion, but so far Dominic and Hugo look like the only people patient and sane enough.

Things nobody's fixed

A few bug reports nobody's investigated yet, for those of you with a spare minute: xsubpp truncates files

, the AUTOLOAD behaviour changed in 5.6, and UTF |= is broken.

Enjoy!

Craig Berry

Craig Berry deserves a special mention for his huge contribution to p5p this week - we saw nearly three hundred messages from him. However, on closer examination, 284 of them turned out to be the same, thanks to a malicious copy of CC Mail belonging to somewhere completely different. (It wasn't Craig's fault at all.)

Various

I leave you with a veritable smorgasbord of misdirected queries, test reports, FAQs, off-topic posts, and people spamming the list about how much spam there was on the list, and until next week I remain, your humble and obedient servant,


Simon Cozens

Beginner's Introduction to Perl - Part 2

Editor's note: this venerable series is undergoing updates. You might be interested in the newer versions, available at:



Table of Contents

Part 1 of this series
Part 3 of this series
Part 4 of this series
Part 5 of this series
Part 6 of this series

Comparison operators
while and until
String comparisons
More fun with strings
Filehandles
Writing files
Live free or die!
Subs
Putting it all together
Play around!

In our last article, we talked about the core elements of Perl: variables (scalars, arrays, and hashes), math operators and some basic flow control (the for statement). Now it's time to interact with the world.

In this installment, we're going to discuss how to slice and dice strings, how to play with files and how to define your own functions. But first, we'll discuss one more core concept of the Perl language: conditions and comparisons.

Comparison operators

There's one important element of Perl that we skipped in the last article: comparison operators. Like all good programming languages, Perl allows you ask questions such as ``Is this number greater than that number?'' or ``Are these two strings the same?'' and do different things depending on the answer.

When you're dealing with numbers, Perl has four important operators: <, >, == and !=. These are the ``less than,'' ``greater than,'' ``equal to'' and ``not equal to'' operators. (You can also use <=, ``less than or equal to,'' and >=, ``greater than or equal to.)

You can use these operators along with one of Perl's conditional keywords, such as if and unless. Both of these keywords take a condition that Perl will test, and a block of code in curly brackets that Perl will run if the test works. These two words work just like their English equivalents - an if test succeeds if the condition turns out to be true, and an unless test succeeds if the condition turns out to be false:

    if ($year_according_to_computer == 1900) {
        print "Y2K has doomed us all!  Everyone to the compound.\n";
    }

    unless ($bank_account > 0) {
        print "I'm broke!\n";
    }

Be careful of the difference between = and ==! One equals sign means ``assignment'', two means ``comparison for equality''. This is a common, evil bug:

    if ($a = 5) {
        print "This works - but doesn't do what you want!\n";
    }

Instead of testing whether $a is equal to five, you've made $a equal to five and clobbered its old value. (In a later article, we'll discuss a way to make sure this bug won't occur in running code.)

Both if and unless can be followed by an else statement and code block, which executes if your test failed. You can also use elsif to chain together a bunch of if statements:

    if ($a == 5) {
        print "It's five!\n";
    } elsif ($a == 6) {
        print "It's six!\n";
    } else {
        print "It's something else.\n";
    }

    unless ($pie eq 'apple') {
        print "Ew, I don't like $pie flavored pie.\n";
    } else {
        print "Apple!  My favorite!\n";
    }

while and until

Two slightly more complex keywords are while and until. They both take a condition and a block of code, just like if and unless, but they act like loops similar to for. Perl tests the condition, runs the block of code and runs it over and over again for as long as the condition is true (for a while loop) or false (for a until loop).

Take a look at the following code and try to guess what it will do before reading further:

   $a = 0;

   while ($a != 3) {
       $a++;
       print "Counting up to $a...\n";
   }

   until ($a == 0) {
       $a--;
       print "Counting down to $a...\n";
   }

Here's what you see when you run this program:

    Counting up to 1...
    Counting up to 2...
    Counting up to 3...
    Counting down to 2...
    Counting down to 1...
    Counting down to 0...

String comparisons

So that's how you compare numbers. Now, what about strings? The most common string comparison operator is eq, which tests for string equality - that is, whether two strings have the same value.

Remember the pain that is caused when you mix up = and ==? Well, you can also mix up == and eq. This is one of the few cases where it does matter whether Perl is treating a value as a string or a number. Try this code:

    $yes_no = "no";
    if ($yes_no == "yes") {
        print "You said yes!\n";
    }

Why does this code think you said yes? Remember that Perl automatically converts strings to numbers whenever it's necessary; the == operator implies that you're using numbers, so Perl converts the value of $yes_no (``no'') to the number 0, and ``yes'' to the number 0 as well. Since this equality test works (0 is equal to 0), the if block gets run. Change the condition to $yes_no eq "yes", and it'll do what it should.

Things can work the other way, too. The number five is numerically equal to the string " 5 ", so comparing them to == works. But when you compare five and " 5 " with eq, Perl will convert the number to the string "5" first, and then ask whether the two strings have the same value. Since they don't, the eq comparison fails. This code fragment will print Numeric equality!, but not String equality!:

    $a = 5;
    if ($a == " 5 ") { print "Numeric equality!\n"; }
    if ($a eq " 5 ") { print "String equality!\n"; }

More fun with strings

You'll often want to manipulate strings: Break them into smaller pieces, put them together and change their contents. Perl offers three functions that make string manipulation easy and fun: substr(), split() and join().

If you want to retrieve part of a string (say, the first four characters or a 10-character chunk from the middle), use the substr() function. It takes either two or three parameters: the string you want to look at, the character position to start at (the first character is position 0) and the number of characters to retrieve. If you leave out the number of characters, you'll retrieve everything up to the end of the string.

    $a = "Welcome to Perl!\n";
    print substr($a, 0, 7);     # "Welcome"
    print substr($a, 7);        # " to Perl!\n"

A neat and often-overlooked thing about substr() is that you can use a negative character position. This will retrieve a substring that begins with many characters from the end of the string.

     $a = "Welcome to Perl!\n";
     print substr($a, -6, 4);      # "Perl"

(Remember that inside double quotes, \n represents the single new-line character.)

You can also manipulate the string by using substr() to assign a new value to part of it. One useful trick is using a length of zero to insert characters into a string:

    $a = "Welcome to Java!\n";
    substr($a, 11, 4) = "Perl";   # $a is now "Welcome to Perl!\n";
    substr($a, 7, 3) = "";        #       ... "Welcome Perl!\n";
    substr($a, 0, 0) = "Hello. "; #       ... "Hello. Welcome Perl!\n";

Next, let's look at split(). This function breaks apart a string and returns a list of the pieces. split() generally takes two parameters: a regular expression to split the string with and the string you want to split. (We'll discuss regular expressions in more detail in the next article; for the moment, we're only going to use a space. Note the special syntax for a regular expression: / /.) The characters you split won't show up in any of the list elements.

    $a = "Hello. Welcome Perl!\n";
    @a = split(/ /, $a);   # Three items: "Hello.", "Welcome", "Perl!\n"

You can also specify a third parameter: the maximum number of items to put in your list. The splitting will stop as soon as your list contains that many items:

    $a = "Hello. Welcome Perl!\n";
    @a = split(/ /, $a, 2);   # Two items: "Hello.", "Welcome Perl!\n";

Of course, what you can split, you can also join(). The join() function takes a list of strings and attaches them together with a specified string between each element, which may be an empty string:

    @a = ("Hello.", "Welcome", "Perl!\n");
    $a = join(' ', @a);       # "Hello. Welcome Perl!\n";
    $b = join(' and ', @a);   # "Hello. and Welcome and Perl!\n";
    $c = join('', @a);        # "Hello.WelcomePerl!\n";

Filehandles

Enough about strings. Let's look at files - after all, what good is string manipulation if you can't do it where it counts?

To read from or write to a file, you have to open it. When you open a file, Perl asks the operating system if the file can be accessed - does the file exist if you're trying to read it (or can it be created if you're trying to create a new file), and do you have the necessary file permissions to do what you want? If you're allowed to use the file, the operating system will prepare it for you, and Perl will give you a filehandle.

You ask Perl to create a filehandle for you by using the open() function, which takes two arguments: the filehandle you want to create and the file you want to work with. First, we'll concentrate on reading files. The following statement opens the file log.txt using the filehandle LOGFILE:

    open (LOGFILE, "log.txt");

Opening a file involves several behind-the-scenes tasks that Perl and the operating system undertake together, such as checking that the file you want to open actually exists (or creating it if you're trying to create a new file) and making sure you're allowed to manipulate the file (do you have the necessary file permissions, for instance). Perl will do all of this for you, so in general you don't need to worry about it.

Once you've opened a file to read, you can retrieve lines from it by using the <> construct. Inside the angle brackets, place the name of your filehandle. What is returned by this depends on what you want to get: in a scalar context (a more technical way of saying ``if you're assigning it to a scalar''), you retrieve the next line from the file, but if you're looking for a list, you get a list of all the remaining lines in the file. (One common trick is to use for $lines (<FH>) to retrieve all the lines from a file - the for means you're asking a list.)

You can, of course, close a filehandle that you've opened. You don't always have to do this, because Perl is clever enough to close a filehandle when your program ends or when you try to reuse an existing filehandle. It's a good idea, though, to use the close statement. Not only will it make your code more readable, but your operating system has built-in limits on the number of files that can be open at once, and each open filehandle will take up valuable memory.

Here's a simple program that will display the contents of the file log.txt, and assumes that the first line of the file is its title:

    open (LOGFILE, "log.txt") or die "I couldn't get at log.txt";
    # We'll discuss the "or die" in a moment.

    $title = <LOGFILE>;
    print "Report Title: $title";
    for $line (<LOGFILE>) {
        print $line;
    }
    close LOGFILE;

Writing files

You also use open() when you are writing to a file. There are two ways to open a file for writing: overwrite and append. When you open a file in overwrite mode, you erase whatever it previously contained. In append mode, you attach your new data to the end of the existing file without erasing anything that was already there.

To indicate that you want a filehandle for writing, you put a single > character before the filename you want to use. This opens the file in overwrite mode. To open it in append mode, use two > characters.

     open (OVERWRITE, ">overwrite.txt") or die "$! error trying to overwrite";
     # The original contents are gone, wave goodbye.

     open (APPEND, ">>append.txt") or die "$! error trying to append";
     # Original contents still there, we're adding to the end of the file

Once our filehandle is open, we can use the humble print statement to write to it. Specify the filehandle you want to write to and a list of values you want to write:

    print OVERWRITE "This is the new content.\n";
    print APPEND "We're adding to the end here.\n", "And here too.\n";

Live free or die!

You noticed that most of our open() statements are followed by or die "some sort of message". This is because we live in an imperfect world, where programs don't always behave exactly the way we want them to. It's always possible for an open() call to fail; maybe you're trying to write to a file that you're not allowed to write, or you're trying to read from a file that doesn't exist. In Perl, you can guard against these problems by using or and and.

A series of statements separated by or will continue until you hit one that works, or returns a true value. This line of code will either succeed at opening OUTPUT in overwrite mode, or cause Perl to quit:

    open (OUTPUT, ">$outfile") or die "Can't write to $outfile: $!";

The die statement ends your program with an error message. The special variable $! contains Perl's explanation of the error. In this case, you might see something like this if you're not allowed to write to the file. Note that you get both the actual error message (``Permission denied'') and the line where it happened:

    Can't write to a2-die.txt: Permission denied at ./a2-die.pl line 1.

Defensive programming like this is useful for making your programs more error-resistant - you don't want to write to a file that you haven't successfully opened!

Here's an example: As part of your job, you write a program that records its results in a file called vitalreport.txt. You use the following code:

    open VITAL, ">vitalreport.txt";

If this open() call fails (for instance, vitalreport.txt is owned by another user who hasn't given you write permission), you'll never know it until someone looks at the file afterward and wonders why the vital report wasn't written. (Just imagine the joy if that ``someone'' is your boss, the day before your annual performance review.) When you use or die, you avoid all this:

    open VITAL, ">vitalreport.txt" or die "Can't write vital report: $!";

Instead of wondering whether your program wrote your vital report, you'll immediately have an error message that both tells you what went wrong and on what line of your program the error occurred.

You can use or for more than just testing file operations:

    ($pie eq 'apple') or ($pie eq 'cherry') or ($pie eq 'blueberry')
        or print "But I wanted apple, cherry, or blueberry!\n";

In this sequence, if you have an appropriate pie, Perl skips the rest of the chain. Once one statement works, the rest are ignored. The and operator does the opposite: It evaluates your chain of statements, but stops when one of them doesn't work.

   open (LOG, "log.file") and print "Logfile is open!\n";

This statement will only show you the words Logfile is open! if the open() succeeds - do you see why?

Subs

So far, our Perl programs have been a bunch of statements in series. This is OK if you're writing very small programs, but as your needs grow, you'll find it's limiting. This is why most modern programming languages allow you to define your own functions; in Perl, we call them subs.

A sub is defined with the sub keyword, and adds a new function to your program's capabilities. When you want to use this new function, you call it by name. For instance, here's a short definition of a sub called boo:

    sub boo {
        print "Boo!\n";
    }

    boo();   # Eek!

(Older versions of Perl required that you precede the name of a sub with the & character when you call it. You no longer have to do this, but if you see code that looks like &boo in other people's Perl, that's why.)

Subs are useful because they allow you to break your program into small, reusable chunks. If you need to analyze a string in four different places in your program, it's much easier to write one &analyze_string sub and call it four times. This way, when you make an improvement to your string-analysis routine, you'll only need to do it in one place, instead of four.

In the same way that Perl's built-in functions can take parameters and can return values, your subs can, too. Whenever you call a sub, any parameters you pass to it are placed in the special array @_. You can also return a single value or a list by using the return keyword.

    sub multiply {
        my (@ops) = @_;
        return $ops[0] * $ops[1];
    }

    for $i (1 .. 10) {
         print "$i squared is ", multiply($i, $i), "\n";
    }

Why did we use the my keyword? That indicates that the variables are private to that sub, so that any existing value for the @ops array we're using elsewhere in our program won't get overwritten. This means that you'll evade a whole class of hard-to-trace bugs in your programs. You don't have to use my, but you also don't have to avoid smashing your thumb when you're hammering nails into a board. They're both just good ideas.

You can also use my to set up local variables in a sub without assigning them values right away. This can be useful for loop indexes or temporary variables:

    sub annoy {
        my ($i, $j);
        for $i (1 .. 100) {
            $j .= "Is this annoying yet?\n";
        }
        print $j;
    }

If you don't expressly use the return statement, the sub returns the result of the last statement. This implicit return value can sometimes be useful, but it does reduce your program's readability. Remember that you'll read your code many more times than you write it!

Putting it all together

At the end of the first article we had a simple interest calculator. Now let's make it a bit more interesting by writing our interest table to a file instead of to the screen. We'll also break our code into subs to make it easier to read and maintain.

[Download this program]

        #!/usr/local/bin/perl -w



        # compound_interest_file.pl - the miracle of compound interest, part 2


        # First, we'll set up the variables we want to use.
        $outfile = "interest.txt";  # This is the filename of our report.
        $nest_egg = 10000;          # $nest_egg is our starting amount
        $year = 2000;               # This is the starting year for our table.
        $duration = 10;             # How many years are we saving up?
        $apr = 9.5;                 # This is our annual percentage rate.


        &open_report;
        &print_headers;
        &interest_report($nest_egg, $year, $duration, $apr);
        &report_footer;


        sub open_report {
            open (REPORT, ">$outfile") or die "Can't open report: $!";
        }


        sub print_headers {
            # Print the headers for our report.
            print REPORT "Year", "\t", "Balance", "\t", "Interest", "\t",
                         "New balance", "\n";
        }


        sub calculate_interest {
            # Given a nest egg and an APR, how much interest do we collect?
            my ($nest_egg, $apr) = @_;


            return int (($apr / 100) * $nest_egg * 100) / 100;
        }


        sub interest_report {
            # Get our parameters.  Note that these variables won't clobber the
            # global variables with the same name.
            my ($nest_egg, $year, $duration, $apr) = @_;


            # We have two local variables, so we'll use my to declare them here.
            my ($i, $line);


            # Calculate interest for each year.
            for $i (1 .. $duration) {
                $year++;
                $interest = &calculate_interest($nest_egg, $apr);


                $line = join("\t", $year, $nest_egg, $interest,
                             $nest_egg + $interest) . "\n";


                print REPORT $line;

                $nest_egg += $interest;
            }
        }

        sub report_footer {
            print REPORT "\n Our original assumptions:\n";
            print REPORT "   Nest egg: $nest_egg\n";
            print REPORT "   Number of years: $duration\n";
            print REPORT "   Interest rate: $apr\n";

            close REPORT;
        }

Notice how much clearer the program logic becomes when you break it down into subs. One nice quality of a program written as small, well-named subs is that it almost becomes self-documenting. Take a look at these four lines from our program:

     open_report;
     print_headers;
     interest_report($nest_egg, $year, $duration, $apr);
     report_footer;

Code like this is invaluable when you come back to it six months later and need to figure out what it does - would you rather spend your time reading the entire program trying to figure it out or read four lines that tell you the program 1) opens a report file, 2) prints some headers, 3) generates an interest report, and 4) prints a report footer?

You'll also notice we use my to set up local variables in the interest_report and calculate_interest subs. The value of $nest_egg in the main program never changes. This is useful at the end of the report, when we output a footer containing our original assumptions. Since we never specified a local $nest_egg in report_footer, we use the global value.

Play around!

In this article, we've looked at files (filehandles, open(), close(), and <>), string manipulation (substr(), split() and join()) and subs. Here's a pair of exercises - again, one simple and one complex:

  • You have a file called dictionary.txt that contains dictionary definitions, one per line, in the format ``word space definition''. (Here's a sample.) Write a program that will look up a word from the command line. (Hints: @ARGV is a special array that contains your command line arguments and you'll need to use the three-argument form of split().) Try to enhance it so that your dictionary can also contain words with multiple definitions in the format ``word space definition:alternate definition:alternate definition, etc...''.

  • Write an analyzer for your Apache logs. You can find a brief description of the common log format at http://www.w3.org/Daemon/User/Config/Logging.html. Your analyzer should count the total number of requests for each URL, the total number of results for each status code and the total number of bytes output.

Hold the Sackcloth and Ashes



(This is a slightly edited and expanded version of the reply I sent to perl6-meta@perl.org spurred by Mark-Jason Dominus sending the URL to his critique on the Perl 6 RFC process.)


I agree partly on Mark-Jason Dominus' critique, but not fully.

He did point out several shortcomings that I agree with: we should have exercised tighter control by requiring authors to record the opposing opinions or pointed-out deficiencies, and in general by defining better the roles of the workgroup chairs or moderators, and giving those moderators the power to force changes or even drop controversial or simply bad RFCs.

But I certainly don't agree that the whole process was a fiasco.

Firstly, now, for the first time in the Perl history, we opened up the floodgates, so the speak, and had at least some sort of (admittedly weakly) formalized protocol of submitting ideas for enhancement, instead of the shark tank known as the perl5-porters (p5p).

(On the other hand, we do need shark tanks. If an idea wasn't solid enough to survive the p5p ordeal by fire, it probably wasn't solid enough to begin with. In p5p you also ultimately had to have the code to prove your point.)

Secondly, what was -- and still is -- sorely missing form the p5p process is writing things down. The first round of the Perl 6 RFCs certainly weren't shining examples of how RFCs should be written, but at least they were written down. Unless an idea is written down, it is close to impossible to discuss it in any serious terms in the email medium. Also, often writing down an idea is a very good way to organize your thoughts better, possibly even leading into seeing why the idea wouldn't work anyway.

Thirdly, to continue on the theme of the first point, now we have a record of the kind of things people want, seven years after Perl 5 came out. Not necessarily a representative list, not perhaps a well thought-out list, but a list. The ideas are quite often suggested in a much too detailed way, or they are trying to shoehorn new un-Perlish ways into Perl, suggesting things that clearly do not belong to a language core, breaking backward compatibility, and other evil things. But now we have an idea of the kind of things (both language-wise and application/data-wise) people want to do in Perl and with Perl, or don't like about Perl.

Based on that feedback Larry can design Perl 6 to be more flexible, to accommodate as many as possible of those requests in some way. Not all of them, and most of them will probably be implemented in some more Perlish way than suggested, and I guess often in some much more lower level way than the RFC submitter thought it should be done. After all, Perl is a general-purpose programming language, not a language designed for some specific application task, nor is Perl a language with theoretical axes to grind, as Larry points out in his Atlanta Linux Showcase talk.

Without the RFC process we wouldn't have had that feedback.

I vehemently disagree with the quip that we would have been better off by everybody just sending Larry their suggestions. Now we did have a process: it was public, it was announced, it began, we had rules, we had discussions, it had a definite deadline.

The state of the IMPLEMENTATION sections is not a cause for great concern. At this stage of Perl 6 where no code exists any deep implementation details would be pointless. For many relatively small RFC ideas that would cause only surface changes to the language "trivial" is a perfectly good implementation description.

We certainly expected (I certainly expected) RFCs of deeper technical level and detail, with more implementation plans or details, or with more background research on existing practices in other languages or application areas. But obviously our expectations were wrong, and we will have to work with what we got.

On the whole I think the process was a success. Of course it, like everything, could have been conducted better, with tighter rein, but we do have a good start into Perl 6 as a result.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en