Recently in Files Category

Testing Files and Test Modules

For the last several years, there has been more and more emphasis on automated testing. No self-respecting CPAN author can post a distribution without tests. Yet some things are hard to test. This article explains how writing Test::Files gave me a useful tool for validating one module's output and taught me a few things about the current state of Perl testing.

Introduction

My boss put me to work writing a moderately large suite in Perl. Among many other things, it needed to perform check out and commit operations on CVS repositories. In a quest to build quality tests for that module, I wrote Test::Files, which is now on CPAN. This article explains how to use that module and, perhaps more importantly, how it tests itself.

Using Test::Files

To use Test::Files, first use Test::More and tell it how many tests you want to run.

use strict;
use warnings;
use Test::More tests => 5;
use Test::Files;

After you use the module, there are four things it can help you do:

  • Compare one file to a string or to another file.
  • Make sure that directories have the files you expect them to have.
  • Compare all the files in one directory to all the files in another directory.
  • Exclude some things from consideration.

Single Files

In the simplest case, you have written a file. Now it is time to validate it. That could look like this:

file_ok($file_name, "This is the\ntext\n",
    "file one contents");

The file_ok function takes two (or optionally, and preferably, three) arguments. The first is the name of the file you want to validate. The second is a text string containing the text that should be in the file. The third is the name of the test. In the rush of writing, I'm likely to fail to mention the test names at some point, so let me say up front that all of the tests shown here take a name argument. Including a name makes finding the test easier.

If the file agrees with the string, the test passes with only an OK message. Otherwise, the test will fail and diagnostic messages will show where the two differed. The diagnostic output is really the reason to use Test::Files.

Some, including myself, prefer to check one file against another. I put one version in the distribution. The other one, my tests write. To compare two files, use:

compare_ok($file1, $file2, $name);

As with file_ok, if the files are the same, Test::Files only reports an OK message. Failure shows where the files differ.

Directory Structures

Sometimes, you need to validate that certain files are present in a directory. Other times, you need to make that check exclusive so that only known files are present. Finally, you might want to know that not only is the directory structure is the same, but that the files contain the same data.

To look for some files in a directory by name, write:

dir_contains_ok($dir, [qw(list files here)], $name);

This will succeed, even if the directory has some other files you weren't looking for.

To ensure that your list is exclusive, add only to the function name:

dir_only_contains_ok($dir, [qw(list all files here)], $name);

Both of these report a list of absent files if they fail due to them. The exclusive form also reports a list of unexpected files, if it sees any.

Directory Contents

If knowing that certain file names are present is not enough, use the compare_dirs_ok function to check the contents of all files in one directory against files in another directory. A typical module might build one directory during make test, with the other built ahead of time and shipped with the distribution.

compare_dirs_ok($test_built, $shipped, $name);

This will generate a separate diagnostic diff output for each pair of files that differs, in addition to listing files that are missing from either distribution. (If you need to know which files are missing from the built directory, either reverse the order of the directories or use dir_only_contains_ok in addition to compare_dirs_ok. This is a bug and might eventually be fixed.) Even though this could yield many diagnostic reports, all of those separate failures only count as one failed test.

There are many times when testing all files in the directories is just wrong. In these cases, it is best to use File::Find or an equivalent, putting an exclusion criterion at the top of your wanted function and a call to compare_ok at the bottom. This probably requires you to use no_plan with Test::More:

use Test::More qw(no_plan);

Test::More wants to know the exact number of tests you are about to run. If you tell it the wrong number, the test harness will think something is wrong with your test script, causing it to report failures. To avoid this confusion, use no_plan--but keep in mind that plans are there for a reason. If your test dies, the plan lets the harness know how many tests it missed. If you have no_plan, the harness doesn't always have enough information to keep score. Thus, you should put such tests in separate scripts, so that the harness can count your other tests properly.

Filtering

While the above list of functions seemed sufficient during planning, reality set in as soon as I tried it out on my CVS module. I wanted to compare two CVS repositories: one ready for shipment with the distribution, the other built during testing. As soon as I tried the test it failed, not because the operative parts of the module were not working, but because the CVS timestamps differed between the two versions.

To deal with cosmetic differences that should not count as failures, I added two functions to the above list: one for single files and the other for directories. These new functions accept a code reference that receives each line prior to comparison. It performs any needed alterations, and then returns a line suitable for comparison. My example function below redacts the offending timestamps. With the filtered versions in place, the tests pass and fail when they should.

My final tests for the CVS repository directories look like this:

compare_dirs_filter_ok(
    't/cvsroot/CVSROOT',
    't/sampleroot/CVSROOT',
    \&chop_dates,
    "make repo"
);

The code reference argument comes between the directory names and the test name. The chop_dates function is not particularly complex. It removes two kinds of dates favored by CVS, as shown in its comments.

sub chop_dates {
    my $line =  shift;

    #  2003.10.15.13.45.57 (year month day hour minute sec)
    $line    =~ s/\d{4}(.\d\d){5}//;

    #  Thu Oct 16 18:00:28 2003
    $line    =~ s/\w{3} \w{3} \d\d? \d\d:\d\d:\d\d \d{4}//;

    return $line;
}

This shows the general behavior of filters. They receive a line of input which they must not directly change. Instead, they must return a new, corrected line.

In addition to compare_dirs_filter_ok for whole directory structures, there is also compare_filter_ok, which works similarly for single file comparisons. (There is no file_filter_ok, but maybe there should be.)

Testing a Test Module

The most interesting part of writing Test::Files was learning how to test it. Thanks to Schwern, I learned about Test::Builder::Tester, which eases the problems inherent in testing a Perl test module.

The difficulty with testing Perl tests has to do with how they normally run. The venerable test harness scheme expects test scripts to produce pass and fail data on standard out and diagnostic help on standard error. This is a great design. The simplicity is exactly what you would expect from a Unix-inspired tool. Yet, it poses a problem for testing test modules.

When eventual users use the test module, their harness expects it to write quite specific things to standard out and standard error. Among the things that must go to standard out are a sequence of lines such as ok 1. When you write a test of the test module, its harness also expects to see this sort of data on standard out and standard error. Having two different sources of ok 1 is highly confusing, not least to the harness, which chokes on such duplications.

Test module writers need a scheme to trap the output from the module being tested, check it for correct content, and report that result onto the actual standard channels for the harness to see. This is tricky, requiring care in diversion of file handles at the right moments without the knowledge of the module whose output is diverted. Doing this by hand is inelegant and prone to error. Further, multiple test scripts might have to recreate home-rolled solutions (introducing the oldest of known coding sins: duplication of code). Finally, the diagnostic output, in the event of failure, from homemade diverters is unlikely to be helpful when tests of the test module fail.

Enter Test::Builder::Tester.

To help us test testers, Mark Fowler collected some code from Schwern, and used it to make Test::Builder::Tester. With it, tests of test modules are relatively painless and their failure diagnostics are highly informative. Here are two examples from the Test::Files test suite. The first shows a file comparison that should pass:

test_out("ok 1 - passing file");
compare_ok("t/ok_pass.dat", "t/ok_pass.same.dat",
    "passing file");
test_test("passing file");

This test should work, generating ok 1 - passing file on standard output. To tell Test::Builder::Tester what the standard output should be, I called test_out. After the test, I called test_test with only the name of my test. (To avoid confusion, I made the test names the same.)

Between the call to test_out and the one to test_test, Test::Builder::Tester diverted the regular output channels so the harness won't see them.

The second example shows a failed test and how to check both standard out and standard error. The later contains the diagnostic data the module should generate.

test_out("not ok 1 - failing file");
$line = line_num(+9);
test_diag("    Failed test (t/03compare_ok.t at line $line)",
'+---+--------------------+-------------------+',
'|   |Got                 |Expected           |',
'| Ln|                    |                   |',
'+---+--------------------+-------------------+',
'|  1|This file           |This file          |',
'*  2|is for 03ok_pass.t  |is for many tests  *',
'+---+--------------------+-------------------+'  );
compare_ok("t/ok_pass.dat", "t/ok_pass.diff.dat",
    "failing file");
test_test("failing file");

Two new functions appear here. First, line_num returns the current line number plus or minus an offset. Because failing tests report the line number of the failure, checking standard error for an exact match requires matching that number. Yet, no one wants his tests to break because he inserted a new line at the top of the script. With line_num, you can obtain the line number of the test relative to where you are. Here, there are nine lines between the call to line_num and the actual test.

The other new function is test_diag. It allows you to check the standard error output, where diagnostic messages appear. The easiest way to use it is to provide each line of output as a separate parameter.

Summary

Now you know how to use Test::Files and how to test modules that implement tests. There is one final way I use Test::Files. I use it outside of module testing any time I want to know how the contents of text files in two directory hierarchies compare. With this, I can quickly locate differences in archives, for example, enabling me to debug builders of those archives. In one example, I used it compare more than 400 text files in two WebSphere .ear archives. My program had only about 30 operative lines (there were also comments and blank lines) and performed the comparison in under five seconds. This is testament to the leverage of Perl and CPAN.

(Since doing that comparison, I have moved to a new company. In the process I exchanged WebSphere for mod_perl and am generally happier with the latter.)

FMTYEWTK About Mass Edits In Perl

For those not used to the terminology, FMTYEWTK stands for Far More Than You Ever Wanted To Know. This one is fairly light as FMTYEWTKs usually go. In any case, the question before us is, "How do you apply an edit against a list of files using Perl?" Well, that depends on what you want to do....

The Beginning

If you only want to read in one or more files, apply a regex to the contents, and spit out the altered text as one big stream -- the best approach is probably a one-liner such as the following:

perl -p -e "s/Foo/Bar/g" <FileList>

This command calls perl with the options -p and -e "s/Foo/Bar/g" against the files listed in FileList. The first argument, -p, tells Perl to print each line it reads after applying the alteration. The second option, -e, tells Perl to evaluate the provided substitution regex rather than reading a script from a file. The Perl interpreter then evaluates this regex against every line of all (space separated) files listed on the command line and spits out one huge stream of the concatenated fixed lines.

In standard fashion, Perl allows you to concatenate options without arguments with following options for brevity and convenience. Therefore, you'll more often see the previous example written as:

perl -pe "s/Foo/Bar/g" <FileList>

In-place Editing

If you want to edit the files in place, editing each file before going on to the next, that's pretty easy, too:

perl -pi.bak -e "s/Foo/Bar/g" <FileList>

The only change from the last command is the new option -i.bak, which tells Perl to operate on files in-place, rather than concatenating them together into one big output stream. Like the -e option, -i takes one argument, an extension to add to the original file names when making backup copies; for this example I chose .bak. Warning: If you execute the command twice, you've most likely just overwritten your backups with the changed versions from the first run. You probably didn't want to do that.

Because -i takes an argument, I had to separate out the -e option, which Perl otherwise would interpret as the argument to -i, leaving us with a backup extension of .bake, unlikely to be correct unless you happen to be a pastry chef. In addition, Perl would have thought that "s/Foo/Bar/" was the filename of the script to run, and would complain when it could not find a script by that name.

Running Multiple Regexes

Of course, you may want to make more extensive changes than just one regex. To make several changes all at once, add more code to the evaluated script. Remember to separate each additional line of code with a semicolon (technically, you should place a semicolon at the end of each line of code, but the very last one in any code block is optional). For example, you could make a series of changes:

perl -pi.bak -e "s/Bill Gates/Microsoft CEO/g;
 	s/CEO/Overlord/g" <FileList>

"Bill Gates" would then become "Microsoft Overlord" throughout the files. (Here, as in all examples, we ignore such finicky things as making sure we don't change "HERBACEOUS" to "HERBAOverlordUS"; for that kind of information, refer to a good treatise on regular expressions, such as Jeffrey Friedl's impressive book Mastering Regular Expressions, 2nd Edition. Also, I've wrapped the command to fit, but you should type it in as just one line.)

Doing Your Own Printing

You may wish to override the behavior created by -p, which prints every line read in, after any changes made by your script. In this case, change to the -n option. -p -e "s/Foo/Bar/" is roughly equivalent to -n -e "s/Foo/Bar/; print". This allows you to write interesting commands, such as removing lines beginning with hash marks (Perl comments, C-style preprocessor directives, etc.):

perl -ni.bak -e "print unless /^\s*#/;" <FileList>

Fields and Scripts

Of course, there are far more powerful things you can do with this. For example, imagine a flat-file database, with one row per line of the file, and fields separated by colons, like so:

Bill:Hennig:Male:43:62000
Mary:Myrtle:Female:28:56000
Jim:Smith:Male:24:50700
Mike:Jones:Male:29:35200
...

Suppose you want to find everyone who was over 25, but paid less than $40,000. At the same time, you'd like to document the number and percentage of women and men found. This time, instead of providing a mini-script on the command line, we'll create a file, glass.pl, which contains the script. Here's how to run the query:

perl -naF':' glass.pl <FileList>

glass.pl contains the following:

BEGIN { $men = $women = $lowmen = $lowwomen = 0; }

next unless /:/;
/Female/ ? $women++ : $men++;
if ($F[3] > 25 and $F[4] < 40000)
    { print; /Female/ ? $lowwomen++ : $lowmen++; }

END {
print "\n\n$lowwomen of $women women (",
      int($lowwomen / $women * 100),
      "%) and $lowmen of $men men (",
      int($lowmen / $men * 100),
      "%) seem to be underpaid.\n";
}

Don't worry too much about the syntax, other than to note some of the awk and C similarities. The important thing here and in later sections is to see how Perl makes these problems easily solvable.

Several new features appear in this example; first, if there is no -e option to evaluate, Perl assumes the first filename listed, in this case glass.pl, refers to a Perl script for it to execute. Secondly, two new options make it easy to deal with field-based data. -a (autosplit mode) takes each line and splits its fields into the array @F, based on the field delimiter given by the -F (Field delimiter) option, which can be a string or a regex. If no -F option exists, the field delimiter defaults to ' ' (one single-quoted space). By default, arrays in Perl are zero-based, so $F[3] and $F[4] refer to the age and pay fields, respectively. Finally, the BEGIN and END blocks allow the programmer to perform actions before file reading begins and after it finishes, respectively.

File Handling

All of these little tidbits have made use only of data from within the files being operated on. What if you want to be able to read in data from elsewhere? For example, imagine that you had some sort of file that allows includes; in this case, we'll assume that you somehow specify these files by relative pathname, rather than looking them up in an include path. Perhaps the includes look like the following:

...
#include foo.bar, baz.bar, boo.bar
...

If you want to see what the file looks like with the includes placed into the master file, you might try something like this:

perl -ni.bak -e "if (s/#include\s+//) {foreach $file
 (split /,\s*/) {open FILE, '<', $file; print <FILE>}}
 else {print}" <FileList>

To make it easier to see what's going on here, this is what it looks like with a full set of line breaks added for clarity:

perl -ni.bak -e "
        if (s/#include\s+//) {
            foreach $file (split /,\s*/) {
                open FILE, '<', $file;
                print <FILE>
            }
        } else {
            print
        }
    " <FileList>

Of course, this only expands one level of include, but then we haven't provided any way for the script to know when to stop if there's an include loop. In this little example, we take advantage of the fact that the substitution operator returns the number of changes made, so if it manages to chop off the #include at the beginning of the line, it returns a non-zero (true) value, and the rest of the code splits apart the list of includes, opens each one in turn, and prints its entire contents.

There are some handy shortcuts as well: if you open a new file using the name of an old file handle (FILE in this case), Perl automatically closes the old file first. In addition, if you read from a file using the <> operator into a list (which the print function expects), it happily reads in the entire file at once, one line per list entry. The print call then prints the entire list, inserting it into the current file, as expected. Finally, the else clause handles printing non-include lines from the source, because we are using -n rather than -p.

Better File Lists

The fact that it is relatively easy to handle filenames listed within other files indicates that it ought to be fairly easy to deal entirely with files read from some other source than a list on the end of the command line. The simplest case is to read all of the file contents from standard input as a single stream, which is common when building up pipes. As a matter of fact, this is so common that Perl automatically switches to this mode if there are no files listed on the command line:

<Source> | perl -pe "s/Foo/Bar/g" | <Sink>

Here Source and Sink are the commands that generate the raw data and handle the altered output from Perl, respectively. Incidentally, the filename consisting of a single hyphen (-) is an explicit alias for standard input; this allows the Perl programmer to merge input from files and pipes, like so:

<Source> | perl -pe "s/Foo/Bar/g" header.bar - footer.bar
 | <Sink>

This example first reads a header file, then the input from the pipe source, and then a footer file — the whole mess. The program modifies this text and sends it through to the out pipe.

As I mentioned earlier, when dealing with multiple files it is usually better to keep the files separate, by using in-place editing or by explicitly handling each file separately. On the other hand, it can be a pain to list all of the files on the command line, especially if there are a lot of files, or when dealing with files generated programmatically.

The simplest method is to read the files from standard input, pushing them onto @ARGV in a BEGIN block; this has the effect of tricking Perl into thinking it received all of the filenames on the command line! Assuming the common case of one filename per input line, the following will do the trick:

<FilenamesSource> | perl -pi.bak -e "BEGIN {push @ARGV,
 <STDIN>; chomp @ARGV} s/Foo/Bar/g"

Here we once again use the shortcut that reading in a file in a list context (which push provides) will read in the entire file. This adds the entire contents, one filename per entry, to the @ARGV array, which normally contains the list of arguments to the script. To complete the trick, we chomp the line endings from the filenames, because Perl normally returns the line ending characters (a carriage return and/or a line feed) when reading lines from a file. We don't want to consider these to be part of the filenames. (On some platforms, you could actually have filenames containing line ending characters, but then you'd have to make the Perl code a little more complex, and you deserve to figure that out for yourself for trying it in the first place.)

Response Files

Another common design is to provide filenames on the command line as usual, treating filenames starting with an @ specially. The program should consider their contents to be lists of filenames to insert directly into the command line. For example, if the contents of the file names.baz (often called a response file) are:

two
three
four

then this command:

perl -pi.bak -e "s/Foo/Bar/g" one @names.baz five

should work equivalently to:

perl -pi.bak -e "s/Foo/Bar/g" one two three four five

To make this work, we once again need to do a little magic in a BEGIN block. Essentially, we want to parse through the @ARGV array, looking for filenames that begin with @. We pass through any unmarked filenames, but for each response file found, we read in the contents of the response file and insert the new list of filenames into @ARGV. Finally, we chomp the line endings, just as in the previous section. This produces a canonical file list in @ARGV, just as if we'd specified all of the files on the command line. Here's what it looks like in action:

perl -pi.bak -e "BEGIN {@ARGV = map {s/^@// ? @{open RESP,
 '<', $_; [<RESP>]} : $_} @ARGV; chomp @ARGV} s/Foo/Bar/g"
 <ResponseFileList>

Here's the same code with line breaks added so you can see what's going on:

perl -pi.bak -e "
        BEGIN {
            @ARGV = map {
                        s/^@// ? @{open RESP, '<', $_;
                                   [<RESP>]}
                               : $_
                    } @ARGV;
            chomp @ARGV
        }
        
        s/Foo/Bar/g
    " <ResponseFileList>

The only tricky part is the map block. map applies a piece of code to every element of a list, returning a list of the return values of the code; the current element is in the $_ special variable. The block here checks to see if it could remove a @ from the beginning of each filename. If so, it opens the file, reads the whole thing into an anonymous temporary array (that's what the square brackets are there for), and then inserts that array instead of the response file's name (that's the odd @{...} construct). If there is no @ at the beginning of the filename to remove, the filename goes directly into the map results. Once we've performed this expansion and chomped any line endings, we can then proceed with the main work, in this case our usual substitution, s/Foo/Bar/g.

Recursing Directories

For our final example, let's deal with a major weakness in the way we've been doing things so far — we're not recursing into directories, instead expecting all of the files we need to read to appear explicitly on the command line. To perform the recursion, we need to pull out the big guns: File::Find. This Perl module provides very powerful recursion methods. It also comes standard with any recent version of the Perl interpreter. The command line is deceptively simple, because all of the brains are in the script:

perl cleanup.pl <DirectoryList>

This script will perform some basic housecleaning, marking all files readable and writeable, removing those with the extensions .bak, .$$$, and .tmp, and cleaning up .log files. For the log files, we will create a master log file (for archiving or perusal) containing the contents of all of the other logs, and then delete the logs so that they remain short over time. Here's the script:

use File::Find;

die "All arguments must be directories!"
    if grep {!-d} @ARGV;
open MASTER, '>', 'master.lgm';
finddepth(\&filehandler, @ARGV);
close MASTER;
rename 'master.lgm', 'master.log';

sub filehandler
{
    chmod stat(_) | 0666, $_ unless (-r and -w);
    unlink if (/\.bak$/ or /\.tmp$/ or /\.\$\$\$$/);
    if (/\.log$/) {
        open LOG, '<', $_;
        print MASTER "\n\n****\n$File::Find::name\n****\n";
        print MASTER <LOG>;
        close LOG;
        unlink;
    }
}

This example shows just how powerful Perl and Perl modules can be, and at the same time just how obtuse Perl can appear to the inexperienced. In this case, the short explanation is that the finddepth() function iterates through all of the program arguments (@ARGV), recursing into each directory and calling the filehandler() subroutine for each file. That subroutine then can examine the file and decide what to do with it. The example checks for readability and writability with -r and -w, fixing the file's security settings if needed with chmod. It then unlinks (deletes) any file with a name ending in any of the three unwanted extensions. Finally, if the extension is .log, it opens the file, writes a few header lines to the master log, copies the file into the master log, closes it, and deletes it.

Instead of using finddepth(), which does a depth-first search of the directories and visits them from the bottom up, we could have used find(), which does the same depth-first search from the top down. As a side note, the program writes the master log file with the extension .lgm, then renames it at the end to have the extension .log, so as to avoid the possibility of writing the master log into itself if the program is searching the current directory.

Conclusion

That's it. Sure, there's a lot more that you could do with these examples, including adding error checking, generating additional statistics, producing help text, etc. To learn how to do this, find a copy of Programming Perl, 3rd Edition, by Larry Wall, Tom Christiansen, and Jon Orwant. This is the bible (or the Camel, rather) of the Perl community, and well worth the read. Good luck!

Perl's Special Variables

One of the best ways to make your Perl code look more like ... well, like Perl code -- and not like C or BASIC or whatever you used before you were introduced to Perl -- is to get to know the internal variables that Perl uses to control various aspects of your program's execution.

In this article we'll take a look at a number of variables that give you finer control over your file input and output.

Counting Lines

I decided to write this article because I am constantly amazed by the number of people who don't know about the existence of $.. I still see people producing code that looks like this:

  my $line_no = 0;

  while (<FILE>) {
    ++$line_no;
    unless (/some regex/) {
      warn "Error in line $line_no\n";
      next;
    }

    # process the record in some way
  }

For some reason, many people seem to completely miss the existence of $., which is Perl's internal variable that keeps track of your current record number. The code above can be rewritten as:

  while (<FILE>) {
    unless (/some regex/) {
      warn "Error in line $.\n";
      next;
    }

    # process the record in some way
  }

I know that it doesn't actually save you very much typing, but why create a new variable if you don't have to?

One other nice way to use $. is in conjunction with Perl's "flip-flop" operator (..). When used in list context, .. is the list construction operator. It builds a list of elements by calculating all of the items between given start and end values like this:

  my @numbers = (1 .. 1000);

But when you use this operator in a scalar context (like, for example, as the condition of an if statement), its behavior changes completely. The first operand (the left-hand expression) is evaluated to see if it is true or false. If it is false then the operator returns false and nothing happens. If it is true, however, the operator returns true and continues to return true on subsequent calls until the second operand (the right-hand expression) returns true.

An example will hopefully make this clearer. Suppose you have a file and you only want to process certain sections of it. The sections that you want to print are clearly marked with the string "!! START !!" at the start and "!! END !!" at the end. Using the flip-flop operator you can write code like this:

  while (<FILE>) {
    if (/!! START !!/ .. /!! END !!/) {
      # process line
    }
  }

Each time around the loop, the current line is checked by the flip-flop operator. If the line doesn't match /!! START !!/ then the operator returns false and the loop continues. When we reach the first line that matches /!! START !!/ then the flip-flop operator returns true and the code in the if block is executed. On subsequent iterations of the while loop, the flip-flop operator checks for matches again /!! END !!/, but it continues to return true until it finds a match. This means that all of the lines between the "!! START !!" and "!! END !!" markers are processed. When a line matches /!! END !!/ then the flip-flop operator returns false and starts checking against the first regex again.

So what does all this have to do with $.? Well, there's another piece of magic coded into the flip-flop operator. If either of its operands are constant values then they are converted to integers and matched against $.. So to print out just the first 10 lines of a file you can write code like this:

  while (<FILE>) {
    print if 1 .. 10;
  }

One final point on $., there is only one $. variable. If you are reading from multiple filehandles then $. contains the current record number from the most recently read filehandle. If you want anything more complex then you can use something like IO::File objects for your filehandle. These objects all have an input_line_number method.

The Field Record Separators

Next, we'll look at $/ and $\ which are the input and output record separators respectively. They control what defines a "record" when you are reading or writing data.

Let me explain that in a bit more detail. Remember when you were first learning Perl and you were introduced to the file input operator. Almost certainly you were told that <FILE> read data from the file up to and including the next newline character. Well that's not true. Well, it is, but it's only a specialized case. Actually it reads data up to and including the next occurrence of whatever is currently in $/ - the file input separator. Let's look at an example.

Imagine you have a text file which contains amusing quotes. Or lyrics from songs. Or whatever it is that you like to put in your randomly generated signature. The file might look something like this.

    This is the definition of my life
  %%
    We are far too young and clever
  %%
    Stab a sorry heart
    With your favorite finger

Here we have three quotes separated by a line containing just the string %%. How would you go about reading in that file a quote at a time?

One solution would be to read the file a line at a time, checking to see if the new line is just the string %%. You'd need to keep a variable that contains current quote that you are building up and process a completed quote when you find the termination string. Oh, and you'd need to remember to process the last quote in the file as that doesn't have a termination string (although, it might!)

A simpler solution would be to change Perl's idea of what constitutes a record. We do that by changing the value of $/. The default value is a newline character - which is why <...> usually reads in a line at a time. But we can set it to any value we like. We can do something like this

  $/ = "%%\n";

  while (<QUOTE>) {
    chomp;
    print;
  }

Now each time we call the file input operator, Perl reads data from the filehandle until it finds %%\n (or the end of file marker). A newline is no longer seen as a special character. Notice, however, that the file input operator always returns the next record with the file input separator still attached. When $/ has its default value of a newline character, you know that you can remove the newline character by calling chomp. Well it works exactly the same way when $/ has other values. It turns out that chomp doesn't just remove a newline character (that's another "simplification" that you find in beginners books) it actually removes whatever is the current value of $/. So in our sample code above, the call to chomp is removing the whole string %%\n.

Changing Perl's Special Variables

Before we go on I just need to alert you to one possible repercussion of changing these variables whenever you want. The problem is that most of these variables are forced into the main package. This means that when you change one of these variables, you are altering the value everywhere in your program. This includes any modules that you use in your program. The reverse is also true. If you're writing a module that other people will use in their programs and you change the value of $/ inside it, then you have changed the value for all of the remaining program execution. I hope you can seen why changing variables like $/ in one part of your program can potentially lead to hard to find bugs in another part.

So we need to do what we can to avoid this. Your first approach might be to reset the value of $/ after you have finished with it. So you'd write code like this.

  $/ = "%%\n";

  while (<QUOTE>) {
    chomp;
    print;
  }

  $/ = "\n";

The problem with this is you can't be sure that $/ contained \n before you started fiddling with it. Someone else might have changed it before your code was reached. So the next attempt might look like this.

  $old_input_rec_sep = $/;
  $/ = "%%\n";

  while (<QUOTE>) {
    chomp;
    print;
  }

  $/ = $old_input_rec_sep;

This code works and doesn't have the bug that we're trying to avoid but there's another way that looks cleaner. Remember the local function that you used to declare local variables until someone told you that you should use my instead? Well this is one of the few places where you can use local to great effect.

It's generally acknowledged that local is badly named. The name doesn't describe what the function does. In Perl 6 the function is likely to be renamed to temp as that's a far better description of what it does - it creates a temporary variable with the same name as an existing variable and restores the original variable when the program leaves the innermost enclosing block. This means that we can write our code like this.

  {
    local $/ = "%%\n";

    while (<QUOTE>) {
      chomp;
      print;
    }
  }

We've enclosed all of the code in another pair of braces to create a naked block. Code blocks are usually associated with loops, conditionals or subroutines, but in Perl they don't need to be. You can introduce a new block whenever you want. Here, we've introduced a block purely to delimit the area where we want $/ to have a new value. We then use local to store the old $/ variable somewhere where it can't be disturbed and set our new version of the variable to %%\n. We can then do whatever we want in the code block and when we exit from the block, Perl automatically restores the original copy of $/ and we never needed to know what it was set to.

For all this reason, it's good practice to never change one of Perl's internal variables unless it is localized in a block.

Other Values For $/

There are a few special values that you can give $/ which turn on interesting behaviours. The first of these is setting it to undef. This turns on "slurp mode" and the next time you read from a filehandle you will get all of the remaining data right up to the end of file marker. This means that you can read a whole file in using code like this.

  my $file = do { local $/; <FILE> };

A do block returns the value of the last expression evaluated within it, which in this case is the file input operator. And as $/ has been set to undef it returns the whole file. Notice that we don't even need to explicitly set $/ to undef as all Perl variables are initialized to undef when they are created.

There is a big difference between setting $/ to undef and setting it to an empty string. Setting it to an empty string turns on "paragraph" mode. In this mode each record is a paragraph of text terminated by one or more empty lines. You might think that this effect can be mimicked by setting $/ to \n\n, but the subtle difference is that paragraph mode acts as thought $/ had been set to \n\n+ (although you can't actually set $/ equal to a regular expression.)

The final special value is to set $/ to either a reference to a scalar variable that holds an integer, or to a reference to an integer constant. In these cases the next read from a filehandle will read up to that number of bytes (I say "up to" because at the end of the file there might not be enough data left to give you). So you read a file 2Kb at a time and you can do this.

  {
    local $/ = \2048;

    while (<FILE>) {
      # $_ contains the next 2048 bytes from FILE
    }
  }

$/ and $.

Note that changing $/ alters Perl's definition of a record and therefore it alters the behavior of $.. $. doesn't actually contain the current line number, it contains the current record number. So in our quotes example above, $. will be incremented for each quote that you read from the filehandle.

What About $\?

Many paragraphs back I mentioned both $/ and $\ as being the input and output record separators. But since then I've just gone on about $/. What happened to $\?

Well, to be honest, $\ isn't anywhere near as useful as $/. It contains a string that is printed at the end of every call to print. Its default value is the empty string, so nothing gets added to data that you display with print. But if, for example, you longed for the days of Pascal you could write a println function like this.

  sub println {
    local $\ = "\n";
    print @_;
  }

Then every time you called println, all of the arguments would be printed followed by a newline.

Other Print Variables

The next two variables that I want to discuss are very easily confused although they do completely different things. To illustrate them, consider the following code.

  my @arr = (1, 2, 3);

  print @arr;
  print "@arr";

Now, without looking it up do you know what the difference is between the output from the two calls to print?

The answer is that the first one prints the three elements of the array with nothing separating them (like this - 123) whereas the second one prints the elements separated by spaces (like this - 1 2 3). Why is there a difference?

The key to understanding it is to look at exactly what is being passed to print in each case. In the first case print is passed an array. Perl unrolls that array into a list and print actually sees the three elements of the array as separate arguments. In the second case, the array is interpolated into a double quoted string before print sees it. That interpolation has nothing at all to do with the call to print. Exactly the same process would take place if, for example, we did something like this.

  my $string = "@arr";
  print $string;

So in the second case, the print function only sees one argument. The fact that it is the results of interpolating an array in double quotes has no effect on how print treats the string.

We therefore have two cases. When print receives a number of arguments it prints them out with no spaces between them. And when an array is interpolated in double quotes it is expanded with spaces between the individual elements. These two cases are completely unrelated, but from our first example above it's easy to see how people can get them confused.

Of course, Perl allows us to change these behaviors if we want to. The string that is printed between the arguments passed to print is stored in a variable called $, (because you use a comma to separate arguments). As we've seen, the default value for that is an empty string but it can, of course, be changed.

  my @arr = (1, 2, 3);
  {
    local $, = ',';

    print @arr;
  }

This code prints the string 1,2,3.

The string that separates the elements of an array when expanded in a double quoted string is stored in $". Once again, it's simple to change it to a different value.

  my @arr = (1, 2, 3);
  {
    local $" = '+';

    print "@arr";
  }

This code prints 1+2+3".

Of course, $" doesn't necessarily have to used in conjunction with a print statement. You can use it anywhere that you have an array in a doubled quoted string. And it doesn't just work for arrays. Array and hash slices work just as well.

  my %hash = (one => 1, two => 2, three => 3);

  {
    local $" = ' < ';

    print "@hash{qw(one two three)}";
  }

This displays 1 < 2 < 3.

Conclusion

In this article we've just scratched the surface of what you can do by changing the values in Perl's internal variables. If this makes you want to look at this subject in more detail, then you should read the perlvar manual page.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02