Listen Print

Perl Slurp-Eaze

by Uri Guttman
November 21, 2003

One of the common Perl idioms is processing text files line by line:


while( <FH> ) {
    do something with $_
}

This idiom has several variants, but the key point is that it reads in only one line from the file in each loop iteration. This has several advantages, including limiting memory use to one line, the ability to handle any size file (including data piped in via STDIN), and it is easily taught to and understood by Perl beginners. Unfortunately, it means they then go on to do things like this:


while( <FH> ) {
    push @lines, $_ ;
}

foreach ( @lines ) {
    do something with $_
}

Line by line processing is fine, but it isn't the only way to deal with reading files. The other common style is reading the entire file into a scalar or array, and that is commonly known as slurping. Now, slurping has somewhat of a poor reputation, and this article is an attempt at rehabilitating it.

Slurping files has advantages and limitations, and is not something you should just do when line by line processing is fine. It is best when you need the entire file in memory for processing all at once. Slurping with in memory processing can be faster and lead to simpler code than line by line if done properly.

The biggest issue to watch for with slurping is file size. Slurping very large files or unknown amounts of data from STDIN can be disastrous to your memory usage and cause swap disk thrashing. You can slurp STDIN if you know that you can handle the maximum size input without detrimentally affecting your memory usage, and so I advocate slurping only disk files and only when you know their size is reasonable and you have a real reason to process the file as a whole.

Related Reading

Programming Perl

Programming Perl
By Larry Wall, Tom Christiansen, Jon Orwant

Table of Contents
Index
Sample Chapter

Read Online--Safari Search this book on Safari:
 

Code Fragments only

Note that "reasonable" size these days is larger than it was in the bad old days of limited RAM. Slurping in a megabyte is not an issue on most systems. But most of the files I tend to slurp in are much smaller than that. Typical files that work well with slurping are configuration files, (mini-)language scripts, some data (especially binary) files, and other files of known sizes which need fast processing.

Another major win for slurping over line by line is speed. Perl's IO system (like many others) is slow. Calling <> for each line requires a check for the end of line, checks for EOF, copying a line, munging the internal handle structure, etc. Plenty of work for each line read in. On the other hand, slurping, if done correctly, will usually involve only one I/O call and no extra data copying. The same is true for writing files to disk, and we will cover that as well.

Finally, when you have slurped the entire file into memory, you can do operations on the data that are not possible or easily done with line by line processing. These include global search/replace (without regard for newlines), grabbing all matches with one call of //g, complex parsing (which in many cases must ignore newlines), processing *ML (where line endings are just white space) and performing complex transformations such as template expansion.

Global Operations

Here are some simple global operations that can be done quickly and easily on an entire file that has been slurped in. They could also be done with line by line processing but that would be slower and require more code.

A common problem is reading in a file with key/value pairs. There are modules which do this but who needs them for simple formats? Just slurp in the file and do a single parse to grab all the key/value pairs.


my $text = read_file( $file ) ;
my %config = $test =~ /^(\w+)=(.+)$/mg ;

That matches a key which starts a line (anywhere inside the string because of the /m modifier), the '=' char and the text to the end of the line (again, /m makes that work). In fact the ending $ is not even needed since . will not normally match a newline. Since the key and value are grabbed and the m// is in list context with the /g modifier, it will grab all key/value pairs and return them. The %config hash will be assigned this list and now you have the file fully parsed into a hash.

Various projects I have worked on needed some simple templating and I wasn't in the mood to use a full module (please, no flames about your favorite template module :-). So I rolled my own by slurping in the template file, setting up a template hash and doing this one line:


$text =~ s/<%(.+?)%>/$template{$1}/g ;

That only works if the entire file was slurped in. With a little extra work it can handle chunks of text to be expanded:


$text =~ s/<%(\w+)_START%>(.+)<%\1_END%>/ template($1, $2)/sge ;

Just supply a template sub to expand the text between the markers and you have yourself a simple system with minimal code. Note that this will work and grab over multiple lines due the the /s modifier. This is something that is much trickier with line by line processing.

Note that this is a very simple templating system, and it can't directly handle nested tags and other complex features. But even if you use one of the myriad of template modules on the CPAN, you will gain by having speedier ways to read and write files.

Slurping in a file into an array also offers some useful advantages. One simple example is reading in a flat database where each record has fields separated by a character such as ::


my @pw_fields = map [ split /:/ ], read_file( '/etc/passwd' ) ;

Random access to any line of the slurped file is another advantage. Also a line index could be built to speed up searching the array of lines.

Traditional Slurping

Perl has always supported slurping files with minimal code. Slurping of a file to a list of lines is trivial, just call the <> operator in a list context:


my @lines = <FH> ;

and slurping to a scalar isn't much more work. Just set the built in variable $/ (the input record separator) to the undefined value and read in the file with <>:


{
    local( $/, *FH ) ;
    open( FH, $file ) or die "sudden flaming death\n"
    $text = <FH>
}

Notice the use of local(). It sets $/ to undef for you and when the scope exits it will revert $/ back to its previous value (most likely "\n").

Here is a Perl idiom that allows the $text variable to be declared, and there is no need for a tightly nested block. The do block will execute <FH> in a scalar context and slurp in the file named by $file:


    local( *FH ) ;
    open( FH, $file ) or die "sudden flaming death\n"
    my $text = do { local( $/ ) ; <FH> } ;

Both of those slurps used localized filehandles to be compatible with 5.005. Here they are with 5.6.0 lexical autovivified handles:


{
    local( $/ ) ;
    open( my $fh, $file ) or die "sudden flaming death\n"
    $text = <$fh>
}

        open( my $fh, $file ) or die "sudden flaming death\n"
        my $text = do { local( $/ ) ; <$fh> } ;

And this is a variant of that idiom that removes the need for the open call:


my $text = do { local( @ARGV, $/ ) = $file ; <> } ;

The filename in $file is assigned to a localized @ARGV and the null filehandle is used which reads the data from the files in @ARGV.

Instead of assigning to a scalar, all the above slurps can assign to an array and it will get the file but split into lines (using $/ as the end of line marker).

There is one common variant of those slurps which is very slow and not good code. You see it around, and it is almost always cargo cult code:


my $text = join( '', <FH> ) ;

That needlessly splits the input file into lines (join provides a list context to <FH>) and then joins up those lines again. The original coder of this idiom obviously never read perlvar and learned how to use $/ to allow scalar slurping.

Write Slurping

While reading in entire files at one time is common, writing out entire files is also done. We call it "slurping" when we read in files, but there is no commonly accepted term for the write operation. I asked some Perl colleagues and got two interesting nominations: Peter Scott said to call it "burping" (rhymes with "slurping" and suggests movement in the opposite direction); others suggested "spewing" which has a stronger visual image :-) Tell me your favorite or suggest your own. I will use both in this section so you can see how they work for you.

Spewing a file is a much simpler operation than slurping. You don't have context issues to worry about and there is no efficiency problem with returning a buffer. Here is a simple burp subroutine:


sub burp {
    my( $file_name ) = shift ;
    open( my $fh, ">$file_name" ) || 
        die "can't create $file_name $!" ;
    print $fh @_ ;
}

Note that it doesn't copy the input text but passes @_ directly to print. We will look at faster variations of that later on.

Slurp on the CPAN

As you would expect there are modules in the CPAN that will slurp files for you. The two I found are called Slurp.pm (by Rob Casey - ROBAU on CPAN) and File::Slurp.pm (by David Muir Sharnoff - MUIR on CPAN).

Here is the code from Slurp.pm:


sub slurp { 
    local( $/, @ARGV ) = ( wantarray ? $/ : undef, @_ ); 
    return <ARGV>;
}

sub to_array {
    my @array = slurp( @_ );
    return wantarray ? @array : \@array;
}

sub to_scalar {
    my $scalar = slurp( @_ );
    return $scalar;
}

The subroutine slurp() uses the magic undefined value of $/ and the magic file handle ARGV to support slurping into a scalar or array. It also provides two wrapper subs that allow the caller to control the context of the slurp. And the to_array() subroutine will return the list of slurped lines or a anonymous array of them according to its caller's context by checking wantarray. It has 'slurp' in @EXPORT and all three subroutines in @EXPORT_OK.

File::Slurp.pm has this code:


sub read_file
{
    my ($file) = @_;

    local($/) = wantarray ? $/ : undef;
    local(*F);
    my $r;
    my (@r);

    open(F, "<$file") || croak "open $file: $!";
    @r = <F>;
    close(F) || croak "close $file: $!";

    return $r[0] unless wantarray;
    return @r;
}

This module provides several subroutines including read_file() (more on the others later). read_file() behaves similarly to Slurp::slurp() in that it will slurp a list of lines or a single scalar depending on the caller's context. It also uses the magic undefined value of $/ for scalar slurping but it uses an explicit open call rather than using a localized @ARGV and the other module did. Also it doesn't provide a way to get an anonymous array of the lines but that can easily be rectified by calling it inside an anonymous array constructor [].

Both of these modules make it easier for Perl coders to slurp in files. They both use the magic $/ to slurp in scalar mode and the natural behavior of <> in list context to slurp as lines. But neither is optimized for speed nor can they handle binmode() to support binary or unicode files. See below for more on slurp features and speedups.

Slurping API Design

The slurp modules on CPAN are have a very simple API and don't support binmode(). This section will cover various API design issues such as efficient return by reference, binmode() and calling variations.

Let's start with the call variations. Slurped files can be returned in four formats: as a single scalar, as a reference to a scalar, as a list of lines or as an anonymous array of lines. But the caller can only provide two contexts: scalar or list. So we have to either provide an API with more than one subroutine (as Slurp.pm did) or just provide one subroutine which only returns a scalar or a list (not an anonymous array) as File::Slurp does.

I have used my own read_file() subroutine for years and it has the same API as File::Slurp: a single subroutine that returns a scalar or a list of lines depending on context. But I recognize the interest of those that want an anonymous array for line slurping. For one thing, it is easier to pass around to other subs and, for another, it eliminates the extra copying of the lines via return. So my module will support multiple subroutines with one that returns the file based on context, and the other returns only lines (either as a list or as an anonymous array). So this API is in between the two CPAN modules. There is no need for a specific slurp-in-as-a-scalar subroutine as the general slurp() will do that in scalar context. If you wanted to slurp a scalar into an array, just select the desired array element and that will provide scalar context to the read_file() subroutine.

The next area to cover is what to name these subs. I will go with read_file() and read_file_lines(). They are descriptive, simple and don't use the 'slurp' nickname (though that nickname is in the module name).

Another critical area when designing APIs is how to pass in arguments. The read_file*() subroutines takes one required argument which is the file name. To support binmode() we need another optional argument. A third optional argument is needed to support returning a slurped scalar by reference. My first thought was to design the API with 3 positional arguments - file name, buffer reference and binmode. But if you want to set the binmode and not pass in a buffer reference, you have to fill the second argument with undef and that is ugly. So I decided to make the filename argument positional and the other two named. The subroutine starts off like this:


sub read_file {

    my( $file_name, %args ) = @_ ;

    my $buf ;
    my $buf_ref = $args{'buf'} || \$buf ;

The binmode argument will be handled later (see code below).

The other sub (read_file_lines()) will only take an optional binmode (so you can read files with binary delimiters). It doesn't need a buffer reference argument since it can return an anonymous array if the called in a scalar context. So this subroutine could use positional arguments, but to keep its API similar to the API of read_file(), it will also use pass by name for the optional arguments. This also means that new optional arguments can be added later without breaking any legacy code. A bonus with keeping the API the same for both subs will be seen how the two subs are optimized to work together.

Write slurping (or spewing or burping :-)) needs to have its API designed as well. The biggest issue is not only needing to support optional arguments but a list of arguments to be written is needed. Perl 6 will be able to handle that with optional named arguments and a final slurp argument. Since this is Perl 5, we have to do it using some cleverness. The first argument is the file name and it will be positional as with the read_file subroutine. But how can we pass in the optional arguments and also a list of data? The solution lies in the fact that the data list should never contain a reference. Burping/spewing works only on plain data. So if the next argument is a hash reference, we can assume it contains the optional arguments and the rest of the arguments is the data list. So the write_file() subroutine will start off like this:


sub write_file {

    my $file_name = shift ;

    my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;

Whether or not optional arguments are passed in, we leave the data list in @_ to minimize any more copying. You call write_file() like this:


write_file( 'foo', { binmode => ':raw' }, @data ) ;
write_file( 'junk', { append => 1 }, @more_junk ) ;
write_file( 'bar', @spew ) ;

Pages: 1, 2

Next Pagearrow





Contact Us | Advertise with Us | Privacy Policy | Press Center | Jobs | Submissions Guidelines

Copyright © 2000-2008 O’Reilly Media, Inc. All Rights Reserved. | (707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on the O'Reilly Network are the property of their respective owners.

For problems or assistance with this site, email