Perl Slurp-Eaze
by Uri Guttman
|
Pages: 1, 2
Fast Slurping
Somewhere along the line, I learned about a way to slurp files faster than by setting $/ to undef. The method is very simple, you do a single read call with the size of the file (which the -s operator provides). This bypasses the I/O loop inside perl that checks for EOF and does all sorts of processing. I then decided to experiment and found that sysread is even faster as you would expect. sysread bypasses all of Perl's stdin and reads the file from the kernel buffers directly into a Perl scalar. This is why the slurp code in File::Slurp uses sysopen/sysread/syswrite. All the rest of the code is just to support the various options and data passing techniques.
Benchmarks
Benchmarks can be enlightening, informative, frustrating and deceiving. It would make no sense to create a new and more complex slurp module unless it also gained significantly in speed. So I created a benchmark script which compares various slurp methods with differing file sizes and calling contexts. This script can be run from the main directory of the tarball like this:
perl -Ilib extras/slurp_bench.pl
If you pass in an argument on the command line, it will be passed to
timethese() and it will control the duration. It defaults to -2 which
makes each benchmark run to at least 2 seconds of CPU time.
The following numbers are from a run I did on my 300Mhz sparc. You will most likely get much faster counts on your boxes but the relative speeds shouldn't change by much. If you see major differences on your benchmarks, please send me the results and your Perl and OS versions. Also you can play with the benchmark script and add more slurp variations or data files.
The rest of this section will be discussing the results of the benchmarks. You can refer to extras/slurp_bench.pl to see the code for the individual benchmarks. If the benchmark name starts with cpan_, it is either from Slurp.pm or File::Slurp.pm. Those starting with new_ are from the new File::Slurp.pm. Those that start with file_contents_ are from a client's code base. The rest are variations I created to highlight certain aspects of the benchmarks.
The short and long file data is made like this:
my @lines = ( 'abc' x 30 . "\n") x 100 ;
my $text = join( '', @lines ) ;
@lines = ( 'abc' x 40 . "\n") x 1000 ;
$text = join( '', @lines ) ;
So the short file is 9,100 bytes and the long file is 121,000 bytes.
Scalar Slurp of Short File
file_contents 651/s
file_contents_no_OO 828/s
cpan_read_file 1866/s
cpan_slurp 1934/s
read_file 2079/s
new 2270/s
new_buf_ref 2403/s
new_scalar_ref 2415/s
sysread_file 2572/s
Scalar Slurp of Long File
file_contents_no_OO 82.9/s
file_contents 85.4/s
cpan_read_file 250/s
cpan_slurp 257/s
read_file 323/s
new 468/s
sysread_file 489/s
new_scalar_ref 766/s
new_buf_ref 767/s
The primary inference you get from looking at the numbers above is that when slurping a file into a scalar, the longer the file, the more time you save by returning the result via a scalar reference. The time for the extra buffer copy can add up. The new module came out on top overall except for the very simple sysread_file entry which was added to highlight the overhead of the more flexible new module which isn't that much. The file_contents entries are always the worst since they do a list slurp and then a join, which is a classic newbie and cargo culted style which is extremely slow. Also the OO code in file_contents slows it down even more (I added the file_contents_no_OO entry to show this). The two CPAN modules are decent with small files but they are laggards compared to the new module when the file gets much larger.
List Slurp of Short File
cpan_read_file 589/s
cpan_slurp_to_array 620/s
read_file 824/s
new_array_ref 824/s
sysread_file 828/s
new 829/s
new_in_anon_array 833/s
cpan_slurp_to_array_ref 836/s
List Slurp of Long File
cpan_read_file 62.4/s
cpan_slurp_to_array 62.7/s
read_file 92.9/s
sysread_file 94.8/s
new_array_ref 95.5/s
new 96.2/s
cpan_slurp_to_array_ref 96.3/s
new_in_anon_array 97.2/s
This is perhaps the most interesting result of this benchmark. Five different entries have effectively tied for the lead. The logical conclusion is that splitting the input into lines is the bounding operation, no matter how the file gets slurped. This is the only benchmark where the new module isn't the clear winner (in the long file entries - it is no worse than a close second in the short file entries).
Note: In the benchmark information for all the spew entries, the extra number at the end of each line is how many wall-clock seconds the whole entry took. The benchmarks were run for at least 2 CPU seconds per entry. The unusually large wall-clock times will be discussed below.
Scalar Spew of Short File
cpan_write_file 1035/s 38
print_file 1055/s 41
syswrite_file 1135/s 44
new 1519/s 2
print_join_file 1766/s 2
new_ref 1900/s 2
syswrite_file2 2138/s 2
Scalar Spew of Long File
cpan_write_file 164/s 20
print_file 211/s 26
syswrite_file 236/s 25
print_join_file 277/s 2
new 295/s 2
syswrite_file2 428/s 2
new_ref 608/s 2
In the scalar spew entries, the new module API wins when it is passed a
reference to the scalar buffer. The syswrite_file2 entry beats it
with the shorter file due to its simpler code. The old CPAN module is
the slowest due to its extra copying of the data and its use of print.
List Spew of Short File
cpan_write_file 794/s 29
syswrite_file 1000/s 38
print_file 1013/s 42
new 1399/s 2
print_join_file 1557/s 2
List Spew of Long File
cpan_write_file 112/s 12
print_file 179/s 21
syswrite_file 181/s 19
print_join_file 205/s 2
new 228/s 2
Again, the simple print_join_file entry beats the new module when
spewing a short list of lines to a file. But is loses to the new module
when the file size gets longer. The old CPAN module lags behind the
others since it first makes an extra copy of the lines and then it calls
print on the output list and that is much slower than passing to
print a single scalar generated by join. The print_file entry
shows the advantage of directly printing @_ and the
print_join_file adds the join optimization.
Now about those long wall-clock times. If you look carefully at the
benchmark code of all the spew entries, you will find that some always
write to new files and some overwrite existing files. When I asked David
Muir why the old File::Slurp module had an overwrite subroutine, he
answered that by overwriting a file, you always guarantee something
readable is in the file. If you create a new file, there is a moment
when the new file is created but has no data in it. I feel this is not a
good enough answer. Even when overwriting, you can write a shorter file
than the existing file and then you have to truncate the file to the new
size. There is a small race window there where another process can slurp
in the file with the new data followed by leftover junk from the
previous version of the file. This reinforces the point that the only
way to ensure consistent file data is the proper use of file locks.
But what about those long times? Well it is all about the difference
between creating files and overwriting existing ones. The former have to
allocate new inodes (or the equivalent on other file systems) and the
latter can reuse the existing inode. This mean the overwrite will save on
disk seeks as well as on cpu time. In fact when running this benchmark,
I could hear my disk going crazy allocating inodes during the spew
operations. This speedup in both cpu and wall-clock is why the new module
always does overwriting when spewing files. It also does the proper
truncate (and this is checked in the tests by spewing shorter files
after longer ones had previously been written). The overwrite
subroutine is just an typeglob alias to write_file and is there for
backwards compatibility with the old File::Slurp module.
Benchmark Conclusion
Other than a few cases where a simpler entry beat it out, the new
File::Slurp module is either the speed leader or among the leaders. Its
special APIs for passing buffers by reference prove to be very useful
speedups. Also it uses all the other optimizations including using
sysread/syswrite and joining output lines. I expect many projects
that extensively use slurping will notice the speed improvements,
especially if they rewrite their code to take advantage of the new API
features. Even if they don't touch their code and use the simple API
they will get a significant speedup.
Error Handling
Slurp subroutines are subject to conditions such as not being able to
open the file, or I/O errors. How these errors are handled, and what the
caller will see, are important aspects of the design of an API. The
classic error handling for slurping has been to call die() or even
better, croak(). But sometimes you want the slurp to either
warn()/carp() or allow your code to handle the error. Sure, this
can be done by wrapping the slurp in a eval block to catch a fatal
error, but not everyone wants all that extra code. So I have added
another option to all the subroutines which selects the error
handling. If the 'err_mode' option is 'croak' (which is also the
default), the called subroutine will croak. If set to 'carp' then carp
will be called. Set to any other string (use 'quiet' when you want to
be explicit) and no error handler is called. Then the caller can use the
error status from the call.
write_file() doesn't use the return value for data so it can return a
false status value in-band to mark an error. read_file() does use its
return value for data, but we can still make it pass back the error
status. A successful read in any scalar mode will return either a
defined data string or a reference to a scalar or array. So a bare
return would work here. But if you slurp in lines by calling it in a
list context, a bare return will return an empty list, which is the
same value it would get from an existing but empty file. So now,
read_file() will do something I normally strongly advocate against,
i.e., returning an explicit undef value. In the scalar context this
still returns a error, and in list context, the returned first value
will be undef, and that is not legal data for the first element. So
the list context also gets a error status it can detect:
my @lines = read_file( $file_name, err_mode => 'quiet' ) ;
your_handle_error( "$file_name can't be read\n" ) unless
@lines && defined $lines[0] ;
The implementation
Here's the whole code which implements my faster slurp:
sub read_file {
my( $file_name, %args ) = @_ ;
my $buf ;
my $buf_ref = $args{'buf_ref'} || \$buf ;
my $mode = O_RDONLY ;
$mode |= O_BINARY if $args{'binmode'} ;
local( *FH ) ;
sysopen( FH, $file_name, $mode ) or
carp "Can't open $file_name: $!" ;
my $size_left = -s FH ;
while( $size_left > 0 ) {
my $read_cnt = sysread( FH, ${$buf_ref},
$size_left, length ${$buf_ref} ) ;
unless( $read_cnt ) {
carp "read error in file $file_name: $!" ;
last ;
}
$size_left -= $read_cnt ;
}
# handle void context (return scalar by buffer reference)
return unless defined wantarray ;
# handle list context
return split m|?<$/|g, ${$buf_ref} if wantarray ;
# handle scalar context
return ${$buf_ref} ;
}
sub read_file_lines {
# handle list context
return &read_file if wantarray;
# otherwise handle scalar context
return [ &read_file ] ;
}
sub write_file {
my $file_name = shift ;
my $args = ( ref $_[0] eq 'HASH' ) ? shift : {} ;
my $buf = join '', @_ ;
my $mode = O_WRONGLY ;
$mode |= O_BINARY if $args->{'binmode'} ;
$mode |= O_APPEND if $args->{'append'} ;
local( *FH ) ;
sysopen( FH, $file_name, $mode ) or
carp "Can't open $file_name: $!" ;
my $size_left = length( $buf ) ;
my $offset = 0 ;
while( $size_left > 0 ) {
my $write_cnt = syswrite( FH, $buf,
$size_left, $offset ) ;
unless( $write_cnt ) {
carp "write error in file $file_name: $!" ;
last ;
}
$size_left -= $write_cnt ;
$offset += $write_cnt ;
}
return ;
}
In Summary
We have compared classic line-by-line processing with munging a whole file in memory. Slurping files can speed up your programs and simplify your code, if done properly. You must still be aware to not slurp humongous files (logs, DNA sequences, and so forth), or STDIN, where you don't know how much data you will read in. But slurping megabyte-size files is not a major issue on today's systems with the typical amount of RAM installed. When Perl was first being used in-depth (Perl 4), slurping was limited by the smaller RAM size of ten years ago. Now, you should be able to slurp almost any reasonably sized file, whether it contains configuration, source code, or data.

