For those not used to the terminology, FMTYEWTK stands for Far More Than You Ever Wanted To Know. This one is fairly light as FMTYEWTKs usually go. In any case, the question before us is, "How do you apply an edit against a list of files using Perl?" Well, that depends on what you want to do....
The Beginning
If you only want to read in one or more files, apply a regex to the contents, and spit out the altered text as one big stream -- the best approach is probably a one-liner such as the following:
perl -p -e "s/Foo/Bar/g" <FileList>
This command calls perl with the options -p and
-e "s/Foo/Bar/g" against the files listed in
FileList. The first argument, -p, tells Perl
to print each line it reads after applying the alteration. The second
option, -e, tells Perl to evaluate the provided
substitution regex rather than reading a script from a file. The Perl interpreter
then evaluates this regex against every line of all (space separated) files
listed on the command line and spits out one huge stream of the concatenated
fixed lines.
In standard fashion, Perl allows you to concatenate options without arguments with following options for brevity and convenience. Therefore, you'll more often see the previous example written as:
perl -pe "s/Foo/Bar/g" <FileList>
In-place Editing
If you want to edit the files in place, editing each file before going on to the next, that's pretty easy, too:
perl -pi.bak -e "s/Foo/Bar/g" <FileList>
The only change from the last command is the new option -i.bak,
which tells Perl to operate on files in-place, rather than
concatenating them together into one big output stream. Like the
-e option, -i takes one argument, an extension to add
to the original file names when making backup copies; for this example I chose
.bak. Warning: If you execute the command twice,
you've most likely just overwritten your backups with the changed versions from
the first run. You probably didn't want to do that.
Because -i takes an argument, I had to separate out the
-e option, which Perl otherwise would interpret as the argument to
-i, leaving us with a backup extension of .bake,
unlikely to be correct unless you happen to be a pastry chef. In addition, Perl
would have thought that "s/Foo/Bar/" was the filename of the
script to run, and would complain when it could not find a script by that
name.
Running Multiple Regexes
Of course, you may want to make more extensive changes than just one regex. To make several changes all at once, add more code to the evaluated script. Remember to separate each additional line of code with a semicolon (technically, you should place a semicolon at the end of each line of code, but the very last one in any code block is optional). For example, you could make a series of changes:
perl -pi.bak -e "s/Bill Gates/Microsoft CEO/g;
s/CEO/Overlord/g" <FileList>
"Bill Gates" would then become "Microsoft Overlord" throughout the files. (Here, as in all examples, we ignore such finicky things as making sure we don't change "HERBACEOUS" to "HERBAOverlordUS"; for that kind of information, refer to a good treatise on regular expressions, such as Jeffrey Friedl's impressive book Mastering Regular Expressions, 2nd Edition. Also, I've wrapped the command to fit, but you should type it in as just one line.)
Doing Your Own Printing
You may wish to override the behavior created by -p, which
prints every line read in, after any changes made by your script. In this case,
change to the -n option. -p -e "s/Foo/Bar/" is
roughly equivalent to -n -e "s/Foo/Bar/; print". This allows you
to write interesting commands, such as removing lines beginning with hash marks
(Perl comments, C-style preprocessor directives, etc.):
perl -ni.bak -e "print unless /^\s*#/;" <FileList>
Fields and Scripts
Of course, there are far more powerful things you can do with this. For example, imagine a flat-file database, with one row per line of the file, and fields separated by colons, like so:
Bill:Hennig:Male:43:62000
Mary:Myrtle:Female:28:56000
Jim:Smith:Male:24:50700
Mike:Jones:Male:29:35200
...
Suppose you want to find everyone who was over 25, but paid less than
$40,000. At the same time, you'd like to document the number and percentage of
women and men found. This time, instead of providing a mini-script on the
command line, we'll create a file, glass.pl, which contains the
script. Here's how to run the query:
perl -naF':' glass.pl <FileList>
glass.pl contains the following:
BEGIN { $men = $women = $lowmen = $lowwomen = 0; }
next unless /:/;
/Female/ ? $women++ : $men++;
if ($F[3] > 25 and $F[4] < 40000)
{ print; /Female/ ? $lowwomen++ : $lowmen++; }
END {
print "\n\n$lowwomen of $women women (",
int($lowwomen / $women * 100),
"%) and $lowmen of $men men (",
int($lowmen / $men * 100),
"%) seem to be underpaid.\n";
}
Don't worry too much about the syntax, other than to note some of the awk and C similarities. The important thing here and in later sections is to see how Perl makes these problems easily solvable.
Several new features appear in this example; first, if there is no
-e option to evaluate, Perl assumes the first filename listed, in
this case glass.pl, refers to a Perl script for it to execute. Secondly, two new options make it easy to deal with field-based data. -a (autosplit mode) takes each line and splits its fields
into the array @F, based on the field delimiter given by the
-F (Field delimiter) option, which can be a string or a
regex. If no -F option exists, the field delimiter defaults to
' ' (one single-quoted space). By default, arrays in Perl are
zero-based, so $F[3] and $F[4] refer to the age and
pay fields, respectively. Finally, the BEGIN and END
blocks allow the programmer to perform actions before file reading begins and
after it finishes, respectively.
File Handling
All of these little tidbits have made use only of data from within the files being operated on. What if you want to be able to read in data from elsewhere? For example, imagine that you had some sort of file that allows includes; in this case, we'll assume that you somehow specify these files by relative pathname, rather than looking them up in an include path. Perhaps the includes look like the following:
...
#include foo.bar, baz.bar, boo.bar
...
If you want to see what the file looks like with the includes placed into the master file, you might try something like this:
perl -ni.bak -e "if (s/#include\s+//) {foreach $file
(split /,\s*/) {open FILE, '<', $file; print <FILE>}}
else {print}" <FileList>
To make it easier to see what's going on here, this is what it looks like with a full set of line breaks added for clarity:
perl -ni.bak -e "
if (s/#include\s+//) {
foreach $file (split /,\s*/) {
open FILE, '<', $file;
print <FILE>
}
} else {
print
}
" <FileList>
Of course, this only expands one level of include, but then we haven't
provided any way for the script to know when to stop if there's an include
loop. In this little example, we take advantage of the fact that the
substitution operator returns the number of changes made, so if it manages to
chop off the #include at the beginning of the line, it returns a
non-zero (true) value, and the rest of the code splits apart the list of
includes, opens each one in turn, and prints its entire contents.
There are some handy shortcuts as well: if you open a new file using the
name of an old file handle (FILE in this case), Perl automatically
closes the old file first. In addition, if you read from a file using the
<> operator into a list (which the print
function expects), it happily reads in the entire file at once, one line per
list entry. The print call then prints the entire list, inserting
it into the current file, as expected. Finally, the else clause
handles printing non-include lines from the source, because we are using
-n rather than -p.
Better File Lists
The fact that it is relatively easy to handle filenames listed within other files indicates that it ought to be fairly easy to deal entirely with files read from some other source than a list on the end of the command line. The simplest case is to read all of the file contents from standard input as a single stream, which is common when building up pipes. As a matter of fact, this is so common that Perl automatically switches to this mode if there are no files listed on the command line:
<Source> | perl -pe "s/Foo/Bar/g" | <Sink>
Here Source and Sink are the commands that generate the
raw data and handle the altered output from Perl, respectively. Incidentally, the
filename consisting of a single hyphen (-) is an explicit alias
for standard input; this allows the Perl programmer to merge input from files
and pipes, like so:
<Source> | perl -pe "s/Foo/Bar/g" header.bar - footer.bar
| <Sink>
This example first reads a header file, then the input from the pipe source, and then a footer file — the whole mess. The program modifies this text and sends it through to the out pipe.
As I mentioned earlier, when dealing with multiple files it is usually better to keep the files separate, by using in-place editing or by explicitly handling each file separately. On the other hand, it can be a pain to list all of the files on the command line, especially if there are a lot of files, or when dealing with files generated programmatically.
The simplest method is to read the files from standard input, pushing them
onto @ARGV in a BEGIN block; this has the effect of
tricking Perl into thinking it received all of the filenames on the command
line! Assuming the common case of one filename per input line, the following
will do the trick:
<FilenamesSource> | perl -pi.bak -e "BEGIN {push @ARGV,
<STDIN>; chomp @ARGV} s/Foo/Bar/g"
Here we once again use the shortcut that reading in a file in a list context
(which push provides) will read in the entire file. This adds the
entire contents, one filename per entry, to the @ARGV array, which
normally contains the list of arguments to the script. To complete the trick,
we chomp the line endings from the filenames, because Perl
normally returns the line ending characters (a carriage return and/or a line
feed) when reading lines from a file. We don't want to consider these to be
part of the filenames. (On some platforms, you could actually have
filenames containing line ending characters, but then you'd have to make the
Perl code a little more complex, and you deserve to figure that out for
yourself for trying it in the first place.)
Response Files
Another common design is to provide filenames on the command line as usual,
treating filenames starting with an @ specially. The program
should consider their contents to be lists of filenames to insert directly into
the command line. For example, if the contents of the file
names.baz (often called a response file) are:
two
three
four
then this command:
perl -pi.bak -e "s/Foo/Bar/g" one @names.baz five
should work equivalently to:
perl -pi.bak -e "s/Foo/Bar/g" one two three four five
To make this work, we once again need to do a little magic in a
BEGIN block. Essentially, we want to parse through the
@ARGV array, looking for filenames that begin with @.
We pass through any unmarked filenames, but for each response file found, we
read in the contents of the response file and insert the new list of filenames
into @ARGV. Finally, we chomp the line endings, just as in the previous section. This produces a canonical file list in
@ARGV, just as if we'd specified all of the files on the command
line. Here's what it looks like in action:
perl -pi.bak -e "BEGIN {@ARGV = map {s/^@// ? @{open RESP,
'<', $_; [<RESP>]} : $_} @ARGV; chomp @ARGV} s/Foo/Bar/g"
<ResponseFileList>
Here's the same code with line breaks added so you can see what's going on:
perl -pi.bak -e "
BEGIN {
@ARGV = map {
s/^@// ? @{open RESP, '<', $_;
[<RESP>]}
: $_
} @ARGV;
chomp @ARGV
}
s/Foo/Bar/g
" <ResponseFileList>
The only tricky part is the map block. map
applies a piece of code to every element of a list, returning a list of the
return values of the code; the current element is in the $_
special variable. The block here checks to see if it could remove a
@ from the beginning of each filename. If so, it opens the file,
reads the whole thing into an anonymous temporary array (that's what the square
brackets are there for), and then inserts that array instead of the response
file's name (that's the odd @{...} construct). If there is no
@ at the beginning of the filename to remove, the filename goes
directly into the map results. Once we've performed this expansion and chomped
any line endings, we can then proceed with the main work, in this case our
usual substitution, s/Foo/Bar/g.
Recursing Directories
For our final example, let's deal with a major weakness in the way we've
been doing things so far — we're not recursing into directories, instead
expecting all of the files we need to read to appear explicitly on the command
line. To perform the recursion, we need to pull out the big guns:
File::Find. This Perl module provides very powerful recursion
methods. It also comes standard with any recent version of the Perl
interpreter. The command line is deceptively simple, because all of the brains
are in the script:
perl cleanup.pl <DirectoryList>
This script will perform some basic housecleaning, marking all files
readable and writeable, removing those with the extensions .bak,
.$$$, and .tmp, and cleaning up .log
files. For the log files, we will create a master log file (for archiving or
perusal) containing the contents of all of the other logs, and then delete the
logs so that they remain short over time. Here's the script:
use File::Find;
die "All arguments must be directories!"
if grep {!-d} @ARGV;
open MASTER, '>', 'master.lgm';
finddepth(\&filehandler, @ARGV);
close MASTER;
rename 'master.lgm', 'master.log';
sub filehandler
{
chmod stat(_) | 0666, $_ unless (-r and -w);
unlink if (/\.bak$/ or /\.tmp$/ or /\.\$\$\$$/);
if (/\.log$/) {
open LOG, '<', $_;
print MASTER "\n\n****\n$File::Find::name\n****\n";
print MASTER <LOG>;
close LOG;
unlink;
}
}
This example shows just how powerful Perl and Perl modules can be, and at
the same time just how obtuse Perl can appear to the inexperienced. In this
case, the short explanation is that the finddepth() function
iterates through all of the program arguments (@ARGV), recursing
into each directory and calling the filehandler() subroutine for
each file. That subroutine then can examine the file and decide what to do with
it. The example checks for readability and writability with -r and
-w, fixing the file's security settings if needed with
chmod. It then unlinks (deletes) any file with a name
ending in any of the three unwanted extensions. Finally, if the extension is
.log, it opens the file, writes a few header lines to the master
log, copies the file into the master log, closes it, and deletes it.
Instead of using finddepth(), which does a depth-first search
of the directories and visits them from the bottom up, we could have used
find(), which does the same depth-first search from the top down.
As a side note, the program writes the master log file with the extension
.lgm, then renames it at the end to have the extension
.log, so as to avoid the possibility of writing the master log
into itself if the program is searching the current directory.
Conclusion
That's it. Sure, there's a lot more that you could do with these examples, including adding error checking, generating additional statistics, producing help text, etc. To learn how to do this, find a copy of Programming Perl, 3rd Edition, by Larry Wall, Tom Christiansen, and Jon Orwant. This is the bible (or the Camel, rather) of the Perl community, and well worth the read. Good luck!


