For those not used to the terminology, FMTYEWTK stands for Far More Than You Ever Wanted To Know. This one is fairly light as FMTYEWTKs usually go. In any case, the question before us is, "How do you apply an edit against a list of files using Perl?" Well, that depends on what you want to do . . . .
If you just want to read in one or more files, apply a regex to the contents, and spit out the altered text as one big stream, that is probably best done with a one liner such as the following:
perl -p -e "s/Foo/Bar/g" <FileList>
This command calls perl with the options -p and
-e "s/Foo/Bar/g" against the files listed in the
FileList. The first argument,
-p, tells perl to print each
line it reads after applying the alteration. The second
option, -e, tells perl to evaluate
the provided substition regex rather than reading a script
from a file. The perl interpreter then evaluates this regex
against every line of all (space separated) files listed on
the command line, and spits out one huge stream of the
concatenated fixed lines.
In standard fashion, perl allows options without arguments to be concatenated with following options for brevity and convenience. Therefore, the previous example is more often written:
perl -pe "s/Foo/Bar/g" <FileList>
If you want to edit the files inplace, editing each file before going on to the next, that's pretty easy too:
perl -pi.bak -e "s/Foo/Bar/g" <FileList>
The only change from the last command is the new option
-i.bak, which as you might expect tells perl to
operate on files inplace, rather than concatenating
them together into one big output stream. Like the
-e option, -i takes an argument,
in this case an extension to add to the original file names
when making backup copies; for this example I chose .bak.
Warning: If you execute the command twice,
you've most likely just overwritten your backups with the
changed versions from the first run. You probably didn't
want to do that.
Note that since -i takes an argument, I had
to separate out the -e option, which otherwise
would have been added to the argument to -i,
leaving us with a backup extension of .bake,
unlikely to be correct unless you happen to be a pastry chef.
In addition, perl would have thought that
"s/Foo/Bar/" was the filename of the script to
run, and would complain when it could not find a script by
that name.
Of course, you may want to make more extensive changes than just one regex. If you simply want to make several changes all at once, you can do that fairly easily by simply adding more code to the evaluated script; each additional line of code should be separated by a semicolon (technically, you should place a semicolon at the end of each line of code, but the very last one in any code block is optional). For example, you could make a series of changes:
perl -pi.bak -e "s/Bill Gates/Microsoft CEO/g; s/CEO/Overlord/g" <FileList>
"Bill Gates" would then become "Microsoft Overlord" throughout the files. (Here, as in all examples, we ignore such finicky things as making sure we don't change "HERBACEOUS" to "HERBAOverlordUS"; for that kind of information, refer to a good treatise on regular expressions, such as Jeffrey Friedl's impressive book Mastering Regular Expressions, 2nd Edition. Also, I've wrapped the command to fit, but you should type it in as just one line.)
You may wish to override the behavior created by -p,
which causes every line read in to be printed out, after any
changes made by your script. In this case, you should change
to the -n option. -p -e "s/Foo/Bar/"
is roughly equivalent to -n -e "s/Foo/Bar/; print".
This means we can do interesting stuff like the following,
which removes lines beginning with hash marks (Perl comments,
C-style preprocessor directives, etc.):
perl -ni.bak -e "print unless /^\s*#/;" <FileList>
Of course, there are far more powerful things you can do with this; for example, imagine a flatfile database, with one row per line of the file, and fields separated by colons, like so:
Bill:Hennig:Male:43:62000 Mary:Myrtle:Female:28:56000 Jim:Smith:Male:24:50700 Mike:Jones:Male:29:35200 ...
Now let's say that you wanted to find everyone who was over 25,
but paid less than $40,000. At the same time, you'd like to
document the number and percentage of women and men found.
This time, instead of providing a mini-script on the command
line, we'll create a file, glass.pl, which contains
the script we'll run. To run the query, the following will
do the trick:
perl -naF':' glass.pl <FileList>
glass.pl contains the following:
BEGIN { $men = $women = $lowmen = $lowwomen = 0; }
next unless /:/;
/Female/ ? $women++ : $men++;
if ($F[3] > 25 and $F[4] < 40000)
{ print; /Female/ ? $lowwomen++ : $lowmen++; }
END {
print "\n\n$lowwomen of $women women (",
int($lowwomen / $women * 100),
"%) and $lowmen of $men men (",
int($lowmen / $men * 100),
"%) seem to be underpaid.\n";
}
Don't worry too much about the syntax, other than to note
some of the AWK and C similarities; the important thing here
and in later sections is to see some of the capabilities
available to make these sorts of problems easily solvable in Perl.
Several new features are used for this example; first, if
there is no -e option to evaluate, perl assumes
the first filename listed, in this case glass.pl,
refers to a Perl script to be executed. Second, two
new options make it easy to deal with field-based data.
-a (autosplit mode) takes each line
and splits its fields into the array @F, based
on the field delimiter given by the -F
(Field delimeter) option, which can be a string or
a regex. If no -F option exists, the field
delimiter defaults to ' ' (one single-quoted space).
By default, arrays in Perl are zero-based, so $F[3]
and $F[4] refer to the age and pay fields,
respectively. Finally, the BEGIN and END
blocks allow the programmer to perform actions before file
reading begins and after all files have been dealt with,
respectively.
All of these little tidbits have made use only of data from within the files being operated on. But what if you wanted to be able to read in data from elsewhere? For example, imagine that you had some sort of file that allows includes; in this case, we'll assume that include files are specified by relative pathname, rather than being looked up in some sort of include path. Perhaps the includes look like the following:
... #include foo.bar, baz.bar, boo.bar ...
If you wanted to see what the file looked like with the includes placed into the master file, you might try something like this:
perl -ni.bak -e "if (s/#include\s+//) {foreach $file
(split /,\s*/) {open FILE, '<', $file; print <FILE>}}
else {print}" <FileList>
To make it easier to see what's going on here, this is what it looks like if we add in a full set of line breaks for clarity:
perl -ni.bak -e "
if (s/#include\s+//) {
foreach $file (split /,\s*/) {
open FILE, '<', $file;
print <FILE>
}
} else {
print
}
" <FileList>
Of course, this only expands one level of include, but then
we haven't provided any way for the script to know when to
stop if there's an include loop. In this little example,
we take advantage of the fact that the substitution operator
returns the number of changes made, so if it manages to chop
off the #include at the beginning of the line,
it returns a non-zero (true) value, and the rest of the code
splits apart the list of includes, opens each one in turn,
and prints its entire contents. Handy shortcuts are used as
well: if you open a new file using the name of an old file
handle (FILE in this case), perl automatically
closes the old file first; in addition, if you read from a file
using the <> operator into a list (which
the print function expects), it happily reads
in the entire file at once, one line per list entry. The
print call then prints the entire list, inserting
it into the current file, as expected. Finally, the else
clause handles printing non-include lines from the source,
since we are using -n rather than -p.
The fact that it is relatively easy to handle filenames listed within other files indicates that it ought to be fairly easy to deal entirely with files read from some other source than a list on the end of the command line. The simplest case is to simply read all of the file contents from standard input as a single stream, which is common when building up pipes. As a matter of fact, this is so common that perl automatically switches to this mode if there are no files listed on the command line:
<Source> | perl -pe "s/Foo/Bar/g" | <Sink>
Here Source and Sink are the commands
that generate the raw data and handle the altered output from
perl, respectively. Incidently, the filename consisting of
a single hyphen (-) is an explicit alias for
standard input; this allows the Perl programmer to merge input
from files and pipes, like so:
<Source> | perl -pe "s/Foo/Bar/g" header.bar - footer.bar | <Sink>
In this example, a header file is read, followed by the input
from the pipe source, followed by a footer file; the whole mess
is read in, modified, and sent through to the out pipe. Still,
as was mentioned early on, when dealing with multiple files it
is usually desirable to keep the files separate, by using inplace
editing or by explicitely handling each file separately. On the
other hand, it can be a pain to list all of the files on the
command line, especially if there are a lot of files, or they
are generated programmatically. The simplest method is to simply
read the files from standard input, pushing them onto @ARGV
in a BEGIN block; this has the effect of tricking
perl into thinking it received all of the filenames on the
command line! Assuming the common case of one filename per input
line, the following will do the trick:
<FilenamesSource> | perl -pi.bak -e "BEGIN {push @ARGV,
<STDIN>; chomp @ARGV} s/Foo/Bar/g"
Here we once again use the shortcut that reading in a file in a
list context (which is provided by the push) will
read in the entire file; the entire contents are added, one
filename per entry, to the @ARGV array, which
normally contains the list of arguments to the script. To
complete the trick, we chomp the line endings from
the filenames, since Perl normally returns the line ending
characters (a carriage return and/or a line feed) when reading
lines from a file, and we don't want to consider these to be
part of the filenames. (On some platforms, you could
actually have filenames containing line ending characters, but
then you'd have to make the Perl code a little more complex,
and you deserve to figure that out for yourself for trying it
in the first place.)
Another common design is to provide filenames on the command
line as usual, but filenames starting with an @
are treated specially; their contents are considered to be a
list of filenames to insert directly into the command line.
For example, if the contents of the file names.baz
(often called a response file) are:
two three four
then this command:
perl -pi.bak -e "s/Foo/Bar/g" one @names.baz five
should be treated as exactly equivalent to:
perl -pi.bak -e "s/Foo/Bar/g" one two three four five
To make this work, we once again need to do a little magic
in a BEGIN block. Essentially, we want to
parse through the @ARGV array, looking for
filenames that begin with @. We pass through
any unmarked filenames, but for each response file found, we
read in the contents of the response file and insert the
new list of filenames into @ARGV. Finally, we
chomp the line endings, just as in the
previous section; we then have
a canonical file list in @ARGV, just as if all
of the files had been specified on the command line. Here's
what it looks like in action:
perl -pi.bak -e "BEGIN {@ARGV = map {s/^@// ? @{open RESP,
'<', $_; [<RESP>]} : $_} @ARGV; chomp @ARGV} s/Foo/Bar/g"
<ResponseFileList>
Here's the same code with line breaks added so you can see what's going on:
perl -pi.bak -e "
BEGIN {
@ARGV = map {
s/^@// ? @{open RESP, '<', $_;
[<RESP>]}
: $_
} @ARGV;
chomp @ARGV
}
s/Foo/Bar/g
" <ResponseFileList>
The only tricky part is the map block.
map applies a piece of code to every element of
a list, returning a list of the return values of the code; the
current element is represented as $_. The block
we're using here checks to see if it was able to remove a
@ from the beginning of each filename. If so,
it opens the file, reads the whole thing into an anonymous
temporary array (that's what the square brackets are there for),
and then inserts that array instead of the response file's name
(that's the odd @{...} construct).
If there was no @ at the beginning of the filename
to remove, the filename is copied directly into the map results.
Once we've performed this expansion, and chomped any line endings,
we can then get on with the main work, which in this case is
simply our usual substitution, s/Foo/Bar/g.
For our final example, let's deal with a major weakness in
the way we've been doing things so far -- we're not recursing
into directories, but merely expecting all of the files we
need to read to be listed explicitely on the command line.
To perform the recursion, we need to pull out the big guns:
File::Find, which is a Perl module that provides
very powerful recursion methods; it comes standard with any
recent version of the perl interpreter. The command line will
be deceptively simple, because all of the brains will be in
the script:
perl cleanup.pl <DirectoryList>
This script will perform some basic housecleaning, marking
all files readable and writeable, removing those with the extensions
.bak, .$$$, and .tmp,
and cleaning up .log files. For the log files,
we will create a master log file for archiving or perusal,
containing the contents of all of the other logs, and then
delete the logs so that they remain short over time. Here's
the script:
use File::Find;
die "All arguments must be directories!"
if grep {!-d} @ARGV;
open MASTER, '>', 'master.lgm';
finddepth(\&filehandler, @ARGV);
close MASTER;
rename 'master.lgm', 'master.log';
sub filehandler
{
chmod stat(_) | 0666, $_ unless (-r and -w);
unlink if (/\.bak$/ or /\.tmp$/ or /\.\$\$\$$/);
if (/\.log$/) {
open LOG, '<', $_;
print MASTER "\n\n****\n$File::Find::name\n****\n";
print MASTER <LOG>;
close LOG;
unlink;
}
}
This example shows just how powerful Perl and Perl modules
can be, and at the same time just how obtuse Perl can appear
without some experience with it. In this case, the short
explanation is that the finddepth() function
iterates through all of the program arguments (@ARGV),
recursing into each directory, and calling the
filehandler() subroutine for each file. That
subroutine then can examine the file and decide what to do
with it. In the example, we check for readability and
writability with -r and -w, fixing
the file's security settings if needed with chmod.
We then unlink (delete) any file with a name
ending in any of the three unwanted extensions. Finally,
if the extension is .log, we open the file,
write a few header lines to the master log, copy the file
into the master log, close it, and delete it.
Instead of using finddepth(), which does a
depth-first search of the directories and visits them from
the bottom up, we could have used find(), which
does the same depth-first search, but visits them from the
top down. As a side note, the master log file is written
with the extension .lgm, and then renamed at
the end to have the extension .log, so as to
avoid the possibility of writing the master log into itself
if the current directory is one of those searched.
And that's it. Sure, there's a lot more that could be done with these examples, including error checking, additional statistics, help text, etc. If you want to learn how to do this, get a copy of Programming Perl, 3rd Edition, by Larry Wall, Tom Christiansen, and Jon Orwant. This is the bible (or the camel, rather) of the Perl community, and well worth the read. Good luck!