Power Regexps, Part II
by Simon CozensJuly 01, 2003
In the previous article, we looked at some of the more intermediate features of regular expressions, including multiline matching, quoting, and interpolation. This time, we're going to look at more-advanced features. We'll also look at some modules that can help us handle regular expressions.
Look Forward, Look Back
Perhaps the most misunderstood facility of regular expressions are the lookahead and lookbehind operators; let's begin with the simplest, the positive lookahead operator.
This operator, spelled (?= ), attempts to match a pattern, and if
successful, promptly forgets all about it. As its name implies, it peeks
forward into the string to see whether the next part of the string matches
the pattern. For instance:
$a="13.15 Train to London";
$a=~ /(?=.*London)([\d\.]+)/
This is perhaps an inefficient way of writing:
$a =~ /([\d\.]+).*London/;
and it can be read as "See if this string has 'London' in it somewhere, and if so, capture a series of digits or periods."
Here's an example of it in real-life code; I want to turn some file
names into names of Perl modules. I'll have a name like
/Library/Perl/Mail/Miner/Recogniser/Phone.pm - this is part of my
Mail::Miner module, so I can guarantee that the name of the module
will start with Mail/Miner - and I want to get
Mail::Miner::Recogniser::Phone. Here's the code that does it:
our @modules = map {
s/.pm$//;
s{.*(?=Mail/Miner)}{};
join "::", splitdir($_)
} @files;
We look at each of our files, and first take off the .pm from the
end. Now what we need to do is remove everything before the
Mail/Miner portion, stripping off /Library/Perl or whatever our
path happens to be. Now we could write this as:
s{.*Mail/Miner}{Mail/Miner};
removing everything which appears before Mail/Miner and then the text
Mail/Miner itself, and then replacing all that with Mail/Miner
again. This is obviously horribly long-winded, and it's much more
natural to think of this in turns of "get rid of everything but stop
when you see Mail/Miner". In most cases, you can think of (?= ) as
meaning "up to".
Similar but subtly different is the negative counterpart (?! ). This
again peeks forward into the string, but ensures that it doesn't
match the pattern. A good way to think of this is "so long as you don't
see". Damian Conway's Text::Autoformat contains some code for
detecting quoted lines of text, such as may be found in an e-mail
message:
% Will all this regular expression scariness go away in
% Perl 6?
Yes, definitely; we're replacing it with a completely different set
of scariness.
Here the first two lines are quoted, and the expressions that check for this look like so:
my $quotechar = qq{[!#%=|:]};
my $quotechunk = qq{(?:$quotechar(?![a-z])|[a-z]*>+)};
$quotechar contains the characters that we consider signify a
quotation, and $quotechunk has two options for what a quotation looks
like. The second is most natural: a greater-than sign, possibly preceded
by some initials, such as produced by the popular Supercite emacs
package:
SC> You're talking nonsense, you odious little gnome!
The left-hand side of the alternation in $quotechunk is a little more
interesting. We look for one of our quotation characters, such as %
as in the example above, but then we make sure that the next character
we see is not alphabetic; this may be a quotation:
% I think that all right-thinking people...
but this almost certainly isn't
%options = ( verbose => 1, debug => 0 );
The (?!) acts as a "make sure you don't see" directive.
The mistake everyone makes at least once with this is to assume you can say:
/(?!foo)bar/;
and wonder why it matches against foobar. After all, we've made sure
we didn't see a foo before the bar, right? Well, not exactly.
These are lookahead operators, and so can't be used to find things
"before" anything at all; they're only used to determine what we can or
can't see after the current position. To understand why this is wrong,
imagine what it would mean if it were a positive assertion:
/(?=foo)bar/;
This means "are the next three characters we see foo? If so, the next three characters we see are bar". This is obviously never
going to happen, since a string can't contain both foo and bar at
the same position and the same time. (Although I believe Damian has a
paper on that.) So the negative version means "are the next three
characters we see not foo? Then match bar". foo is not
bar, so this matches any bar. What was probably meant was a
lookbehind assertion, which we will look at imminently.
Now we've seen the two forward-facing assertions, we can turn (ha, ha) to the backward-facing assertions, positive and negative lookbehind. There's one important difference between these and their forward-facing counterparts; while lookahead operators can contain more or less any kind of regular expression pattern, for reasons of implementation the lookbehind operators must have a fixed width computable at compile time. That is, you're not allowed to use any indefinite quantifiers in your subpatterns.
The positive lookbehind assertion is (?<=), and the only thing
you need to know about it is that it's so rare I can't remember the last
time I saw it in real code. I don't think I've ever used it, except
possibly in error. If you think you want to use one of these, then you almost
certainly need to rethink your strategy. Here's a quick example, though,
from IPC::Open3:
$@ =~ s/(?<=value attempted) at .*//s;
The context for this is that we've just done the equivalent of
eval { $_[0] = ... };
and if someone maliciously passes a constant value to the subroutine,
we want to through the Modification of a read-only value attempted
error back in their face. We check we're seeing the error we expect,
then strip off the at .../IPC/Open3.pm, line 154 part of the message
so that it can be fed to croak. The less Tom-Christianseny way to
do this would be something like:
croak "You fed me bogus parameters" if $@ =~ /attempted/;
The negative lookbehind assertion, on the other hand, is considerably
more common; this is the answer to our "bar not preceded by foo"
problem of the previous section.
/(?!<foo)bar/;
This will match bar, peeking backward into the string to make sure
it doesn't see foo first. To take another example, suppose we're
preparing some text for sending over the network, and we want to make
sure that all the line feeds (\n) have carriage returns (\r)
before them. Here's the truly lazy way to do it:
# Make sure there's an \r in there somewhere
s{\n} {\r\n}g;
# And then strip out duplicates
s{\r\r}{\r} g;
This is fine (if somewhat inefficient) unless it's OK for two carriage
returns to appear without a line feed in the way. Here's the finesse:
s/(?<!\r)\n/\r\n/g;
If you see a line feed that is not preceded by a carriage return, then stick a carriage return in there -- much cleaner, and much more efficient.
Pages: 1, 2 |

