Listen Print

Power Regexps, Part II

by Simon Cozens
July 01, 2003

In the previous article, we looked at some of the more intermediate features of regular expressions, including multiline matching, quoting, and interpolation. This time, we're going to look at more-advanced features. We'll also look at some modules that can help us handle regular expressions.

Look Forward, Look Back

Perhaps the most misunderstood facility of regular expressions are the lookahead and lookbehind operators; let's begin with the simplest, the positive lookahead operator.

This operator, spelled (?= ), attempts to match a pattern, and if successful, promptly forgets all about it. As its name implies, it peeks forward into the string to see whether the next part of the string matches the pattern. For instance:


    $a="13.15    Train to London"; 
    $a=~ /(?=.*London)([\d\.]+)/

This is perhaps an inefficient way of writing:


    $a =~ /([\d\.]+).*London/;

and it can be read as "See if this string has 'London' in it somewhere, and if so, capture a series of digits or periods."

Here's an example of it in real-life code; I want to turn some file names into names of Perl modules. I'll have a name like /Library/Perl/Mail/Miner/Recogniser/Phone.pm - this is part of my Mail::Miner module, so I can guarantee that the name of the module will start with Mail/Miner - and I want to get Mail::Miner::Recogniser::Phone. Here's the code that does it:


    our @modules = map {
        s/.pm$//;
        s{.*(?=Mail/Miner)}{};
        join "::", splitdir($_)
    } @files;

We look at each of our files, and first take off the .pm from the end. Now what we need to do is remove everything before the Mail/Miner portion, stripping off /Library/Perl or whatever our path happens to be. Now we could write this as:


    s{.*Mail/Miner}{Mail/Miner};

removing everything which appears before Mail/Miner and then the text Mail/Miner itself, and then replacing all that with Mail/Miner again. This is obviously horribly long-winded, and it's much more natural to think of this in turns of "get rid of everything but stop when you see Mail/Miner". In most cases, you can think of (?= ) as meaning "up to".

Similar but subtly different is the negative counterpart (?! ). This again peeks forward into the string, but ensures that it doesn't match the pattern. A good way to think of this is "so long as you don't see". Damian Conway's Text::Autoformat contains some code for detecting quoted lines of text, such as may be found in an e-mail message:


    % Will all this regular expression scariness go away in 
    % Perl 6?

    Yes, definitely; we're replacing it with a completely different set
    of scariness.

Here the first two lines are quoted, and the expressions that check for this look like so:


    my $quotechar = qq{[!#%=|:]};
    my $quotechunk = qq{(?:$quotechar(?![a-z])|[a-z]*>+)};

$quotechar contains the characters that we consider signify a quotation, and $quotechunk has two options for what a quotation looks like. The second is most natural: a greater-than sign, possibly preceded by some initials, such as produced by the popular Supercite emacs package:


    SC> You're talking nonsense, you odious little gnome!

The left-hand side of the alternation in $quotechunk is a little more interesting. We look for one of our quotation characters, such as % as in the example above, but then we make sure that the next character we see is not alphabetic; this may be a quotation:


    % I think that all right-thinking people...

but this almost certainly isn't


    %options = ( verbose => 1, debug => 0 );

The (?!) acts as a "make sure you don't see" directive.

The mistake everyone makes at least once with this is to assume you can say:


    /(?!foo)bar/;

and wonder why it matches against foobar. After all, we've made sure we didn't see a foo before the bar, right? Well, not exactly. These are lookahead operators, and so can't be used to find things "before" anything at all; they're only used to determine what we can or can't see after the current position. To understand why this is wrong, imagine what it would mean if it were a positive assertion:


    /(?=foo)bar/;

This means "are the next three characters we see foo? If so, the next three characters we see are bar". This is obviously never going to happen, since a string can't contain both foo and bar at the same position and the same time. (Although I believe Damian has a paper on that.) So the negative version means "are the next three characters we see not foo? Then match bar". foo is not bar, so this matches any bar. What was probably meant was a lookbehind assertion, which we will look at imminently.

Now we've seen the two forward-facing assertions, we can turn (ha, ha) to the backward-facing assertions, positive and negative lookbehind. There's one important difference between these and their forward-facing counterparts; while lookahead operators can contain more or less any kind of regular expression pattern, for reasons of implementation the lookbehind operators must have a fixed width computable at compile time. That is, you're not allowed to use any indefinite quantifiers in your subpatterns.

The positive lookbehind assertion is (?<=), and the only thing you need to know about it is that it's so rare I can't remember the last time I saw it in real code. I don't think I've ever used it, except possibly in error. If you think you want to use one of these, then you almost certainly need to rethink your strategy. Here's a quick example, though, from IPC::Open3:


    $@ =~ s/(?<=value attempted) at .*//s;

The context for this is that we've just done the equivalent of


    eval { $_[0] = ... };

and if someone maliciously passes a constant value to the subroutine, we want to through the Modification of a read-only value attempted error back in their face. We check we're seeing the error we expect, then strip off the at .../IPC/Open3.pm, line 154 part of the message so that it can be fed to croak. The less Tom-Christianseny way to do this would be something like:


    croak "You fed me bogus parameters" if $@ =~ /attempted/;

The negative lookbehind assertion, on the other hand, is considerably more common; this is the answer to our "bar not preceded by foo" problem of the previous section.


    /(?!<foo)bar/;

This will match bar, peeking backward into the string to make sure it doesn't see foo first. To take another example, suppose we're preparing some text for sending over the network, and we want to make sure that all the line feeds (\n) have carriage returns (\r) before them. Here's the truly lazy way to do it:


    # Make sure there's an \r in there somewhere
    s{\n}  {\r\n}g;
    # And then strip out duplicates
    s{\r\r}{\r}  g;
 
This is fine (if somewhat inefficient) unless it's OK for two carriage
returns to appear without a line feed in the way. Here's the finesse:

    s/(?<!\r)\n/\r\n/g;

If you see a line feed that is not preceded by a carriage return, then stick a carriage return in there -- much cleaner, and much more efficient.

Pages: 1, 2

Next Pagearrow





Contact Us | Advertise with Us | Privacy Policy | Press Center | Jobs | Submissions Guidelines

Copyright © 2000-2008 O’Reilly Media, Inc. All Rights Reserved. | (707) 827-7000 / (800) 998-9938
All trademarks and registered trademarks appearing on the O'Reilly Network are the property of their respective owners.

For problems or assistance with this site, email