Power Regexps, Part II

Jul 1, 2003 by Simon Cozens

In the previous article, we looked at some of the more intermediate features of regular expressions, including multiline matching, quoting, and interpolation. This time, we’re going to look at more-advanced features. We’ll also look at some modules that can help us handle regular expressions.

Look Forward, Look Back

Perhaps the most misunderstood facility of regular expressions are the lookahead and lookbehind operators; let’s begin with the simplest, the positive lookahead operator.

This operator, spelled (?= ), attempts to match a pattern, and if successful, promptly forgets all about it. As its name implies, it peeks forward into the string to see whether the next part of the string matches the pattern. For instance:

    $a="13.15    Train to London"; 
    $a=~ /(?=.*London)([\d\.]+)/

This is perhaps an inefficient way of writing:

    $a =~ /([\d\.]+).*London/;

and it can be read as “See if this string has ‘London’ in it somewhere, and if so, capture a series of digits or periods.”

Here’s an example of it in real-life code; I want to turn some file names into names of Perl modules. I’ll have a name like /Library/Perl/Mail/Miner/Recogniser/Phone.pm - this is part of my Mail::Miner module, so I can guarantee that the name of the module will start with Mail/Miner - and I want to get Mail::Miner::Recogniser::Phone. Here’s the code that does it:

    our @modules = map {
        s/.pm$//;
        s{.*(?=Mail/Miner)}{};
        join "::", splitdir($_)
    } @files;

We look at each of our files, and first take off the .pm from the end. Now what we need to do is remove everything before the Mail/Miner portion, stripping off /Library/Perl or whatever our path happens to be. Now we could write this as:

    s{.*Mail/Miner}{Mail/Miner};

removing everything which appears before Mail/Miner and then the text Mail/Miner itself, and then replacing all that with Mail/Miner again. This is obviously horribly long-winded, and it’s much more natural to think of this in turns of “get rid of everything but stop when you see Mail/Miner”. In most cases, you can think of (?= ) as meaning “up to”.

Similar but subtly different is the negative counterpart (?! ). This again peeks forward into the string, but ensures that it doesn’t match the pattern. A good way to think of this is “so long as you don’t see”. Damian Conway’s Text::Autoformat contains some code for detecting quoted lines of text, such as may be found in an e-mail message:

    % Will all this regular expression scariness go away in 
    % Perl 6?

    Yes, definitely; we're replacing it with a completely different set
    of scariness.

Here the first two lines are quoted, and the expressions that check for this look like so:

    my $quotechar = qq{[!#%=|:]};
    my $quotechunk = qq{(?:$quotechar(?![a-z])|[a-z]*>+)};

$quotechar contains the characters that we consider signify a quotation, and $quotechunk has two options for what a quotation looks like. The second is most natural: a greater-than sign, possibly preceded by some initials, such as produced by the popular Supercite emacs package:

    SC> You're talking nonsense, you odious little gnome!

The left-hand side of the alternation in $quotechunk is a little more interesting. We look for one of our quotation characters, such as % as in the example above, but then we make sure that the next character we see is not alphabetic; this may be a quotation:

    % I think that all right-thinking people...

but this almost certainly isn’t

    %options = ( verbose => 1, debug => 0 );

The (?!) acts as a “make sure you don’t see” directive.

The mistake everyone makes at least once with this is to assume you can say:

    /(?!foo)bar/;

and wonder why it matches against foobar. After all, we’ve made sure we didn’t see a foo before the bar, right? Well, not exactly. These are lookahead operators, and so can’t be used to find things “before” anything at all; they’re only used to determine what we can or can’t see after the current position. To understand why this is wrong, imagine what it would mean if it were a positive assertion:

    /(?=foo)bar/;

This means “are the next three characters we see foo? If so, the next three characters we see are bar”. This is obviously never going to happen, since a string can’t contain both foo and bar at the same position and the same time. (Although I believe Damian has a paper on that.) So the negative version means “are the next three characters we see not foo? Then match bar”. foo is not bar, so this matches any bar. What was probably meant was a lookbehind assertion, which we will look at imminently.

Now we’ve seen the two forward-facing assertions, we can turn (ha, ha) to the backward-facing assertions, positive and negative lookbehind. There’s one important difference between these and their forward-facing counterparts; while lookahead operators can contain more or less any kind of regular expression pattern, for reasons of implementation the lookbehind operators must have a fixed width computable at compile time. That is, you’re not allowed to use any indefinite quantifiers in your subpatterns.

The positive lookbehind assertion is (?<=), and the only thing you need to know about it is that it’s so rare I can’t remember the last time I saw it in real code. I don’t think I’ve ever used it, except possibly in error. If you think you want to use one of these, then you almost certainly need to rethink your strategy. Here’s a quick example, though, from IPC::Open3:

    $@ =~ s/(?<=value attempted) at .*//s;

The context for this is that we’ve just done the equivalent of

    eval { $_[0] = ... };

and if someone maliciously passes a constant value to the subroutine, we want to through the Modification of a read-only value attempted error back in their face. We check we’re seeing the error we expect, then strip off the at .../IPC/Open3.pm, line 154 part of the message so that it can be fed to croak. The less Tom-Christianseny way to do this would be something like:

    croak "You fed me bogus parameters" if $@ =~ /attempted/;

The negative lookbehind assertion, on the other hand, is considerably more common; this is the answer to our “bar not preceded by foo” problem of the previous section.

    /(?!<foo)bar/;

This will match bar, peeking backward into the string to make sure it doesn’t see foo first. To take another example, suppose we’re preparing some text for sending over the network, and we want to make sure that all the line feeds (\n) have carriage returns (\r) before them. Here’s the truly lazy way to do it:

    # Make sure there's an \r in there somewhere
    s{\n}  {\r\n}g;
    # And then strip out duplicates
    s{\r\r}{\r}  g;
 
This is fine (if somewhat inefficient) unless it's OK for two carriage
returns to appear without a line feed in the way. Here's the finesse:

    s/(?<!\r)\n/\r\n/g;

If you see a line feed that is not preceded by a carriage return, then stick a carriage return in there – much cleaner, and much more efficient.

`split`, `//g` and other shenanigans

In the previous article, we had a nice piece of multiline, formatted data, such as one might expect to parse with Perl:

    Name: Mark-Jason Dominus
    Occupation: Perl trainer
    Favourite thing: Octopodes

    Name: Simon Cozens
    Occupation: Hacker
    Favourite thing: Sleep

Now, there’s a boring way to parse this. If you’re coming from a C or Java background, then you might try:

    my $record = {}
    my @records;
    for (split /\n/, $text {
        chomp;
        if (/([^:]+): (.*)/) {
            $record->{$1} = $2;
        } elsif ($_ =~ /^\s*$/) {
            # Blank line => end of current record
            push @records, $record;
            $record = {};
        } else {
            die "Wasn't expecting to see '$_' here";
        }
    }

And, of course, this will work. But there’s several more Perl-ish solutions that this. When you know the fields provided by your data, it’s rather nice to have a regular expression that reflects the data structure:

    while ($data =~ /Name:\s(.*)\n
                     Occupation:\s(.*)\n 
                     Favourite.*:\s(.*)/gx) {
        push @records, { name => $1, occupation => $2, favourite => $3 }
    }

Here we use the /g modifier, which allows us to resume the match from where it last left off.

If we don’t know the fields while we’re writing our program, then we’ll have to break the process up into two stages. First, we extract individual records: records are delimited by a blank line:

    my @texts = split /\n\s*\n/, $text;

And then for each record, we can either use the /g trick again, or simply split each record into lines. I prefer the latter, for reasons you’ll see in a second:

    for (@texts) {
        my $record = {};
        for (split /\n/, $_) {
            /([^:]+): (.*)/;
            $record->{$1} = $2;
        }
        push @records, $record;
    }

This is not dissimilar from the initial solution, but it allows us to make some interesting improvements. For starters, when you see code that transforms data with a for loop, you should wonder whether it could be better written with a map statement. This goes double if you’re using push inside the for loop as we are here. So this version is a natural evolution:

    @records = map {
        my $record = {};
        for (split /\n/, $_) { 
            /([^:]+): (.*)/;
            $record->{$1} = $2;
        }
        $record;
    } split /\n\s*\n/, $text;

And we can actually do away with the inner for loop too:

    @records = map {
        {
            map { /([^:]+): (.*)/ and ($1 => $2) } split /\n/, $_
        }
    } split /\n\s*\n/, $text;

But if we’re prepared to be a little lax about trailing whitespace, there’s actually an even nicer way to do it, using the one thing that everyone forgets about split: if your split pattern contains parentheses, then the captured text is inserted into the list returned by split. That is, the following code:

    split( /(\W+)/, "perl-5.8.0.tar.gz")

will produce the list

    ("perl", "-", "5", ".", "8", ".", "0", ".", "tar", ".", "gz")

So we can actually use the field name, colon and space at the start of each line as the split expression itself:

    split /^([^:]+):\s*/m

There is a slight problem with this idea - because the first thing in each record is delimeter we’re looking for, the first thing returned by split will be an empty string. But we can easily get around this by adding another undef to provide a fake undef => '' hash element. This allows us to reduce the parser code to:

    @records = map { 
                     { undef, split /^([^:]+):\s*/m, $_ } 
                   } split /\n\s*\n/, $text;

It may not be pretty, but it’s quick and it works.

Of course, you may also use lookahead and lookbehind assertions with split; I sometimes use the following code to break a string into tokens:

    split /(?<=\W)|(?=\W)/, $string;

This is almost the same as

    split /(\W)/, $string

but with a subtle difference. Again, as Perl wants to see a nonword character as a delimiter, it will return an empty string between two adjacent nonwords:

    split /(\W)/, '$foo := $bar';
    # '', '$', 'foo', ' ', '', ':', '', '=', '', ' ', '', '$', 'bar'

Splitting on a word boundary goes too much the other way:

    split /\b/, '$foo := $bar';
    # '$', 'foo', ' := $', 'bar'

And so it turns out that we want to cleave the string where we’ve just seen a nonword character, or if we’re about to see one:

    split /(?<=\W)|(?=\W)/, $string;
    # '$', 'foo', ' ', ':', '=', ' ', '$', 'bar'

And this gives us the sort of tokenisation we want.

Regexp Modules

Now, though, we are getting into the sort of regular expressions that are not written lightly, and we may need some help constructing and debugging these expressions. Thankfully, there are plenty of modules which make regexp handling much easier for us.

re

The re module is as invaluable as it is obscure. It’s one of those hidden treasures of the Perl core that Casey was talking about last month. As well as turning on two features of the regular expression engine, tainting subexpressions and evaluated assertions, it provides a debugging facility that allows you to watch your expression being compiled and executed.

Here’s a relative simple expression:

    $a =~ /([^:]+):\s*(.*)/;

When this code is run under -Mre=debug, then the following will be printed when the regexp is compiled:

    Compiling REx `([^:]+):\s*(.*)'
    size 25 first at 4
       1: OPEN1(3)
       3:   PLUS(13)
       4:     ANYOF[\0-9;-\377](0)
      13: CLOSE1(15)
      15: EXACT <:>(17)
      17: STAR(19)
      18:   SPACE(0)
      19: OPEN2(21)
      21:   STAR(23)
      22:     REG_ANY(0)
      23: CLOSE2(25)
      25: END(0)

This tells us the instructions for the little machine that the regular expression compiler creates: it should first open a bracket, then go into a loop (PLUS) finding characters that are ANYOF character zero through to 9 and ; through to character 255 - that is, everything apart from a :. Then we close the bracket, look for a specific character, and so on. The numbers in brackets after each instruction are the line number to jump to on completion; then the PLUS loop exits, it should go on to line 13, CLOSE1 and so on.

Next when we try to run this match against some text:

    $a = "Name: Mark-Jason Dominus";

It will first tell us something about the optimizations it performs:

    Guessing start of match, REx `([^:]+):\s*(.*)' against `Name: ...'
    Found floating substr `:' at offset 4...
    Does not contradict STCLASS...
    Guessed: match at offset 0

What this means is that it has found the constant element : in the regular expression, and tries to locate that in the string, and then work backward to find out where it should start the match. Since the : is at position four in our string, it will go on to deduce that the match should start at the beginning and…

    Matching REx `([^:]+):\s*(.*)' against `Name: Mark-Jason Dominus'
    Setting an EVAL scope, savestack=3
    0 <> <Name: Mark-J>    |  1:  OPEN1
    0 <> <Name: Mark-J>    |  3:  PLUS
    ANYOF[\0-9;-\377] can match 4 times out of 32767...

The [^:] can match four times, since it knows there are four things that are not colons there.

The re module is absolutely essential for heavy-duty study of how the regular expression engine works, and why it doesn’t do what you think it should.

YAPE::Regex::Explain

The description given by re is a little low-level for some people; well, most people. YAPE::Regex::Explain aims to put the explanation at a much higher level; for instance,

     % perl -MYAPE::Regex::Explain -e 'print 
       YAPE::Regex::Explain->new(qr/(?<=\W)|(?=\W)/)->explain'

will produce quite a verbose explanation of the regular expression like so:

    ----------------------------------------------------------------------
    (?-imsx:                 group, but do not capture (case-sensitive)
                             (with ^ and $ matching normally) (with . not
                             matching \n) (matching whitespace and #
                             normally):
    ----------------------------------------------------------------------
      (?<=                     look behind to see if there is:
    ----------------------------------------------------------------------
        \W                       non-word characters (all but a-z, A-Z,
                                 0-9, _)
    ----------------------------------------------------------------------
    ...

GraphViz::Regex

I find that one of the best ways to debug and understand a complex procedure is to draw a picture. GraphViz::Regex uses the graphviz visualization library to draw a state machine diagram for a given regular expression:

    use GraphViz::Regex;

    my $regex = '(([abcd0-9])|(foo))';

    my $graph = GraphViz::Regex->new($regex);
    print $graph->as_png;

Regexp::Common

So much for explaining complicated regular expressions; what about generating them? The Regexp::Common module aims to be a repository for all kinds of commonly needed regular expressions, such as URIs, balanced texts, domain names and IP addresses. The interface is a little freaky, but it can hugely help to clarify complex regexps:

    my $ts = qr/\d+:\d+:\d+\.\d+/;
    $tcpdump =~ /$ts ($RE{net}{IPv4}) > ($RE{net}{IPv4}) : (tcp|udp) (\d+)/;

Text::Balanced

Finally, one particularly common family of things to match for are quoted, parenthesised or tagged text. Damian’s Text::Balanced module helps produce both regular expressions and subroutines to match and extract balanced text sequences. For instance, we can create a regular expression for matching double-quoted strings like so:

    use Text::Balanced qw(gen_delimited_pat);
    $pat = gen_delimited_pat(q{"})
    # (?:\"(?:[^\\\"]*(?:\\.[^\\\"]*)*)\")

This pattern will match quoted text, but will also be aware of escape sequences like \" and \\, and hence not break off in the middle of

    "\"So\", he said, \"How about lunch?\""

Text::Balanced also contains routines for extracting tagged text, finding balanced pairs of parentheses, and much more.

Summary

We’ve looked at some slightly more-complex features of regular expressions, and shown how we can use these to slice and dice text with Perl. As these regexes get more complicated, the need for tools to help us debug them increases; and so we’ve looked also at re, YAPE and GraphViz::Regex.

Finally, the Regexp::Common and Text::Balanced modules help us create complex regular expressions of our own.

Tags

regular-expressions