Recently in Regular Expressions Category

Lexing Your Data

s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g

Most of us have tried at one time or another to use regular expressions to do things we shouldn't: parsing HTML, obfuscating code, washing dishes, etc. This is what the technical term "showing off" means. I've done it too:

$html =~ s{
             (<a\s(?:[^>](?!href))*href\s*)
             (&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
             ([^>]+>)
          }
          {$1 . decode_entities($2) .  $4}gsexi;

I was strutting like a peacock when I wrote that, followed quickly by eating crow when I ran it. I never did get that working right. I'm still not sure what I was trying to do. That regular expression forced me to learn how to use HTML::TokeParser. More importantly, that was the regular expression that taught me how difficult regular expressions can be.

The Problem with Regular Expressions

Look at that regex again:

 /(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/

Do you know that matches? Exactly? Are you sure? Even if it works, how easily can you modify it? If you don't know what it was trying to do (and to be fair, don't forget it's broken), how long did you spend trying to figure it out? When's the last time a single line of code gave you such fits?

The problem, of course, is that this regular expression is trying to do far more work than a single line of code is likely to do. When facing with a regular expression like that, there are a few things I like to do.

  • Document it carefully.
  • Use the /x switch so I can expand it over several lines.
  • Possibly, encapsulate it in a subroutine.

Sometimes, though, there's a fourth option: lexing.

Lexing

When developing code, we typically take a problem and break it down into a series of smaller problems that are easier to solve. Regular expressions are code and you can break them down into a series of smaller problems that are easier to solve. One technique is to use lexing to facilitate this.

Lexing is the act of breaking data down into discrete tokens and assigning meaning to those tokens. There's a bit of fudging in that statement, but it pretty much covers the basics.

Parsing typically follows lexing to convert the tokens into something more useful. Parsing is frequently the domain of some tool that applies a well-defined grammar to the lexed tokens.

Sometimes well-defined grammars are not practical for extracting and reporting information. There might not be a grammar available for a company's ad-hoc log file format. Other times you might find it easier to process the tokens manually then to spend the time writing a grammar. Still other times you might only care about part of the data you've lexed, not all of it. All three of these reasons apply to some problems.

Parsing SQL

Recently, on Perlmonks (parse a query string), someone had some SQL to parse:

select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
  when a=b then 'c'
  else 'd'
  end tough_one
from ...
where ...

The poster needed the alias for each column from that SQL. In this case, the aliases are date, months_old, product, year, and tough_one. Of course, this was only one example. There's actually plenty of generated SQL, all with subtle variations on the column aliases, so this is not a trivial task. What's interesting about this, though, is that we don't give a fig about anything except the column aliases. The rest of the text is merely there to help us find those aliases.

Your first thought might be to parse this with SQL::Statement. As it turns out, this module does not handle CASE statements. Thus, you must figure out how to patch SQL::Statement, submit said patch, and hope it gets accepted and released in a timely fashion. (Note that SQL::Statement uses SQL::Parser, so the latter is also not an option.)

Second, many of us have worked in environments where we have problems to solve in production now, but we still have to wait three weeks to get the necessary modules installed, if we can get them approved at all.

The most important reason, though, is even if SQL::Statement could handle this problem, this would be an awfully short article if you used it instead of a lexer.

Lexing Basics

As mentioned earlier, lexing is essentially the task of analyzing data and breaking it down into a series of easy-to-use tokens. While the data may be in other forms, usually this means analyzing strings. To give a trivial example, consider the expression:

x = (3 + 2) / y

When lexed, you might get a series of tokens, such as:

my @tokens = (
  [ OP  => 'x' ],
  [ OP  => '=' ],
  [ OP  => '(' ],
  [ INT => '3' ],
  [ VAR => '+' ],
  [ INT => '2' ],
  [ OP  => ')' ],
  [ OP  => '/' ],
  [ VAR => 'y' ],
);

With a proper grammar, you could then read this series of tokens and take actions based upon their values, perhaps to build a simple language interpreter or translate this code into another programming language. Even without a grammar, you can find these tokens useful.

Identifying Tokens

The first step in building a lexer is identifying the tokens you wish to parse. Look again at the SQL.

select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
  when a=b then 'c'
    else 'd'
  end tough_one
from ...
where ...

There's nothing really to care about anything after the from keyword. In looking at this closer, everything you do care about is immediately prior to a comma or the from keyword. However, splitting on commas isn't enough, as there are some commas embedded in function parentheses.

The first thing to do is to identify the various things you can match with simple regular expressions.

These "things" appear to be parentheses, commas, operators, keywords, and random text. A first pass at it might look something like this:

my $lparen  = qr/\(/;
my $rparen  = qr/\)/;
my $keyword = qr/(?i:select|from|as)/; # this is all this problem needs
my $comma   = qr/,/;
my $text    = qr/(?:\w+|'\w+'|"\w+")/;
my $op      = qr{[-=+*/<>]};

The text matching is somewhat naive and you might want Regexp::Common for some of the regular expressions, but keep this simple for now.

The operators are a bit more involved. Assume that some SQL might have math statements embedded in them.

Now create the actual lexer. One way to do this is to make your own. It might look something like this:

sub lexer {
    my $sql = shift;
    return sub {
        LEXER: {
            return ['KEYWORD', $1] if $sql =~ /\G ($keyword) /gcx;
            return ['COMMA',   ''] if $sql =~ /\G ($comma)   /gcx;
            return ['OP',      $1] if $sql =~ /\G ($op)      /gcx;
            return ['PAREN',    1] if $sql =~ /\G $lparen    /gcx;
            return ['PAREN',   -1] if $sql =~ /\G $rparen    /gcx;
            return ['TEXT',    $1] if $sql =~ /\G ($text)    /gcx;
            redo LEXER             if $sql =~ /\G \s+        /gcx;
        }
    };
}

my $lexer = lexer($sql);

while (defined (my $token = $lexer->())) {
    # do something with the token
}

Without going into the detail of how that works, it's fair to say that this is not the best solution. By looking at the original Perlmonks post, you should find that you need to make two passes through the data to extract what you want. I've left the explanation an exercise for the reader.

To make this simpler, use the HOP::Lexer module from the CPAN. This module, described by Mark Jason Dominus in his book Higher Order Perl, makes creating lexers a rather trivial task and makes them a bit more powerful than the example. Here's the new code:

use HOP::Lexer 'make_lexer';
my @sql   = $sql;
my $lexer = make_lexer(
    sub { shift @sql },
    [ 'KEYWORD', qr/(?i:select|from|as)/          ],
    [ 'COMMA',   qr/,/                            ],
    [ 'OP',      qr{[-=+*/]}                      ],
    [ 'PAREN',   qr/\(/,      sub { [shift,  1] } ],
    [ 'PAREN',   qr/\)/,      sub { [shift, -1] } ],
    [ 'TEXT',    qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
    [ 'SPACE',   qr/\s*/,     sub {}              ],
);

sub text {
    my ($label, $value) = @_;
    $value =~ s/^["']//;
    $value =~ s/["']$//;
    return [ $label, $value ];
}

This certainly doesn't look any easier to read, but bear with me.

The make_lexer subroutine takes as its first argument an iterator, which returns the text to match on every call. In this case, you only have one snippet of text to match, so merely shift it off of an array. If you were reading lines from a log file, the iterator would be quite handy.

After the first argument comes a series of array references. Each reference takes two mandatory and one optional argument(s):

[ $label, $pattern, $optional_subroutine ]

The $label is the name of the token. The pattern should match whatever the label identifies. The third argument, a subroutine reference, takes as arguments the label and the text the label matched, and returns whatever you wish for a token.

Consider how you typically use the make_lexer subroutine.

[ 'KEYWORD', qr/(?i:select|from|as)/ ],

Here's an example of how to transform the data before making the token:

[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text  ],

As mentioned previously, the regular expression might be naive, but leave that for now and focus on the &text subroutine.

sub text {
    my ($label, $value) = @_;
    $value =~ s/^["']//;
    $value =~ s/["']$//;
    return [ $label, $value ];
}

This says, "Take the label and the value, strip leading and trailing quotes from the value and return them in an array reference."

To strip the white space you don't care about, simply return nothing:

 'SPACE', qr/\s*/, sub {} ],

Now that you have your lexer, put it to work. Remember that column aliases are the TEXT not in parentheses, but immediately prior to commas or the from keyword. How do we know if you're inside of parentheses? Cheat a little bit:

[ 'PAREN', qr/\(/, sub { [shift,  1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],

With that, you can add a one whenever you get to an opening parenthesis and subtract it when you get to a closing one. Whenever the result is zero, you know that you're outside of parentheses.

To get the tokens, call the $lexer iterator repeatedly.

while ( defined (my $token = $lexer->() ) { ... }

The tokens look like this:

[  'KEYWORD',      'select' ]
[  'TEXT',       'the_date' ]
[  'KEYWORD',          'as' ]
[  'TEXT',           'date' ]
[  'COMMA',             ',' ]
[  'TEXT',          'round' ]
[  'PAREN',               1 ]
[  'TEXT', 'months_between' ]
[  'PAREN',               1 ]

And so on.

Here's how to process the tokens:

 1:  my $inside_parens = 0;
 2:  while ( defined (my $token = $lexer->()) ) {
 3:      my ($label, $value) = @$token;
 4:      $inside_parens += $value if 'PAREN' eq $label;
 5:      next if $inside_parens || 'TEXT' ne $label;
 6:      if (defined (my $next = $lexer->('peek'))) {
 7:          my ($next_label, $next_value) = @$next;
 8:          if ('COMMA' eq $next_label) {
 9:              print "$value\n";
10:          }
11:          elsif ('KEYWORD' eq $next_label && 'from' eq $next_value) {
12:              print "$value\n";
13:              last; # we're done
14:          }
15:      }
16:  }

This is pretty straightforward, but there are some tricky bits. Each token is a two-element array reference, so line 3 makes the label and value fairly explicit. Lines 4 and 5 use the "cheat" for handling parentheses. Line 5 also skips anything that isn't text and therefore cannot be a column alias.

Line 6 is a bit odd. In HOP::Lexer, passing the string peek to the lexer will return the next token without actually advancing the $lexer iterator. From there, it's straightforward logic to find out if the value is a column alias that matches the criteria.

Putting all of this together makes:

#!/usr/bin/perl

use strict;
use warnings;
use HOP::Lexer 'make_lexer';

my $sql = <<END_SQL;
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
  when a=b then 'c'
    else 'd'
      end tough_one
      from XXX
END_SQL

my @sql   = $sql;
my $lexer = make_lexer(
    sub { shift @sql },
    [ 'KEYWORD', qr/(?i:select|from|as)/          ],
    [ 'COMMA',   qr/,/                            ],
    [ 'OP',      qr{[-=+*/]}                      ],
    [ 'PAREN',   qr/\(/,      sub { [shift,  1] } ],
    [ 'PAREN',   qr/\)/,      sub { [shift, -1] } ],
    [ 'TEXT',    qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
    [ 'SPACE',   qr/\s*/,     sub {}              ],
);

sub text {
    my ( $label, $value ) = @_;
    $value =~ s/^["']//;
    $value =~ s/["']$//;
    return [ $label, $value ];
}

my $inside_parens = 0;
while ( defined ( my $token = $lexer->() ) ) {
    my ( $label, $value ) = @$token;
    $inside_parens += $value if 'PAREN' eq $label;
    next if $inside_parens || 'TEXT' ne $label;
    if ( defined ( my $next = $lexer->('peek') ) ) {
        my ( $next_label, $next_value ) = @$next;
        if ( 'COMMA' eq $next_label ) {
            print "$value\n";
        }
        elsif ( 'KEYWORD' eq $next_label && 'from' eq $next_value ) {
            print "$value\n";
            last; # we're done
        }
    }
}

That prints out the column aliases:

date
months_old
product
year
tough_one

So are you done? No, probably not. What you really need now are many other examples of the SQL generated in the first problem statement. Maybe the &text subroutine is naive. Maybe there are other operators you forgot. Maybe there are floating-point numbers embedded in the SQL. When you have to lex data by hand, fine-tuning the lexer to match your actual data can take a few tries.

It's also important to note that precedence is very important here. &make_lexer evaluates each array reference passed in the order it receives them. If you passed the TEXT array reference before the KEYWORD array reference, the TEXT regular expression would match keywords before the KEYWORD could, thus generating spurious results.

Happy lexing!

Why Review Code?

Richard Gabriel of Sun Microsystems suggests that beginning programmers should study the source code for great works of software as part of a Master of Fine Arts course in software. Prominent Extreme Programmers talk about finding the Quality Without a Name in software, listening to your code, and reworking your code when it "smells."

With this in mind, this article contains a review of one of the "base" Perl modules, available from CPAN but also included as part of any distribution of Perl. It's a module that I particularly like (with algorithms that I can more or less understand); I aim to show why I like it and what you can learn from it.

Code and Communication

Reviewing (even reading) source code is not always easy. Most software doesn't have human readability as a primary goal — just compilers or interpreters. The exception is software written as examples or to help others learn. This code is primarily of interest in itself and not for what it does. As far as a computer cares, software would have the same behavior if it had no comments and all variables named as a1, a2, ... -- as the obfuscated programming contests prove. They also prove that we humans then find it difficult to read. Most literature (including poems, essays, short stories, novels), on the other hand, has the primary purpose of communicating with other people, and so is often easier to understand.

Nonetheless, code reviews are still useful. Projects that perform them usually do so to find bugs or to suggest improvements; they exist for the benefit of the code writer or tester. Here, we seek to learn from the code, to take away from it lessons that we can apply to our own code. These are lessons, or patterns, but not design patterns; the patterns discussed here appear at a lower level and are more of the order of tips and tricks.

As mentioned above, there are parts of software programs that do exist to communicate: comments as well as variable and function (or method) names are all about communicating (with other people). The challenge for the considerate [1] software developer (who thinks beyond the short term) is to use comments and names to tell the story of the software: what it is, what it should do, and why the developers made certain choices. Ideally, the form of software should match its function [2]: comments and names should explain and clarify the rest of the code, rather than disagreeing with it, repeating what it says, or being irrelevant. Source code reviews are part of this process. They further explain the code to a wider audience, telling its story in a broader and hopefully deeper way.

The Review Itself

Math::Complex is an open source Perl package that is part of the core Perl distribution. It provides methods for complex arithmetic (and overloading of operators).

Raphael Manfredi created the module in 1996. Since then, first Jarkko Hietaniemi and, currently, Daniel S. Lewart have maintained it (according to its own comments). My comments below relate to version 1.34.

I should explain that while I've been using Perl for maybe three years, it's mostly been for test automation and text processing; my day-to-day programming is primarily in C. I tend to notice the things about Perl that are difficult (at times verging on impossible) to do in C.

Early on in the Math::Complex package, a huge (and often reused) regular expression $gre appears:

# Regular expression for floating point numbers.
my $gre =
qr'\s*([\+\-]?(?:(?:(?:\d+(?:_\d+)*(?:\.\d*(?:_\d+)*)?|\.\d+(?:_\d+)*)(?:[eE][\+\-]?\d+(?:_\d+)*)?)))';

This regular expression captures floating-point numbers (which may include isolated underscores) in a single reference. It thus provides flexibility and robustness (key motivations for using regular expressions) without repetition. It also uses ?: within the brackets to cluster parts of the regular expression without providing back references; this means that the whole regular expression provides one back reference rather than five or six.

This regular expression may be easier to understand when refactored into smaller chunks. The test I used when refactoring was quite simple:

my @testexpressions = ('100', '100.12', '100.12e-13', '100_000.545_123',
	'1e-3', '.1', '-100_000.545_123e+11_12');
foreach my $expr (@testexpressions)
{
   if ($expr =~ m/$gre/)
   {
      print "$expr matches gre\n";
   }
}

The same test code can run against the refactored regular expression to make sure that it still matches the same strings. Here, working from the inside out, I extracted some common expressions. For each expression extracted and captured by the quote-like operator qr, I also removed the ?: clustering. $gre above is equivalent to $newgre below:

my $underscoredigits = qr/_\d+/;
my $digitstring      = qr/\d+$underscoredigits*/;
my $fractional       = qr/\.\d*$underscoredigits*/;
my $mantissa         = qr/$digitstring$fractional?|\.$digitstring/;
my $exponent         = qr/[eE][\+\-]?$digitstring/;
my $newgre           = qr/\s*([\+\-]?$mantissa$exponent?)/;

For example, $underscoredigits matches _123; $digitstring matches 545_123; $fractional matches .545_123; $mantissa matches either 100_000.545_123 or .1; $exponent matches e+11_12; and finally $newgre matches the whole string -100_000.545_123e+11_12.

The construction method make is, again, flexible and robust. The author has extracted part of its complexity to different methods to avoid repetition. For example, the module uses the _cannot_make function internally to report errors, calling it from several places in make. It looks like:

# Die on bad *make() arguments.

sub _cannot_make {
    die "@{[(caller(1))[3]]}: Cannot take $_[0] of $_[1].\n";
}

In turn, it calls the built-in caller function to refer back to the original code (similar to an assert in C, or a lightweight version of the cluck or confess functions in the Carp package).

make also calls the the remake function, if make receives only one argument (for example, a string such as "1 + 2i") and must deduce the real and/or imaginary parts:

sub _remake {
    my $arg = shift;
    my ($made, $p, $q);

    if ($arg =~ /^(?:$gre)?$gre\s*i\s*$/) {
        ($p, $q) = ($1 || 0, $2);
        $made = 'cart';
    } elsif ($arg =~ /^\s*\[\s*$gre\s*(?:,\s*$gre\s*)?\]\s*$/) {
        ($p, $q) = ($1, $2 || 0);
        $made = 'exp';
    }

    if ($made) {
        $p =~ s/^\+//;
        $q =~ s/^\+//;
    }

    return ($made, $p, $q);
}

The regular expression $gre appears here as part of a larger regular expression, to interpret not only "1 + 2i" but also "2i" by itself. The expression $1 || 0, used here and elsewhere, replaces the undef value of Perl (treated as 0 in a Boolean context) with 0, while leaving other values unchanged.

The plus function, one of the several binary operators provided in the package, also has some interesting features:

#
# (plus)
#
# Computes z1+z2.
#
sub plus {
        my ($z1, $z2, $regular) = @_;
        my ($re1, $im1) = @{$z1->cartesian};
        $z2 = cplx($z2) unless ref $z2;
        my ($re2, $im2) = ref $z2 ? @{$z2->cartesian} : ($z2, 0);
        unless (defined $regular) {
                $z1->set_cartesian([$re1 + $re2, $im1 + $im2]);
                return $z1;
        }
        return (ref $z1)->make($re1 + $re2, $im1 + $im2);
}

It uses the trinary conditional operator * = * ? * : *, which is also present in C. In Perl, this can return not just scalars but also lists. Thus the code to calculate the values of $re2 and $im2 is much more compact than the equivalent code in C could be. This code uses the Cartesian coordinates of $z2 if it's already a complex number. Otherwise, it turns a real number into a complex number.

The plus function later uses ref $z1, the package name of $z1, to create the sum of $z1 and $z2; this allows subclasses of Math::Complex to reuse exactly the same function.

Finally, the Cartesian function mentioned above can either return existing values for the real and imaginary part, if these are "clean" (valid), or recalculate them from the polar form. Each complex number object stores the "cleanliness" (validity) of its Cartesian values, as follows:

sub cartesian {$_[0]->{c_dirty} ?
        $_[0]->update_cartesian : $_[0]->{'cartesian'}}

This is a neat trick to avoid recalculating Cartesian coordinates when it's not necessary.

Summary

The Math::Complex package is not only useful, but efficient, robust, and flexible. It does use brief variable names, but this is traditional in mathematics; given that it uses complicated (or at least long) expressions, this means that the full expression is easy to understand (or read aloud). The functions themselves tend also to be brief and easy to understand.

We see here the benefits of reuse and refactoring. Over 8 years and 34 versions of this code, it has no doubt seen heavy rewriting, to the point of perhaps each line being different from the corresponding line in version 1.1 (or 0.1!). This extended refactoring has removed unnecessary and repeated code, clarified comments and usage, and led to clear and clean code. It provides an example not only of how to write a mathematical package in Perl, using regular expressions and references as described above, but also of what code can look like. That is perhaps its best lesson.

Endnotes

[1] "Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live." (Martin Golding)

[2] Though perhaps not always to this extent.

Maintaining Regular Expressions

For some, regular expressions provide the chainsaw functionality of the much-touted Perl "Swiss Army knife" metaphor. They are powerful, fast, and very sharp, but like real chainsaws, can be dangerous when used without appropriate safety measures.

In this article I'll discuss the issues associated with using heavy-duty, contractor-grade regular expressions, and demonstrate a few maintenance techniques to keep these chainsaws in proper condition for safe and effective long-term use.

Readability: Whitespace and Comments

Before getting into any deep issues, I want to cover the number one rule of shop safety: use whitespace to format your regular expressions. Most of us already honor this wisdom in our various coding styles (though perhaps not with the zeal of Python developers). But more of us could make better, judicious use of whitespace in our regular expressions, via the /x modifier. Not only does it improve readability, but allows us to add meaningful, explanatory comments. For example, this simple regular expression:

# matching "foobar" is critical here ...
  $_ =~ m/foobar/;

Could be rewritten, using a trailing /x modifier, as:

$_ =~ m/ foobar    # matching "foobar" is critical here ...
         /x;

Now, in this example you might argue that readability wasn't improved at all; I guess that's the problem with triviality. Here's another, slightly less trivial example that also illustrates the need to escape literal whitespace and comment characters when using the /x modifier:

$_ =~ m/^                         # anchor at beginning of line
          The\ quick\ (\w+)\ fox    # fox adjective
          \ (\w+)\ over             # fox action verb
          \ the\ (\w+) dog          # dog adjective
          (?:                       # whitespace-trimmed comment:
            \s* \# \s*              #   whitespace and comment token
            (.*?)                   #   captured comment text; non-greedy!
            \s*                     #   any trailing whitespace
          )?                        # this is all optional
          $                         # end of line anchor
         /x;                        # allow whitespace

This regular expression successfully matches the following lines of input:

The quick brown fox jumped over the lazy dog
The quick red fox bounded over the sleeping dog
The quick black fox slavered over the dead dog   # a bit macabre, no?

While embedding meaningful explanatory comments in your regular expressions can only help readability and maintenance, many of us don't like the plethora of backslashed spaces made necessary by the "global" /x modifier. Enter the "locally" acting (?#) and (?x:) embedded modifiers:

$_ =~ m/^(?#                      # anchor at beginning of line

          )The quick (\w+) fox (?#  # fox adjective
          )(\w+) over (?#           # fox action verb
          )the (\w+) dog(?x:        # dog adjective
                                    # optional, trimmed comment:
            \s*                     #   leading whitespace
            \# \s* (.*?)            #   comment text
            \s*                     #   trailing whitespace

          )?$(?#                    # end of line anchor
          )/;

In this case, the (?#) embedded modifier was used to introduce our commentary between each set of whitespace-sensitive textual components; the non-capturing parentheses construct (?:) used for the optional comment text was also altered to include a locally-acting x modifier. No backslashing was necessary, but it's a bit harder to quickly distinguish relevant whitespace. To each their own, YMMV, TIMTOWTDI, etc.; the fact is, both commented examples are probably easier to maintain than:

# match the fox adjective and action verb, then the dog adjective,
  # and any optional, whitespace-trimmed commentary:
  $_ =~ m/^The quick (\w+) fox (\w+) over the (\w+) dog(?:\s*#\s*(.*?)\s*$/;

This example, while well-commented and clear at first, quickly deteriorates into the nearly unreadable "line noise" that gives Perl programmers a bad name and makes later maintenance difficult.

So, as in other programming languages, use whitespace formatting and commenting as appropriate, or maybe even when it seems like overkill; it can't hurt. And like the choice between alternative code indentation and bracing styles, Perl regular expressions allow a few different options (global /x modifier, local (?#) and (?x:) embedded modifiers) to suit your particular aesthetics.

Capturing Parenthesis: Taming the Jungle

Most of us use regular expressions to actually do something with the parsed text (although the condition that the input matches the expressions is also important). Assigning the captured text from the previous example is relatively easy: the first three capturing parentheses are visually distinct and can be clearly numbered $1, $2 and $3; however, the extra set of non-capturing parentheses, which provide optional commentary, themselves have another set of embedded, capturing parentheses; here's another rewriting of the example, with slightly less whitespace formatting:

my ($fox, $verb, $dog, $comment);
  if ( $_ =~ m/^                         # anchor at beginning of line
               The\ quick\ (\w+)\ fox    # fox adjective
               \ (\w+)\ over             # fox action verb
               \ the\ (\w+) dog          # dog adjective
               (?:\s* \# \s* (.*?) \s*)? # an optional, trimmed comment
               $                         # end of line anchor
              /x
     ) {
      ($fox, $verb, $dog, $comment) = ($1, $2, $3, $4);
  }

From a quick glance at this code, can you immediately tell whether the $comment variable will come from $4 or $5? Will it include the leading # comment character? If you are a practiced regular expression programmer, you probably can answer these questions without difficulty, at least for this fairly trivial example. But if we could make this example even clearer, you will hopefully agree that similarly clarifying some of your more gnarly regular expressions would be beneficial in the long run.

When regular expressions grow very large, or include more than three pairs of parentheses (capturing or otherwise), a useful clarifying technique is to embed the capturing assignments directly within the regular expression, via the code-executing pattern (?{}). In the embedded code, the special $^N variable, which holds the contents of the last parenthetical capture, is used to "inline" any variable assignments; our previous example turns into this:

my ($fox, $verb, $dog, $comment);
  $_ =~ m/^                               # anchor at beginning of line
          The\ quick\  (\w+)              # fox adjective
                       (?{ $fox  = $^N }) 
          \ fox\       (\w+)              # fox action verb
                       (?{ $verb = $^N })
          \ over\ the\ (\w+)              # dog adjective
                       (?{ $dog  = $^N })
          dog
                                          # optional trimmed comment
            (?:\s* \# \s*                 #   leading whitespace
            (.*?)                         #   comment text
            (?{ $comment = $^N })
            \s*)?                         #   trailing whitespace
          $                               # end of line anchor
         /x;                              # allow whitespace

Now it should be explicitly clear that the $comment variable will only contain the whitespace-trimmed commentary following (but not including) the # character. We also don't have to worry about numbered variables $1, $2, $3, etc. anymore, since we don't make use of them. This regular expression can be easily extended to capture other text without rearranging variable assignments.

Repeated Execution

There are a few caveats to using this technique, however; note that code within (?{}) constructs is executed immediately as the regular expression engine incorporates it into a match. That is, if the engine backtracks off a parenthetical capture to generate a successful match that does not include that capture, the associated (?{}) code will have already been executed. To illustrate, let's again look at just the capturing pattern for the comment text (.*?) and let's also add a debugging warn "$comment\n" statement:

# optional trimmed comment
            (?:\s* \# \s*               #   leading whitespace
            (.*?) (?{ $comment = $^N;   #   comment text
                      warn ">>$comment<<\n"
                        if $debug;
                    })
            \s*)?                       #   trailing whitespace
          $                             # end of line anchor

The capturing (.*?) pattern is a non-greedy extension that will cause the regular expression matching engine to constantly try to finish the match (looking for any trailing whitespace and the end of string, $) without extending the .*? pattern any further. The upshot of all this is that with debugging turned on, this input text:

The quick black fox slavered over the dead dog # a bit macabre, no?

Will lead to these debugging statements:

>><<
>>a<<
>>a <<
>>a b<<
>>a bi<<
>>a bit<<
>>a bit <<
>>a bit m<<
[ ... ]
>>a bit macabre, n<<
>>a bit macabre, no<<
>>a bit macabre, no?<<

In other words, the adjacent embedded (?{}) code gets executed every time the matching engine "uses" it while trying to complete the match; because the matching engine may "backtrack" to try many alternatives, the embedded code will also be executed as many times.

This multiple execution behavior does raise a few concerns. If the embedded code is only performing assignments, via $^N, there doesn't seem at first to be much of a problem, because each successive execution overrides any previous assignments, and only the final, successful execution matters, right? However, what if the input text had instead been:

The quick black fox slavered over the dead doggie # a bit macabre, no?

This text should fail to match the regular expression overall (since "doggie" won't match "dog"), and it does. But, because the embedded (?{}) code chunks are executed as the match is evaluated, the $fox, $verb and $dog variables are successfully assigned; the match doesn't fail until "doggie" is seen. Our program might now be more readable and maintainable, but we've also subtly altered the behavior of the program.

The second problem is one of performance; what if our assignment code hadn't simply copied $^N into a variable, but had instead executed a remote database update? Repeatedly hitting the database with meaningless updates may be crippling and inefficient. However, the behavioral aspects of the database example are even more frightening: what if the match failed overall, but our updates had already been executed? Imagine that instead of an update operation, our code triggered a new row insert for the comment, inserting multiple, incorrect comment rows!

Deferred Execution

Luckily, Perl's ability to introduce "locally scoped" variables provides a mechanism to "defer" code execution until an overall successful match is accomplished. As the regular expression matching engine tries alternative matches, it introduces a new, nested scope for each (?{}) block, and, more importantly, it exits a local scope if a particular match is abandoned for another. If we were to write out the code executed by the matching engine as it moved (and backtracked) through our input, it might look like this:

{ # introduce new scope
  $fox = $^N;
  { # introduce new scope
    $verb = $^N;
    { # introduce new scope
      $dog = $^N;
      { # introduce new scope
        $comment = $^N;
      } # close scope: failed overall match
      { # introduce new scope
        $comment = $^N;
      } # close scope: failed overall match
      { # introduce new scope
        $comment = $^N;
      } # close scope: failed overall match

      # ...

      { # introduce new scope
        $comment = $^N;
      } # close scope: successful overall match
    } # close scope: successful overall match
  } # close scope: successful overall match
} # close scope: successful overall match

We can use this block-scoping behavior to solve both our altered behavior and performance issues. Instead of executing code immediately within each block, we'll cleverly "bundle" the code up, save it away on a locally scoped "stack," and only process the code if and when we get to the end of a successful match:

my ($fox, $verb, $dog, $comment);
  $_ =~ m/(?{
              local @c = ();            # provide storage "stack"
          })
          ^                             # anchor at beginning of line
          The\ quick\  (\w+)            # fox adjective
                       (?{
                           local @c;
                           push @c, sub {
                               $fox = $^N;
                           };
                       })
          \ fox\       (\w+)            # fox action verb
                       (?{
                           local @c = @c;
                           push @c, sub {
                               $verb = $^N;
                           };
                       })
          \ over\ the\ (\w+)            # dog adjective
                       (?{
                           local @c = @c;
                           push @c, sub {
                               $dog = $^N;
                           };
                       })
          dog
                                        # optional trimmed comment
            (?:\s* \# \s*               #   leading whitespace
            (.*?)                       #   comment text
            (?{
                local @c = @c;
                push @c, sub {
                    $comment = $^N;
                    warn ">>$comment<<\n"
                      if $debug;
                };
            })
            \s*)?                       #   trailing whitespace
          $                             # end of line anchor
          (?{
              for (@c) { &$_; }         # execute the deferred code
          })
         /x;                            # allow whitespace

Using subroutine "closures" to package up our code and save them on a locally defined stack, @c, allows us to defer any processing until the very end of a successful match. Here's the matching engine code execution "path":

{ # introduce new scope

  local @c = (); # provide storage "stack"

  { # introduce new scope

    local @c;
    push @c, sub { $fox = $^N; };

    { # introduce new scope

      local @c = @c;
      push @c, sub { $verb = $^N; };

      { # introduce new scope

        local @c = @c;
        push @c, sub { $dog = $^N; };

        { # introduce new scope

          local @c = @c;
          push @c, sub { $comment = $^N; };

        } # close scope; lose changes to @c

        { # introduce new scope

          local @c = @c;
          push @c, sub { $comment = $^N; };

        } # close scope; lose changes to @c

        # ...

        { # introduce new scope

          local @c = @c;
          push @c, sub { $comment = $^N; };

          { # introduce new scope

            for (@c) { &$_; }

          } # close scope

        } # close scope; lose changes to @c
      } # close scope; lose changes to @c
    } # close scope; lose changes to @c
  } # close scope; lose changes to @c
} # close scope; no more @c at all

This last technique is especially wordy; however, given judicious use of whitespace and well-aligned formatting, this idiom could ease the maintenance of long, complicated regular expressions.

But, more importantly, it doesn't work as written. What!?! Why? Well, it turns out that Perl's support for code blocks inside (?{}) constructs doesn't support subroutine closures (even attempting to compile one causes a core dump). But don't worry, all is not lost! Since this is Perl, we can always take things a step further, and make the hard things easy ...

Making it Actually Work: use Regexp::DeferredExecution

Though we cannot (yet) compile subroutines within (?{}) constructs, we can manipulate all the other types of Perl variables: scalars, arrays, and hashes. So instead of using closures:

m/
    (?{ local @c = (); })
    # ...
    (?{ local @c; push @c, sub { $comment = ^$N; } })
    # ...
    (?{ for (@c) { &$_; } })
   /x

We can instead just package up our $comment = $^N code into a string, to be executed by an eval statement later:

m/
    (?{ local @c = (); })
    # ...
    (?{ local @c; push @c, [ $^N, q{ $comment = ^$N; } ] })
    # ...
    (?{ for (@c) { $^N = $$[0]; eval $$[1]; } })
   /x

Note that we also had to store away the version of $^N that was active at the time of the (?{}) pattern, because it very likely will have changed by the end of the match. We didn't need to do this previously, as we were storing closures that efficiently captured all the local context of the code to be executed.

Well, now this is getting really wordy, and downright ugly to be honest. However, through the magic of Perl's overloading mechanism, we can avoid having to see any of that ugliness, by simply using the Regexp::DeferredExecution module from CPAN:

use Regexp:DeferredExecution;

  my ($fox, $verb, $dog, $comment);
  $_ =~ m/^                               # anchor at beginning of line
          The\ quick\  (\w+)              # fox adjective
                       (?{ $fox  = $^N }) 
          \ fox\       (\w+)              # fox action verb
                       (?{ $verb = $^N })
          \ over\ the\ (\w+)              # dog adjective
                       (?{ $dog  = $^N })
          dog
                                          # optional trimmed comment
            (?:\s* \# \s*                 #   leading whitespace
            (.*?)
            (?{ $comment = $^N })         #   comment text
            \s*)?                         #   trailing whitespace
          $                               # end of line anchor
         /x;                              # allow whitespace

How does the Regexp::DeferredExecution module perform its magic? Carefully, of course, but also simply; it just makes the same alterations to regular expressions that we made manually. 1) An initiating embedded code pattern is prepended to declare local "stack" storage. 2) Another embedded code pattern is added at the end of the expression to execute any code found in the stack (the stack itself is stored in @Regexp::DeferredExecution::c, so you shouldn't need to worry about variable name collisions with your own code). 3) Finally, any (?{}) constructs seen in your regular expressions are saved away onto a local copy of the stack for later execution. It looks a little like this:

package Regexp::DeferredExecution;

use Text::Balanced qw(extract_multiple extract_codeblock);

use overload;

sub import { overload::constant 'qr' => \&convert; }
sub unimport { overload::remove_constant 'qr'; }

sub convert {

  my $re = shift; 

  # no need to convert regexp's without (?{ <code> }):
  return $re unless $re =~ m/\(\?\{/;

  my @chunks = extract_multiple($re,
                                [ qr/\(\?  # '(?' (escaped)
                                     (?={) # followed by '{' (lookahead)
                                    /x,
                                  \&extract_codeblock
                                ]
                               );

  for (my $i = 1 ; $i < @chunks ; $i++) {
    if ($chunks[$i-1] eq "(?") {
      # wrap all code into a closure and push onto the stack:
      $chunks[$i] =~
        s/\A{ (.*) }\Z/{
          local \@Regexp::DeferredExecution::c;
          push \@Regexp::DeferredExecution::c, [\$^N, q{$1}];
        }/msx;
  }

  $re = join("", @chunks);

  # install the stack storage and execution code:
  $re = "(?{
            local \@Regexp::DeferredExecution::c = (); # the stack
         })$re(?{
            for (\@Regexp::DeferredExecution::c) {
              \$^N = \$\$_[0];  # reinstate \$^N
              eval \$\$_[1];    # execute the code
            }
         })";

  return $re;
}

1;

One caveat of Regexp::DeferredExecution use is that while execution will occur only once per compiled regular expressions, the ability to embed regular expressions inside of other regular expressions will circumvent this behavior:

use Regexp::DeferredExecution;

  # the quintessential foobar/foobaz parser:
  $re = qr/foo
           (?:
              bar (?:{ warn "saw bar!\n"; })
              |
              baz (?:{ warn "saw baz!\n"; })
           )?/x;

  # someone's getting silly now:
  $re2 = qr/ $re
             baroo!
             (?:{ warn "saw foobarbaroo! (or, foobazbaroo!)\n"; })
           /x;

  "foobar" =~ /$re2/;

  __END__
  "saw bar!"
Even though the input text to $re2 failed to match, the deferred code from $re was executed because its pattern did match successfully. Therefore, Regexp::DeferredExecution should only be used with "constant" regular expressions; there is currently no way to overload dynamic, "interpolated" regular expressions.

See Also

The Regexp::Fields module provides a much more compact shorthand for embedded named variable assignments, (?<varname> pattern), such that our example becomes:

use Regexp::Fields qw(my);

  my $rx =
    qr/^                             # anchor at beginning of line
       The\ quick\ (?<fox> \w+)\ fox # fox adjective
       \ (?<verb> \w+)\ over         # fox action verb
       \ the\ (?<dog> \w+) dog       # dog adjective
       (?:\s* \# \s*
          (?<comment> .*?)
       \s*)? # an optional, trimmed comment
       $                             # end of line anchor
      /x;

Note that in this particular example, the my $rx compilation stanza actually implicitly declared $fox, $verb etc. If variable assignment is all you're ever doing, Regexp::Fields is all you'll need. If you want to embed more generic code fragments in your regular expressions, Regexp::DeferredExecution may be your ticket.

And finally, because in Perl there is always One More Way To Do It, I'll also demonstrate Regexp::English, a module that allows you to use regular expressions without actually writing any regular expressions:

use Regexp::English;

  my ($fox, $verb, $dog, $comment);

  my $rx = Regexp::English->new
               -> start_of_line
               -> literal('The quick ')

               -> remember(\$fox)
                   -> word_chars
               -> end

               -> literal(' fox ')

               -> remember(\$verb)
                   -> word_chars
               -> end

               -> literal(' over the ')

               -> remember(\$dog)
                   -> word_chars
               -> end

               -> literal(' dog')

               -> optional
                   -> zero_or_more -> whitespace_char -> end
                   -> literal('#')
                   -> zero_or_more -> whitespace_char -> end

                   -> remember(\$comment)
                       -> minimal
                           -> multiple
                               -> word_char
                               -> or
                               -> whitespace_char
                           -> end
                       -> end
                   -> end
                   -> zero_or_more -> whitespace_char -> end
               ->end

               -> end_of_line;

  $rx->match($_);

I must admit that this last example appeals to my inner-Lispish self.

Hopefully you've gleaned a few tips and tricks from this little workshop of mine that you can take back to your own shop.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02