Maintaining Regular Expressions
by Aaron MackeyJanuary 16, 2004
For some, regular expressions provide the chainsaw functionality of the much-touted Perl "Swiss Army knife" metaphor. They are powerful, fast, and very sharp, but like real chainsaws, can be dangerous when used without appropriate safety measures.
In this article I'll discuss the issues associated with using heavy-duty, contractor-grade regular expressions, and demonstrate a few maintenance techniques to keep these chainsaws in proper condition for safe and effective long-term use.
Readability: Whitespace and Comments
Before getting into any deep issues, I want to cover the number one
rule of shop safety: use whitespace to format your regular
expressions. Most of us already honor this wisdom in our various
coding styles (though perhaps not with the zeal of Python developers).
But more of us could make better, judicious use of whitespace in our
regular expressions, via the /x modifier. Not only does
it improve readability, but allows us to add meaningful, explanatory
comments. For example, this simple regular expression:
# matching "foobar" is critical here ...
$_ =~ m/foobar/;
Could be rewritten, using a trailing /x modifier,
as:
$_ =~ m/ foobar # matching "foobar" is critical here ...
/x;
Now, in this example you might argue that readability wasn't
improved at all; I guess that's the problem with triviality. Here's
another, slightly less trivial example that also illustrates the need
to escape literal whitespace and comment characters when using
the /x modifier:
$_ =~ m/^ # anchor at beginning of line
The\ quick\ (\w+)\ fox # fox adjective
\ (\w+)\ over # fox action verb
\ the\ (\w+) dog # dog adjective
(?: # whitespace-trimmed comment:
\s* \# \s* # whitespace and comment token
(.*?) # captured comment text; non-greedy!
\s* # any trailing whitespace
)? # this is all optional
$ # end of line anchor
/x; # allow whitespace
This regular expression successfully matches the following lines of input:
The quick brown fox jumped over the lazy dog
The quick red fox bounded over the sleeping dog
The quick black fox slavered over the dead dog # a bit macabre, no?
While embedding meaningful explanatory comments in your regular
expressions can only help readability and maintenance, many of us
don't like the plethora of backslashed spaces made necessary by the
"global" /x modifier. Enter the "locally"
acting (?#) and (?x:) embedded
modifiers:
$_ =~ m/^(?# # anchor at beginning of line
)The quick (\w+) fox (?# # fox adjective
)(\w+) over (?# # fox action verb
)the (\w+) dog(?x: # dog adjective
# optional, trimmed comment:
\s* # leading whitespace
\# \s* (.*?) # comment text
\s* # trailing whitespace
)?$(?# # end of line anchor
)/;
In this case, the (?#) embedded modifier was used to
introduce our commentary between each set of whitespace-sensitive
textual components; the non-capturing parentheses
construct (?:) used for the optional comment text was
also altered to include a locally-acting x modifier. No
backslashing was necessary, but it's a bit harder to quickly
distinguish relevant whitespace. To each their own, YMMV, TIMTOWTDI,
etc.; the fact is, both commented examples are probably easier to
maintain than:
# match the fox adjective and action verb, then the dog adjective,
# and any optional, whitespace-trimmed commentary:
$_ =~ m/^The quick (\w+) fox (\w+) over the (\w+) dog(?:\s*#\s*(.*?)\s*$/;
This example, while well-commented and clear at first, quickly deteriorates into the nearly unreadable "line noise" that gives Perl programmers a bad name and makes later maintenance difficult.
|
Related Reading Regular Expression Pocket Reference |
So, as in other programming languages, use whitespace formatting
and commenting as appropriate, or maybe even when it seems like
overkill; it can't hurt. And like the choice between alternative code
indentation and bracing styles, Perl regular expressions allow a few
different options (global /x modifier,
local (?#) and (?x:) embedded modifiers) to
suit your particular aesthetics.
Capturing Parenthesis: Taming the Jungle
Most of us use regular expressions to actually do something with
the parsed text (although the condition that the input matches the
expressions is also important). Assigning the captured text from the
previous example is relatively easy: the first three capturing
parentheses are visually distinct and can be clearly
numbered $1, $2 and $3;
however, the extra set of non-capturing parentheses, which provide
optional commentary, themselves have another set of embedded,
capturing parentheses; here's another rewriting of the example, with
slightly less whitespace formatting:
my ($fox, $verb, $dog, $comment);
if ( $_ =~ m/^ # anchor at beginning of line
The\ quick\ (\w+)\ fox # fox adjective
\ (\w+)\ over # fox action verb
\ the\ (\w+) dog # dog adjective
(?:\s* \# \s* (.*?) \s*)? # an optional, trimmed comment
$ # end of line anchor
/x
) {
($fox, $verb, $dog, $comment) = ($1, $2, $3, $4);
}
From a quick glance at this code, can you immediately tell whether
the $comment variable will come from $4
or $5? Will it include the leading #
comment character? If you are a practiced regular expression
programmer, you probably can answer these questions without
difficulty, at least for this fairly trivial example. But if we could
make this example even clearer, you will hopefully agree
that similarly clarifying some of your more gnarly regular expressions
would be beneficial in the long run.
When regular expressions grow very large, or include more than three pairs of parentheses (capturing or otherwise), a useful clarifying
technique is to embed the capturing assignments directly within the
regular expression, via the code-executing pattern (?{}).
In the embedded code, the special $^N variable, which
holds the contents of the last parenthetical capture, is used to
"inline" any variable assignments; our previous example turns into
this:
my ($fox, $verb, $dog, $comment);
$_ =~ m/^ # anchor at beginning of line
The\ quick\ (\w+) # fox adjective
(?{ $fox = $^N })
\ fox\ (\w+) # fox action verb
(?{ $verb = $^N })
\ over\ the\ (\w+) # dog adjective
(?{ $dog = $^N })
dog
# optional trimmed comment
(?:\s* \# \s* # leading whitespace
(.*?) # comment text
(?{ $comment = $^N })
\s*)? # trailing whitespace
$ # end of line anchor
/x; # allow whitespace
Now it should be explicitly clear that the $comment
variable will only contain the whitespace-trimmed commentary following
(but not including) the # character. We also don't have
to worry about numbered
variables $1, $2, $3,
etc. anymore, since we don't make use of them. This regular expression
can be easily extended to capture other text without rearranging
variable assignments.
Repeated Execution
There are a few caveats to using this technique, however; note that
code within (?{}) constructs is executed immediately as
the regular expression engine incorporates it into a match. That is,
if the engine backtracks off a parenthetical capture to generate a
successful match that does not include that capture, the associated
(?{}) code will have already been executed. To illustrate, let's
again look at just the capturing pattern for the comment
text (.*?) and let's also add a debugging warn
"$comment\n" statement:
# optional trimmed comment
(?:\s* \# \s* # leading whitespace
(.*?) (?{ $comment = $^N; # comment text
warn ">>$comment<<\n"
if $debug;
})
\s*)? # trailing whitespace
$ # end of line anchor
The capturing (.*?) pattern is a non-greedy extension
that will cause the regular expression matching engine to constantly
try to finish the match (looking for any trailing whitespace and the
end of string, $) without extending the .*?
pattern any further. The upshot of all this is that with debugging
turned on, this input text:
The quick black fox slavered over the dead dog # a bit macabre, no?
Will lead to these debugging statements:
>><<
>>a<<
>>a <<
>>a b<<
>>a bi<<
>>a bit<<
>>a bit <<
>>a bit m<<
[ ... ]
>>a bit macabre, n<<
>>a bit macabre, no<<
>>a bit macabre, no?<<
In other words, the adjacent embedded (?{}) code gets
executed every time the matching engine "uses" it while trying to
complete the match; because the matching engine may "backtrack" to try
many alternatives, the embedded code will also be executed as many
times.
This multiple execution behavior does raise a few concerns. If the
embedded code is only performing assignments, via $^N,
there doesn't seem at first to be much of a problem, because each
successive execution overrides any previous assignments, and only the
final, successful execution matters, right? However, what if the
input text had instead been:
The quick black fox slavered over the dead doggie # a bit macabre, no?
This text should fail to match the regular expression overall
(since "doggie" won't match "dog"), and it does. But, because the
embedded (?{}) code chunks are executed as the match is
evaluated, the $fox, $verb
and $dog variables are successfully assigned; the match
doesn't fail until "doggie" is seen. Our program might now be more
readable and maintainable, but we've also subtly altered the behavior
of the program.
The second problem is one of performance; what if our assignment
code hadn't simply copied $^N into a variable, but had
instead executed a remote database update? Repeatedly hitting the
database with meaningless updates may be crippling and inefficient.
However, the behavioral aspects of the database example are even more
frightening: what if the match failed overall, but our updates had
already been executed? Imagine that instead of an update operation,
our code triggered a new row insert for the comment, inserting
multiple, incorrect comment rows!
Pages: 1, 2 |


