Sign In/My Account | View Cart  
advertisement


Listen Print

Maintaining Regular Expressions

by Aaron Mackey
January 16, 2004

For some, regular expressions provide the chainsaw functionality of the much-touted Perl "Swiss Army knife" metaphor. They are powerful, fast, and very sharp, but like real chainsaws, can be dangerous when used without appropriate safety measures.

In this article I'll discuss the issues associated with using heavy-duty, contractor-grade regular expressions, and demonstrate a few maintenance techniques to keep these chainsaws in proper condition for safe and effective long-term use.

Readability: Whitespace and Comments

Before getting into any deep issues, I want to cover the number one rule of shop safety: use whitespace to format your regular expressions. Most of us already honor this wisdom in our various coding styles (though perhaps not with the zeal of Python developers). But more of us could make better, judicious use of whitespace in our regular expressions, via the /x modifier. Not only does it improve readability, but allows us to add meaningful, explanatory comments. For example, this simple regular expression:

# matching "foobar" is critical here ...
  $_ =~ m/foobar/;

Could be rewritten, using a trailing /x modifier, as:

$_ =~ m/ foobar    # matching "foobar" is critical here ...
         /x;

Now, in this example you might argue that readability wasn't improved at all; I guess that's the problem with triviality. Here's another, slightly less trivial example that also illustrates the need to escape literal whitespace and comment characters when using the /x modifier:

$_ =~ m/^                         # anchor at beginning of line
          The\ quick\ (\w+)\ fox    # fox adjective
          \ (\w+)\ over             # fox action verb
          \ the\ (\w+) dog          # dog adjective
          (?:                       # whitespace-trimmed comment:
            \s* \# \s*              #   whitespace and comment token
            (.*?)                   #   captured comment text; non-greedy!
            \s*                     #   any trailing whitespace
          )?                        # this is all optional
          $                         # end of line anchor
         /x;                        # allow whitespace

This regular expression successfully matches the following lines of input:

The quick brown fox jumped over the lazy dog
The quick red fox bounded over the sleeping dog
The quick black fox slavered over the dead dog   # a bit macabre, no?

While embedding meaningful explanatory comments in your regular expressions can only help readability and maintenance, many of us don't like the plethora of backslashed spaces made necessary by the "global" /x modifier. Enter the "locally" acting (?#) and (?x:) embedded modifiers:

$_ =~ m/^(?#                      # anchor at beginning of line

          )The quick (\w+) fox (?#  # fox adjective
          )(\w+) over (?#           # fox action verb
          )the (\w+) dog(?x:        # dog adjective
                                    # optional, trimmed comment:
            \s*                     #   leading whitespace
            \# \s* (.*?)            #   comment text
            \s*                     #   trailing whitespace

          )?$(?#                    # end of line anchor
          )/;

In this case, the (?#) embedded modifier was used to introduce our commentary between each set of whitespace-sensitive textual components; the non-capturing parentheses construct (?:) used for the optional comment text was also altered to include a locally-acting x modifier. No backslashing was necessary, but it's a bit harder to quickly distinguish relevant whitespace. To each their own, YMMV, TIMTOWTDI, etc.; the fact is, both commented examples are probably easier to maintain than:

# match the fox adjective and action verb, then the dog adjective,
  # and any optional, whitespace-trimmed commentary:
  $_ =~ m/^The quick (\w+) fox (\w+) over the (\w+) dog(?:\s*#\s*(.*?)\s*$/;

This example, while well-commented and clear at first, quickly deteriorates into the nearly unreadable "line noise" that gives Perl programmers a bad name and makes later maintenance difficult.

Regular Expression Pocket Reference

Related Reading

Regular Expression Pocket Reference
By Tony Stubblebine

Table of Contents
Index

Read Online--Safari
Search this book on Safari:
 

Code Fragments only

So, as in other programming languages, use whitespace formatting and commenting as appropriate, or maybe even when it seems like overkill; it can't hurt. And like the choice between alternative code indentation and bracing styles, Perl regular expressions allow a few different options (global /x modifier, local (?#) and (?x:) embedded modifiers) to suit your particular aesthetics.

Capturing Parenthesis: Taming the Jungle

Most of us use regular expressions to actually do something with the parsed text (although the condition that the input matches the expressions is also important). Assigning the captured text from the previous example is relatively easy: the first three capturing parentheses are visually distinct and can be clearly numbered $1, $2 and $3; however, the extra set of non-capturing parentheses, which provide optional commentary, themselves have another set of embedded, capturing parentheses; here's another rewriting of the example, with slightly less whitespace formatting:

my ($fox, $verb, $dog, $comment);
  if ( $_ =~ m/^                         # anchor at beginning of line
               The\ quick\ (\w+)\ fox    # fox adjective
               \ (\w+)\ over             # fox action verb
               \ the\ (\w+) dog          # dog adjective
               (?:\s* \# \s* (.*?) \s*)? # an optional, trimmed comment
               $                         # end of line anchor
              /x
     ) {
      ($fox, $verb, $dog, $comment) = ($1, $2, $3, $4);
  }

From a quick glance at this code, can you immediately tell whether the $comment variable will come from $4 or $5? Will it include the leading # comment character? If you are a practiced regular expression programmer, you probably can answer these questions without difficulty, at least for this fairly trivial example. But if we could make this example even clearer, you will hopefully agree that similarly clarifying some of your more gnarly regular expressions would be beneficial in the long run.

When regular expressions grow very large, or include more than three pairs of parentheses (capturing or otherwise), a useful clarifying technique is to embed the capturing assignments directly within the regular expression, via the code-executing pattern (?{}). In the embedded code, the special $^N variable, which holds the contents of the last parenthetical capture, is used to "inline" any variable assignments; our previous example turns into this:

my ($fox, $verb, $dog, $comment);
  $_ =~ m/^                               # anchor at beginning of line
          The\ quick\  (\w+)              # fox adjective
                       (?{ $fox  = $^N }) 
          \ fox\       (\w+)              # fox action verb
                       (?{ $verb = $^N })
          \ over\ the\ (\w+)              # dog adjective
                       (?{ $dog  = $^N })
          dog
                                          # optional trimmed comment
            (?:\s* \# \s*                 #   leading whitespace
            (.*?)                         #   comment text
            (?{ $comment = $^N })
            \s*)?                         #   trailing whitespace
          $                               # end of line anchor
         /x;                              # allow whitespace

Now it should be explicitly clear that the $comment variable will only contain the whitespace-trimmed commentary following (but not including) the # character. We also don't have to worry about numbered variables $1, $2, $3, etc. anymore, since we don't make use of them. This regular expression can be easily extended to capture other text without rearranging variable assignments.

Repeated Execution

There are a few caveats to using this technique, however; note that code within (?{}) constructs is executed immediately as the regular expression engine incorporates it into a match. That is, if the engine backtracks off a parenthetical capture to generate a successful match that does not include that capture, the associated (?{}) code will have already been executed. To illustrate, let's again look at just the capturing pattern for the comment text (.*?) and let's also add a debugging warn "$comment\n" statement:

# optional trimmed comment
            (?:\s* \# \s*               #   leading whitespace
            (.*?) (?{ $comment = $^N;   #   comment text
                      warn ">>$comment<<\n"
                        if $debug;
                    })
            \s*)?                       #   trailing whitespace
          $                             # end of line anchor

The capturing (.*?) pattern is a non-greedy extension that will cause the regular expression matching engine to constantly try to finish the match (looking for any trailing whitespace and the end of string, $) without extending the .*? pattern any further. The upshot of all this is that with debugging turned on, this input text:

The quick black fox slavered over the dead dog # a bit macabre, no?

Will lead to these debugging statements:

>><<
>>a<<
>>a <<
>>a b<<
>>a bi<<
>>a bit<<
>>a bit <<
>>a bit m<<
[ ... ]
>>a bit macabre, n<<
>>a bit macabre, no<<
>>a bit macabre, no?<<

In other words, the adjacent embedded (?{}) code gets executed every time the matching engine "uses" it while trying to complete the match; because the matching engine may "backtrack" to try many alternatives, the embedded code will also be executed as many times.

This multiple execution behavior does raise a few concerns. If the embedded code is only performing assignments, via $^N, there doesn't seem at first to be much of a problem, because each successive execution overrides any previous assignments, and only the final, successful execution matters, right? However, what if the input text had instead been:

The quick black fox slavered over the dead doggie # a bit macabre, no?

This text should fail to match the regular expression overall (since "doggie" won't match "dog"), and it does. But, because the embedded (?{}) code chunks are executed as the match is evaluated, the $fox, $verb and $dog variables are successfully assigned; the match doesn't fail until "doggie" is seen. Our program might now be more readable and maintainable, but we've also subtly altered the behavior of the program.

The second problem is one of performance; what if our assignment code hadn't simply copied $^N into a variable, but had instead executed a remote database update? Repeatedly hitting the database with meaningless updates may be crippling and inefficient. However, the behavioral aspects of the database example are even more frightening: what if the match failed overall, but our updates had already been executed? Imagine that instead of an update operation, our code triggered a new row insert for the comment, inserting multiple, incorrect comment rows!

Pages: 1, 2

Next Pagearrow