Sign In/My Account | View Cart  
advertisement


Listen Print

Maintaining Regular Expressions
by Aaron Mackey | Pages: 1, 2

Deferred Execution

Luckily, Perl's ability to introduce "locally scoped" variables provides a mechanism to "defer" code execution until an overall successful match is accomplished. As the regular expression matching engine tries alternative matches, it introduces a new, nested scope for each (?{}) block, and, more importantly, it exits a local scope if a particular match is abandoned for another. If we were to write out the code executed by the matching engine as it moved (and backtracked) through our input, it might look like this:

{ # introduce new scope
  $fox = $^N;
  { # introduce new scope
    $verb = $^N;
    { # introduce new scope
      $dog = $^N;
      { # introduce new scope
        $comment = $^N;
      } # close scope: failed overall match
      { # introduce new scope
        $comment = $^N;
      } # close scope: failed overall match
      { # introduce new scope
        $comment = $^N;
      } # close scope: failed overall match

      # ...

      { # introduce new scope
        $comment = $^N;
      } # close scope: successful overall match
    } # close scope: successful overall match
  } # close scope: successful overall match
} # close scope: successful overall match

We can use this block-scoping behavior to solve both our altered behavior and performance issues. Instead of executing code immediately within each block, we'll cleverly "bundle" the code up, save it away on a locally scoped "stack," and only process the code if and when we get to the end of a successful match:

my ($fox, $verb, $dog, $comment);
  $_ =~ m/(?{
              local @c = ();            # provide storage "stack"
          })
          ^                             # anchor at beginning of line
          The\ quick\  (\w+)            # fox adjective
                       (?{
                           local @c;
                           push @c, sub {
                               $fox = $^N;
                           };
                       })
          \ fox\       (\w+)            # fox action verb
                       (?{
                           local @c = @c;
                           push @c, sub {
                               $verb = $^N;
                           };
                       })
          \ over\ the\ (\w+)            # dog adjective
                       (?{
                           local @c = @c;
                           push @c, sub {
                               $dog = $^N;
                           };
                       })
          dog
                                        # optional trimmed comment
            (?:\s* \# \s*               #   leading whitespace
            (.*?)                       #   comment text
            (?{
                local @c = @c;
                push @c, sub {
                    $comment = $^N;
                    warn ">>$comment<<\n"
                      if $debug;
                };
            })
            \s*)?                       #   trailing whitespace
          $                             # end of line anchor
          (?{
              for (@c) { &$_; }         # execute the deferred code
          })
         /x;                            # allow whitespace

Using subroutine "closures" to package up our code and save them on a locally defined stack, @c, allows us to defer any processing until the very end of a successful match. Here's the matching engine code execution "path":

{ # introduce new scope

  local @c = (); # provide storage "stack"

  { # introduce new scope

    local @c;
    push @c, sub { $fox = $^N; };

    { # introduce new scope

      local @c = @c;
      push @c, sub { $verb = $^N; };

      { # introduce new scope

        local @c = @c;
        push @c, sub { $dog = $^N; };

        { # introduce new scope

          local @c = @c;
          push @c, sub { $comment = $^N; };

        } # close scope; lose changes to @c

        { # introduce new scope

          local @c = @c;
          push @c, sub { $comment = $^N; };

        } # close scope; lose changes to @c

        # ...

        { # introduce new scope

          local @c = @c;
          push @c, sub { $comment = $^N; };

          { # introduce new scope

            for (@c) { &$_; }

          } # close scope

        } # close scope; lose changes to @c
      } # close scope; lose changes to @c
    } # close scope; lose changes to @c
  } # close scope; lose changes to @c
} # close scope; no more @c at all

This last technique is especially wordy; however, given judicious use of whitespace and well-aligned formatting, this idiom could ease the maintenance of long, complicated regular expressions.

But, more importantly, it doesn't work as written. What!?! Why? Well, it turns out that Perl's support for code blocks inside (?{}) constructs doesn't support subroutine closures (even attempting to compile one causes a core dump). But don't worry, all is not lost! Since this is Perl, we can always take things a step further, and make the hard things easy ...

Making it Actually Work: use Regexp::DeferredExecution

Though we cannot (yet) compile subroutines within (?{}) constructs, we can manipulate all the other types of Perl variables: scalars, arrays, and hashes. So instead of using closures:

m/
    (?{ local @c = (); })
    # ...
    (?{ local @c; push @c, sub { $comment = ^$N; } })
    # ...
    (?{ for (@c) { &$_; } })
   /x

We can instead just package up our $comment = $^N code into a string, to be executed by an eval statement later:

m/
    (?{ local @c = (); })
    # ...
    (?{ local @c; push @c, [ $^N, q{ $comment = ^$N; } ] })
    # ...
    (?{ for (@c) { $^N = $$[0]; eval $$[1]; } })
   /x

Note that we also had to store away the version of $^N that was active at the time of the (?{}) pattern, because it very likely will have changed by the end of the match. We didn't need to do this previously, as we were storing closures that efficiently captured all the local context of the code to be executed.

Well, now this is getting really wordy, and downright ugly to be honest. However, through the magic of Perl's overloading mechanism, we can avoid having to see any of that ugliness, by simply using the Regexp::DeferredExecution module from CPAN:

use Regexp:DeferredExecution;

  my ($fox, $verb, $dog, $comment);
  $_ =~ m/^                               # anchor at beginning of line
          The\ quick\  (\w+)              # fox adjective
                       (?{ $fox  = $^N }) 
          \ fox\       (\w+)              # fox action verb
                       (?{ $verb = $^N })
          \ over\ the\ (\w+)              # dog adjective
                       (?{ $dog  = $^N })
          dog
                                          # optional trimmed comment
            (?:\s* \# \s*                 #   leading whitespace
            (.*?)
            (?{ $comment = $^N })         #   comment text
            \s*)?                         #   trailing whitespace
          $                               # end of line anchor
         /x;                              # allow whitespace

How does the Regexp::DeferredExecution module perform its magic? Carefully, of course, but also simply; it just makes the same alterations to regular expressions that we made manually. 1) An initiating embedded code pattern is prepended to declare local "stack" storage. 2) Another embedded code pattern is added at the end of the expression to execute any code found in the stack (the stack itself is stored in @Regexp::DeferredExecution::c, so you shouldn't need to worry about variable name collisions with your own code). 3) Finally, any (?{}) constructs seen in your regular expressions are saved away onto a local copy of the stack for later execution. It looks a little like this:

package Regexp::DeferredExecution;

use Text::Balanced qw(extract_multiple extract_codeblock);

use overload;

sub import { overload::constant 'qr' => \&convert; }
sub unimport { overload::remove_constant 'qr'; }

sub convert {

  my $re = shift; 

  # no need to convert regexp's without (?{ <code> }):
  return $re unless $re =~ m/\(\?\{/;

  my @chunks = extract_multiple($re,
                                [ qr/\(\?  # '(?' (escaped)
                                     (?={) # followed by '{' (lookahead)
                                    /x,
                                  \&extract_codeblock
                                ]
                               );

  for (my $i = 1 ; $i < @chunks ; $i++) {
    if ($chunks[$i-1] eq "(?") {
      # wrap all code into a closure and push onto the stack:
      $chunks[$i] =~
        s/\A{ (.*) }\Z/{
          local \@Regexp::DeferredExecution::c;
          push \@Regexp::DeferredExecution::c, [\$^N, q{$1}];
        }/msx;
  }

  $re = join("", @chunks);

  # install the stack storage and execution code:
  $re = "(?{
            local \@Regexp::DeferredExecution::c = (); # the stack
         })$re(?{
            for (\@Regexp::DeferredExecution::c) {
              \$^N = \$\$_[0];  # reinstate \$^N
              eval \$\$_[1];    # execute the code
            }
         })";

  return $re;
}

1;

One caveat of Regexp::DeferredExecution use is that while execution will occur only once per compiled regular expressions, the ability to embed regular expressions inside of other regular expressions will circumvent this behavior:

use Regexp::DeferredExecution;

  # the quintessential foobar/foobaz parser:
  $re = qr/foo
           (?:
              bar (?:{ warn "saw bar!\n"; })
              |
              baz (?:{ warn "saw baz!\n"; })
           )?/x;

  # someone's getting silly now:
  $re2 = qr/ $re
             baroo!
             (?:{ warn "saw foobarbaroo! (or, foobazbaroo!)\n"; })
           /x;

  "foobar" =~ /$re2/;

  __END__
  "saw bar!"
Even though the input text to $re2 failed to match, the deferred code from $re was executed because its pattern did match successfully. Therefore, Regexp::DeferredExecution should only be used with "constant" regular expressions; there is currently no way to overload dynamic, "interpolated" regular expressions.

See Also

The Regexp::Fields module provides a much more compact shorthand for embedded named variable assignments, (?<varname> pattern), such that our example becomes:

use Regexp::Fields qw(my);

  my $rx =
    qr/^                             # anchor at beginning of line
       The\ quick\ (?<fox> \w+)\ fox # fox adjective
       \ (?<verb> \w+)\ over         # fox action verb
       \ the\ (?<dog> \w+) dog       # dog adjective
       (?:\s* \# \s*
          (?<comment> .*?)
       \s*)? # an optional, trimmed comment
       $                             # end of line anchor
      /x;

Note that in this particular example, the my $rx compilation stanza actually implicitly declared $fox, $verb etc. If variable assignment is all you're ever doing, Regexp::Fields is all you'll need. If you want to embed more generic code fragments in your regular expressions, Regexp::DeferredExecution may be your ticket.

And finally, because in Perl there is always One More Way To Do It, I'll also demonstrate Regexp::English, a module that allows you to use regular expressions without actually writing any regular expressions:

use Regexp::English;

  my ($fox, $verb, $dog, $comment);

  my $rx = Regexp::English->new
               -> start_of_line
               -> literal('The quick ')

               -> remember(\$fox)
                   -> word_chars
               -> end

               -> literal(' fox ')

               -> remember(\$verb)
                   -> word_chars
               -> end

               -> literal(' over the ')

               -> remember(\$dog)
                   -> word_chars
               -> end

               -> literal(' dog')

               -> optional
                   -> zero_or_more -> whitespace_char -> end
                   -> literal('#')
                   -> zero_or_more -> whitespace_char -> end

                   -> remember(\$comment)
                       -> minimal
                           -> multiple
                               -> word_char
                               -> or
                               -> whitespace_char
                           -> end
                       -> end
                   -> end
                   -> zero_or_more -> whitespace_char -> end
               ->end

               -> end_of_line;

  $rx->match($_);

I must admit that this last example appeals to my inner-Lispish self.

Hopefully you've gleaned a few tips and tricks from this little workshop of mine that you can take back to your own shop.