Maintaining Regular Expressions
by Aaron Mackey
|
Pages: 1, 2
Deferred Execution
Luckily, Perl's ability to introduce "locally scoped" variables
provides a mechanism to "defer" code execution until an overall
successful match is accomplished. As the regular expression matching
engine tries alternative matches, it introduces a new, nested scope
for each (?{}) block, and, more importantly, it exits a
local scope if a particular match is abandoned for another. If we
were to write out the code executed by the matching engine as it moved
(and backtracked) through our input, it might look like this:
{ # introduce new scope
$fox = $^N;
{ # introduce new scope
$verb = $^N;
{ # introduce new scope
$dog = $^N;
{ # introduce new scope
$comment = $^N;
} # close scope: failed overall match
{ # introduce new scope
$comment = $^N;
} # close scope: failed overall match
{ # introduce new scope
$comment = $^N;
} # close scope: failed overall match
# ...
{ # introduce new scope
$comment = $^N;
} # close scope: successful overall match
} # close scope: successful overall match
} # close scope: successful overall match
} # close scope: successful overall match
We can use this block-scoping behavior to solve both our altered behavior and performance issues. Instead of executing code immediately within each block, we'll cleverly "bundle" the code up, save it away on a locally scoped "stack," and only process the code if and when we get to the end of a successful match:
my ($fox, $verb, $dog, $comment);
$_ =~ m/(?{
local @c = (); # provide storage "stack"
})
^ # anchor at beginning of line
The\ quick\ (\w+) # fox adjective
(?{
local @c;
push @c, sub {
$fox = $^N;
};
})
\ fox\ (\w+) # fox action verb
(?{
local @c = @c;
push @c, sub {
$verb = $^N;
};
})
\ over\ the\ (\w+) # dog adjective
(?{
local @c = @c;
push @c, sub {
$dog = $^N;
};
})
dog
# optional trimmed comment
(?:\s* \# \s* # leading whitespace
(.*?) # comment text
(?{
local @c = @c;
push @c, sub {
$comment = $^N;
warn ">>$comment<<\n"
if $debug;
};
})
\s*)? # trailing whitespace
$ # end of line anchor
(?{
for (@c) { &$_; } # execute the deferred code
})
/x; # allow whitespace
Using subroutine "closures" to package up our code and save them on
a locally defined stack, @c, allows us to defer any
processing until the very end of a successful match. Here's the
matching engine code execution "path":
{ # introduce new scope
local @c = (); # provide storage "stack"
{ # introduce new scope
local @c;
push @c, sub { $fox = $^N; };
{ # introduce new scope
local @c = @c;
push @c, sub { $verb = $^N; };
{ # introduce new scope
local @c = @c;
push @c, sub { $dog = $^N; };
{ # introduce new scope
local @c = @c;
push @c, sub { $comment = $^N; };
} # close scope; lose changes to @c
{ # introduce new scope
local @c = @c;
push @c, sub { $comment = $^N; };
} # close scope; lose changes to @c
# ...
{ # introduce new scope
local @c = @c;
push @c, sub { $comment = $^N; };
{ # introduce new scope
for (@c) { &$_; }
} # close scope
} # close scope; lose changes to @c
} # close scope; lose changes to @c
} # close scope; lose changes to @c
} # close scope; lose changes to @c
} # close scope; no more @c at all
This last technique is especially wordy; however, given judicious use of whitespace and well-aligned formatting, this idiom could ease the maintenance of long, complicated regular expressions.
But, more importantly, it doesn't work as written. What!?!
Why? Well, it turns out that Perl's support for code blocks
inside (?{}) constructs doesn't support subroutine
closures (even attempting to compile one causes a core dump). But
don't worry, all is not lost! Since this is Perl, we can always take
things a step further, and make the hard things easy ...
|
Related Reading Mastering Regular Expressions |
Making it Actually Work: use Regexp::DeferredExecution
Though we cannot (yet) compile subroutines within (?{})
constructs, we can manipulate all the other types of Perl variables:
scalars, arrays, and hashes. So instead of using closures:
m/
(?{ local @c = (); })
# ...
(?{ local @c; push @c, sub { $comment = ^$N; } })
# ...
(?{ for (@c) { &$_; } })
/x
We can instead just package up our $comment = $^N code into
a string, to be executed by an eval statement later:
m/
(?{ local @c = (); })
# ...
(?{ local @c; push @c, [ $^N, q{ $comment = ^$N; } ] })
# ...
(?{ for (@c) { $^N = $$[0]; eval $$[1]; } })
/x
Note that we also had to store away the version of $^N that was
active at the time of the (?{}) pattern, because it very
likely will have changed by the end of the match. We didn't need to do this previously, as we were storing closures that efficiently captured all the local context of the code to be executed.
Well, now this is getting really wordy, and downright ugly
to be honest. However, through the magic of Perl's overloading
mechanism, we can avoid having to see any of that ugliness, by simply
using the Regexp::DeferredExecution module from CPAN:
use Regexp:DeferredExecution;
my ($fox, $verb, $dog, $comment);
$_ =~ m/^ # anchor at beginning of line
The\ quick\ (\w+) # fox adjective
(?{ $fox = $^N })
\ fox\ (\w+) # fox action verb
(?{ $verb = $^N })
\ over\ the\ (\w+) # dog adjective
(?{ $dog = $^N })
dog
# optional trimmed comment
(?:\s* \# \s* # leading whitespace
(.*?)
(?{ $comment = $^N }) # comment text
\s*)? # trailing whitespace
$ # end of line anchor
/x; # allow whitespace
How does the Regexp::DeferredExecution module perform its magic? Carefully, of course, but also simply; it just makes the
same alterations to regular expressions that we made manually. 1) An
initiating embedded code pattern is prepended to declare local "stack"
storage. 2) Another embedded code pattern is added at the end of the
expression to execute any code found in the stack (the stack itself is
stored in
@Regexp::DeferredExecution::c, so you shouldn't need to
worry about variable name collisions with your own code). 3) Finally,
any (?{}) constructs seen in your regular expressions are
saved away onto a local copy of the stack for later execution. It
looks a little like this:
package Regexp::DeferredExecution;
use Text::Balanced qw(extract_multiple extract_codeblock);
use overload;
sub import { overload::constant 'qr' => \&convert; }
sub unimport { overload::remove_constant 'qr'; }
sub convert {
my $re = shift;
# no need to convert regexp's without (?{ <code> }):
return $re unless $re =~ m/\(\?\{/;
my @chunks = extract_multiple($re,
[ qr/\(\? # '(?' (escaped)
(?={) # followed by '{' (lookahead)
/x,
\&extract_codeblock
]
);
for (my $i = 1 ; $i < @chunks ; $i++) {
if ($chunks[$i-1] eq "(?") {
# wrap all code into a closure and push onto the stack:
$chunks[$i] =~
s/\A{ (.*) }\Z/{
local \@Regexp::DeferredExecution::c;
push \@Regexp::DeferredExecution::c, [\$^N, q{$1}];
}/msx;
}
$re = join("", @chunks);
# install the stack storage and execution code:
$re = "(?{
local \@Regexp::DeferredExecution::c = (); # the stack
})$re(?{
for (\@Regexp::DeferredExecution::c) {
\$^N = \$\$_[0]; # reinstate \$^N
eval \$\$_[1]; # execute the code
}
})";
return $re;
}
1;
One caveat of Regexp::DeferredExecution use is that
while execution will occur only once per compiled regular expressions,
the ability to embed regular expressions inside of other regular
expressions will circumvent this behavior:
use Regexp::DeferredExecution;
# the quintessential foobar/foobaz parser:
$re = qr/foo
(?:
bar (?:{ warn "saw bar!\n"; })
|
baz (?:{ warn "saw baz!\n"; })
)?/x;
# someone's getting silly now:
$re2 = qr/ $re
baroo!
(?:{ warn "saw foobarbaroo! (or, foobazbaroo!)\n"; })
/x;
"foobar" =~ /$re2/;
__END__
"saw bar!"
Even though the input text to $re2 failed to match, the
deferred code from $re was executed because its pattern
did match successfully.
Therefore, Regexp::DeferredExecution should only be used
with "constant" regular expressions; there is currently no way to
overload dynamic, "interpolated" regular expressions.
See Also
The Regexp::Fields module provides a much more compact
shorthand for embedded named variable
assignments, (?<varname> pattern), such that
our example becomes:
use Regexp::Fields qw(my);
my $rx =
qr/^ # anchor at beginning of line
The\ quick\ (?<fox> \w+)\ fox # fox adjective
\ (?<verb> \w+)\ over # fox action verb
\ the\ (?<dog> \w+) dog # dog adjective
(?:\s* \# \s*
(?<comment> .*?)
\s*)? # an optional, trimmed comment
$ # end of line anchor
/x;
Note that in this particular example, the my $rx
compilation stanza actually implicitly
declared $fox, $verb etc. If variable
assignment is all you're ever doing, Regexp::Fields is
all you'll need. If you want to embed more generic code fragments in
your regular expressions, Regexp::DeferredExecution may
be your ticket.
And finally, because in Perl there is always One More Way To Do It,
I'll also demonstrate Regexp::English, a module that
allows you to use regular expressions without actually writing any
regular expressions:
use Regexp::English;
my ($fox, $verb, $dog, $comment);
my $rx = Regexp::English->new
-> start_of_line
-> literal('The quick ')
-> remember(\$fox)
-> word_chars
-> end
-> literal(' fox ')
-> remember(\$verb)
-> word_chars
-> end
-> literal(' over the ')
-> remember(\$dog)
-> word_chars
-> end
-> literal(' dog')
-> optional
-> zero_or_more -> whitespace_char -> end
-> literal('#')
-> zero_or_more -> whitespace_char -> end
-> remember(\$comment)
-> minimal
-> multiple
-> word_char
-> or
-> whitespace_char
-> end
-> end
-> end
-> zero_or_more -> whitespace_char -> end
->end
-> end_of_line;
$rx->match($_);
I must admit that this last example appeals to my inner-Lispish self.
Hopefully you've gleaned a few tips and tricks from this little workshop of mine that you can take back to your own shop.


