Exegesis 5
by Damian Conway
|
Pages: 1, 2, 3, 4, 5
Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.
Rearranging the deck-chairs
It might have come as a surprise that we were allowed to bind
the pattern's $0 result object directly, but there's nothing magical
about it. $0 turns out to be just another hypothetical variable...the
one that happens to be returned when the match is complete.
Likewise, $1, $2, $3, etc. are all hypotheticals, and can also be
explicitly bound in a rule. That's very handy for ensuring that the
right substring always turns up in the right numbered variable. For
example, consider a Perl 6 rule to match simple Perl 5 method calls
(matching all Perl 5 method calls would, of course, require a much
more sophisticated rule):
rule method_call :w {
# Match direct syntax: $var->meth(...)
\$ (<ident>) -\> (<ident>) \( (<arglist>) \)
| # Match indirect syntax: meth $var (...)
(<ident>) \$ (<ident>) [ \( (<arglist>) \) | (<arglist>) ]
}
my ($varname, methodname, $arglist);
if ($source_code =~ / $0 := <method_call> /) {
$varname = $1 // $5;
$methodname = $2 // $4;
$arglist = $3 // $6 // $7;
}
By binding the match's $0 to the result of the <method_call>
subrule, we bind its $0[1], $0[2], $0[3], etc. to those array
elements in <method_call>'s result object. And thereby bind
$1, $2, $3, etc. as well. Then it's just a matter of sorting
out which numeric variable ended up with which bit of the method call.
That's okay, but it would be much better if we could guarantee that
the variable name was always in $1, the method name in $2,
and the argument list in $3. Then we could replace the last six lines
above with just:
my ($varname, methodname, $arglist) =
$source_code =~ / $0 := <method_call> /;
In Perl 5 there was no way to do that, but in Perl 6 it's relatively easy.
We just modify the method_call rule like so:
rule method_call :w {
\$ $1:=<ident> -\> $2:=<ident> \( $3:=<arglist> \)
| $2:=<ident> \$ $1:=<ident> [ \( $3:=<arglist> \) | $3:=<arglist> ]
}
Or, annotated:
rule method_call :w {
\$ # Match a literal $
$1:=<ident> # Match the varname, bind it to $1
-\> # Match a literal ->
$2:=<ident> # Match the method name, bind it to $2
\( # Match an opening paren
$3:=<arglist> # Match the arg list, bind it to $3
\) # Match a closing paren
| # Or
$2:=<ident> # Match the method name, bind it to $2
\$ # Match a literal $
$1:=<ident> # Match the varname, bind it to $1
[ # Either...
\( $3:=<arglist> \) # Match arg list in parens, bind it to $3
| # Or...
$3:=<arglist> # Just match arg list, bind it to $3
]
}
Now the rule's $1 is bound to the variable name, regardless of which
alternative matches. Likewise $2 is bound to the method name in
either branch of the |, and $3 is associated with the argument
list, no matter which of the three possible ways it was matched.
Of course, that's still rather ugly (especially if we have to write all those comments just so others can understand how clever we were).
So an even better solution is just to use proper named rules (with their handy auto-capturing behaviour) for everything. And then slice the required information out of the result object's hash attribute:
rule varname { <ident> }
rule methodname { <ident> }
rule method_call :w {
\$ <varname> -\> <methodname> \( <arglist> \)
| <methodname> \$ <varname> [ \( <arglist> \) | <arglist> ]
}
$source_code =~ / <method_call> /;
my ($varname, $methodname, $arglist) =
$0{method_call}{"varname","methodname","arglist"}
Deriving a benefit
As the above examples illustrate, using named rules in grammars provides a cleaner syntax and a reduction in the number of variables required in a parsing program. But, beyond those advantages, and the obvious benefits of moving rule construction from run-time to compile-time, there's yet another significant way to gain from placing named rules inside a grammar: we can inherit from them.
For example, the ReverseDiff grammar is almost the same as the
normal Diff grammar. The only difference is in the hunk rule.
So there's no reason why we shouldn't just have ReverseDiff inherit
all that sameness, and simply redefine its notion of hunk-iness.
That would look like this:
grammar ReverseDiff is Diff {
rule hunk :i {
[ <linenum> a :: <linerange> \n
<appendline>+
{ $appendline =~ s/ <in_marker> /</;
let $0 := "${linerange}d${linenum}\n"
_ join "", @$appendline;
}
|
<linerange> d :: <linenum> \n
<deleteline>+
{ $deleteline =~ s/ <out_marker> />/;
let $0 := "${linenum}a${linerange}\n"
_ join "", @$deleteline;
}
|
$from:=<linerange> c :: $to:=<linerange> \n
<deleteline>+
--- \n
<appendline>+
{ $appendline =~ s/ <in_marker> /</;
$deleteline =~ s/ <out_marker> />/;
let $0 := "${to}c${from}\n"
_ join("", @$appendline)
_ "---\n"
_ join("", @$deleteline);
}
]
|
<badline("Invalid diff hunk")>
}
}
The ReverseDiff is Diff syntax is the standard Perl 6 way of
inheriting behaviour. Classes will use the same notation:
class Hacker is Programmer {...}
class JAPH is Hacker {...}
# etc.
Likewise, in the above example Diff is specified as the base grammar
from which the new ReverseDiff grammar is derived. As a result of that
inheritance relationship, ReverseDiff immediately inherits all of the
Diff grammar's rules. We then simple redefine ReverseDiff's version of
the hunk rule, and the job's done.
Different diffs
Grammatical inheritance isn't only useful for tweaking the behaviour of a grammar's rules. It's also handy when two or more related grammars share some characteristics, but differ in some particulars. For example, suppose we wanted to support the “unified” diff format, as well as the “classic”.
A unified diff consists of two lines of header information, followed by a series of hunks. The header information indicates the name and modification date of the old file (prefixing the line with three minus signs), and then the name and modification date of the new file (prefixing that line with three plus signs). Each hunk consists of an offset line, followed by one or more lines representing either shared context, or a line to be inserted, or a line to be deleted. Offset lines start with two “at” signs, then consist of a minus sign followed by the old line offset and line-count, and then a plus sign followed by the nes line offset and line-count, and then two more “at” signs. Context lines are prefixed with two spaces. Insertion lines are prefixed with a plus sign and a space. Deletion lines are prefixed with a minus sign and a space.
But that's not important right now.
What is important is that we could write another complete grammar for that, like so:
grammar Diff::Unified {
rule file { ^ <fileinfo> <hunk>* $ }
rule fileinfo {
<out_marker><3> $oldfile:=(\S+) $olddate:=[\h* (\N+?) \h*?] \n
<in_marker><3> $newfile:=(\S+) $newdate:=[\h* (\N+?) \h*?] \n
}
rule hunk {
<header>
@spec := ( <contextline>
| <appendline>
| <deleteline>
| <badline("Invalid line for unified diff")>
)*
}
rule header {
\@\@ <out_marker> <linenum> , <linecount> \h+
<in_marker> <linenum> , <linecount> \h+
\@\@ \h* \n
}
rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" } }
rule linenum { (\d+) }
rule linecount { (\d+) }
rule deleteline { ^^ <out_marker> (\N* \n) }
rule appendline { ^^ <in_marker> (\N* \n) }
rule contextline { ^^ <sp> <sp> (\N* \n) }
rule out_marker { \+ <sp> }
rule in_marker { - <sp> }
}
That represents (and can parse) the new diff format correctly, but it's a needless duplication of effort and code. Many the rules of this grammar are identical to those of the original diff parser. Which suggests we could just grab them straight from the original -- by inheriting them:
grammar Diff::Unified is Diff {
rule file { ^ <fileinfo> <hunk>* $ }
rule fileinfo {
<out_marker><3> $newfile:=(\S+) $olddate:=[\h* (\N+?) \h*?] \n
<in_marker><3> $newfile:=(\S+) $newdate:=[\h* (\N+?) \h*?] \n
}
rule hunk {
<header>
@spec := ( <contextline>
| <appendline>
| <deleteline>
| <badline("Invalid line for unified diff")>
)*
}
rule header {
\@\@ <out_marker> <linenum> , <linecount> \h+
<in_marker> <linenum> , <linecount> \h+
\@\@ \h* \n
}
rule linecount { (\d+) }
rule contextline { ^^ <sp> <sp> (\N* \n) }
rule out_marker { \+ <sp> }
rule in_marker { - <sp> }
}
Note that in this version we don't need to specify the rules for
appendline, deleteline, linenum, etc. They're provided
automagically by inheriting from the Diff grammar. So we only have to
specify the parts of the new grammar that differ from the original.
In particular, this is where we finally reap the reward for factoring
out the in_marker and out_marker rules. Because we did that
earlier, we can now just change the rules for matching those two markers
directly in the new grammar. As a result, the inherited appendline and
deleteline rules (which use in_marker and out_marker as
subrules) will now attempt to match the new versions of in_marker and
out_marker rules instead.
And if you're thinking that looks suspiciously like polymorphism, you're absolutely right. The parallels between pattern matching and OO run very deep in Perl 6.
Let's get cooking
To sum up: Perl 6 patterns and grammars extend Perl's text matching capacities enormously. But you don't have to start using all that extra power right away. You can ignore grammars and embedded closures and assertions and the other sophisticated bits until you actually need them.
The new rule syntax also cleans up much of the “line-noise” of Perl 5 regexes. But the fundamentals don't change that much. Many Perl 5 patterns will translate very simply and naturally to Perl 6.
To demonstrate that, and to round out this exploration of Perl 6 patterns, here are a few common Perl 5 regexes -- some borrowed from the Perl Cookbook, and others from the Regexp::Common module -- all ported to equivalent Perl 6 rules:
- Match a C comment:
-
# Perl 5 $str =~ m{ /\* .*? \*/ }xs;# Perl 6 $str =~ m{ /\* .*? \*/ }; - Remove leading qualifiers from a Perl identifier
-
# Perl 5 $ident =~ s/^(?:\w*::)*//;# Perl 6 $ident =~ s/^[\w*\:\:]*//; - Warn of text with lines greater than 80 characters
-
# Perl 5 warn "Thar she blows!: $&" if $str =~ m/.{81,}/;# Perl 6 warn "Thar she blows!: $0" if $str =~ m/\N<81,>/; - Match a Roman numeral
-
# Perl 5 $str =~ m/ ^ m* (?:d?c{0,3}|c[dm]) (?:l?x{0,3}|x[lc]) (?:v?i{0,3}|i[vx]) $ /ix;# Perl 6 $str =~ m:i/ ^ m* [d?c<0,3>|c<[dm]>] [l?x<0,3>|x<[lc]>] [v?i<0,3>|i<[vx]>] $ /; - Extract lines regardless of line terminator
-
# Perl 5 push @lines, $1 while $str =~ m/\G([^\012\015]*)(?:\012\015?|\015\012?)/gc;# Perl 6 push @lines, $1 while $str =~ m:c/ (\N*) \n /; - Match a quote-delimited string (Friedl-style), capturing contents:
-
# Perl 5 $str =~ m/ " ( [^\\"]* (?: \\. [^\\"]* )* ) " /x;# Perl 6 $str =~ m/ " ( <-[\\"]>* [ \\. <-[\\"]>* ]* ) " /; - Match a decimal IPv4 address:
-
# Perl 5 my $quad = qr/(?: 25[0-5] | 2[0-4]\d | [0-1]??\d{1,2} )/x;$str =~ m/ $quad \. $quad \. $quad \. $quad /x;# Perl 6 rule quad { (\d<1,3>) :: { fail unless $1 < 256 } }$str =~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /x;# Perl 6 (same great approach, now less syntax) rule quad { (\d<1,3>) :: <($1 < 256)> }$str =~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /x; - Match a floating-point number, returning components:
-
# Perl 5 ($sign, $mantissa, $exponent) = $str =~ m/([+-]?)([0-9]+\.?[0-9]*|\.[0-9]+)(?:e([+-]?[0-9]+))?/;# Perl 6 ($sign, $mantissa, $exponent) = $str =~ m/(<[+-]>?)(<[0-9]>+\.?<[0-9]>*|\.<[0-9]>+)[e(<[+-]>?<[0-9]>+)]?/; - Match a floating-point number maintainably, returning components:
-
# Perl 5 my $digit = qr/[0-9]/; my $sign_pat = qr/(?: [+-]? )/x; my $mant_pat = qr/(?: $digit+ \.? $digit* | \. digit+ )/x; my $expo_pat = qr/(?: $signpat $digit+ )? /x;($sign, $mantissa, $exponent) = $str =~ m/ ($sign_pat) ($mant_pat) (?: e ($expo_pat) )? /x;# Perl 6 rule sign { <[+-]>? } rule mantissa { <digit>+ [\. <digit>*] | \. <digit>+ } rule exponent { [ <sign> <digit>+ ]? }($sign, $mantissa, $exponent) = $str =~ m/ (<sign>) (<mantissa>) [e (<exponent>)]? /; - Match nested parentheses:
-
# Perl 5 our $parens = qr/ \( (?: (?>[^()]+) | (??{$parens}) )* \) /x; $str =~ m/$parens/;# Perl 6 $str =~ m/ \( [ <-[()]> + : | <self> ]* \) /; - Match nested parentheses maintainably:
-
# Perl 5 our $parens = qr/ \( # Match a literal '(' (?: # Start a non-capturing group (?> # Never backtrack through... [^()] + # Match a non-paren (repeatedly) ) # End of non-backtracking region | # Or (??{$parens}) # Recursively match entire pattern )* # Close group and match repeatedly \) # Match a literal ')' /x;$str =~ m/$parens/;# Perl 6 $str =~ m/ <'('> # Match a literal '(' [ # Start a non-capturing group <-[()]> + # Match a non-paren (repeatedly) : # ...and never backtrack that match | # Or <self> # Recursively match entire pattern ]* # Close group and match repeatedly <')'> # Match a literal ')' /;
Return to the Perl.com.





