Exegesis 5
by Damian Conway
|
Pages: 1, 2, 3, 4, 5
Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.
Lay it out for me
In Perl 6, each rule implicitly has the equivalent of the Perl 5 /x modifier
turned on, so we could lay out (and annotate) that first pattern like this:
$file = rx/ ^ # Must be at start of string
<$hunk> # Match what the rule in $hunk would match...
* # ...zero-or-more times
$ # Must be at end of string (no newline allowed)
/;
Because /x is the default, the whitespace in the pattern is ignored,
which allows us to lay out the rule more readably. Comments are also honored,
which enables us to document the rule sensibly. You can even use the closing
delimiter in a comment safely:
$caveat = rx/ Make \s+ sure \s+ to \s+ ask
\s+ (mum|mom) # handle UK/US spelling
\s+ (and|or) # handle and/or
\s+ dad \s+ first
/;
Of course, the examples in this Exegesis don't represent good comments in general, since they document what is happening, rather than why.
The meanings of the ^ and * metacharacters are unchanged
from Perl 5. However, the meaning of the $ metacharacter has
changed slightly: it no longer allows an optional newline before the end
of the string. If you want that behavior, then you need to specify it
explicitly. For example, to match a line ending in digits: / \d+ \n? $/
The compensation is that, in Perl 6, a \n in a pattern matches a logical
newline (that is any of: "\015\012" or "\012" or "\015"
or "\x85" or "\x2028"), rather than just a
physical ASCII newline (i.e. just "\012"). And a \n will always
try to match any kind of physical newline marker (not just the current system's
favorite), so it correctly matches against strings that have been
aggregated from multiple systems.
Interpolate ye not ...
|
|
The really new bit in the $file rule is the <$hunk> element.
It's a directive to grab whatever's in the $hunk variable (presumably
another pattern) and attempt to match it at that point in the rule. The
important point is that the contents of $hunk are only grabbed when
the pattern matching mechanism actually needs to match against them,
not when the rule is being constructed. So it's like the mysterious
(??{...}) construct in Perl 5 regexes.
The angle brackets themselves are a much more general mechanism in Perl 6 rules.
They are the “metasyntactic markers” and replace the Perl 5 (?...) syntax.
They are used to specify numerous other features of Perl 6 rules, many of which
we will explore below.
Note that if we hadn't put the variable in angle-brackets, and had just written:
rx/ ^ $hunk* $ /;
then the contents of $hunk would still not be interpolated when
the pattern was parsed. Once again, the pattern would grab the
contents of the variable when it reached that point in its match.
But, this time, without the angle brackets around $hunk, the
pattern would try to match the contents of the variable as an
atomic literal string (rather than as a subpattern). “Atomic”
means that the * repetition quantifier applies to everything
that's in $hunk, not just to the last character
(as it does in Perl 5).
In other words, a raw variable in a Perl 6 pattern is matched
as if it was a Perl 5 regex in which the interpolation had been
quotemeta'd and then placed in a pair of noncapturing parentheses.
That's really handy in something like:
# Perl 6
my $target = <>; # Get literal string to search for
$text =~ m/ $target* /; # Search for them as literals
which in Perl 5 we'd have to write as:
# Perl 5
my $target = <>; # Get literal string to search for
chomp $target; # No autochomping in Perl 5
$text =~ m/ (?:\Q$target\E)* /x; # Search for it, quoting metas
Raw arrays and hashes interpolate as literals, too. For example, if we use an array in a Perl 6 pattern, then the matcher will attempt to match any of its elements (each as a literal). So:
# Perl 6
@cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
$str =~ / @cmd \( .*? \) /; # Match a cmd, followed by stuff in parens
is the same as:
# Perl 5
@cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
$cmd = join '|', map { quotemeta $_ } @cmd;
$str =~ / (?:$cmd) \( .*? \) /;
By the way, putting the array into angle brackets would cause the matcher to try and match each of the array elements as a pattern, rather than as a literal.
The incredible $hunk
The rule that <$hunk> tries to match against is the next one defined
in the program. Here's the annotated version of it:
$hunk = rx :i { # Case-insensitively...
[ # Start a non-capturing group
<$linenum> # Match the subrule in $linenum
a # Match a literal 'a'
:: # Commit to this alternative
<$linerange> # Match the subrule in $linerange
\n # Match a newline
<$appendline> # Match the subrule in $appendline...
+ # ...one-or-more times
| # Or...
<$linerange> d :: <$linenum> \n # Match $linerange, 'd', $linenum, newline
<$deleteline>+ # Then match $deleteline once-or-more
| # Or...
<$linerange> c :: <$linerange> \n # Match $linerange, 'c', $linerange, newline
<$deleteline>+ # Then match $deleteline once-or-more
--- \n # Then match three '-' and a newline
<$appendline>+ # Then match $appendline once-or-more
] # End of non-capturing group
| # Or...
( # Start a capturing group
\N* # Match zero-or-more non-newlines
) # End of capturing group
::: # Emphatically commit to this alternative
{ fail "Invalid diff hunk: $1" } # Then fail with an error msg
};
The first thing to note is that, like a Perl 5 qr, a Perl 6 rx can take
(almost) any delimiters we choose. The $hunk pattern uses {...}, but
we could have used:
rx/pattern/ # Standard
rx[pattern] # Alternative bracket-delimiter style
rx<pattern> # Alternative bracket-delimiter style
rx«forme» # Délimiteurs très chic
rx>pattern< # Inverted bracketing is allowed too (!)
rx»Muster« # Begrenzungen im korrekten Auftrag
rx!pattern! # Excited
rx=pattern= # Unusual
rx?pattern? # No special meaning in Perl 6
rx#pattern# # Careful with these: they disable internal comments
Modified modifiers
In fact, the only characters not permitted as rx delimiters are
':' and '('. That's because ':' is the character used to
introduce pattern modifiers in Perl 6, and '(' is the character used
to delimit any arguments that might be passed to those pattern modifiers.
In Perl 6, pattern modifiers are placed before the pattern, rather
than after it. That makes life easier for the parser, since it doesn't
have to go back and reinterpret the contents of a rule when it reaches
the end and discovers a /s or /m or /i or /x. And it makes life
easier for anyone reading the code -- for precisely the same reason.
The only modifier used in the $hunk rule is the :i (case-insensitivity)
modifier, which works exactly as it does in Perl 5.
The other rule modifiers available in Perl 6 are:
:eor:each-
This is the replacement for Perl 5's
/gmodifier. It causes a match (or substitution) to be attempted as many times as possible. The name was changed because “each” is shorter and clearer in intent than “globally”. And because the:eachmodifier can be combined with other modifiers (see below) in such a way that it's no longer “global” in its effect. :x($count)-
This modifier is like
:e, in that it causes the match or substitution to be attempted repeatedly. However, unlike:e, it specifies exactly how many times the match must succeed. For example:"fee fi " =~ m:x(3)/ (f\w+) /; # fails "fee fi fo" =~ m:x(3)/ (f\w+) /; # succeeds (matches "fee","fi","fo") "fee fi fo fum" =~ m:x(3)/ (f\w+) /; # succeeds (matches "fee","fi","fo")Note that the repetition count doesn't have to be a constant:
m:x($repetitions)/ pattern /There is also a series of tidy abbreviations for all the constant cases:
m:1x/ pattern / # same as: m:x(1)/ pattern / m:2x/ pattern / # same as: m:x(2)/ pattern / m:3x/ pattern / # same as: m:x(3)/ pattern / # etc.
:nth($count)-
This modifier causes a match or substitution to be attempted repeatedly, but to ignore the first
$count-1successful matches. For example:my $foo = "fee fi fo fum";$foo =~ m:nth(1)/ (f\w+) /; # succeeds (matches "fee") $foo =~ m:nth(2)/ (f\w+) /; # succeeds (matches "fi") $foo =~ m:nth(3)/ (f\w+) /; # succeeds (matches "fo") $foo =~ m:nth(4)/ (f\w+) /; # succeeds (matches "fum") $foo =~ m:nth(5)/ (f\w+) /; # fails $foo =~ m:nth($n)/ (f\w+) /; # depends on the numeric value of $n$foo =~ s:nth(3)/ (f\w+) /bar/; # $foo now contains: "fee fi bar fum"Again, there is also a series of abbreviations:
$foo =~ m:1st/ (f\w+) /; # succeeds (matches "fee") $foo =~ m:2nd/ (f\w+) /; # succeeds (matches "fi") $foo =~ m:3rd/ (f\w+) /; # succeeds (matches "fo") $foo =~ m:4th/ (f\w+) /; # succeeds (matches "fum") $foo =~ m:5th/ (f\w+) /; # fails$foo =~ s:3rd/ (f\w+) /bar/; # $foo now contains: "fee fi bar fum"By the way, Perl isn't going to be pedantic about these “ordinal” versions of repetition specifiers. If you're not a native English speaker, and you find
:1th,:2th,:3th,:4th, etc., easier to remember, then that's perfectly OK.The various types of repetition modifiers can also be combined by separating them with additional colons:
my $foo = "fee fi fo feh far foo fum ";$foo =~ m:2nd:2x/ (f\w+) /; # succeeds (matches "fi", "feh") $foo =~ m:each:2nd/ (f\w+) /; # succeeds (matches "fi", "feh", "foo") $foo =~ m:x(2):nth(3)/ (f\w+) /; # succeeds (matches "fo", "foo") $foo =~ m:each:3rd/ (f\w+) /; # succeeds (matches "fo", "foo") $foo =~ m:2x:4th/ (f\w+) /; # fails (not enough matches to satisfy :2x) $foo =~ m:4th:each/ (f\w+) /; # succeeds (matches "feh")$foo =~ s:each:2nd/ (f\w+) /bar/; # $foo now "fee bar fo bar far bar fum ";Note that the order in which the two modifiers are specified doesn't matter.
:p5or:perl5-
This modifier causes Perl 6 to interpret the contents of a rule as a regular expression in Perl 5 syntax. This is mainly provided as a transitional aid for porting Perl 5 code. And to mollify the curmudgeonly.
:wor:word-
This modifier causes whitespace appearing in the pattern to match optional whitespace in the string being matched. For example, instead of having to cope with optional whitespace explicitly:
$cmd =~ m/ \s* <keyword> \s* \( [\s* <arg> \s* ,?]* \s* \)/;we can just write:
$cmd =~ m:w/ <keyword> \( [ <arg> ,?]* \)/;The
:wmodifier is also smart enough to detect those cases where the whitespace should actually be mandatory. For example:$str =~ m:w/a symmetric ally/is the same as:
$str =~ m/a \s+ symmetric \s+ ally/rather than:
$str =~ m/a \s* symmetric \s* ally/So it won't accidentally match strings like
"asymmetric ally"or"asymmetrically". :anyThis modifier causes the rule to match a given string in every possible way, simultaneously, and then return all the possible matches. For example:
my $str = "ahhh";@matches = $str =~ m/ah*/; # returns "ahhh" @matches = $str =~ m:any/ah*/; # returns "ahhh", "ahh", "ah", "a":u0,:u1,:u2,:u3-
These modifiers specify how the rule matches the dot (
.) metacharacter against Unicode data. If:u0is specified, then dot matches a single byte; if:u1is specified, then dot matches a single codepoint (i.e. one or more bytes representing a single Unicode “character”). If:u2is specified, then dot matches a single grapheme (i.e. a base codepoint followed by zero or more modifier codepoints, such as accents). If:u3is specified, then dot matches an appropriate “something” in a language-dependent manner.It's OK to ignore this modifier if you're not using Unicode (and maybe even if you are). As usual, Perl will try to do the right thing. To that end, the default behavior of rules is
:u2, unless an overriding pragma (e.g.use bytes) is in effect.
Note that the /s, /m, and /e modifiers are no longer available.
This is because they're no longer needed. The /s isn't needed because
the . (dot) metacharacter now matches newlines as well. When we want
to match “anything except a newline”, we now use the new \N metatoken
(i.e. “opposite of \n”).
The /m modifier isn't required, because ^ and $ always mean start and
end of string, respectively. To match the start and end of a line, we use
the new ^^ and $$ metatokens instead.
The /e modifier is no longer needed, because Perl 6 provides the
$(...) string interpolator (as described in Apocalypse 2). So a
substitution such as:
# Perl 5
s/(\w+)/ get_val_for($1) /e;
becomes just:
# Perl 6
s/(\w+)/$( get_val_for($1) )/;


