Sign In/My Account | View Cart  
advertisement


Listen Print

Exegesis 5
by Damian Conway | Pages: 1, 2, 3, 4, 5

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Lay it out for me

In Perl 6, each rule implicitly has the equivalent of the Perl 5 /x modifier turned on, so we could lay out (and annotate) that first pattern like this:

    $file = rx/ ^               # Must be at start of string
                <$hunk>         # Match what the rule in $hunk would match...
                        *       #          ...zero-or-more times
                $               # Must be at end of string (no newline allowed)
              /;

Because /x is the default, the whitespace in the pattern is ignored, which allows us to lay out the rule more readably. Comments are also honored, which enables us to document the rule sensibly. You can even use the closing delimiter in a comment safely:

    $caveat = rx/ Make \s+ sure \s+ to \s+ ask
                  \s+ (mum|mom)                 # handle UK/US spelling
                  \s+ (and|or)                  # handle and/or
                  \s+ dad \s+ first
                /;

Of course, the examples in this Exegesis don't represent good comments in general, since they document what is happening, rather than why.

The meanings of the ^ and * metacharacters are unchanged from Perl 5. However, the meaning of the $ metacharacter has changed slightly: it no longer allows an optional newline before the end of the string. If you want that behavior, then you need to specify it explicitly. For example, to match a line ending in digits: / \d+ \n? $/

The compensation is that, in Perl 6, a \n in a pattern matches a logical newline (that is any of: "\015\012" or "\012" or "\015" or "\x85" or "\x2028"), rather than just a physical ASCII newline (i.e. just "\012"). And a \n will always try to match any kind of physical newline marker (not just the current system's favorite), so it correctly matches against strings that have been aggregated from multiple systems.


Interpolate ye not ...

The Perl CD BookshelfThe Perl CD Bookshelf
May 2001
0-596-00164-9, Order Number: 1649
672 pages, $79.95, Features CD-ROM

The really new bit in the $file rule is the <$hunk> element. It's a directive to grab whatever's in the $hunk variable (presumably another pattern) and attempt to match it at that point in the rule. The important point is that the contents of $hunk are only grabbed when the pattern matching mechanism actually needs to match against them, not when the rule is being constructed. So it's like the mysterious (??{...}) construct in Perl 5 regexes.

The angle brackets themselves are a much more general mechanism in Perl 6 rules. They are the “metasyntactic markers” and replace the Perl 5 (?...) syntax. They are used to specify numerous other features of Perl 6 rules, many of which we will explore below.

Note that if we hadn't put the variable in angle-brackets, and had just written:

    rx/ ^  $hunk*  $ /;

then the contents of $hunk would still not be interpolated when the pattern was parsed. Once again, the pattern would grab the contents of the variable when it reached that point in its match. But, this time, without the angle brackets around $hunk, the pattern would try to match the contents of the variable as an atomic literal string (rather than as a subpattern). “Atomic” means that the * repetition quantifier applies to everything that's in $hunk, not just to the last character (as it does in Perl 5).

In other words, a raw variable in a Perl 6 pattern is matched as if it was a Perl 5 regex in which the interpolation had been quotemeta'd and then placed in a pair of noncapturing parentheses. That's really handy in something like:

    # Perl 6
    my $target = <>;                  # Get literal string to search for
    $text =~ m/ $target* /;           # Search for them as literals

which in Perl 5 we'd have to write as:

    # Perl 5
    my $target = <>;                  # Get literal string to search for
    chomp $target;                    # No autochomping in Perl 5 
    $text =~ m/ (?:\Q$target\E)* /x;  # Search for it, quoting metas

Raw arrays and hashes interpolate as literals, too. For example, if we use an array in a Perl 6 pattern, then the matcher will attempt to match any of its elements (each as a literal). So:

    # Perl 6
    @cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
    $str =~ / @cmd \( .*? \) /;     # Match a cmd, followed by stuff in parens

is the same as:

    # Perl 5 
    @cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
    $cmd = join '|', map { quotemeta $_ } @cmd;
    $str =~ / (?:$cmd) \( .*? \) /;

By the way, putting the array into angle brackets would cause the matcher to try and match each of the array elements as a pattern, rather than as a literal.


The incredible $hunk

The rule that <$hunk> tries to match against is the next one defined in the program. Here's the annotated version of it:

    $hunk = rx :i {                             # Case-insensitively...
        [                                       #   Start a non-capturing group
            <$linenum>                          #     Match the subrule in $linenum
            a                                   #     Match a literal 'a'
            ::                                  #     Commit to this alternative
            <$linerange>                        #     Match the subrule in $linerange
            \n                                  #     Match a newline
            <$appendline>                       #     Match the subrule in $appendline...
                          +                     #         ...one-or-more times
        |                                       #   Or...
          <$linerange> d :: <$linenum> \n       #     Match $linerange, 'd', $linenum, newline
          <$deleteline>+                        #     Then match $deleteline once-or-more
        |                                       #   Or...
          <$linerange> c :: <$linerange> \n     #     Match $linerange, 'c', $linerange, newline
          <$deleteline>+                        #     Then match $deleteline once-or-more
          --- \n                                #     Then match three '-' and a newline
          <$appendline>+                        #     Then match $appendline once-or-more
        ]                                       #   End of non-capturing group
      |                                         # Or...
        (                                       #   Start a capturing group
            \N*                                 #     Match zero-or-more non-newlines
        )                                       #     End of capturing group
        :::                                     #     Emphatically commit to this alternative
        { fail "Invalid diff hunk: $1" }        #     Then fail with an error msg
    };

The first thing to note is that, like a Perl 5 qr, a Perl 6 rx can take (almost) any delimiters we choose. The $hunk pattern uses {...}, but we could have used:

    rx/pattern/     # Standard
    rx[pattern]     # Alternative bracket-delimiter style
    rx<pattern>     # Alternative bracket-delimiter style
    rx«forme»       # Délimiteurs très chic
    rx>pattern<     # Inverted bracketing is allowed too (!)
    rx»Muster«      # Begrenzungen im korrekten Auftrag
    rx!pattern!     # Excited
    rx=pattern=     # Unusual
    rx?pattern?     # No special meaning in Perl 6
    rx#pattern#     # Careful with these: they disable internal comments

Modified modifiers

In fact, the only characters not permitted as rx delimiters are ':' and '('. That's because ':' is the character used to introduce pattern modifiers in Perl 6, and '(' is the character used to delimit any arguments that might be passed to those pattern modifiers.

In Perl 6, pattern modifiers are placed before the pattern, rather than after it. That makes life easier for the parser, since it doesn't have to go back and reinterpret the contents of a rule when it reaches the end and discovers a /s or /m or /i or /x. And it makes life easier for anyone reading the code -- for precisely the same reason.

The only modifier used in the $hunk rule is the :i (case-insensitivity) modifier, which works exactly as it does in Perl 5.

The other rule modifiers available in Perl 6 are:

:e or :each

This is the replacement for Perl 5's /g modifier. It causes a match (or substitution) to be attempted as many times as possible. The name was changed because “each” is shorter and clearer in intent than “globally”. And because the :each modifier can be combined with other modifiers (see below) in such a way that it's no longer “global” in its effect.

:x($count)

This modifier is like :e, in that it causes the match or substitution to be attempted repeatedly. However, unlike :e, it specifies exactly how many times the match must succeed. For example:

    "fee fi "       =~ m:x(3)/ (f\w+) /;  # fails
    "fee fi fo"     =~ m:x(3)/ (f\w+) /;  # succeeds (matches "fee","fi","fo")
    "fee fi fo fum" =~ m:x(3)/ (f\w+) /;  # succeeds (matches "fee","fi","fo")

Note that the repetition count doesn't have to be a constant:

    m:x($repetitions)/ pattern /

There is also a series of tidy abbreviations for all the constant cases:

    m:1x/ pattern /         # same as: m:x(1)/ pattern /
    m:2x/ pattern /         # same as: m:x(2)/ pattern /
    m:3x/ pattern /         # same as: m:x(3)/ pattern /
    # etc.

:nth($count)

This modifier causes a match or substitution to be attempted repeatedly, but to ignore the first $count-1 successful matches. For example:

    my $foo = "fee fi fo fum";
    $foo =~ m:nth(1)/ (f\w+) /;        # succeeds (matches "fee")
    $foo =~ m:nth(2)/ (f\w+) /;        # succeeds (matches "fi")
    $foo =~ m:nth(3)/ (f\w+) /;        # succeeds (matches "fo")
    $foo =~ m:nth(4)/ (f\w+) /;        # succeeds (matches "fum")
    $foo =~ m:nth(5)/ (f\w+) /;        # fails
    $foo =~ m:nth($n)/ (f\w+) /;       # depends on the numeric value of $n
    $foo =~ s:nth(3)/ (f\w+) /bar/;    # $foo now contains: "fee fi bar fum"

Again, there is also a series of abbreviations:

    $foo =~ m:1st/ (f\w+) /;           # succeeds (matches "fee")
    $foo =~ m:2nd/ (f\w+) /;           # succeeds (matches "fi")
    $foo =~ m:3rd/ (f\w+) /;           # succeeds (matches "fo")
    $foo =~ m:4th/ (f\w+) /;           # succeeds (matches "fum")
    $foo =~ m:5th/ (f\w+) /;           # fails
    $foo =~ s:3rd/ (f\w+) /bar/;       # $foo now contains: "fee fi bar fum"

By the way, Perl isn't going to be pedantic about these “ordinal” versions of repetition specifiers. If you're not a native English speaker, and you find :1th, :2th, :3th, :4th, etc., easier to remember, then that's perfectly OK.

The various types of repetition modifiers can also be combined by separating them with additional colons:

    my $foo = "fee fi fo feh far foo fum ";
    $foo =~ m:2nd:2x/ (f\w+) /;        # succeeds (matches "fi", "feh")
    $foo =~ m:each:2nd/ (f\w+) /;      # succeeds (matches "fi", "feh", "foo")
    $foo =~ m:x(2):nth(3)/ (f\w+) /;   # succeeds (matches "fo", "foo")
    $foo =~ m:each:3rd/ (f\w+) /;      # succeeds (matches "fo", "foo")
    $foo =~ m:2x:4th/ (f\w+) /;        # fails (not enough matches to satisfy :2x)
    $foo =~ m:4th:each/ (f\w+) /;      # succeeds (matches "feh")
    $foo =~ s:each:2nd/ (f\w+) /bar/;  # $foo now "fee bar fo bar far bar fum ";

Note that the order in which the two modifiers are specified doesn't matter.

:p5 or :perl5

This modifier causes Perl 6 to interpret the contents of a rule as a regular expression in Perl 5 syntax. This is mainly provided as a transitional aid for porting Perl 5 code. And to mollify the curmudgeonly.

:w or :word

This modifier causes whitespace appearing in the pattern to match optional whitespace in the string being matched. For example, instead of having to cope with optional whitespace explicitly:

    $cmd =~ m/ \s* <keyword> \s* \( [\s* <arg> \s* ,?]* \s* \)/;

we can just write:

    $cmd =~ m:w/ <keyword> \( [ <arg> ,?]* \)/;

The :w modifier is also smart enough to detect those cases where the whitespace should actually be mandatory. For example:

    $str =~ m:w/a symmetric ally/

is the same as:

    $str =~ m/a \s+ symmetric \s+ ally/

rather than:

    $str =~ m/a \s* symmetric \s* ally/

So it won't accidentally match strings like "asymmetric ally" or "asymmetrically".

:any

This modifier causes the rule to match a given string in every possible way, simultaneously, and then return all the possible matches. For example:

    my $str = "ahhh";
    @matches =  $str =~ m/ah*/;         # returns "ahhh"
    @matches =  $str =~ m:any/ah*/;     # returns "ahhh", "ahh", "ah", "a"

:u0, :u1, :u2, :u3

These modifiers specify how the rule matches the dot (.) metacharacter against Unicode data. If :u0 is specified, then dot matches a single byte; if :u1 is specified, then dot matches a single codepoint (i.e. one or more bytes representing a single Unicode “character”). If :u2 is specified, then dot matches a single grapheme (i.e. a base codepoint followed by zero or more modifier codepoints, such as accents). If :u3 is specified, then dot matches an appropriate “something” in a language-dependent manner.

It's OK to ignore this modifier if you're not using Unicode (and maybe even if you are). As usual, Perl will try to do the right thing. To that end, the default behavior of rules is :u2, unless an overriding pragma (e.g. use bytes) is in effect.

Note that the /s, /m, and /e modifiers are no longer available. This is because they're no longer needed. The /s isn't needed because the . (dot) metacharacter now matches newlines as well. When we want to match “anything except a newline”, we now use the new \N metatoken (i.e. “opposite of \n”).

The /m modifier isn't required, because ^ and $ always mean start and end of string, respectively. To match the start and end of a line, we use the new ^^ and $$ metatokens instead.

The /e modifier is no longer needed, because Perl 6 provides the $(...) string interpolator (as described in Apocalypse 2). So a substitution such as:

    # Perl 5
    s/(\w+)/ get_val_for($1) /e;

becomes just:

    # Perl 6
    s/(\w+)/$( get_val_for($1) )/;

Pages: 1, 2, 3, 4, 5

Next Pagearrow