Sign In/My Account | View Cart  
advertisement


Listen Print

Exegesis 5
by Damian Conway | Pages: 1, 2, 3, 4, 5

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Take no prisoners

The first character of the $hunk rule is an opening square bracket. In Perl 5, that denoted the start of a character class, but not in Perl 6. In Perl 6, square brackets mark the boundaries of a noncapturing group. That is, a pair of square brackets in Perl 6 are the same as a (?:...) in Perl 5, but less line-noisy.

By the way, to get a character class in Perl 6, we need to put the square brackets inside a pair of metasyntactic angle brackets. So the Perl 5:

    # Perl 5
    / [A-Za-z] [0-9]+ /x          # An A-Z or a-z, followed by digits

would become in Perl 6:

    # Perl 6
    / <[A-Za-z]> <[0-9]>+ /       # An A-Z or a-z, followed by digits

The Perl 5 complemented character class:

    # Perl 5
    / [^A-Za-z]+ /x               # One-or-more chars-that-aren't-A-Z-or-a-z

becomes in Perl 6:

    # Perl 6
    / <-[A-Za-z]>+ /              #  One-or-more chars-that-aren't-A-Z-or-a-z

The external minus sign is used (instead of an internal caret), because Perl 6 allows proper set operations on character classes, and the minus sign is the “difference” operator. So we could also create:

    # Perl 6
    / < <alpha> - [A-Za-z] >+ /   # All alphabetics except A-Z or a-z
                                  # (i.e. the accented alphabetics)

Explicit character classes were deliberately made a little less convenient in Perl 6, because they're generally a bad idea in a Unicode world. For example, the [A-Za-z] character class in the above examples won't even match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham, Cherokee, or Klingon.


Meanwhile, back at the $hunk ...

The noncapturing group of the $hunk pattern groups together three alternatives, separated by | metacharacters (as in Perl 5). The first alternative:

    <$linenum> a :: <$linerange>
    \n                         
    <$appendline>+

grabs whatever is in the $linenum variable, treats it as a subpattern, and attempts to match against it. It then matches a literal letter 'a' (or an 'A', because of the :i modifier on the rule). Then whatever the contents of the $linerange variable match. Then a newline. Then it tries to match whatever the pattern in $appendline would match, one or more times.

But what about that double-colon after the a? Shouldn't the pattern have tried to match two colons at that point?


This or nothing

Actually, no. The double-colon is a new Perl 6 pattern-control structure. It has no effect (and is ignored) when the pattern is successfully matching, but if the pattern match should fail, and consequently back-track over the double-colon -- for example, to try and rematch an earlier repetition one fewer times -- the double-colon causes the entire surrounding group (i.e. the surrounding [...] in this case) to fail as well.

That's a useful optimization in this case because, if we match a line number followed by an 'a' but subsequently fail, then there's no point even trying either of the other two alternatives in the same group. Because we found an 'a', there's no chance we could match a 'd' or a 'c' instead.

So, in general, a double-colon means: “At this point I'm committed to this alternative within the current group -- don't bother with the others if this one fails after this point”.

There are other control directives like this, too. A single colon means: “Don't bother backtracking into the previous element”. That's useful in a pattern like:

    rx:w/ $keyword [-full|-quick|-keep]+ : end /

Suppose we successfully match the keyword (as a literal, by the way) and one or more of the three options, but then fail to match 'end'. In that case, there's no point backtracking and trying to match one fewer option, and still failing to find an 'end'. And then backtracking another option, and failing again, etc. By using the colon after the repetition, we tell the matcher to give up after the first attempt.

However, the single colon isn't just a “Greed is Good” operator. It's much more like a “Resistance is Futile” operator. That is, if the preceding repetition had been non-greedy instead:

    rx:w/ $keyword [-full|-quick|-keep]+? : end /

then backtracking over the colon would prevent the +? from attempting to match more options. Note that this means that x+?: is just a baroque way of matching exactly one repetition of x, since the non-greedy repetition initially tries to match the minimal number of times (i.e. once) and the trailing colon then prevents it from backtracking and trying longer matches. Likewise, x*?: and x??: are arcane ways of matching exactly zero repetitions of x.

Generally, though, a single colon tells the pattern matcher that there's no point trying any other match on the preceding repetition, because retrying (whether more or fewer repetitions) would just waste time and would still fail.

There's also a three-colon directive. Three colons means: “If we have to backtrack past here, cause the entire rule to fail” (i.e. not just this group). If the double-colon in $hunk had been triple:

    <$linenum> a ::: <$linerange>
    \n                         
    <$appendline>+

then matching a line number and an 'a' and subsequently failing would cause the entire $hunk rule to fail immediately (though the $file rule that invoked it might still match successfully in some other way).

So, in general, a triple-colon specifies: “At this point I'm committed to this way of matching the current rule -- give up on the rule completely if the matching process fails at this point”.

Four colons ... would just be silly. So, instead, there's a special named directive: <commit>. Backtracking through a <commit> causes the entire match to immediately fail. And if the current rule is being matched as part of a larger rule, that larger rule will fail as well. In other words, it's the “Blow up this Entire Planet and Possibly One or Two Others We Noticed on our Way Out Here” operator.

If the double-colon in $hunk had been a <commit> instead:

    <$linenum> a <commit> <$linerange>
    \n                         
    <$appendline>+

then matching a line number and an 'a' and subsequently failing would cause the entire $hunk rule to fail immediately, and would also cause the $file rule that invoked it to fail immediately.

So, in general, a <commit> means: “At this point I'm committed to this way of completing the current match -- give up all attempts at matching anything if the matching process fails at this point”.


Failing with style

The other two alternatives:

    | <$linerange> d :: <$linenum> \n
      <$deleteline>+                 
    | <$linerange> c :: <$linerange> \n
      <$deleteline>+  --- \n  <$appendline>+

are just variants on the first.

If none of the three alternatives in the square brackets matches, then the alternative outside the brackets is tried:

    |  (\N*) ::: { fail "Invalid diff hunk: $1" }

This captures a sequence of non-newline characters (\N means “not \n”, in the same way \S means “not \s” or \W means “not \w”). Then it invokes a block of Perl code inside the pattern. The call to fail causes the match to fail at that point, and sets an associated error message that would subsequently appear in the $! error variable (and which would also be accessible as part of $0).

Note the use of the triple colon after the repetition. It's needed because the fail in the block will cause the pattern match to backtrack, but there's no point backing up one character and trying again, since the original failure was precisely what we wanted. The presence of the triple-colon causes the entire rule to fail as soon as the backtracking reaches that point the first time.

The overall effect of the $hunk rule is therefore either to match one hunk of the diff, or else fail with a relevant error message.


Home, home on the (line)range

The third and fourth rules:

    $linerange = rx/ <$linenum> , <$linenum>
                   | <$linenum> 
                   /;
    $linenum = rx/ \d+ /;

specify that a line number consists of a series of digits, and that a line range consists of either two line numbers with a comma between them or a single line number. The $linerange rule could also have been written:

    $linerange = rx/ <$linenum> [ , <$linenum> ]? /;

which might be marginally more efficient, since it doesn't have to backtrack and rematch the first $linenum in the second alternative. It's likely, however, that the rule optimizer will detect such cases and automatically hoist the common prefix out anyway, so it's probably not worth the decrease in readability to do that manually.


What's my line?

The final two rules specify the structure of individual context lines in the diff (i.e. the lines that say what text is being added or removed by the hunk):

    $deleteline = rx/^^ \< <sp> (\N* \n) /
    $appendline = rx/^^ \> <sp> (\N* \n) /

The ^^ markers ensure that each rule starts at the beginning of an entire line.

The first character on that line must be either a '<' or a '>'. Note that we have to escape these characters since angle brackets are metacharacters in Perl 6. An alternative would be to use the “literal string” metasyntax:

    $deleteline = rx/^^ <'<'> <sp> (\N* \n) /
    $appendline = rx/^^ <'>'> <sp> (\N* \n) /

That is, angle brackets with a single-quoted string inside them match the string's sequence of characters as literals (including whitespace and other metatokens).

Or we could have used the quotemeta metasyntax (\Q[...]):

    $deleteline = rx/^^ \Q[<] <sp> (\N* \n) /
    $appendline = rx/^^ \Q[>] <sp> (\N* \n) /

Note that Perl 5's \Q...\E construct is replaced in Perl 6 by just the \Q marker, which now takes a group after it.

We could also have used a single-letter character class:

    $deleteline = rx/^^ <[<]> <sp> (\N* \n) /
    $appendline = rx/^^ <[>]> <sp> (\N* \n) /

or even a named character (\c[CHAR NAME HERE]):

    $deleteline = rx/^^ \c[LEFT ANGLE BRACKET] <sp> (\N* \n) /
    $appendline = rx/^^ \c[RIGHT ANGLE BRACKET] <sp> (\N* \n) /

Whether any of those MTOWTDI is better than just escaping the angle bracket is, of course, a matter of personal taste.


The final frontier

After the leading angle, a single literal space is expected. Again, we could have specified that by escapology () or literalness (<' '>) or quotemetaphysics (\Q[ ]) or character classification (<[ ]>), or deterministic nomimalism (\c[SPACE]), but Perl 6 also gives us a simple name for the space character: <sp>. This is the preferred option, since it reduces line-noise and makes the significant space much harder to miss.

Perl 6 provides predefined names for other useful subpatterns as well, including:

<dot>

which matches a literal dot ('.') character (i.e. it's a more elegant synonym for \.);

<lt> and <gt>

which match a literal '<' and '>' respectively. These give us yet another way of writing:

    $deleteline = rx/^^ <lt> <sp> (\N* \n) /
    $appendline = rx/^^ <gt> <sp> (\N* \n) /
<ws>
which matches any sequence of whitespace (i.e. it's a more elegant synonym for \s+). Optional whitespace is, therefore, specified as <ws>? or <ws>* (Perl 6 will accept either);
<alpha>
which matches a single alphabetic character (i.e. it's like the character class <[A-Za-z]> but it handles accented characters and alphabetic characters from non-Roman scripts as well);
<ident>
which is a short-hand for [ [<alpha>|_] \w* ] (i.e. a standard identifier in many languages, including Perl)

Using named subpatterns like these makes rules clearer in intent, easier to read, and more self-documenting. And, as we'll see shortly, they're fully generalizable...we can create our own.


Match-maker, match-maker...

Finally, we're ready to actually read in and match a diff file. In Perl 5, we'd do that like so:

    # Perl 5
    local $/;          # Disable input record separator (enable slurp mode)
    my $text = <>;     # Slurp up input stream into $text
    print "Valid diff" 
        if $text =~ /$file/;

We could do the same thing in Perl 6 (though the syntax would differ slightly) and in this case that would be fine. But, in general, it's clunky to have to slurp up the entire input before we start matching. The input might be huge, and we might fail early. Or we might want to match input interactively (and issue an error message as soon as the input fails to match). Or we might be matching a series of different formats. Or we might want to be able to leave the input stream in its original state if the match fails.

The inability to do pattern matches immediately on an input stream is one of Perl 5's few weaknesses when it comes to text processing. Sure, we can read line-by-line and apply pattern matching to each line, but trying to match a construct that may be laid out across an unknown number of lines is just painful.

Not in Perl 6 though. In Perl 6, we can bind an input stream to a scalar variable (i.e. like a Perl 5 tied variable) and then just match on the characters in that stream as if they were already in memory:

    my $text is from($*ARGS);       # Bind scalar to input stream
    print "Valid diff" 
        if $text =~ /<$file>/;      # Match against input stream

The important point is that, after the match, only those characters that the pattern actually matched will have been removed from the input stream.

It may also be possible to skip the variable entirely and just write:

    print "Valid diff" 
        if $*ARGS =~ /<$file>/;     # Match against input stream

or:

    print "Valid diff" 
        if <> =~ /<$file>/;         # Match against input stream

but that's yet to be decided.


A cleaner approach

The previous example solves the problem of recognizing a valid diff file quite nicely (and with only six rules!), but it does so by cluttering up the program with a series of variables storing those precompiled patterns.

It's as if we were to write a collection of subroutines like this:

    my $print_name = sub ($data) { print $data{name}, "\n"; };
    my $print_age  = sub ($data) { print $data{age}, "\n"; };
    my $print_addr = sub ($data) { print $data{addr}, "\n"; };
    my $print_info = sub ($data) {
        $print_name($data);
        $print_age($data);
        $print_addr($data);
    };
    # and later...
    $print_info($info);

You could do it that way, but it's not the right way to do it. The right way to do it is as a collection of named subroutines or methods, often collected together in the namespace of a class or module:

    module Info {
        sub print_name ($data) { print $data{name}, "\n"; }
        sub print_age ($data)  { print $data{age}, "\n"; }
        sub print_addr ($data) { print $data{addr}, "\n"; }
        sub print_info ($data) {
            print_name($data);
            print_age($data);
            print_addr($data);
        }
    }
    Info::print_info($info);

So it is with Perl 6 patterns. You can write them as a series of pattern objects created at run-time, but they're much better specified as a collection of named patterns, collected together at compile-time in the namespace of a grammar.

Here's the previous diff-parsing example rewritten that way (and with a few extra bells-and-whistles added in):

    grammar Diff {
        rule file { ^  <hunk>*  $ }
        rule hunk :i { 
            [ <linenum> a :: <linerange> \n
              <appendline>+ 
            |
              <linerange> d :: <linenum> \n
              <deleteline>+
            |
              <linerange> c :: <linerange> \n
              <deleteline>+
              --- \n
              <appendline>+
            ]
          |
            <badline("Invalid diff hunk")>
        }
        rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" }
        rule linerange { <linenum> , <linenum>
                       | <linenum>
                       }
        rule linenum { \d+ }
        rule deleteline { ^^ <out_marker> (\N* \n) }
        rule appendline { ^^ <in_marker>  (\N* \n) }
        rule out_marker { \<  <sp> }
        rule in_marker  { \>  <sp> }
    }
    # and later...
    my $text is from($*ARGS);
    print "Valid diff" 
        if $text =~ /<Diff.file>/;

What's in a name?

The grammar declaration creates a new namespace for rules (in the same way a class or module declaration creates a new namespace for methods or subroutines). If a block is specified after the grammar's name:

    grammar HTML {
        rule file :iw { \Q[<HTML>]  <head>  <body>  \Q[</HTML>] }
        rule head :iw { \Q[<HEAD>]  <head_tag>+  \Q[<HEAD>] }
        # etc.
    } # Explicit end of HTML grammar

then that new namespace is confined to that block. Otherwise the namespace continues until the end of the source section of the current file:

    grammar HTML;
    rule file :iw { \Q[<HTML>]  <head>  <body>  \Q[</HTML>] }
    rule head :iw { \Q[<HEAD>]  <head_tag>+  \Q[<HEAD>] }
    # etc.
    # Implicit end of HTML grammar
    __END__

Note that, as with the blockless variants on class and module, this form of the syntax is designed to simplify one-namespace-per-file situations. It's a compile-time error to put two or more blockless grammars, classes or modules in a single file.

Within the namespace, named rules are defined using the rule declarator. It's analogous to the sub declarator within a module, or the method declarator within a class. Just like a class method, a named rule has to be invoked through its grammar if we refer to it outside its own namespace. That's why the actual match became:

    $text =~ /<Diff.file>/;         # Invoke through grammar

If we want to match a named rule, we put the name in angle brackets. Indeed, many of the constructs we've already seen -- <sp>, <ws>, <ident>, <alpha>, <commit> -- are really just predefined named rules that come standard with Perl 6.

Like subroutines and methods, within their own namespace, rules don't have to be qualified. Which is why we can write things like:

    rule linerange { <linenum> , <linenum>
                   | <linenum>
                   }

instead of:

    rule linerange { <Diff.linenum> , <Diff.linenum>
                   | <Diff.linenum>
                   }

Using named rules has several significant advantages, apart from making the patterns look cleaner. For one thing, the compiler may be able to optimize the embedded named rules better. For example, it could inline the attempts to match <linenum> within the linerange rule. In the rx version:

    $linerange = rx{ <$linenum> , <$linenum>
                   | <$linenum>
                   };

that's not possible, since the pattern matching mechanism won't know what's in $linenum until it actually tries to perform the match.

By the way, we can still use interpolated <$subrule>-ish subpatterns in a named rule, and we can use named subpatterns in an rx-ish rule. The difference between rule and rx is just that a rule can have a name and must use {...} as its delimiters, whereas an rx doesn't have a name and can use any allowed delimiters.


Bad line! No match!

This version of the diff parser has an additional rule, named badline. This rule illustrates another similarity between rules and subroutines/methods: rules can take arguments. The badline rule factors out the error message creation at the end of the hunk rule. Previously that rule ended with:

    |  (\N*) ::: { fail "Invalid diff hunk: $1" }

but in this version it ends with:

    |  <badline("Invalid diff hunk")>

That's a much better abstraction of the error condition. It's easier to understand and easier to maintain, but it does require us to be able to pass an argument (the error message) to the new badline subrule. To do that, we simply declare it to have a parameter list:

    rule badline($errmsg) { (\N*) ::: { fail "$errmsg: $1" }

Note the strong syntactic parallel with a subroutine definition:

    sub  subname($param)  { ... }

The argument is passed to a subrule by placing it in parentheses after the rule name within the angle brackets:

    |  <badline("Invalid diff hunk")>

The argument can also be passed without the parentheses, but then it is interpreted as if it were the body of a separate rule:

    rule list_of ($pattern) { 
            <$pattern> [ , <$pattern> ]*
    }
    # and later...
    $str =~ m:w/  \[                  # Literal opening square bracket
                  <list_of \w\d+>     # Call list_of subrule passing rule rx/\w\d+/
                  \]                  # Literal closing square bracket
               /;

A rule can take as many arguments as it needs to:

    rule seplist($elem, $sep) {
            <$elem>  [ <$sep> <$elem> ]*
    }

and those arguments can also be passed by name, using the standard Perl 6 pair-based mechanism (as described in Apocalypse 3).

    $str =~ m:w/
                \[                                      # literal left square bracket
                <seplist(sep=>":", elem=>rx/<ident>/)>  # colon-separated list of identifiers
                \]                                      # literal right square bracket
               /;

Note that the list's element specifier is itself an anonymous rule, which the seplist rule will subsequently interpolate as a pattern (because the $elem parameter appears in angle brackets within seplist).


Pages: 1, 2, 3, 4, 5

Next Pagearrow