Sign In/My Account | View Cart  
advertisement


Listen Print

Exegesis 5
by Damian Conway | Pages: 1, 2, 3, 4, 5

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Thinking ahead

The only other change in the grammar version of the diff parser is that the matching of the '<' and '>' at the start of the context lines has been factored out. Whereas before we had:

    $deleteline = rx/^^ \< <sp> (\N* \n) /
    $appendline = rx/^^ \> <sp> (\N* \n) /

now we have:

    rule deleteline { ^^ <out_marker> (\N* \n) }
    rule appendline { ^^ <in_marker>  (\N* \n) }
    rule out_marker { \<  <sp> }
    rule in_marker  { \>  <sp> }

That seems like a step backwards, since it complicated the grammar for no obvious benefit, but the benefit will be reaped later when we discover another type of diff file that uses different markers for incoming and outgoing lines.


What you match is what you get

Both the variable-based and grammatical versions of the code above do a great job of recognizing a diff, but that's all they do. If we only want syntax checking, that's fine. But, generally, if we're parsing data what we really want is to do something useful with it: transform it into some other syntax, make changes to its contents, or perhaps convert it to a Perl internal data structure for our program to manipulate.

Suppose we did want to build a hierarchical Perl data structure representing the diff that the above examples match. What extra code would we need?

None.

That's right. Whenever Perl 6 matches a pattern, it automatically builds a “result object” representing the various components of the match.

That result object is named $0 (the program's name is now $*PROG) and it's lexical to the scope in which the match occurs. The result object stores (amongst other things) the complete string matched by the pattern, and it evaluates to that string when used in a string context. For example:

    if ($text =~ /<Diff.file>/) {
        $difftext = $0;
    }

That's handy, but not really useful for extracting data structures. However, in addition, any components within a match that were captured using parentheses become elements of the object's array attribute, and are accessible through its array index operator. So, for example, when a pattern such as:

    rule linenum_plus_comma { (\d+) (,?) };

matches successfully, the array element 1 of the result object (i.e. $0[1]) is assigned the result of the first parenthesized capture (i.e. the digits), whilst the array element 2 ($0[2]) receives the comma. Note that array element zero of any result object is assigned the complete string that the pattern matched.

There are also abbreviations for each of the array elements of $0. $0[1] can also be referred to as...surprise, surprise...$1, $0[2] can also be referred to as $2, $0[3] as $3, etc. Like $0, each of these numeric variables is also lexical to the scope in which the pattern match occurred.

The parts of a matched string that were matched by a named subrule become entries in the result object's hash attribute, and are subsequently accessible through its hash lookup operator. So, for example, when the pattern:

    rule deleteline { ^^ <out_marker> (\N* \n) }

matches, the result object's hash entry for the key 'out_marker' (i.e. $0{out_marker}) will contain the result object returned by the successful nested match of the out_marker subrule.


A hypothetical solution to a very real problem

Named capturing into a hash is very convenient, but it doesn't work so well for a rule like:

    rule linerange {
          <linenum> , <linenum>
        | <linenum>
    }

The problem is that the hash attribute of the rule's $0 can only store one entry with the key 'linenum'. So if the <linenum> , <linenum> alternative matches, then the result object from the second match of <linenum> will overwrite the entry for the first <linenum> match.

The solution to this is a new Perl 6 pattern matching feature known as “hypothetical variables”. A hypothetical variable is a variable that is declared and bound within a pattern match (i.e. inside a closure within a rule). The variable is declared, not with a my, our, or temp, but with the new keyword let, which was chosen because it's what mathematicians and other philosophers use to indicate a hypothetical assumption.

Once declared, a hypothetical variable is then bound using the normal binding operator. For example:

    rule checked_integer {
            (\d+)                   # Match and capture one-or-more digits
            { let $digits := $1 }   # Bind to hypothetical var $digits
            -                       # Match a hyphen
            (\d)                    # Match and capture one digit
            { let $check := $2 }    # Bind to hypothetical var $check
    }

In this example, if a sequence of digits is found, then the $digits variable is bound to that substring. Then, if the dash and check-digit are matched, the digit is bound to $check. However, if the dash or digit is not matched, the match will fail and backtrack through the closure. This backtracking causes the $digits hypothetical variable to be automatically un-bound. Thus, if a rule fails to match, the hypothetical variables within it are not associated with any value.

Each hypothetical variable is really just another name for the corresponding entry in the result object's hash attribute. So binding a hypothetical variable like $digits within a rule actually sets the $0{digits} element of the rule's result object.

So, for example, to distinguish the two line numbers within a line range:

    rule linerange {
          <linenum> , <linenum>
        | <linenum>
    }

we could bind them to two separate hypothetical variables -- say, $from and $to -- like so:

    rule linerange {
          (<linenum>)               # Match linenum and capture result as $1
          { let $from := $1 }       # Save result as hypothetical variable
          ,                         # Match comma
          (<linenum>)               # Match linenum and capture result as $2
          { let $to := $2 }         # Save result as hypothetical variable
        |
          (<linenum>)               # Match linenum and capture result as $3
          { let $from := $3 }       # Save result as hypothetical variable
    }

Now our result object has a hash entry $0{from} and (maybe) one for $0{to} (if the first alternative was the one that matched). In fact, we could ensure that the result always has a $0{to}, by setting the corresponding hypothetical variable in the second alternative as well:

    rule linerange {
          (<linenum>)
          { let $from := $1 }
          ,         
          (<linenum>)
          { let $to := $2 }
        |
          (<linenum>)
          { let $from := $3; let $to := $from }
    }

Problem solved.

But only by introducing a new problem. All that hypothesizing made our rule ugly and complex. So Perl 6 provides a much prettier short-hand:

    rule linerange {
          $from := <linenum>          # Match linenum rule, bind result to $from
          ,                           # Match comma
          $to := <linenum>            # Match linenum rule, bind result to $to
        |                             # Or...
          $from := $to := <linenum>   # Match linenum rule,
    }                                 #   bind result to both $from and $to

or, more compactly:

    rule linerange {
          $from:=<linenum> , $to:=<linenum>
        | $from:=$to:=<linenum>
    }

If a Perl 6 rule contains a variable that is immediately followed by the binding operator (:=), that variable is never interpolated. Instead, it is treated as a hypothetical variable, and bound to the result of the next component of the rule (in the above examples, to the result of the <linenum> subrule match).

You can also use hypothetical arrays and hashes, binding them to a component that captures repeatedly. For example, we might choose to name our set of hunks:

    rule file { ^  @adonises := <hunk>*  $ }

collecting all the <hunk> matches into a single array (which would then be available after the match as $0{'@adonises'}. Note that the sigil is included in the key in this case).

Or we might choose to bind a hypothetical hash:

    rule config {
        %init :=            # Hypothetically, bind %init to...
            [               # Start of group
                (<ident>)   # Match and capture an identifier
                \h*=\h*     # Match an equals sign with optional horizontal whitespace
                (\N*)       # Match and capture the rest of the line
                \n          # Match the newline
            ]*
    }

where each repetition of the [...]* grouping captures two substrings on each repetition and converts them to a key/value pair, which is then added to the hash. The first captured substring in each repetition becomes the key, and the second captured substring becomes its associated value. The hypothetical %init hash is also available through the rule's result object, as $0{'%init'} (again, with the sigil as part of the key).


The nesting instinct

Of course, those line number submatches in:

    rule linerange {
          $from:=<linenum> , $to:=<linenum>
        | $from:=$to:=<linenum>
    }

will have returned their own result objects. And it's a reference to those nested result objects that actually gets stored in linerange's $0{from} and $0{to}.

Likewise, in the next higher rule:

    rule hunk :i { 
        [ <linenum> a :: <linerange> \n
          <appendline>+ 
        |
          <linerange> d :: <linenum> \n
          <deleteline>+
        |
          <linerange> c :: <linerange> \n
          <deleteline>+
          --- \n
          <appendline>+
        ]
    };

the match on <linerange> will return its $0 object. So, within the hunk rule, we could access the “from” digits of the line range of the hunk as: $0{linerange}{from}.

Likewise, at the highest level:

    rule file { ^  <hunk>*  $ }

we are matching a series of hunks, so the hypothetical $hunk variable (and hence $0{hunk}) will contain a result object whose array attribute contains the series of result objects returned by each individual <hunk> match.

So, for example, we could access the “from” digits of the line range of the third hunk as: $0{hunk}[2]{linerange}{from}.


Extracting the insertions

More usefully, we could locate and print every line in the diff that was being inserted, regardless of whether it was inserted by an “append” or a “change” hunk. Like so:

    my $text is from($*ARGS);
    if $text =~ /<Diff.file>/ {
        for @{ $0{file}{hunk} } -> $hunk
             print @{$hunk{appendline}}
                 if $hunk{appendline};
        }
    }

Here, the if statement attempts to match the text against the pattern for a diff file. If it succeeds, the for loop grabs the <hunk>* result object, treats it as an array, and then iterates each hunk match object in turn into $hunk. The array of append lines for each hunk match is then printed (if there is in fact a reference to that array in the hunk).


Don't just match there; do something!

Because Perl 6 patterns can have arbitrary code blocks inside them, it's easy to have a pattern actually perform syntax transformations whilst it's parsing. That's often a useful technique because it allows us to manipulate the various parts of a hierarchical representation locally (within the rules that recognize them).

For example, suppose we wanted to “reverse” the diff file. That is, suppose we had a diff that specified the changes required to transform file A to file B, but we needed the back-transformation instead: from file B to file A. That's relatively easy to create. We just turn every “append” into a “delete”, every “delete” into an “append”, and reverse every “change”.

The following code does exactly that:

    grammar ReverseDiff {
        rule file { ^  <hunk>*  $ }
        rule hunk :i { 
            [ <linenum> a :: <linerange> \n
              <appendline>+ 
              { @$appendline =~ s/<in_marker>/< /;
                let $0 := "${linerange}d${linenum}\n"
                        _ join "", @$appendline;
              }
            |
              <linerange> d :: <linenum> \n
              <deleteline>+
              { @$deleteline =~ s/<out_marker>/> /;
                let $0 := "${linenum}a${linerange}\n"
                        _ join "", @$deleteline;
              }
            |
              $from:=<linerange> c :: $to:=<linerange> \n
              <deleteline>+
              --- \n
              <appendline>+
              { @$appendline =~ s/<in_marker>/</;
                @$deleteline =~ s/<out_marker>/>/;
                let $0 := "${to}c${from}\n"
                        _ join("", @$appendline)
                        _ "---\n"
                        _ join("", @$deleteline);
              }
            ]
          |
            <badline("Invalid diff hunk")>
        }
    rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" } }
    rule linerange { $from:=<linenum> , $to:=<linenum>
                       | $from:=$to:=<linenum>
                       }
    rule linenum { (\d+) }
    rule deleteline { ^^ <out_marker> (\N* \n) }
        rule appendline { ^^ <in_marker>  (\N* \n) }
    rule out_marker { \<  <sp> }
        rule in_marker  { \>  <sp> }
    }
    # and later...
    my $text is from($*ARGS);
    print @{ $0{file}{hunk} }
        if $text =~ /<Diff.file>/;

The rule definitions for file, badline, linerange, linenum, appendline, deleteline, in_marker and out_marker are exactly the same as before.

All the work of reversing the diff is performed in the hunk rule. To do that work, we have to extend each of the three main alternatives of that rule, adding to each a closure that changes the result object it returns.


Smarter alternatives

In the first alternative (which matches “append” hunks), we match as before:

    <linenum> a :: <linerange> \n
    <appendline>+

But then we execute an embedded closure:

    { @$appendline =~ s/<in_marker>/</;
      let $0 := "${linerange}d${linenum}\n"
              _ join "", @$appendline;
    }

The first line reverses the “marker” arrows on each line of data that was previously being appended, using the smart-match operator to apply the transformation to each line. Note too, that we reuse the in_marker rule within the substitution.

Then we bind the result object (i.e. the hypothetical variable $0) to a string representing the “reversed” append hunk. That is, we reverse the order of the line range and line number components, put a 'd' (for “delete”) between them, and then follow that with all the reversed data:

    let $0 := "${linerange}d${linenum}\n"
            _ join "", @$appendline;

The changes to the “delete” alternative are exactly symmetrical. Capture the components as before, reverse the marker arrows, reverse the $linerange and $linenum, change the 'd' to an 'a', and append the reversed data lines.

In the third alternative:

    $from:=<linerange> c :: $to:=<linerange> \n
    <deleteline>+   
    --- \n
    <appendline>+
    { @$appendline =~ s/<in_marker>/</;
      @$deleteline =~ s/<out_marker>/>/;
      let $0 := "${to}c${from}\n"
              _ join("", @$appendline)
              _ "---\n"
              _ join("", @$deleteline);
    }

there are line ranges on both sides of the 'c'. So we need to give them distinct names, by binding them to extra hypothetical variables: $from and $to. We then reverse the order of two line ranges, but leave the 'c' as it was (because we're simply changing something back to how it was previously). The markers on both the append and delete lines are reversed, and then the order of the two sets of lines is also reversed.

Once those transformations has been performed on each hunk (i.e. as it's being matched!), the result of successfully matching any <hunk> subrule will be a string in which the matched hunk has already been reversed.

All that remains is to match the text against the grammar, and print out the (modified) hunks:

    print @{ $0/2002/08/22/exegesis5.html{hunk} }
        if $text =~ /<ReverseDiff.file>/;

And, since the file rule is now in the ReverseDiff grammar's namespace, we need to call the rule through that grammar. Note the way the syntax for doing that continues the parallel with methods and classes.


Pages: 1, 2, 3, 4, 5

Next Pagearrow