Sign In/My Account | View Cart  
advertisement


Listen Print

Apocalypse 5
by Larry Wall | Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Character Class Reform

As we mentioned earlier, character classes are becoming more like standard grammar rules, because the definition of "character" is getting fuzzier. This is part of the motivation for demoting enumerated character classes and stealing the square brackets for another purpose. Actually, for old times' sake you still use square brackets on enumerated character classes, but you have to put an extra set of angles around it. But this actually tends to save keystrokes when you want to use any named character classes or Unicode properties, particularly when you want to combine them:

    Old                 New
    ---                 ---
    [a-z]               <[a-z]>
    [[:alpha:]]         <alpha>
    [^[:alpha:]]        <-alpha>
    [[:alpha:][:digit]] <<alpha><digit>>

The outer <...> also naturally serves as a container for any extra syntax we decide to come up with for character set manipulation:

    <[_]+<alpha>+<digit>-<Swedish>>

State

[This section gets pretty abstruse. It's okay if your eyes glaze over.]

Every regex match maintains a state object, and any closure within the regex is actually an anonymous method of that object, which means in turn that the closure's topic is the current state object. Since a unary dot introduces a method call on the current topic, it follows that you can call any method in the state object that way:

    /(.*) { print .pos }/       # print current position

The state object may in fact be an instance of a grammar class. A grammar object has additional methods that know how to build a parse tree. Its rules also know how to refer to each other or to rules of related grammars.

Note that $_ within the closure refers to this state object, not the original search string. If you search on the state object, however, it pretends that you wanted to continue the search on the original string. If the internal search succeeds, the position of the external state is updated as well, just as if the internal search had been a rule invoked directly from the outer regex.

Because the state object is aware of how the tree is being built, when backtracking occurs the object can destroy parts of the parse tree that were conjectured in error. Because the grammar's action methods have control of the regex state, they can access named fields in the regex without having to explicitly pass them to the method call.

For instance, in our earlier example we passed $expr explicitly to build the parse tree, but the method can actually figure that out itself. So we could have just written:

    rule modifier { if     <ws> :: <expr> { .new_cond(0) }
                  | unless <ws> :: <expr> { .new_cond(1) }
                  | while  <ws> :: <expr> { .new_loop(0) }
                  | until  <ws> :: <expr> { .new_loop(1) }
                  | for    <ws> :: <expr> { .new_for }
                  | <@other_modifiers>  # user defined
                  | <null>              # no modifier
                  },

See Variable Scoping below for where @other_modifiers gets looked up.

Within a closure, $_ represents the current state of the current regex, and by extension, the current state of all the regexes participating in the current match. (The type of the state object is the current grammar class, which may be an anonymous type if the current grammar has no name. If the regex is not a member of a grammar, it's of type RULE.) Part of the state of the current regex is the current node of the parse tree that is being built. When the current regex succeeds, the state object becomes a result object, and is returned to the calling regex. The calling regex can refer to the returned object as a "hypothetical" variable, the name of which is either implicitly generated from the name of the rule, or explicitly bound using :=. Through that variable you can get at anything captured by the subrule. (That is what $expr was doing earlier.)

When the entire match succeeds, the top-level node is returned as a result object that has various values in various contexts, whether boolean, numeric, or string context. The name of the result object is $0. The result object contains all the other information, such as $1, $2, etc. Unlike $& in Perl 5, $0 is lexically scoped to the enclosing block. By extension, $1, etc. are also lexically scoped.

As a kind of iterator, a regex stored in a variable doesn't expand in list context unless you put angles around it or use it with m//:

    $rx = /(xxx)/;
    print 1,2,<$rx($_)>;
    print 1,2,</(xxx)/>;
    my &rx := /(xxx)/;
    print 1,2,<rx($_)>;

$0, $1, etc. are not set in iterated cases like this. Each list item is a result object, though, and you can still get at the internal values that way.

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Next Pagearrow