Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
Character Class Reform
As we mentioned earlier, character classes are becoming more like standard grammar rules, because the definition of "character" is getting fuzzier. This is part of the motivation for demoting enumerated character classes and stealing the square brackets for another purpose. Actually, for old times' sake you still use square brackets on enumerated character classes, but you have to put an extra set of angles around it. But this actually tends to save keystrokes when you want to use any named character classes or Unicode properties, particularly when you want to combine them:
Old New
--- ---
[a-z] <[a-z]>
[[:alpha:]] <alpha>
[^[:alpha:]] <-alpha>
[[:alpha:][:digit]] <<alpha><digit>>
The outer <...> also naturally serves as a container for any
extra syntax we decide to come up with for character set manipulation:
<[_]+<alpha>+<digit>-<Swedish>>
State
[This section gets pretty abstruse. It's okay if your eyes glaze over.]
Every regex match maintains a state object, and any closure within the regex is actually an anonymous method of that object, which means in turn that the closure's topic is the current state object. Since a unary dot introduces a method call on the current topic, it follows that you can call any method in the state object that way:
/(.*) { print .pos }/ # print current position
The state object may in fact be an instance of a grammar class. A grammar object has additional methods that know how to build a parse tree. Its rules also know how to refer to each other or to rules of related grammars.
Note that $_ within the closure refers to this state object, not the
original search string. If you search on the state object, however, it
pretends that you wanted to continue the search on the original string.
If the internal search succeeds, the position of the external state is
updated as well, just as if the internal search had been a rule invoked
directly from the outer regex.
Because the state object is aware of how the tree is being built, when backtracking occurs the object can destroy parts of the parse tree that were conjectured in error. Because the grammar's action methods have control of the regex state, they can access named fields in the regex without having to explicitly pass them to the method call.
For instance, in our earlier example we passed $expr explicitly
to build the parse tree, but the method can actually figure that
out itself. So we could have just written:
rule modifier { if <ws> :: <expr> { .new_cond(0) }
| unless <ws> :: <expr> { .new_cond(1) }
| while <ws> :: <expr> { .new_loop(0) }
| until <ws> :: <expr> { .new_loop(1) }
| for <ws> :: <expr> { .new_for }
| <@other_modifiers> # user defined
| <null> # no modifier
},
See Variable Scoping below for where @other_modifiers gets looked up.
Within a closure, $_ represents the current state of the current
regex, and by extension, the current state of all the regexes
participating in the current match. (The type of the state object
is the current grammar class, which may be an anonymous type if
the current grammar has no name. If the regex is not a member of a
grammar, it's of type RULE.) Part of the state of the current regex
is the current node of the parse tree that is being built. When the
current regex succeeds, the state object becomes a result object,
and is returned to the calling regex. The calling regex can refer to the
returned object as a "hypothetical" variable, the name of which is either
implicitly generated from the name of the rule, or explicitly bound
using :=. Through that variable you can get at anything captured
by the subrule. (That is what $expr was doing earlier.)
When the entire match succeeds, the top-level node is returned as a
result object that has various values in various contexts, whether
boolean, numeric, or string context. The name of the result object
is $0. The result object contains all the other information, such
as $1, $2, etc. Unlike $& in Perl 5, $0 is lexically scoped
to the enclosing block. By extension, $1, etc. are also lexically
scoped.
As a kind of iterator, a regex stored in a variable doesn't expand
in list context unless you put angles around it or use it with m//:
$rx = /(xxx)/;
print 1,2,<$rx($_)>;
print 1,2,</(xxx)/>;
my &rx := /(xxx)/;
print 1,2,<rx($_)>;
$0, $1, etc. are not set in iterated cases like this. Each list
item is a result object, though, and you can still get at the
internal values that way.
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |

