Sign In/My Account | View Cart  
advertisement


Listen Print

Apocalypse 5
by Larry Wall | Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Hypothetical Variables, er Values

Values that are determined within a regular expression should usually be viewed as speculative, subject to cancellation if backtracking occurs. This applies not only to the values captured by (...) within the regex, but also to values determined within closures embedded in the regex. The scope of these values is rather strange, compared to ordinary variables. They are dynamically scoped, but not like temp variables. A temporary variable is restored at the end of the current block. A hypothetical variable keeps its value after the current block exits, and in fact keeps that value clear to the end of its natural lifetime if the regex succeeds (where the natural lifetime depends on where it's declared). But if failure causes backtracking over where the variable was set, then it is restored to its previous state. Perl 5 actually coerced the local operator into supporting this behavior, but that was a mistake. In Perl 6 temp will keep consistent semantics, and restore values on exit from the current block. A new word, let, will indicate the desire to set a variable to a hypothetical value. (I was tempted to use "suppose", but "let" is shorter, and tends to mean the same thing, at least to mathematicians.)

    my $x;
    / (\S*) { let $x = .pos } \s* foo /

After this pattern, $x will be set to the ending position of $1--but only if the pattern succeeds. If it fails, $x is restored to undef when the closure is backtracked. It's possible to do things in a closure that the regex engine doesn't know how to backtrack, of course, but a hypothetical value doesn't fall into that category. For things that do fall into that category, perhaps we need to define a BACK block that is like UNDO, but scoped to backtracking.

Sometimes we'll talk about declaring a hypothetical variable, but as with temp variables, we're not actually declaring the variable itself, but the dynamic scope of its new value. In Perl 6, you can in fact say:

    my $x = 0;
    ...
    {
        temp $x = 1;    # temporizes the lexical variable
        ...
    }
    # $x restored to 0

(This is primarily useful for dynamically scoping a file-scoped lexical, which is slightly safer than temporizing a package variable since nobody can see it outside the file.)

You may declare a hypothetical variable only when the topic is a regex state. This is not as much of a hardship as it might seem. Suppose your closure calls out to some other routine, and passes the regex state as an argument, $rx_state. It suffices to say:

    given $rx_state { let $x = .pos }

As it happens, $1 and friends are all simply hypothetical variables. When we say "hypothetical variable", we aren't speaking of where the variable is stored, but rather how its contents are treated dynamically. If a regex sets a hypothetical variable that was declared with either my or our beforehand, then the regex modifies that lexical or package variable, and let is purely a run-time operation.

On the other hand, if the variable is not pre-declared, it's actually stored in the regex state object. In this case, the let also serves to declare the variable as lexically scoped to the rest of the regex, in addition to its run-time action. Such a variable is not directly visible outside the regex, but you can get at it through the $0 object (always presuming the match succeeded). For a regex variable named $maybe, its external name is $0._var_{'maybe'}. The $0 object can behave as a hash, so $0{maybe} is the short way to say that.

All other variable names are stored with their sigil, so the external name for @maybe is $0{'@maybe'}, and for %maybe is $0{'%maybe'}.

$1 is a special case--it's visible outside the regex, not because it's predeclared, but because Perl already knows that the numbered variable $1 is really stored as a subarray of the $0 object: $0[1]. The numbered variables are available only through the array, not the hash.

Since $0 represents the state of the currently executing regex, you can't use it within a rule to get at the result of a completed subrule. When you successfully call a subrule named <somerule>, the regex state is automatically placed in a hypothetical variable named $somerule. (Rules accessed indirectly must be captured explicitly, or they won't have a name by which you can get to them. More on that in the next section.)

As the current recursive regex executes, it automatically builds a tree of hashes corresponding to all captured hypothetical variables. So from outside the regex, you could get at the $1 of the subrule <somerule> by saying $0{somerule}[1].

Named Captures

Suppose you want to use a hypothetical variable to bind a name to a capture:

    / (\S+) { let $x := $1 } /

A shorthand for that is:

    / $x:=(\S+) /

The parens are numbered independently of any name, so $x is an alias for $1.

You may also use arrays to capture appropriately quantified patterns:

    / @x := (\S+ \s*)* /                # including space
    / @x := [ (\S+) \s* ]* /            # excluding space
    / @x := [ (\S+) (\s*) ]* /          # each element is [word, space]

Note that in general, naming square brackets doesn't cause the square brackets to capture, but rather provides a destination for the parens within the square brackets. Only parens and rules can capture. It's illegal to name square brackets that don't capture something inside.

You can also capture to a hash:

    / %x := [ (\S+)\: \s* (.*) ]* /     # key/value pairs

After that match, $1 returns a list of keys, and $2 returns a list of values. You can capture just the keys:

    / %x := [ (\S+) \s* ]* /            # just enter keys, values are undef

You can capture a closure's return value too:

    / $x := { "I'm in scalar context" } /
    / @x := { "I", "am", "in", "list", "context" } /
    / %x := { "I" => "am in hash context" } /

Note that these do not use parens. If you say:

    / $x := ({ code }) /

it would capture whatever text was traversed by the closure, but ignore the closure's actual return value.

You can reorder paren groups by naming them with numeric variables:

    / $2:=(.*?), \h* $1:=(.*) /

If you use a numeric variable, the numeric variables will start renumbering from that point, so subsequent captures can be of a known number (which clobbers any previous association with that number). So for instance you can reset the numbers for each alternative:

    / $1 := (.*?) (\:)  (.*) { process $1, $2, $3 }
    | $1 := (.*?) (=\>) (.*) { process $1, $2, $3 }
    | $1 := (.*?) (-\>) (.*) { process $1, $2, $3 }
    /

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Next Pagearrow