Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
Hypothetical Variables, er Values
Values that are determined within a regular expression should usually
be viewed as speculative, subject to cancellation if backtracking
occurs. This applies not only to the values captured by (...)
within the regex, but also to values determined within closures
embedded in the regex. The scope of these values is rather strange,
compared to ordinary variables. They are dynamically scoped, but
not like temp variables. A temporary variable is restored at
the end of the current block. A hypothetical variable keeps its
value after the current block exits, and in fact keeps that value
clear to the end of its natural lifetime if the regex succeeds
(where the natural lifetime depends on where it's declared).
But if failure causes backtracking over where the variable was set,
then it is restored to its previous state. Perl 5 actually coerced
the local operator into supporting this behavior, but that was
a mistake. In Perl 6 temp will keep consistent semantics, and
restore values on exit from the current block. A new word, let,
will indicate the desire to set a variable to a hypothetical value.
(I was tempted to use "suppose", but "let" is shorter, and tends to
mean the same thing, at least to mathematicians.)
my $x;
/ (\S*) { let $x = .pos } \s* foo /
After this pattern, $x will be set to the ending position of $1--but only
if the pattern succeeds. If it fails, $x is restored to undef when the closure
is backtracked. It's possible to do things in a closure that the regex
engine doesn't know how to backtrack, of course, but a hypothetical
value doesn't fall into that category. For things that do fall into
that category, perhaps we need to define a BACK block that is like
UNDO, but scoped to backtracking.
Sometimes we'll talk about declaring a hypothetical variable, but as with
temp variables, we're not actually declaring the variable itself, but the
dynamic scope of its new value. In Perl 6, you can in fact say:
my $x = 0;
...
{
temp $x = 1; # temporizes the lexical variable
...
}
# $x restored to 0
(This is primarily useful for dynamically scoping a file-scoped lexical, which is slightly safer than temporizing a package variable since nobody can see it outside the file.)
You may declare a hypothetical variable only when the topic is a regex state.
This is not as much of a hardship as it might seem. Suppose your closure
calls out to some other routine, and passes the regex state as an argument, $rx_state.
It suffices to say:
given $rx_state { let $x = .pos }
As it happens, $1 and friends are all simply hypothetical variables.
When we say "hypothetical variable", we aren't speaking of where the
variable is stored, but rather how its contents are treated dynamically.
If a regex sets a hypothetical variable that was declared with either
my or our beforehand, then the regex modifies that lexical or
package variable, and let is purely a run-time operation.
On the other hand, if the variable is not pre-declared, it's actually
stored in the regex state object. In this case, the let also serves
to declare the variable as lexically scoped to the rest of the regex,
in addition to its run-time action. Such a variable is not directly
visible outside the regex, but you can get at it through the $0 object
(always presuming the match succeeded). For a regex variable named
$maybe, its external name is $0._var_{'maybe'}. The $0 object
can behave as a hash, so $0{maybe} is the short way to say that.
All other variable names are stored with their sigil, so the
external name for @maybe is $0{'@maybe'}, and for %maybe
is $0{'%maybe'}.
$1 is a special case--it's visible outside the regex, not because
it's predeclared, but because Perl already knows that the numbered variable
$1 is really stored as a subarray of the $0 object: $0[1]. The
numbered variables are available only through the array, not the hash.
Since $0 represents the state of the currently executing regex,
you can't use it within a rule to get at the result of a completed
subrule. When you successfully call a subrule named <somerule>,
the regex state is automatically placed in a hypothetical variable
named $somerule. (Rules accessed indirectly must be captured
explicitly, or they won't have a name by which you can get to them.
More on that in the next section.)
As the current recursive regex executes, it automatically builds a
tree of hashes corresponding to all captured hypothetical variables.
So from outside the regex, you could get at the $1 of the subrule
<somerule> by saying $0{somerule}[1].
Named Captures
Suppose you want to use a hypothetical variable to bind a name to a capture:
/ (\S+) { let $x := $1 } /
A shorthand for that is:
/ $x:=(\S+) /
The parens are numbered independently of any name, so $x is an alias for $1.
You may also use arrays to capture appropriately quantified patterns:
/ @x := (\S+ \s*)* / # including space
/ @x := [ (\S+) \s* ]* / # excluding space
/ @x := [ (\S+) (\s*) ]* / # each element is [word, space]
Note that in general, naming square brackets doesn't cause the square brackets to capture, but rather provides a destination for the parens within the square brackets. Only parens and rules can capture. It's illegal to name square brackets that don't capture something inside.
You can also capture to a hash:
/ %x := [ (\S+)\: \s* (.*) ]* / # key/value pairs
After that match, $1 returns a list of keys, and $2 returns a list of values.
You can capture just the keys:
/ %x := [ (\S+) \s* ]* / # just enter keys, values are undef
You can capture a closure's return value too:
/ $x := { "I'm in scalar context" } /
/ @x := { "I", "am", "in", "list", "context" } /
/ %x := { "I" => "am in hash context" } /
Note that these do not use parens. If you say:
/ $x := ({ code }) /
it would capture whatever text was traversed by the closure, but ignore the closure's actual return value.
You can reorder paren groups by naming them with numeric variables:
/ $2:=(.*?), \h* $1:=(.*) /
If you use a numeric variable, the numeric variables will start renumbering from that point, so subsequent captures can be of a known number (which clobbers any previous association with that number). So for instance you can reset the numbers for each alternative:
/ $1 := (.*?) (\:) (.*) { process $1, $2, $3 }
| $1 := (.*?) (=\>) (.*) { process $1, $2, $3 }
| $1 := (.*?) (-\>) (.*) { process $1, $2, $3 }
/
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |

