Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
Accepted RFCs
RFC 072: Variable-length lookbehind.
This seems good to me. It's just a SMOP to reverse the ordering of nodes in the syntax tree, and I think we can pretty well determine when it's impossible to reverse the tree. The operation of a reversed syntax tree will not be totally transparent, however, so it will be necessary to document that quantifiers will actually be working right-to-left rather than left-to-right. (It's probably also a good idea to document that many syntactic constructs can't actually be reliably recognized in reverse. An attempt to do so probably means you needed to do a lookahead earlier, rather than a lookbehind later.)
The syntax of lookbehind uses the new assertion syntax:
<after ...> # positive lookbehind
<!after ...> # negative lookbehind
Yes, the pos() function could return multiple values in list context,
but I think it's more reasonable for the individual captured elements
to know where their positions are. The pos function is really
just a special case of a more general data structure contained in the
regex result object from the last successful match. In which case,
maybe it really needs to have a better name than pos. Maybe $0
or something. Then we get $0.beg and $0.end, $1.beg,
and $1.end, etc. Since @$0 returns a list of captures, you
can do @$0^.beg and @$0^.end if you want a list of beginnings
and endings. Did I mention that the magical @+ and @- arrays
are gonna be real dead? Never could remember which one was which
anyway...
RFC 093: Regex: Support for incremental pattern matching
I don't think this proposal is powerful enough. "Infinite" strings are a more powerful concept. But I don't think infinite strings are powerful enough either!
We're certainly going to have "infinite" arrays for which missing
elements are defined by a generator (where the action could be as simple as
reading more data from some other source). We could do the same thing for strings
directly, or we could define strings that are implemented underneath
via arrays (of strings or of stringifiable objects), and achieve
infinitude that way. This latter approach has the benefit that the
array element boundaries could be meaningful as zero-width boundaries
between, say, tokens in a token stream. We're thinking that <,>
could match such a boundary.
But beyond that, such arrays-as-strings could allow us to associate hidden metadata with the tokens, if the abstract string is constructed from a list of objects, or a list of strings with properties. This is typically how a parser would receive data from a lexical analyzer. It's the parser's job to transform the linear stream of objects into a parse tree of objects.
Matching against such boundaries or metadata would not be possible
unless ether the regex engine is aware that it is matching against an
array, or the string emulation provides visibility through the abstract
string into the underlying array. The latter may be preferable,
since (by the rules of the =~ matrix discussed in Apocalypse 4)
@array =~ /regex/ is currently interpreted as matching against
each element of the array individually rather than sequentially, and
there are other uses for a string that's really an array. In fact,
@array =~ /regex/ could conceivably be matching against a set of
infinite strings in parallel, though that seems a bit scary.
Even if we don't care about the boundaries between array elements, this approach gives us the ability to read a file in chunks and not worry that the pattern won't match because it happens to span a boundary.
It might be objected that matching against a subroutine rather than
an infinite string or array has the benefit of not promising to keep
around the entire string or array in memory. But this is not really
a feature, since in general a regex can potentially backtrack all
the way to the beginning of the string. And there's nothing to say
that the front of the infinite string or array has to stay around
anyway. Whether to throw away the head of a string or array should
really depend on the programmer, and I don't think there's a more
intuitive way to manage that than to simply let the programmer whack
off the front of the string or array using operators like substr
or splice, or the new <cut> assertion.
Indeed, the very existence of the string/array precludes the caching problem that RFC 316 complains about.
The question remains how to declare such a string/array. If we decided to do a magical name identification, we could conceivably declare
my $@array;
and then both $array and @array refer to the same object, but treated
as a string when you say $array and as an array when you say @array).
One is tempted to set up the input routine by saying
my $@array is from { <$input> };
Additional lines (or chunks) would then come from the <$input> iterator.
But really, the infinite nature of the array is a feature of the underlying object, not the variable. After all, we want to be able to say
@array := 1..Inf;
even with an ordinary array.
So we could even make this work:
my $@array := <$input>;
But I think we need to break the aliasing down, which will give us more flexibility at the expense of more verbiage:
my @array := <$input>; # @array now bound to iterator
my $array is ArrayString(@array); # an ordinary tie
That would let us do cool and/or sick things like this:
my @lines := <$article>;
my $_ is ArrayString(@lines);
s/^ .*? \n<2,> //; # delete header from $_ AND @lines!
for @lines { ... } # process remaining lines
The for loop potentially runs forever, since @lines is implicitly
extended from an iterator. The array is automatically extended on
the end, but it's not automatically shifted on the front. So if you
really did want the loop to run forever without exhausting memory,
you'd need to say something like:
substr($_, 0, $_.pos, "");
The same effect can be effected within a regex by asserting <cut>, which makes the
current position the new string beginning. (If you backtrack over <cut>, the entire
match will fail.)
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |

