Sign In/My Account | View Cart  
advertisement


Listen Print

Apocalypse 5
by Larry Wall | Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Accepted RFCs

RFC 072: Variable-length lookbehind.

This seems good to me. It's just a SMOP to reverse the ordering of nodes in the syntax tree, and I think we can pretty well determine when it's impossible to reverse the tree. The operation of a reversed syntax tree will not be totally transparent, however, so it will be necessary to document that quantifiers will actually be working right-to-left rather than left-to-right. (It's probably also a good idea to document that many syntactic constructs can't actually be reliably recognized in reverse. An attempt to do so probably means you needed to do a lookahead earlier, rather than a lookbehind later.)

The syntax of lookbehind uses the new assertion syntax:

    <after ...>         # positive lookbehind
    <!after ...>        # negative lookbehind

Yes, the pos() function could return multiple values in list context, but I think it's more reasonable for the individual captured elements to know where their positions are. The pos function is really just a special case of a more general data structure contained in the regex result object from the last successful match. In which case, maybe it really needs to have a better name than pos. Maybe $0 or something. Then we get $0.beg and $0.end, $1.beg, and $1.end, etc. Since @$0 returns a list of captures, you can do @$0^.beg and @$0^.end if you want a list of beginnings and endings. Did I mention that the magical @+ and @- arrays are gonna be real dead? Never could remember which one was which anyway...

RFC 093: Regex: Support for incremental pattern matching

I don't think this proposal is powerful enough. "Infinite" strings are a more powerful concept. But I don't think infinite strings are powerful enough either!

We're certainly going to have "infinite" arrays for which missing elements are defined by a generator (where the action could be as simple as reading more data from some other source). We could do the same thing for strings directly, or we could define strings that are implemented underneath via arrays (of strings or of stringifiable objects), and achieve infinitude that way. This latter approach has the benefit that the array element boundaries could be meaningful as zero-width boundaries between, say, tokens in a token stream. We're thinking that <,> could match such a boundary.

But beyond that, such arrays-as-strings could allow us to associate hidden metadata with the tokens, if the abstract string is constructed from a list of objects, or a list of strings with properties. This is typically how a parser would receive data from a lexical analyzer. It's the parser's job to transform the linear stream of objects into a parse tree of objects.

Matching against such boundaries or metadata would not be possible unless ether the regex engine is aware that it is matching against an array, or the string emulation provides visibility through the abstract string into the underlying array. The latter may be preferable, since (by the rules of the =~ matrix discussed in Apocalypse 4) @array =~ /regex/ is currently interpreted as matching against each element of the array individually rather than sequentially, and there are other uses for a string that's really an array. In fact, @array =~ /regex/ could conceivably be matching against a set of infinite strings in parallel, though that seems a bit scary.

Even if we don't care about the boundaries between array elements, this approach gives us the ability to read a file in chunks and not worry that the pattern won't match because it happens to span a boundary.

It might be objected that matching against a subroutine rather than an infinite string or array has the benefit of not promising to keep around the entire string or array in memory. But this is not really a feature, since in general a regex can potentially backtrack all the way to the beginning of the string. And there's nothing to say that the front of the infinite string or array has to stay around anyway. Whether to throw away the head of a string or array should really depend on the programmer, and I don't think there's a more intuitive way to manage that than to simply let the programmer whack off the front of the string or array using operators like substr or splice, or the new <cut> assertion.

Indeed, the very existence of the string/array precludes the caching problem that RFC 316 complains about.

The question remains how to declare such a string/array. If we decided to do a magical name identification, we could conceivably declare

    my $@array;

and then both $array and @array refer to the same object, but treated as a string when you say $array and as an array when you say @array). One is tempted to set up the input routine by saying

    my $@array is from { <$input> };

Additional lines (or chunks) would then come from the <$input> iterator.

But really, the infinite nature of the array is a feature of the underlying object, not the variable. After all, we want to be able to say

    @array := 1..Inf;

even with an ordinary array.

So we could even make this work:

    my $@array := <$input>;

But I think we need to break the aliasing down, which will give us more flexibility at the expense of more verbiage:

    my @array := <$input>;              # @array now bound to iterator
    my $array is ArrayString(@array);   # an ordinary tie

That would let us do cool and/or sick things like this:

    my @lines := <$article>;
    my $_ is ArrayString(@lines);
    s/^ .*? \n<2,> //;  # delete header from $_ AND @lines!
    for @lines { ... }  # process remaining lines

The for loop potentially runs forever, since @lines is implicitly extended from an iterator. The array is automatically extended on the end, but it's not automatically shifted on the front. So if you really did want the loop to run forever without exhausting memory, you'd need to say something like:

    substr($_, 0, $_.pos, "");

The same effect can be effected within a regex by asserting <cut>, which makes the current position the new string beginning. (If you backtrack over <cut>, the entire match will fail.)

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Next Pagearrow