Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
Difficult to use nested patterns
For many of the reasons we've mentioned, it's difficult to make regexes refer to each other, and even if you do, it's almost impossible to get the nested information back out of them. And there are entire classes of parsing problems that are not solvable without recursive definitions.
Little support for grammars
Even if it were easier for regexes to refer to other regexes, we'd still have the problem that those other regexes aren't organized in any meaningful way. They might be off in variables that come and go at the whim of the surrounding context.
When we have an organized system of parsing rules, we call it a grammar. One advantage of having a grammar is that you can optimize based on the assumption that the rules maintain their relationship to each other. For instance, if you think of grammar rules as a funny kind of subroutine, you can write an optimizer to inline some of the subrules--but only if you know the subrule is fixed in the grammar.
Without support for grammar classes, there's no decent way to think of deriving one grammar from another. And if you can't derive one grammar from another, you can't easily evolve your language to handle new kinds of problems.
Inability to define variants
If we want to have variant grammars for Perl dialects, then what about regex dialects? Can regexes be extended either at compile time or at run time? Perl 5 has some rudimentary overloading magic for rewriting regex strings, but that's got the same problems as source filters for Perl code; namely that you just get the raw regex source text and have to parse it yourself. Once again the fundamental assumption is that a regex is a funny kind of string, existing only at the behest of the surrounding program.
Do we think of regexes as a real, living language?
Poor integration with rich languages
Let's face it, in the culture of computing, regex languages are mostly considered second-class citizens, or worse. "Real" languages like C and C++ will exploit regexes, but only through a strict policy of apartheid. Regular expressions are our servants or slaves; we tell them what to do, they go and do it, and then they come back to say whether they succeeded or not.
At the other extreme, we have languages like Prolog or Snobol where the pattern matching is built into the very control structure of the language. These languages don't succeed in the long run because thinking about that kind of control structure is rather difficult in actual fact, and one gets tired of doing it constantly. The path to freedom is not to make everyone a slave.
However, I would like to think that there is some happy medium between those two extremes. Coming from a C background, Perl has historically treated regexes as servants. True, Perl has treated them as trusted servants, letting them move about in Perl society better than any other C-like language to date. Nevertheless, if we emancipate regexes to serve as co-equal control structures, and if we can rid ourselves of the regexist attitudes that many of us secretly harbor, we'll have a much more productive society than we currently do. We need to empower regexes with a sense of control (structure). It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.
Missing backtracking controls
Perl 5 started to give regexes more control of their own destiny with
the "grab" construct, (?>...), which tells the regex engine
that when it fails to match the rest of the pattern, it should not
backtrack into the innards of the grab, but skip back to before it. That's
a useful notion, but there are problems. First, the notation sucks, but
you knew that already. Second, it doesn't go far enough. There's no
way to backtrack out of just the current grouping. There's no way to backtrack
out of just the current rule. Both of these are crucial for giving first-class
status to the control flow of regexes.
Difficult to define assertions
Notionally, a regex is an organization of assertions that either succeed or fail. Some assertions are easily expressed in traditional regex language, while others are more easily expressed in a procedural language like Perl.
The natural (but wrong) solution is to try to reinvent Perl expressions within regex language. So, for instance, I'm rejecting those RFCs that propose special assertion syntax for numerics or booleans. The better solution is to make it easier to embed Perl assertions within regexes.
Brave New World
I've just made a ton of negative assertions about the current state of regex culture. Now I'd like you to perform a cool mental trick and turn all those negatives assertions into positive assertions about what I'm going to say, because I'm not intending to give the rationale again, but just present the design as it stands. Damian will discuss an extended example in his Exegesis 5, which will show the big picture of how these various features work together to produce a much more readable whole.
So anyway, here's what's new.
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |

