Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
|
Related Reading
Perl in a Nutshell, 2nd Edition |
Too much reliance on modifiers
The /s modifier in the previous example changes the meaning of
the . metacharacter. We could, in fact, do away with the /s modifier
entirely if we only had two different representations for "any character",
one of which matched a newline, and one which didn't. A similar argument
applies to the /m modifier. The whole notion of something outside
the regex changing the meaning of the regex is just a bit bogus, not because
we're afraid of context sensitivity, but because we need to have better
control within the regex of what we mean, and in this case the context
supplied outside the regex is not precise enough. (Perl 5 has a way
to control the inner contexts, but it uses the self-obfuscating (?...)
notation.)
Modifiers that control how the regex is used as a whole do make some sense outside the regex. But they still have the end-weight problem.
Too many special rules and boobytraps
Without knowing the context, you cannot know what the pattern // will do.
It might match a null string, or it might match the previously successful match.
The local operator behaves differently inside regular expressions than
it does outside.
It's too easy to write a null pattern accidentally. For instance, the following will never match anything but the null string:
/
| foo
| bar
| baz
/x
Even when it's intentional, it may not look intentional:
(a|b|c|)
That's hard to read because it's difficult to make the absence of something visible.
It's too easy to confuse the multiple meanings of dot. Or the multiple
meanings of ^, and $. And the opposite of \A is frequently
not \Z, but \z. Tell me again, when do I say \1, and when
do I say $1? Why are they different?
Backreferences not useful enough
Speaking of \1, backreferences have a number of shortcomings.
The first is actually getting ahold of the right backreference.
Since captures are numbered from the beginning, you have to count,
and you can easily count wrong. For many purposes it would be better if you
could ask for the last capture, or the one before that. Or perhaps
if there were a way to restart the numbering part way through...
Another major problem with backreferences is that you can't easily
modify one to search for a variant. Suppose you match an opening parenthesis,
bracket, or curly. You'll like to search for everything up to the corresponding
closing parenthesis, bracket, or curly, but there's no way to transmogrify
the opening version to the closing version, because the backref search
is hardwired independently of ordinary variable matching. And that's because
Perl doesn't instantiate $1 soon enough. And that's because Perl relies on
variable interpolation to get subexpressions into regexes. Which leads us
to...
Too hard to match a literal string
Since regexes undergo an interpolation pass before they're compiled,
anything you interpolate is forced to be treated as a regular
expression. Often that's not what you want, so we have the klunky
\Q$string\E mechanism to hide regex metacharacters. And that's
because...
Two-level interpretation is problematic
The problem with \Q$string\E arises because of the fundamental
mistake of using interpolation to build regexes instead of letting
the regex control how it treats the variables it references. Regexes
aren't strings, they're programs. Or, rather, they're strings only in the sense
that any piece of program is a string. Just as you have to work to
eval a string as a program, you should have to work to eval a string
as a regular expression. Most people tend to expect a variable in
a regular expression to match its contents literally. Perl violates
that expectation. And because it violates that expectation, we can't
make $1 synonymous with \1. And interpolated parentheses throw
off the capture count, so you can't easily use interpolation to call
subrules, so we invented (??{$var}) to get around that. But then
you can't actually get at the parentheses captured by the subrule.
The ramifications go on and on.
Too little abstraction
Historically, regular expressions were considered a very low-level
language, a kind of glorified assembly language for the regex engine.
When you're only dealing with ASCII, there is little need for abstraction,
since the shortest way to say [a-z] is just that. With the advent
of the eighth bit, we started getting into a little bit of trouble,
and POSIX started thinking about names like [:alpha:] to deal with
locale difficulties. But as with the problem of conciseness, the
culture was still biased away from naming abstractly anything that could be
expressed concretely.
However, it's almost impossible to write a parser without naming things, because you have to be able to name the separate grammar rules so that the various rules can refer to each other.
It's difficult to deal with any subset of Unicode without naming it.
These days, if you see [a-z] in a program, it's probably an
outright bug. It's much better to use a named character property so
that your program will work right in areas that don't just use ASCII.
Even where we do allow names, it tends to be awkward because of the cultural bias against it. To call a subrule by name in Perl 5 you have to say this:
(??{$rule})
That has 4 or 5 more characters than it ought to. Dearth of abstraction produces bad Huffman coding.
Little support for named captures
Make that "no support" in Perl, unless you include assignment to a list.
This is just a part of the bias against naming things. Instead we
are forced to number our capturing parens and count. That works okay
for the top-level regular expression, when we can do list assignment
or assign $1 to $foo. But it breaks down as soon as you start
trying to use nested regexes. It also breaks down when the capturing
parentheses match more than once. Perl handles this currently by
returning only the last match. This is slightly better than useless,
but not by much.
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |


