Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
|
Related Reading
|
Too much history
Most of the other problems stem from trying to deal with a rich history. Now there's nothing wrong with history per se, but those of us who are doomed to repeat it find that many parts of history are suboptimal and contradictory. Perl has always tried to err on the side of incorporating as much history as possible, and sometimes Perl has succeeded in that endeavor.
Cultural continuity has much to be said for it, but what can you do when the culture you're trying to be continuous with is itself discontinuous? As it says in Ecclesiastes, there's a time to build up, and a time to tear down. The first five versions of Perl mostly built up without tearing down, so now we're trying to redress that omission.
Too compact and "cute"
Regular expressions were invented by computational linguists who love to
write examples like /aa*b*(cd)*ee/. While these are conducive to reasoning
about pattern matching in the abstract, they aren't so good for pattern matching
in the concrete. In real life, most atoms are longer than "a" or "b". In real
life, tokens are more recognizable if they are separated by whitespace. In the
abstract, /a+/ is reducible to /aa*/. In real life, nobody wants to
repeat a 15 character token merely to satisfy somebody's idea of theoretical
purity. So we have shortcuts like the + quantifier to say "one or more".
Now, you may rightly point out that + is something we already have,
and we already introduced /x to allow whitespace, so why is this
bullet point here? Well, there's a lot of inertia in culture, and
the problem with /x is that it's not the default, so people don't think
to turn it on when it would probably do a lot of good. The culture
is biased in the wrong direction. Whitespace around tokens should
be the norm, not the exception. It should be acceptable to use
whitespace to separate tokens that could be confused. It should not be
considered acceptable to define new constructs that contain a plethora
of punctuation, but we've become accustomed to constructs like (?<=...)
and (??{...}) and [\r\n\ck\p{Zl}\p{Zp}],
so we don't complain. We're frogs who are getting boiled in a pot
full of single-character morphemes, and we don't notice.
Poor Huffman coding
Huffman invented a method of data compaction in which common characters are represented by a small number of bits, and rarer characters are represented by more bits. The principle is more general, however, and language designers would do well to pay attention to the "other" Perl slogan: Easy things should be easy, and hard things should be possible. However, we haven't always taken our own advice. Consider those two regex constructs we just saw:
(?<=...)
(??{...})
Which one do you think is likely to be the most common in everyday use? Guess which one is longer...
There are many examples of poor Huffman coding in current regexes. Consider these:
(...)
(?:...)
Is it really the case that grouping is rarer than capturing? And by two gobbledygooky character's worth? Likewise there are many constructs that are the same length that shouldn't be:
(?:...)
(?#...)
Grouping is much more important than the ability to embed a comment. Yet they're the same length currently.
Too much reliance on too few metacharacters
A lot of our Huffman troubles came about because we were trying to
shoehorn new capabilities into an old syntax without breaking anything.
The (?...) construct succeeded at that goal, but it was new
wine in old wineskins, as they say. More successful was the *?
minimal matching hack, but it's still symptomatic of the problem
that we only had three characters to choose from that would have
worked at that point in the grammar. We've pretty nearly exhausted
the available backslash sequences.
The waterbed theory of linguistic complexity says that if you push down one place, it goes up somewhere else. If you arbitrarily limit yourself to too few metacharacters, the complexity comes out somewhere else. So it seems obvious to me that the way out of this mess is to grab a few more metacharacters. And the metacharacters I want to grab are...well, we'll see in a moment.
Different things look too similar
Consider these constructs:
(??{...})
(?{...})
(?#...)
(?:...)
(?i:...)
(?=...)
(?!...)
(?<=...)
(?<!...)
(?>...)
(?(...)...|...)
These all look quite similar, but some of them do radically different things.
In particular, the (?<...) does not mean the opposite of the (?>...). The
underlying visual problem is the overuse of parentheses, as in Lisp.
Programs are more readable if different things look different.
Poor end-weight design
In linguistics, the notion of end-weight is the idea that people tend to prefer sentences where the short things come first and the long things come last. That minimizes the amount of stuff you have to remember while you're reading or listening. Perl violates this with regex modifiers. It's okay when you say something short like this:
s/foo/bar/g
But when you say something like we find in RFC 360:
while ($text =~ /name:\s*(.*?)\n\s*
children:\s*(?:(?@\S+)[, ]*)*\n\s*
favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {...}
it's not until you read the /sigx at the end that you know how to
read the regex. This actually causes problems for the Perl 5 parser,
which has to defer parsing the regular expression till it sees
the /x, because that changes how whitespace and comments work.
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |


