Sign In/My Account | View Cart  
advertisement


Listen Print

Apocalypse 5
by Larry Wall | Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Related Reading

Perl in a Nutshell, 2nd Edition

Perl in a Nutshell, 2nd Edition
By Stephen Spainhour, Ellen Siever, Nate Patwardhan

Too much reliance on modifiers

The /s modifier in the previous example changes the meaning of the . metacharacter. We could, in fact, do away with the /s modifier entirely if we only had two different representations for "any character", one of which matched a newline, and one which didn't. A similar argument applies to the /m modifier. The whole notion of something outside the regex changing the meaning of the regex is just a bit bogus, not because we're afraid of context sensitivity, but because we need to have better control within the regex of what we mean, and in this case the context supplied outside the regex is not precise enough. (Perl 5 has a way to control the inner contexts, but it uses the self-obfuscating (?...) notation.)

Modifiers that control how the regex is used as a whole do make some sense outside the regex. But they still have the end-weight problem.

Too many special rules and boobytraps

Without knowing the context, you cannot know what the pattern // will do. It might match a null string, or it might match the previously successful match.

The local operator behaves differently inside regular expressions than it does outside.

It's too easy to write a null pattern accidentally. For instance, the following will never match anything but the null string:

    /
    | foo
    | bar
    | baz
    /x

Even when it's intentional, it may not look intentional:

    (a|b|c|)

That's hard to read because it's difficult to make the absence of something visible.

It's too easy to confuse the multiple meanings of dot. Or the multiple meanings of ^, and $. And the opposite of \A is frequently not \Z, but \z. Tell me again, when do I say \1, and when do I say $1? Why are they different?

Backreferences not useful enough

Speaking of \1, backreferences have a number of shortcomings. The first is actually getting ahold of the right backreference. Since captures are numbered from the beginning, you have to count, and you can easily count wrong. For many purposes it would be better if you could ask for the last capture, or the one before that. Or perhaps if there were a way to restart the numbering part way through...

Another major problem with backreferences is that you can't easily modify one to search for a variant. Suppose you match an opening parenthesis, bracket, or curly. You'll like to search for everything up to the corresponding closing parenthesis, bracket, or curly, but there's no way to transmogrify the opening version to the closing version, because the backref search is hardwired independently of ordinary variable matching. And that's because Perl doesn't instantiate $1 soon enough. And that's because Perl relies on variable interpolation to get subexpressions into regexes. Which leads us to...

Too hard to match a literal string

Since regexes undergo an interpolation pass before they're compiled, anything you interpolate is forced to be treated as a regular expression. Often that's not what you want, so we have the klunky \Q$string\E mechanism to hide regex metacharacters. And that's because...

Two-level interpretation is problematic

The problem with \Q$string\E arises because of the fundamental mistake of using interpolation to build regexes instead of letting the regex control how it treats the variables it references. Regexes aren't strings, they're programs. Or, rather, they're strings only in the sense that any piece of program is a string. Just as you have to work to eval a string as a program, you should have to work to eval a string as a regular expression. Most people tend to expect a variable in a regular expression to match its contents literally. Perl violates that expectation. And because it violates that expectation, we can't make $1 synonymous with \1. And interpolated parentheses throw off the capture count, so you can't easily use interpolation to call subrules, so we invented (??{$var}) to get around that. But then you can't actually get at the parentheses captured by the subrule. The ramifications go on and on.

Too little abstraction

Historically, regular expressions were considered a very low-level language, a kind of glorified assembly language for the regex engine. When you're only dealing with ASCII, there is little need for abstraction, since the shortest way to say [a-z] is just that. With the advent of the eighth bit, we started getting into a little bit of trouble, and POSIX started thinking about names like [:alpha:] to deal with locale difficulties. But as with the problem of conciseness, the culture was still biased away from naming abstractly anything that could be expressed concretely.

However, it's almost impossible to write a parser without naming things, because you have to be able to name the separate grammar rules so that the various rules can refer to each other.

It's difficult to deal with any subset of Unicode without naming it. These days, if you see [a-z] in a program, it's probably an outright bug. It's much better to use a named character property so that your program will work right in areas that don't just use ASCII.

Even where we do allow names, it tends to be awkward because of the cultural bias against it. To call a subrule by name in Perl 5 you have to say this:

    (??{$rule})

That has 4 or 5 more characters than it ought to. Dearth of abstraction produces bad Huffman coding.

Little support for named captures

Make that "no support" in Perl, unless you include assignment to a list. This is just a part of the bias against naming things. Instead we are forced to number our capturing parens and count. That works okay for the top-level regular expression, when we can do list assignment or assign $1 to $foo. But it breaks down as soon as you start trying to use nested regexes. It also breaks down when the capturing parentheses match more than once. Perl handles this currently by returning only the last match. This is slightly better than useless, but not by much.

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Next Pagearrow