Sign In/My Account | View Cart  
advertisement


Listen Print

Apocalypse 5
by Larry Wall | Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Metacharacter Reform

Some things stay the same: (...) captures text just as it did before, and the quantifiers *, +, and ? are also unchanged. The vertical bar | still separates alternatives. The backslash \ still protects the following character from its ordinary interpretation. The ? suffix character still does minimal matching. (Note that these are by far the most commonly used metacharacters, so many ordinary regexes will look nearly identical in Perl 5 and Perl 6.)

Since /x extended syntax is now the default, # is now always a metacharacter indicating a comment, and whitespace is now always "meta". Whitespace is now the standard way to separate regex tokens that would otherwise be confused as a single token.

Even in character classes, whitespace is not taken literally any more. Backwhack the space if you mean it literally, or use <sp>, or \040, or \x20, or \c[SPACE]. But speaking of character classes...

Perhaps the most radical change is that I've taken [...] away from character classes and made it the non-capturing grouping operator, because grouping is more fundamental than character classes, and explicit character classes are becoming less common than named character classes. (You can still do character classes, just not with bare square brackets.)

I've also stolen {...} from generalized quantifiers and made them into closure delimiters. (Use <n,m> for the generalized quantifier now.)

I've stolen three new metacharacters. The new extensible metasyntax for assertions uses angle brackets, <...>. And the colon : is now used for declaration and backtracking control. (Recall Larry's 2nd Law of Language Redesign: Larry gets the colon.) The colon always introduces a token that controls the meaning of what is around it. The nature of the token depends on what follows the colon. Both the colon syntax and angle syntax are extensible. (Backslash syntax is also extensible.)

This may sound like we're complexifying things, but we're really simplifying. We now have the following regex invariants:

    (...)       # always delimits a capturing group
    [...]       # always delimits a non-capturing group
    {...}       # always delimits a closure
    <...>       # always delimits an assertion
    :...        # always introduces a metasyntactic token

(Note that we're using "assertion" here in the broad sense of anything that either matches or fails, whether or not it has a width.)

The nature of the angle assertion is controlled by the first character inside it. If the first character is alphabetic, it's a grammatical assertion, and the entire first word controls the meaning. The word is first looked up in the current grammar, if any. If not found there, it is checked to see if it is one of the built-in grammar rules such as those defined by the Unicode property classes. If the first character is not alphabetic, there will be special rules in the current grammar or in the Perl grammar for looking up the parse rule. For instance, by default, any assertion that begins with ! is simply negated. Assertions that start with a digit are assumed to be a range assertion (<n,m>) regarding the previous atom. (Taking the last two together, you can say <!n,m> to exclude a range.) Assertions that start with $, @, %, or & are assumed to interpolate an indirect regex rule stored in a variable or returned by a subroutine. An assertion that starts with a parenthesis is a closure being used as an assertion. An assertion that starts with a square bracket or another angle bracket is a character class. An assertion that starts with a quote asserts the match of a literal string. And so on.

Some metacharacters are still used but have a slightly different meaning, in part to get rid of the /s and /m modifiers, and in part because most strings in Perl 6 will come from the filehandle pre-chomped. So anchors ^ and $ now always mean the real beginning and ending of the string. Use ^^ and $$ to match the beginnings and endings of lines within a string. (They're doubled because they're "fancier", because they can match in multiple places, and because they'll be rarer, so Huffman says they should be longer.) The ^^ and $$ also match where ^ and $ would.

The dot . now always matches any character including newline. (Use \N to match a non-newline. Or better, use an autochomping filehandle, if you're processing line-by-line.)

In a sense, the sigils $, @, %, and & are different metacharacters because they don't interpolate, but are now subject to the interpretation of the regex engine. This allows us to change the default behavior of ordinary "interpolation" to a literal match, and also lets us put in lvalue-ish constructs like:

    / $name := (\S+) /
    / @kids := [(\S+) \s+]* /
    / %pets := [(\S+) \: (\S+) \s+]* /

(Notice also the delicate interplay of quantified non-capturing brackets with capturing parens, particularly for gathering multiple values or even multiple key/value pairs.)

Here are some of the metacharacter differences in table form:

    Old                   New
    ---                   ---
    /pat pat #text        /pat pat #text
        pat/x                 pat/              # Look Ma, no /x!
    /patpat(?#text)/       /pat pat <('text')>/ # can always use whitespace
    /pat pat/             / pat\ pat /          # match whitespace literally
                          / pat \s* pat /       # or generically
                          / pat \h* pat /       # or horizontally
                          / pat <' '> pat /     # or as a literal string
                          / pat <sp> pat /      # or by explicit rule
                          /:w pat pat/          # or by implicit rule
    /^pat$/               /^pat\n?$/            # ^ and $ mean string
    /^pat$/m              /^^pat$$/             # no more /m
    /\A...(^pat$)*...\z/m /^...(^^pat$$)*...$/  # no more \A or \z
    /.*\n/                /\N*\n/               # \N is negated \n
                          /.*?\n/               # this still works
    /.*/s                 /.*/                  # . always matches "any"
    \Q$string\E           $string               # interpret literally
    (?{ code })           { code }              # call code, ignore return
                          { code or fail }      # use code as an assertion
    (??{$rule})           <$var>                # call $var as regex
                          <name>                # call rule from current grammar
                          <Other::rule>         # call rule from some Other grammar
                          <*rule>               # bypass local rule to call built-in
                          <@array>              # call array of alternate rules
                          <%hash>               # parse keyword as key to rule
                          <@array[1]>           # call a rule from an array
                          <%hash{"x"}>          # call a rule from a hash
                          <&sub(1,2,3)>         # call a rule returned by a sub
                          <{ code }>            # call return value as anonymous rule
                          <( code )>            # call code as boolean assertion
                          <name(expr)>          # call rule, passing Perl args
                          { .name(expr) }       # same thing.
                          <$var(expr)>          # call rule indirectly by name
                          { .$var(expr) }       # same thing.
                          <name pat>            # call rule, passing regex arg
                          { .name(/pat/) }      # same thing.
                          # maybe...
                          <name: text>          # call rule, passing string
                          { .name(q<text>) }    # same thing.
    [\040\t\p{Zs}]        \h                    # horizontal whitespace
    [\r\n\ck\p{Zl}\p{Zp}] \v                    # vertical whitespace
    [a-z]                 <[a-z]>               # equivalently non-international
                          <alpha>               # more international
    [[:alpha:][:digit:]   <<alpha><digit>>      # POSIX classes are built-in rules
    {n,m}                 <n,m>                 # assert repeat count
    {$n,$m}               <$n,$m>               # indirect repeat counts
    (?>.*)                [.*]:                 # don't backtrack through [.*]
                          .*:                   # brackets not necessary on atom
                          (.*):                 # same, but capture
                          <xyz>:                # don't backtrack into subrule
                          :                     # skip previous atom when backtracking
                          ::                    # fail all |'s when backtracking
                          :::                   # fail current rule when backtracking
                          :=                    # bind a name to following atom
    my ($x) = /(.*)/      my $x; / $x:=(.*) /   # may now bind it inside regex
    (?i)                  :i                    # ignore case in the following
                          :ignorecase           # same thing, self-documenting form
    (?i:...)              [:i ...]              # can limit scope without capture
                          (:i ...)              # can limit scope with capture

Declarations like :i are lexically scoped and do not pass to any subrules. Each rule maintains its own sensitivity. There is no built-in operator to turn case ignorance back off--just call a different rule and it's automatically case sensitive again. (If you want a parameterized subrule, that can be arranged. It's just a method, after all. Proof of this assertion is left to future generations of hackers.)

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Next Pagearrow