Sign In/My Account | View Cart  
advertisement


Listen Print

Apocalypse 5
by Larry Wall | Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Backslash Reform

There are some changes to backslash sequences. Character properties \p and \P are no longer needed--predefined character classes are just considered intrinsic grammar rules. (You can negate any <...> assertion by using <!...> instead.) As mentioned in a previous Apocalypse, the \L, \U, and \Q sequences no longer use \E to terminate--they now require bracketing characters of some sort. And \Q will rarely be needed due to regex policy changes. In fact, they may all go away since it's easy to say things like:

    $(lc $foo)

For any bracketing construct, square brackets are preferred, but others are allowed:

    \x[...]     # preferred, indicates simple bracketing
    \x(...)     # okay, but doesn't capture.
    \x{...}     # okay, but isn't a closure.
    \x<...>     # okay, but isn't an assertion

The \c sequence is now a bracketing construct, having been extended from representing control characters to any named character.

Backreferences such as \1 are gone in favor of the corresponding variable $1. \A, \Z, and \z are gone with the disappearance of /s and /m. The position assertion \G is gone in favor of a :c modifier that forces continuation from where the last match left off. That's because \G was almost never used except at the front of a regex. In the unlikely event that you want to assert that you're at the old final position elsewhere in your regex, you can always test the current position (via the .pos method) with an assertion:

    $oldpos = pos $string;
    $string =~ m/... <( .pos == $oldpos )> .../;

You may be thinking of .pos as the final position of the previous match, but that's not what it is. It's the current position of the current match. It's just that, between matches, the current position of the current match happens to be the same as the final position of the current match, which happens to be the last match, which happens to be done. But as soon as you start another match, the last match is no longer the current match.

Note that the :c continuation is needed only on constructs that ordinarily force the search to start from the beginning. Subrules automatically continue at the current location, since their initial position is controlled by some other rule.

There are two new backslash sequences, \h and \v, which match horizontal and vertical whitespace respectively, including Unicode spacing characters and control codes. Note that \r is considered vertical even though it theoretically moves the carriage sideways. Finally, \n matches a logical newline, which is not necessarily a linefeed character on all architectures. After all, that's why it's an "n", not an "l". Your program should not break just because you happened to run it on a file from a partition mounted from a Windows machine. (Within an interpolated string, \n still produces whatever is the normal newline for the current architecture.)

    Old                 New
    ---                 ---
    \x0a                \x0a                    # same
    \x{263a}            \x263a                  # brackets required only if ambiguous
    \x{263a}abc         \x[263a]abc             # brackets required only if ambiguous
    \0123               \0123                   # same (no ambiguity with $123 now)
    \0123               \0[123]                 # can use brackets here too
    \p{prop}            <prop>                  # properties are just grammar rules
    \P{prop}            <!prop>
    [\040\t\p{Zs}]      \h                      # horizontal whitespace
    space               \h                      # not exact, but often more correct
    [\r\n\ck\p{Zl}\p{Zp}] \v                    # vertical whitespace
    \Qstring\E          \q[string]
                        <'string with spaces'>  # match literal string
                        <' '>                   # match literal space
    \E                  gone                    # use \Q[...] instead
    \A                  ^                       # ^ now invariant
    \a                  \c[BEL]                 # alarm (bell)
    \Z                  \n?$                    # clearer
    \z                  $                       # $ now invariant
    \G                  <( .pos == $oldpos )>   # match at particular position
                                                # typically just use m:c/pat/
    \N{CENT SIGN}       \c[CENT SIGN]           # named character
    \c[                 \e                      # escape
    \cX                 \c[^X]                  # control char
    \n                  \c[LF]                  # specifically a linefeed
    \x0a\x0d            \x[0a;0d]               # CRLF
    \x0a\x0d            \c[CR;LF]               # CRLF (conjectural)
    \C                  [:u0 .]                 # forced byte in utf8 (dangerous)
    [^\N[CENT SIGN]]    \C[CENT SIGN]           # match any char but CENT SIGN
    \Q$var\E            $var                    # always assumed literal,
    \1                  $1                      # so $1 is literal backref
    /$1/                my $old1 = $1; /$old1/  # must use temporary here
    \r?\n               \n                      # \n asserts logical newline
    [^\n]               \N                      # not a logical newline
                        \C[LF]                  # not a linefeed
    [^\t]               \T                      # not a tab (these are conjectural)
    [^\r]               \R                      # not a return
    [^\f]               \F                      # not a form feed
    [^\e]               \E                      # not an escape
    [^\x1B]             \X1B                    # not specified hex char
    [^\x{263a}]         \X[263a]                # not a Unicode SMILEY
    \X                  <.>                     # a grapheme (combining char seq)
                        [:u2 .]                 # At level 2+, dot means grapheme

Under level 2 Unicode support, a character is assumed to mean a grapheme, that is, a sequence consisting of a base character followed by 0 or more combining characters. That not only affects the meaning of the . character, but also any negated character, since a negated character is really a negative lookahead assertion followed by the traversal of a single character. For instance, \N really means:

    [<!before \n> . ]

So it doesn't really matter how many characters \n actually matches. \N always matches a single character--whatever that is...

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Next Pagearrow