Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
Backslash Reform
There are some changes to backslash sequences. Character properties
\p and \P are no longer needed--predefined character classes
are just considered intrinsic grammar rules. (You can negate any <...>
assertion by using <!...> instead.) As mentioned in
a previous Apocalypse, the \L, \U, and \Q sequences no longer
use \E to terminate--they now require bracketing characters of some
sort. And \Q will rarely be needed due to regex policy changes.
In fact, they may all go away since it's easy to say things like:
$(lc $foo)
For any bracketing construct, square brackets are preferred, but others are allowed:
\x[...] # preferred, indicates simple bracketing
\x(...) # okay, but doesn't capture.
\x{...} # okay, but isn't a closure.
\x<...> # okay, but isn't an assertion
The \c sequence is now a bracketing construct, having been
extended from representing control characters to any named character.
Backreferences such as \1 are gone in favor of the corresponding
variable $1. \A, \Z, and \z are gone with the
disappearance of /s and /m. The position assertion \G is
gone in favor of a :c modifier that forces continuation from where
the last match left off. That's because \G was almost never used
except at the front of a regex. In the unlikely event that you want to
assert that you're at the old final position elsewhere in your regex,
you can always test the current position (via the .pos method)
with an assertion:
$oldpos = pos $string;
$string =~ m/... <( .pos == $oldpos )> .../;
You may be thinking of .pos as the final position of the previous
match, but that's not what it is. It's the current position of the
current match. It's just that, between matches, the current position
of the current match happens to be the same as the final position of
the current match, which happens to be the last match, which happens
to be done. But as soon as you start another match, the last match is no
longer the current match.
Note that the :c continuation is needed only on constructs that
ordinarily force the search to start from the beginning. Subrules
automatically continue at the current location, since their initial
position is controlled by some other rule.
There are two new backslash sequences, \h and \v, which match
horizontal and vertical whitespace respectively, including Unicode
spacing characters and control codes. Note that \r is considered
vertical even though it theoretically moves the carriage sideways.
Finally, \n matches a logical newline, which is not necessarily
a linefeed character on all architectures. After all, that's why it's an "n",
not an "l". Your program should not break just because you happened to
run it on a file from a partition mounted from a Windows machine.
(Within an interpolated string, \n still produces whatever
is the normal newline for the current architecture.)
Old New
--- ---
\x0a \x0a # same
\x{263a} \x263a # brackets required only if ambiguous
\x{263a}abc \x[263a]abc # brackets required only if ambiguous
\0123 \0123 # same (no ambiguity with $123 now)
\0123 \0[123] # can use brackets here too
\p{prop} <prop> # properties are just grammar rules
\P{prop} <!prop>
[\040\t\p{Zs}] \h # horizontal whitespace
space \h # not exact, but often more correct
[\r\n\ck\p{Zl}\p{Zp}] \v # vertical whitespace
\Qstring\E \q[string]
<'string with spaces'> # match literal string
<' '> # match literal space
\E gone # use \Q[...] instead
\A ^ # ^ now invariant
\a \c[BEL] # alarm (bell)
\Z \n?$ # clearer
\z $ # $ now invariant
\G <( .pos == $oldpos )> # match at particular position
# typically just use m:c/pat/
\N{CENT SIGN} \c[CENT SIGN] # named character
\c[ \e # escape
\cX \c[^X] # control char
\n \c[LF] # specifically a linefeed
\x0a\x0d \x[0a;0d] # CRLF
\x0a\x0d \c[CR;LF] # CRLF (conjectural)
\C [:u0 .] # forced byte in utf8 (dangerous)
[^\N[CENT SIGN]] \C[CENT SIGN] # match any char but CENT SIGN
\Q$var\E $var # always assumed literal,
\1 $1 # so $1 is literal backref
/$1/ my $old1 = $1; /$old1/ # must use temporary here
\r?\n \n # \n asserts logical newline
[^\n] \N # not a logical newline
\C[LF] # not a linefeed
[^\t] \T # not a tab (these are conjectural)
[^\r] \R # not a return
[^\f] \F # not a form feed
[^\e] \E # not an escape
[^\x1B] \X1B # not specified hex char
[^\x{263a}] \X[263a] # not a Unicode SMILEY
\X <.> # a grapheme (combining char seq)
[:u2 .] # At level 2+, dot means grapheme
Under level 2 Unicode support, a character is assumed to mean a grapheme,
that is, a sequence consisting of a base character followed by 0 or more
combining characters. That not only affects the meaning of the . character, but
also any negated character, since a negated character is really a negative
lookahead assertion followed by the traversal of a single character. For
instance, \N really means:
[<!before \n> . ]
So it doesn't really matter how many characters \n actually matches. \N always
matches a single character--whatever that is...
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |

