Apocalypse 5
by Larry Wall
|
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24
Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.
Modifier Reform
You can't use colon for a regex delimiter any more. That's because regex modifiers may now be placed in front of a regex construct:
s:w:i:e /foo/bar/ # :words :ignorecase :each
That can also be written:
s/:w:i:e foo/bar/ # :words :ignorecase :each
Single character modifiers may be bundled like this:
s:wie /foo/bar/ # :words :ignorecase :each
...but only if the sequence as a whole is not already defined as a long modifier, since ambiguity will be resolved in favor of the long modifier. Long modifiers may not be bundled with any other modifier. So this is legal:
s:once:wie /foo/bar/
but not these (unless you've defined them):
s:wieonce /foo/bar/
s:oncewie /foo/bar/
Not only is colon disallowed as a regex delimiter, but you may no longer use parentheses as the delimiters either. This will allow us to parameterize modifiers:
s:myoption($x) /foo/bar/
This rule also allows us to differentiate s/// from an s() function,
tr/// from tr(), etc. If you want matching brackets for the
delimiters I'd suggest that you use square brackets, since they now
mean grouping without capturing.
Several modifiers, /x, /s, and /m, are no longer needed and have
been retired. It's unclear whether /o is necessary any more. We will
assume it's gone unless it's shown that caching can't handle the
problem. Note that the regex now has more control over when to cache subrules
because it is no longer subject to the vagaries of standard interpolation.
The old /c modifier is gone because regexes never reset the position
on failure any more. To do that, set $string.pos = 0 explicitly.
But note also that assigning to a string automatically resets its position
to 0, so any string in your typical loop is going to start with its current
search position already set 0. Modifying a string in place causes the position
to move to the end of the replacement section by default, if the position
was within the span replaced. (This is consistent with s/// semantics.)
The /e modifier is also gone, since it did reverse parsing magic, and
:e will be short for :each--see below. It's still easy to substitute
the value of an expression though:
s/pat/$( code )/;
or even
s(/pat/, { code });
There's a new modifier, :once, that causes a match to succeed only
once (like the old ?...? construct). To reset it, use the .reset
method on the regex object. (If you haven't named the regex object,
too bad...)
Another new modifier is :w, which causes an implicit match of
whitespace wherever there's literal whitespace in a pattern.
In other words, it replaces every sequence of actual whitespace
in the pattern with a \s+ (between two identifiers) or a \s*
(between anything else). So
m:w/ foo bar \: ( baz )*/
really means (expressed in Perl 5 form):
m:p5/\s*foo\s+bar\s*:(\s*baz\s*)*/
You can still control the handling of whitespace under :w, since
we extend the rule to say that any explicit whitespace-matching token can't
match whitespace implicitly on either side. So:
m:w/ foo\ bar \h* \: (baz)*/
really means (expressed in Perl 5 form):
m:p5/\s*foo bar[\040\t\p{Zs}]*:\s*(baz)*/
The first space in
/[:w foo bar]/
matches \s* before "foo". That's usually what you want, but if
it's not what you want, you have a little problem. Unfortunately you
can't just say:
/[:wfoo bar]/
That won't work because it'll look for the :wfoo modifier.
However, there are several ways to get the effect you want:
/[:w()foo bar]/
/[:w[]foo bar]/
/[:w\bfoo bar]/
/[:w::foo bar]/
That last one is just our friend the :: operator in disguise.
If you backtrack into it, you're leaving the brackets anyway, so it's
essentially a no-op.
The new :c/:cont modifier forces the regex to continue
at the current "pos" of the string. It may only be used outside
the regex. (Well, it could be used inside but it'd be redundant.)
The modifier also forces the regex to match only the next available
thing. That's not quite the same as the ^ anchor, though,
because it not only disables the implicit scanning done by m//
and s///, but it also works on more than the first iteration.
It forces all matches to be contiguous, in other words. So :c
is short for both "continue" and "contiguous". If you say
$_ = "foofoofoo foofoofoo";
s:each:cont/foo/FOO/;
you get:
FOOFOOFOO foofoofoo
This may seem odd, but it's precisely the semantics of any embedded regex:
$_ = "foofoofoo foofoofoo";
$rx = rx/foo/;
m/<$rx>*/; # matches "foofoofoo"
A modifier that starts with a number causes the pattern to match that many times. It may only be used outside the regex. It may not be bundled, because ordinals are distinguished from cardinals. That is, how it treats those multiple matches depends on the next character. If you say
s:3x /foo/bar/
then it changes the first 3 instances. But if you say
s:3rd /foo/bar/
it changes only the 3rd instance. You can say
s:1st /foo/bar/
but that's just the default, and should not be construed
as equivalent to :once, which matches only once, ever.
(Unless you .reset it, of course.)
You can combine cardinals and ordinals:
s:3x:3rd /foo/bar/
That changes the 3rd, 6th, and 9th occurrences. To change every other quote character, say
s:each:2nd /"/\&rquot;/;
:each is synonymous with :3x (for large values of 3). Note that
:each does not, in fact, generate every possible match, because it
disallows overlaps. To get every possible match, use the :any
modifier. Saying:
$_ = "abracadabra";
@all = m:any /a[^a]+a/;
produces:
abra aca ada abra
It can even match multiple times at the same spot as long as the rest of the regex progresses somehow. Saying:
@all = m:any /a.*?a/;
produces:
abra abraca abracada abracadabra aca acada acadabra ada adabra abra
If you say
$sentence.m:any /^ <english> $/
you'll get every possible parsing of the sentence according to the rules
of english (not to be confused with the rules of English, which are already
confusing enough, except when they aren't).
To indicate varying levels of Unicode support we have these modifiers, which may be used either inside or outside a regex:
:u0 # use bytes (. is byte)
:u1 # level 1 support (. is codepoint)
:u2 # level 1 support (. is grapheme)
:u3 # level 1 support (. is language dependent)
These modifiers say nothing about the state of the data, but in general internal
Perl data will already be in Normalization Form C, so even under :u1, the precomposed
characters will usually do the right thing. Note that these modifiers
are for overriding the default support level, which was probably set by pragma
at the top of the file.
Finally, there's the :p5 modifier, which causes the rest of the regex (or group) to
be parsed as a Perl 5 regular expression, including any interpolated strings.
(But it still doesn't enable Perl 5's trailing modifiers.)
Old New
--- ---
?pat? m:once/pat/ # match once only
/pat/i m:i/pat/ # ignorecase
/:i pat/ # ignorecase
/pat/x /pat/ # always extended
/pat\s*pat/ /:w pat pat/ # match word sequence
/(?i)$p5pat/ m:p5/(?i)$p5pat/ # use Perl 5 syntax
$n = () = /.../g $n = +/.../; # count occurrences
for $i (1..3){s///} s:3///; # do 3 times
/^pat$/m /^^pat$$/ # no more /m
/./s /./ # no more /s
/./ /\N/ # . now works like /s
Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 |





