Sign In/My Account | View Cart  
advertisement


Listen Print

Apocalypse 5
by Larry Wall | Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Modifier Reform

You can't use colon for a regex delimiter any more. That's because regex modifiers may now be placed in front of a regex construct:

    s:w:i:e /foo/bar/           # :words :ignorecase :each

That can also be written:

    s/:w:i:e foo/bar/           # :words :ignorecase :each

Single character modifiers may be bundled like this:

    s:wie /foo/bar/             # :words :ignorecase :each

...but only if the sequence as a whole is not already defined as a long modifier, since ambiguity will be resolved in favor of the long modifier. Long modifiers may not be bundled with any other modifier. So this is legal:

    s:once:wie /foo/bar/

but not these (unless you've defined them):

    s:wieonce /foo/bar/
    s:oncewie /foo/bar/

Not only is colon disallowed as a regex delimiter, but you may no longer use parentheses as the delimiters either. This will allow us to parameterize modifiers:

    s:myoption($x) /foo/bar/

This rule also allows us to differentiate s/// from an s() function, tr/// from tr(), etc. If you want matching brackets for the delimiters I'd suggest that you use square brackets, since they now mean grouping without capturing.

Several modifiers, /x, /s, and /m, are no longer needed and have been retired. It's unclear whether /o is necessary any more. We will assume it's gone unless it's shown that caching can't handle the problem. Note that the regex now has more control over when to cache subrules because it is no longer subject to the vagaries of standard interpolation.

The old /c modifier is gone because regexes never reset the position on failure any more. To do that, set $string.pos = 0 explicitly. But note also that assigning to a string automatically resets its position to 0, so any string in your typical loop is going to start with its current search position already set 0. Modifying a string in place causes the position to move to the end of the replacement section by default, if the position was within the span replaced. (This is consistent with s/// semantics.)

The /e modifier is also gone, since it did reverse parsing magic, and :e will be short for :each--see below. It's still easy to substitute the value of an expression though:

    s/pat/$( code )/;

or even

    s(/pat/, { code });

There's a new modifier, :once, that causes a match to succeed only once (like the old ?...? construct). To reset it, use the .reset method on the regex object. (If you haven't named the regex object, too bad...)

Another new modifier is :w, which causes an implicit match of whitespace wherever there's literal whitespace in a pattern. In other words, it replaces every sequence of actual whitespace in the pattern with a \s+ (between two identifiers) or a \s* (between anything else). So


    m:w/ foo bar \: ( baz )*/

really means (expressed in Perl 5 form):


    m:p5/\s*foo\s+bar\s*:(\s*baz\s*)*/

You can still control the handling of whitespace under :w, since we extend the rule to say that any explicit whitespace-matching token can't match whitespace implicitly on either side. So:

    m:w/ foo\ bar \h* \: (baz)*/

really means (expressed in Perl 5 form):


    m:p5/\s*foo bar[\040\t\p{Zs}]*:\s*(baz)*/

The first space in

    /[:w foo bar]/

matches \s* before "foo". That's usually what you want, but if it's not what you want, you have a little problem. Unfortunately you can't just say:

    /[:wfoo bar]/

That won't work because it'll look for the :wfoo modifier. However, there are several ways to get the effect you want:

    /[:w()foo bar]/ 
    /[:w[]foo bar]/ 
    /[:w\bfoo bar]/ 
    /[:w::foo bar]/

That last one is just our friend the :: operator in disguise. If you backtrack into it, you're leaving the brackets anyway, so it's essentially a no-op.

The new :c/:cont modifier forces the regex to continue at the current "pos" of the string. It may only be used outside the regex. (Well, it could be used inside but it'd be redundant.) The modifier also forces the regex to match only the next available thing. That's not quite the same as the ^ anchor, though, because it not only disables the implicit scanning done by m// and s///, but it also works on more than the first iteration. It forces all matches to be contiguous, in other words. So :c is short for both "continue" and "contiguous". If you say

    $_ = "foofoofoo foofoofoo";
    s:each:cont/foo/FOO/;

you get:

    FOOFOOFOO foofoofoo

This may seem odd, but it's precisely the semantics of any embedded regex:

    $_ = "foofoofoo foofoofoo";
    $rx = rx/foo/;
    m/<$rx>*/;          # matches "foofoofoo"

A modifier that starts with a number causes the pattern to match that many times. It may only be used outside the regex. It may not be bundled, because ordinals are distinguished from cardinals. That is, how it treats those multiple matches depends on the next character. If you say

    s:3x /foo/bar/

then it changes the first 3 instances. But if you say

    s:3rd /foo/bar/

it changes only the 3rd instance. You can say

    s:1st /foo/bar/

but that's just the default, and should not be construed as equivalent to :once, which matches only once, ever. (Unless you .reset it, of course.)

You can combine cardinals and ordinals:

    s:3x:3rd /foo/bar/

That changes the 3rd, 6th, and 9th occurrences. To change every other quote character, say

    s:each:2nd /"/\&rquot;/;

:each is synonymous with :3x (for large values of 3). Note that :each does not, in fact, generate every possible match, because it disallows overlaps. To get every possible match, use the :any modifier. Saying:

    $_ = "abracadabra";
    @all = m:any /a[^a]+a/;

produces:

    abra aca ada abra

It can even match multiple times at the same spot as long as the rest of the regex progresses somehow. Saying:

    @all = m:any /a.*?a/;

produces:

    abra abraca abracada abracadabra aca acada acadabra ada adabra abra

If you say

    $sentence.m:any /^ <english> $/

you'll get every possible parsing of the sentence according to the rules of english (not to be confused with the rules of English, which are already confusing enough, except when they aren't).

To indicate varying levels of Unicode support we have these modifiers, which may be used either inside or outside a regex:

    :u0         # use bytes       (. is byte)
    :u1         # level 1 support (. is codepoint)
    :u2         # level 1 support (. is grapheme)
    :u3         # level 1 support (. is language dependent)

These modifiers say nothing about the state of the data, but in general internal Perl data will already be in Normalization Form C, so even under :u1, the precomposed characters will usually do the right thing. Note that these modifiers are for overriding the default support level, which was probably set by pragma at the top of the file.

Finally, there's the :p5 modifier, which causes the rest of the regex (or group) to be parsed as a Perl 5 regular expression, including any interpolated strings. (But it still doesn't enable Perl 5's trailing modifiers.)

    Old                 New
    ---                 ---
    ?pat?               m:once/pat/             # match once only
    /pat/i              m:i/pat/                # ignorecase
                        /:i pat/                # ignorecase
    /pat/x              /pat/                   # always extended
    /pat\s*pat/         /:w pat pat/            # match word sequence
    /(?i)$p5pat/        m:p5/(?i)$p5pat/        # use Perl 5 syntax
    $n = () = /.../g    $n = +/.../;            # count occurrences
    for $i (1..3){s///} s:3///;                 # do 3 times
    /^pat$/m            /^^pat$$/               # no more /m
    /./s                /./                     # no more /s
    /./                 /\N/                    # . now works like /s

Pages: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24

Next Pagearrow