June 2002 Archives

Synopsis 5

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

A summary of the changes in Apocalypse 5:

Unchanged features

  • Capturing: (...)
  • Repetition quantifiers: *, +, and ?
  • Alternatives: |
  • Backslash escape: \
  • Minimal matching suffix: ??, *?, +?

Modifiers

  • The extended syntax (/x) is no longer required...it's the default.

  • There are no /s or /m modifiers (changes to the meta-characters replace them - see below).

  • There is no /e evaluation modifier on substitutions; use s/pattern/$( code() )/ instead.

  • The /g modifier has been renamed to e (for each).

  • Modifiers are now placed as adverbs at the start of a match/substitution:
        @matches = m:ei/\s* (\w*) \s* ,?/;

  • The single-character modifiers also have longer versions:
            :i        :ignorecase
            :e        :each

  • The :c (or :cont) modifier causes the match to continue from the string's current .pos:
        m:c/ pattern /        # start at end of
                              # previous match on $_

  • The new :o (:once) modifier replaces the Perl 5 ?...? syntax:
        m:once/ pattern /    # only matches first time

  • The new :w (:word) modifier causes whitespace sequences to be replaced by \s* or \s+ subpattern:
        m:w/ next cmd =   <condition>/

  • Same as:
        m/ \s* next \s+ cmd \s* = \s* <condition>/

  • The new :uN modifier specifies Unicode level:
        m:u0/ .<2> /        # match two bytes
        m:u1/ .<2> /        # match two codepoints
        m:u2/ .<2> /        # match two graphemes
        m:u3/ .<2> /        # match language dependently

  • The new :p5 modifier allows Perl 5 regex syntax to be used instead:
        m:p5/(?mi)^[a-z]{1,2}(?=\s)/

  • Any integer modifier specifies a count. What kind of count is determined by the character that follows.

  • If followed by an x, it means repetition:
        s:4x{ (<ident>) = (\N+) $$}{$1 => $2};
        # same as:
        s{ (<ident>) = (\N+) $$}{$1 => $2} for 1..4;

  • If followed by an st, nd, rd, or th, it means find the Nth occurance:
        s:3rd/(\d+)/@data[$1]/;
        # same as:
        m/(\d+)/ && m:c/(\d+)/ && s:c/(\d+)/@data[$1]/;

  • With the new :any modifier, the regex will match every possible way (including overlapping) and return all matches.
        $str = "abracadabra";
        @substrings = $str =~ m:any/ a (.*) a /;
        # br brac bracad bracadabr c cad cadabr d dabr br

  • The :i, :w, :c, :uN, and :p5 modifiers can be placed inside the regex (and are lexically scoped):
        m/:c alignment = [:i left|right|cent[er|re]] /

  • User-defined modifiers will be possible
            m:fuzzy/pattern/;

  • Single letter flags can be ``chained'':
            s:ewi/cat/feline/

  • User-defined modifiers can also take arguments:
            m:fuzzy('bare')/pattern/;

  • Hence parentheses are no longer valid regex delimiters

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Changed metacharacters

  • A dot . now matches any character including newline. (The /s modifier is gone.)

  • ^ and $ now always match the start/end of a string, like the old \A and \z. (The /m modifier is gone.)

  • A $ no longer matches an optional preceding \n so it's necessary to say \n?$ if that's what you mean.

  • \n now matches a logical (platform independent) newline not just \012.

  • The \A, \Z, and \z metacharacters are gone.


New metacharacters

  • Because /x is default:
    • # now always introduces a comment.

    • Whitespace is now always metasyntactic, i.e. used only for layout and not matched literally (but see the :w modifier described above).

  • ^^ and $$ match line beginnings and endings. (The /m modifier is gone.)

  • . matches an ``anything'', while \N matches an ``anything except newline''. (The /s modifier is gone.)


Bracket rationalization

  • (...) still delimits a capturing group.

  • [...] is no longer a character class.

  • It now delimits a non-capturing group.

  • {...} is no longer a repetition quantifier.

  • It now delimits an embedded closure.

  • You can call Perl code as part of a regex match.

  • Embedded code does not usually affect the match - it is only used for side-effects:
        / (\S+) { print "string not blank\n"; $text = $1; }
           \s+  { print "but does contain whitespace\n" }
        /

  • It can affect the match if it calls fail:
        / (\d+) {$1<256 or fail} /

  • <...> are now extensible metasyntax delimiters or ``assertions'' (i.e. they replace (?...)).


Variable (non-)interpolation

  • In Perl 6 regexes, variables don't interpolate.

  • Instead they're passed ``raw'' to the regex engine, which can then decide how to handle them (more on that below).

  • The default way in which the engine handles a scalar is to match it as a \Q[...] literal (i.e.it does not treat the interpolated string as a subpattern).

  • In other words, a Perl 6: / $var /

    is like a Perl 5: / \Q$var\E /


  • (To get regex interpolation use an assertion - see below)

  • An interpolated array:
        / @cmds /

    is matched as if it were an alternation of its elements:

        / [ @cmds[0] | @cmds[1] | @cmds[2] | ... ] /

  • And, of course, each one is matched as a literal.

  • An interpolated hash matches a /\w+/ and then requires that sequence to be a valid key of the hash.

  • So:
        / %cmds /

  • is like:
        / (\w+) { fail unless exists %cmds{$1} } /


Extensible metasyntax (<...>)

  • The first character after < determines the behaviour of the assertion.

  • A leading alphabetic character means it's a grammatical assertion (i.e. a subpattern or a named character class - see below):
        / <sign>? <mantissa> <exponent>? /

  • The special named assertions include:
        / <before pattern> /    # was /(?=pattern)/
        / <after pattern> /     # was /(?<pattern)/ 
                                # but now a real pattern!
        / <ws> /                # match any whitespace
        / <sp> /                # match a space char

  • A leading number, pair of numbers, or pair of scalars means it's a repetition specifier:
        / value was (\d<1,6>) with (\w<$m,$n>) /

  • A leading $, @, %, or & interpolates a variable or subroutine return value as a regex rather than as a literal:
        / <$key_pat> = <@value_alternatives> /

  • A leading ( indicates a code assertion:
        / (\d<1,3>) <( $1 < 256 )> /

  • Same as:
        / (\d<1,3>) {$1<256 or fail} /

  • A leading { indicates code that produces a regex to be interpolated into the pattern at that point:
        / (<ident>)  <{ cache{$1} //= get_body($1) }> /

  • A leading [ indicates an enumerated character class:
        / <[a-z_]>* /

  • A leading - indicates a complemented character class:
        / <-[a-z_]> <-<alpha>> /

  • A leading ' indicates an interpolated literal match (including whitespace):
        / <'match this exactly (whitespace matters)'> /

  • The special assertion <.> matches any logical grapheme (including a Unicode combining character sequences):
        / seekto = <.> /  # Maybe a combined char

  • A leading ! indicates a negated meaning (a zero-width assertion except for repetition specifiers):
        / <!before _>    # We aren't before an _
          \w<!1,3>       # We match 0 or >3 word chars
        /

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Backslash reform

  • The \p and \P properties become intrinsic grammar rules (<prop ...> and <!prop ...>).

  • The \L...\E, \U...\E, and \Q...\E sequences become \L[...], \U[...], and \Q[...] (\E is gone).

  • Note that \Q[...] will rarely be needed since raw variables interpolate as eq matches, rather than regexes.

  • Backreferences (e.g. \1) are gone; $1 can be used instead, because it's no longer interpolated.

  • New backslash sequences, \h and \v, match horizontal and vertical whitespace respectively, including Unicode.

  • \s now matches any Unicode whitespace character.

  • The new backslash sequence \N matches anything except a logical newline; it is the negation of \n.

  • A series of other new capital backslash sequences are also the negation of their lower-case counterparts:
    • \H matches anything but horizontal whitespace.

    • \V matches anything but vertical whitespace.

    • \T matches anything but a tab.

    • \R matches anything but a return.

    • \F matches anything but a formfeed.

    • \E matches anything but an escape.

    • \X... matches anything but the specified hex character.


Regexes are rules

  • The Perl 5 qr/pattern/ regex constructor is gone.

  • The Perl 6 equivalents are:
        rule { pattern }    # always takes {...} as delimiters
        rx/ pattern /       # can take (almost any) chars as delimiters

  • If either needs modifiers, they go before the opening delimiter:
        $regex = rule :ewi { my name is (.*) };
        $regex = rx:ewi/ my name is (.*) /;

  • The name of the constructor was changed from qr because it's no longer an interpolating quote-like operator.

  • As the syntax indicates, it is now more closely analogous to a sub {...} constructor.

  • In fact, that analogy will run very deep in Perl 6.

  • Just as a raw {...} is now always a closure (which may still execute immediately in certain contexts and be passed as a reference in others)...

  • ...so too a raw /.../ is now always a regex (which may still match immediately in certain contexts and be passed as a reference in others).

  • Specifically, a /.../ matches immediately in a void or Boolean context, or when it is an explicit argument of a =~.

  • Otherwise it's a regex constructor.

  • So this:
        $var = /pattern/;

    no longer does the match and sets $var to the result.


  • Instead it assigns a regex reference to $var.

  • The two cases can always be distinguished using m{...} or rx{...}:
        $var = m{pattern};    # Match regex, assign result
        $var = rx{pattern};   # Assign regex itself

  • Note that this means that former magically lazy usages like:
        @list = split /pattern/, $str;

    are now just consequences of the normal semantics.


  • It's now also possible to set up a user-defined subroutine that acts like grep:
        sub my_grep($selector, *@list) {
            given $selector {
                when RULE  { ... }
                when CODE  { ... }
                when HASH  { ... }
                # etc.
            }
        }

  • Using {...} or /.../ in the scalar context of the first argument causes it to produce a CODE or RULE reference, which the switch statement then selects upon.


Backtracking control

  • Backtracking over a single colon causes the regex engine not to retry the preceding atom:
        m:w/ \( <expr> [ , <expr> ]* : \) /

    (i.e. there's no point trying fewer <expr> matches, if there's no closing parenthesis on the horizon)


  • Backtracking over a double colon causes the surrounding group to immediately fail:
        m:w/ [ if :: <expr> <block>
             | for :: <list> <block>
             | loop :: <loop_controls>? <block>
             ]
        /

    (i.e. there's no point trying to match a different keyword if one was already found but failed)


  • Backtracking over a triple colon causes the current rule to fail outright (no matter where in the rule it occurs):
        rule ident {
              ( [<alpha>|_] \w* ) ::: { fail if %reserved{$1} }
            | " [<alpha>|_] \w* "
        }
        m:w/ get <ident>? /

    (i.e. using an unquoted reserved word as an identifier is not permitted)


  • Backtracking over a <commit> assertion causes the entire match to fail outright, no matter how many subrules down it happens:
        rule subname {
            ([<alpha>|_] \w*) <commit> { fail if %reserved{$1} }
        }
        m:w/ sub <subname>? <block> /

    (i.e. using a reserved word as a subroutine name is instantly fatal to the ``surrounding'' match as well)


  • A <cut> assertion always matches successfully, and has the side effect of deleting the parts of the string already matched.

  • Attempting to backtrack past a <cut> causes the complete match to fail (like backtracking past a <commit>. This is because there's now no preceding text to backtrack into.

  • This is useful for throwing away successfully processed input when matching from an input stream or an iterator of arbitrary length.

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Named Regexes

  • The analogy between sub and rule extends much further.

  • Just as you can have anonymous subs and named subs...

  • ...so too you can have anonymous regexes and named regexes:
        rule ident { [<alpha>|_] \w* }
        # and later...
        @ids = grep /<ident>/, @strings;

  • As the above example indicates, it's possible to refer to named regexes, such as:
        rule serial_number { <[A-Z]> \d<8> })
        rule type { alpha | beta | production | deprecated | legacy }

    in other regexes as named assertions:

        rule identification { [soft|hard]ware <type> <serial_number> }


Nothing is illegal

  • The null pattern is now illegal.

  • To match whatever the prior successful regex matched, use:
        /<prior>/

  • To match the zero-width string, use:
        /<null>/


Hypothetical variables

  • In embedded closures it's possible to bind a variable to a value that only ``sticks'' if the surrounding pattern successfully matches.

  • A variable is declared with the keyword let and then bound to the desired value:
        / (\d+) {let $num := $1} (<alpha>+)/

  • Now $num will only be bound if the digits are actually found.

  • If the match ever backtracks past the closure (i.e. if there are no alphabetics following), the binding is ``undone''.

  • This is even more interesting in alternations:
        / [ (\d+)      { let $num   := $1 }
          | (<alpha>+) { let $alpha := $2 }
          | (.)        { let $other := $3 }
          ]
        /

  • There is also a shorthand for assignment to hypothetical variables:
        / [ $num  := (\d+)
          | $alpha:= (<alpha>+)
          | $other:=(.)
          ]
        /

  • The numeric variables ($1, $2, etc.) are also ``hypothetical''.

  • Numeric variables can be assigned to, and even re-ordered:
        my ($key, $val) = m:w{ $1:=(\w+) =\> $2:=(.*?)
                             | $2:=(.*?) \<= $1:=(\w+)
                             };

  • Repeated captures can be bound to arrays:
        / @values:=[ (.*?) , ]* /

  • Pairs of repeated captures can be bound to hashes:
        / %options:=[ (<ident>) = (\N+) ]* /

  • Or just capture the keys (and leave the values undef):
        / %options:=[ (<ident>) = \N+ ]* /

  • Subrules (e.g. <rule>) also capture their result in a hypothetical variable of the same name as the rule:
        / <key> =\> <value> { %hash{$key} = $value } /


Return values from matches

  • A match always returns a ``match object'', which is also available as (lexical) $0

  • The match object evaluates differently in different contexts:
    • in boolean context it evaluates as true or false (i.e. did the match succeed?):
          if /pattern/ {...}
          # or:
          /pattern/; if $0 {...}

    • in numeric context it evaluates to the number of matches:
          $match_count += m:e/pattern/;

    • in string context it evaluates to the captured substring (if there was exactly one capture in the pattern) or to the entire text that was matched (if the pattern does not capture, or captures multiple elements):
          print %hash{$text =~ /,? (<ident>)/};
          # or: 
          $text =~ /,? (<ident>)/  &&  print %hash{$0};

  • Within a regex, $0 acts like a hypothetical variable.

  • It controls what a regex match returns (like $$ does in yacc)

  • Use $0:= to override the default return behaviour described above:
        rule string1 { (<["'`]>) ([ \\. | <-[\\]> ]*?) $1 }
        $match = m/<string1>/;  # default: $match includes 
                                # opening and closing quotes
        rule string2 { (<["'`]>) $0:=([ \\. | <-[\\]> ]*?) $1 }
        $match = m/<string2>/;  # $match now excludes quotes
                                # because $0 explicitly bound 
                                # to second capture only

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Matching against non-strings

  • Anything that can be tied to a string can be matched against a regex. This feature is particularly useful with input streams:
        my @array := <$fh>;           # lazy when aliased
        my $array is from(\@array);   # tie scalar
        # and later...
        $array =~ m/pattern/;         # match from stream


Grammars

  • Potential ``collision'' problem with named regexes

  • Of course, a named ident regex shouldn't clobber someone else's ident regex.

  • So some mechanism is needed to confine regexes to a namespace.

  • If subs are the model for rules, then modules/classes are the obvious model for aggregating them.

  • Such collections of rules are generally known as ``grammars''.

  • Just as a class can collect named actions together:
        class Identity {
            method name { "Name = $.name" }
            method age  { "Age  = $.age"  }
            method addr { "Addr = $.addr" }
            method desc {
                print .name(), "\n",
                      .age(),  "\n",
                      .addr(), "\n";
            }
            # etc.
        }

  • So too a grammar can collect a set of named rules together:
        grammar Identity {
            rule name :w { Name = (\N+) }
            rule age  :w { Age  = (\d+) }
            rule addr :w { Addr = (\N+) }
            rule desc {
                <name> \n
                <age>  \n
                <addr> \n
            }
            # etc.
        }

  • Like classes, grammars can inherit:
        grammar Letter {
            rule text     { <greet> <body> <close> }
            rule greet :w { [Hi|Hey|Yo] $to:=(\S+?) , $$}
            rule body     { <line>+ }
            rule close :w { Later dude, $from:=(.+) }
            # etc.
        }
        grammar FormalLetter is Letter {
            rule greet :w { Dear $to:=(\S+?) , $$}
            rule close :w { Yours sincerely, $from:=(.+) }
        }

  • Inherit rule definitions (polymorphically!)

  • So there's no need to respecify body, line, etc.

  • Perl 6 will come with at least one grammar predefined:
        grammar Perl {    # Perl's own grammar
            rule prog { <line>* }
            rule line { <decl>
                      | <loop>
                      | <label> [<cond>|<sideff>|;]
            }
            rule decl { <sub> | <class> | <use> }
            # etc. etc.
        }

  • Hence:
        given $source_code {
            $parsetree = m/<Perl::prog>/;
        }


Transliteration

  • The tr/// quote-like operator now also has a subroutine form.

  • It can be given either a single PAIR:
        $str =~ tr( 'A-C' => 'a-c' );

  • Or a hash (or hash ref):
        $str =~ tr( {'A'=>'a', 'B'=>'b', 'C'=>'c'} );
        $str =~ tr( {'A-Z'=>'a-z', '0-9'=>'A-F'} );
        $str =~ tr( %mapping );

  • Or two arrays (or array refs):
        $str =~ tr( ['A'..'C'], ['a'..'c'] );
        $str =~ tr( @UPPER, @lower );

  • Note that the array version can map one-or-more characters to one-or-more characters:
        $str =~ tr( [' ',      '<',    '>',    '&'    ],
                    ['&nbsp;', '&lt;', '&gt;', '&amp;' ]);


Return to Perl.com.

Improving mod_perl Sites' Performance: Part 2

In this article we will talk about tools that we need before we can start working on the performance of our service.

Essential Tools

In order to improve performance, we need measurement tools. The main tool categories are benchmarking and code profiling.

It's important to understand that, in a large number of the benchmarking tests that we will execute, we will not look at the absolute result numbers but the relation between two or more result sets. The purpose of the benchmarks is to try to show which coding approach is preferable. You shouldn't try to compare the absolute results presented in the articles with those that you get while running the same benchmarks on your machine, since you won't have the exact hardware and software setup anyway. This kind of comparison would be misleading. If you compare the relative results from the tests running on your machine, then you will do the right thing.

Benchmarking Applications

How much faster is mod_perl than mod_cgi (aka plain Perl/CGI)? There are many ways to benchmark the two. I'll present a few examples and numbers below. Check out the benchmark directory of the mod_perl distribution for more examples.

There is no need to write a special benchmark though. If you want to impress your boss or colleagues, then just take some heavy CGI script you have (e.g. a script that crunches some data and prints the results to STDOUT), open two xterms and call the same script in mod_perl mode in one xterm and in mod_cgi mode in the other. You can use lwp-get from the LWP package to emulate the browser. The benchmark directory of the mod_perl distribution includes such an example.

Benchmarking Perl Code

If you are going to write your own benchmarking utility, then use the Benchmark module and the Time::HiRes module where you need better time precision (<10msec).

An example of the Benchmark.pm module usage:


  benchmark.pl
  ------------
  use Benchmark;
  
  timethis (1_000,
   sub {
    my $x = 100;
    my $y = log ($x ** 100)  for (0..10000);
  });

  % perl benchmark.pl
  timethis 1000: 25 wallclock secs (24.93 usr +  0.00 sys = 24.93 CPU)

If you want to get the benchmark results in microseconds, then you will have to use the Time::HiRes module. Its usage is similar to Benchmark's.


  use Time::HiRes qw(gettimeofday tv_interval);
  my $start_time = [ gettimeofday ];
  sub_that_takes_a_teeny_bit_of_time();
  my $end_time = [ gettimeofday ];
  my $elapsed = tv_interval($start_time,$end_time);
  print "The sub took $elapsed seconds."

Benchmarking a Graphic Hits Counter with Persistent DB Connections

Here are the numbers from Michael Parker's mod_perl presentation at the Perl Conference (Aug, 98). The script is a standard hits counter, but it logs the counts into a mysql relational DataBase:


    Benchmark: timing 100 iterations of cgi, perl...  [rate 1:28]
    
    cgi: 56 secs ( 0.33 usr 0.28 sys = 0.61 cpu) 
    perl: 2 secs ( 0.31 usr 0.27 sys = 0.58 cpu) 
    
    Benchmark: timing 1000 iterations of cgi,perl...  [rate 1:21]
     
    cgi: 567 secs ( 3.27 usr 2.83 sys = 6.10 cpu) 
    perl: 26 secs ( 3.11 usr 2.53 sys = 5.64 cpu)      
    
    Benchmark: timing 10000 iterations of cgi, perl   [rate 1:21]
     
    cgi: 6494 secs (34.87 usr 26.68 sys = 61.55 cpu) 
    perl: 299 secs (32.51 usr 23.98 sys = 56.49 cpu)

We don't know what server configurations were used for these tests, but I guess the numbers speak for themselves.

The source code of the script was available online, but, sadly, isn't anymore. However, you can reproduce the same performance speedup with pretty much any CGI script written in Perl.

Benchmarking Response Times With ApacheBench

ApacheBench (ab) is a tool for benchmarking your Apache HTTP server. It is designed to give you an idea of the performance that your current Apache installation can give. In particular, it shows you how many requests per second your Apache server is capable of serving. The ab tool comes bundled with the Apache source distribution.

Let's try it. We will simulate 10 users concurrently requesting a light script at www.example.com/perl/test.pl. Each simulated user makes 10 requests.


  % ./ab -n 100 -c 10 www.example.com/perl/test.pl

The results are:


  Document Path:          /perl/test.pl
  Document Length:        319 bytes
  
  Concurrency Level:      10
  Time taken for tests:   0.715 seconds
  Complete requests:      100
  Failed requests:        0
  Total transferred:      60700 bytes
  HTML transferred:       31900 bytes
  Requests per second:    139.86
  Transfer rate:          84.90 kb/s received
  
  Connection Times (ms)
                min   avg   max
  Connect:        0     0     3
  Processing:    13    67    71
  Total:         13    67    74

We can see that under load of 10 concurrent users our server is capable of processing 140 requests per second. Of course, this benchmark is correct only when the script under test is used. We can also learn about the average processing time, which in this case was 67 milliseconds. Other numbers reported by ab may or may not be of interest to you.

For example, if we believe that the script perl/test.pl is not efficient, then we will try to improve it and run the benchmark again to see whether we have any improvement in performance.

Benchmarking Response Times With httperf

httperf is a utility written by David Mosberger. Just like ApacheBench, it measures the performance of the Web server.

A sample command line is shown below:


  httperf --server hostname --port 80 --uri /test.html \
   --rate 150 --num-conn 27000 --num-call 1 --timeout 5

This command causes httperf to use the Web server on the host with IP name hostname, running at port 80. The Web page being retrieved is /test.html and, in this simple test, the same page is retrieved repeatedly. The rate at which requests are issued is 150 per second. The test involves initiating a total of 27,000 TCP connections and on each connection one HTTP call is performed. A call consists of sending a request and receiving a reply.

The timeout option defines the number of seconds that the client is willing to wait to hear back from the server. If this timeout expires, then the tool considers the corresponding call to have failed. Note that with a total of 27,000 connections and a rate of 150 per second, the total test duration will be approximately 180 seconds (27,000/150), independently of what load the server can actually sustain. Here is a result that one might get:


     Total: connections 27000 requests 26701 replies 26701 test-duration 179.996 s
    
     Connection rate: 150.0 conn/s (6.7 ms/conn, <=47 concurrent connections)
     Connection time [ms]: min 1.1 avg 5.0 max 315.0 median 2.5 stddev 13.0
     Connection time [ms]: connect 0.3
     
     Request rate: 148.3 req/s (6.7 ms/req)
     Request size [B]: 72.0
     
     Reply rate [replies/s]: min 139.8 avg 148.3 max 150.3 stddev 2.7 (36 samples)
     Reply time [ms]: response 4.6 transfer 0.0
     Reply size [B]: header 222.0 content 1024.0 footer 0.0 (total 1246.0)
     Reply status: 1xx=0 2xx=26701 3xx=0 4xx=0 5xx=0
     
     CPU time [s]: user 55.31 system 124.41 (user 30.7% system 69.1% total 99.8%)
     Net I/O: 190.9 KB/s (1.6*10^6 bps)
     
     Errors: total 299 client-timo 299 socket-timo 0 connrefused 0 connreset 0
     Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0

Benchmarking Response Times With http_load

http_load is yet another utility that does Web server load testing. It can simulate a 33.6 modem connection (-throttle) and allows you to provide a file with a list of URLs, which we be fetched randomly. You can specify how many parallel connections to run using the -parallel N option, or you can specify the number of requests to generate per second with -rate N option. Finally, you can tell the utility when to stop by specifying either the test time length (-seconds N) or the total number of fetches (-fetches N).

A sample run with the file urls including:


  http://www.example.com/foo/
  http://www.example.com/bar/

We ask to generate three requests per second and run for only two seconds. Here is the generated output:


  % ./http_load -rate 3 -seconds 2 urls
  http://www.example.com/foo/: check-connect SUCCEEDED, ignoring
  http://www.example.com/bar/: check-connect SUCCEEDED, ignoring
  http://www.example.com/bar/: check-connect SUCCEEDED, ignoring
  http://www.example.com/bar/: check-connect SUCCEEDED, ignoring
  http://www.example.com/foo/: check-connect SUCCEEDED, ignoring
  5 fetches, 3 max parallel, 96870 bytes, in 2.00258 seconds
  19374 mean bytes/connection
  2.49678 fetches/sec, 48372.7 bytes/sec
  msecs/connect: 1.805 mean, 5.24 max, 0.79 min
  msecs/first-response: 291.289 mean, 560.338 max, 34.349 min

So you can see that it has reported 2.5 requests per second. Of course, for the real test you will want to load the server heavily and run the test for a longer time to get more reliable results.

Note that when you provide a file with a list of URLs make sure that you don't have empty lines in it. If you do, then the utility won't work, complaining:


  ./http_load: unknown protocol -

Benchmarking Response Times With crashme Script

This is another crashme suite originally written by Michael Schilli (and was located at http://www.linux-magazin.de site, but now the link has gone). I made a few modifications, mostly adding my() operators. I also allowed it to accept more than one url to test, since sometimes you want to test more than one script.

The tool provides the same results as ab above but it also allows you to set the timeout value, so requests will fail if not served within the time out period. You also get values for Latency (seconds per request) and Throughput (requests per second). It can do a complete simulation of your favorite Netscape browser :) and give you a better picture.

I have noticed while running these two benchmarking suites, that ab gave me results from two and a half to three times better. Both suites were run on the same machine, with the same load and the same parameters, but the implementations were different.

Sample output:


  URL(s):          http://www.example.com/perl/access/access.cgi
  Total Requests:  100
  Parallel Agents: 10
  Succeeded:       100 (100.00%)
  Errors:          NONE
  Total Time:      9.39 secs
  Throughput:      10.65 Requests/sec
  Latency:         0.85 secs/Request

And the code:


  #!/usr/bin/perl -w
  
  use LWP::Parallel::UserAgent;
  use Time::HiRes qw(gettimeofday tv_interval);
  use strict;
  
  ###
  # Configuration
  ###
  
  my $nof_parallel_connections = 10; 
  my $nof_requests_total = 100; 
  my $timeout = 10;
  my @urls = (
            'http://www.example.com/perl/faq_manager/faq_manager.pl',
            'http://www.example.com/perl/access/access.cgi',
           );
  
  
  ##################################################
  # Derived Class for latency timing
  ##################################################
  
  package MyParallelAgent;
  @MyParallelAgent::ISA = qw(LWP::Parallel::UserAgent);
  use strict;
  
  ###
  # Is called when connection is opened
  ###
  sub on_connect {
    my ($self, $request, $response, $entry) = @_;
    $self->{__start_times}->{$entry} = [Time::HiRes::gettimeofday];
  }
  
  ###
  # Are called when connection is closed
  ###
  sub on_return {
    my ($self, $request, $response, $entry) = @_;
    my $start = $self->{__start_times}->{$entry};
    $self->{__latency_total} += Time::HiRes::tv_interval($start);
  }
  
  sub on_failure {
    on_return(@_);  # Same procedure
  }
  
  ###
  # Access function for new instance var
  ###
  sub get_latency_total {
    return shift->{__latency_total};
  }
  
  ##################################################
  package main;
  ##################################################
  ###
  # Init parallel user agent
  ###
  my $ua = MyParallelAgent->new();
  $ua->agent("pounder/1.0");
  $ua->max_req($nof_parallel_connections);
  $ua->redirect(0);    # No redirects
  
  ###
  # Register all requests
  ###
  foreach (1..$nof_requests_total) {
    foreach my $url (@urls) {
      my $request = HTTP::Request->new('GET', $url);
      $ua->register($request);
    }
  }
  
  ###
  # Launch processes and check time
  ###
  my $start_time = [gettimeofday];
  my $results = $ua->wait($timeout);
  my $total_time = tv_interval($start_time);
  
  ###
  # Requests all done, check results
  ###
  
  my $succeeded     = 0;
  my %errors = ();
  
  foreach my $entry (values %$results) {
    my $response = $entry->response();
    if($response->is_success()) {
      $succeeded++; # Another satisfied customer
    } else {
      # Error, save the message
      $response->message("TIMEOUT") unless $response->code();
      $errors{$response->message}++;
    }
  }
  
  ###
  # Format errors if any from %errors 
  ###
  my $errors = join(',', map "$_ ($errors{$_})", keys %errors);
  $errors = "NONE" unless $errors;
  
  ###
  # Format results
  ###
  
  #@urls = map {($_,".")} @urls;
  my @P = (
        "URL(s)"          => join("\n\t\t ", @urls),
        "Total Requests"  => "$nof_requests_total",
        "Parallel Agents" => $nof_parallel_connections,
        "Succeeded"       => sprintf("$succeeded (%.2f%%)\n",
                                   $succeeded * 100 / $nof_requests_total),
        "Errors"          => $errors,
        "Total Time"      => sprintf("%.2f secs\n", $total_time),
        "Throughput"      => sprintf("%.2f Requests/sec\n", 
                                   $nof_requests_total / $total_time),
        "Latency"         => sprintf("%.2f secs/Request", 
                                   ($ua->get_latency_total() || 0) / 
                                   $nof_requests_total),
       );
  
  my ($left, $right);
  ###
  # Print out statistics
  ###
  format STDOUT =
  @<<<<<<<<<<<<<<< @*
  "$left:",        $right
  .
  
  while(($left, $right) = splice(@P, 0, 2)) {
    write;
  }

Benchmarking PerlHandlers

The Apache::Timeit module does PerlHandler Benchmarking. With the help of this module you can log the time taken to process the request, just like you'd use the Benchmark module to benchmark a regular Perl script. Of course, you can extend this module to perform more advanced processing like putting the results into a database for a later processing. But all it takes is adding this configuration directive inside httpd.conf:


  PerlFixupHandler Apache::Timeit

Since scripts running under Apache::Registry are running inside the PerlHandler these are benchmarked as well.

An example of the lines which show up in the error_log file:


  timing request for /perl/setupenvoff.pl:
    0 wallclock secs ( 0.04 usr +  0.01 sys =  0.05 CPU)
  timing request for /perl/setupenvoff.pl:
    0 wallclock secs ( 0.03 usr +  0.00 sys =  0.03 CPU)

The Apache::Timeit package is a part of the Apache-Perl-contrib files collection available from CPAN.

References


Return to Perl.com

Where Wizards Fear To Tread

One of the big new features in perl 5.8 is that we now have real working threads available to us through the threads pragma.

However, for us module authors who already have to support our modules on different versions of perl and different platforms, we now have to deal with another case: threads! This article will show you how threads relate to modules, how we can take old modules and make them thread-safe, and round off with a new module that alters perl's behavior of the "current working directory".

To run the examples I have shown here, you need perl 5.8 RC1 or later compiled with threads. On Unix, you can use Configure -Duseithreads -Dusethreads; On Win32, the default build will always have threading enabled.

How do threads relate to modules?

Threading in Perl is based on the notion of explicit shared data. That is, only data that is explicitly requested to be shared will be shared between threads. This is controlled by the threads::shared pragma and the ": shared" attribute. Witness how it works:


     use threads;
     my $var = 1;
     threads->create(sub { $var++ })->join();
     print $var;

If you are accustomed to threading in most other languages, (Java/C) you would expect $var to contain a 2 and the result of this script to be "2". However since Perl does not share data between threads, $var is copied in the thread and only incremented in the thread. The original value in the main thread is not changed, so the output is "1".

However if we add in threads::shared and a : shared attribute we get the desired result:


     use threads;
     use threads::shared;
     my $var : shared = 1;
     threads->create(sub { $var++ })->join();
     print $var

Now the result will be "2", since we declared $var to be a shared variable. Perl will then act on the same variable and provide automatic locking to keep the variable out of trouble.

This makes it quite a bit simpler for us module developers to make sure our modules are thread-safe. Essentially, all pure Perl modules are thread-safe because any global state data, which is usually what gives you thread-safety problems, is by default local to each thread.

Definition of thread-safe levels

To define what we mean by thread-safety, here are some terms adapted from the Solaris thread-safety levels.

thread-safe
This module can safely be used from multiple threads. The effect of calling into a safe module is that the results are valid even when called by multiple threads. However, thread-safe modules can still have global consequences; for example, sending or reading data from a socket affects all threads that are working with that socket. The application has the responsibility to act sane with regards to threads. If one thread creates a file with the name file.tmp then another file which tries to create it will fail; this is not the fault of the module.
thread-friendly
Thread-friendly modules are thread-safe modules that know about and provide special functions for working with threads or utilize threads by themselves. A typical example of this is the core threads::queue module. One could also imagine a thread-friendly module with a cache to declare that cache to be shared between threads to make hits more likely and save memory.
thread-unsafe
This module can not safely be used from different threads; it is up to the application to synchronize access to the library and make sure it works with it the way it is specified. Typical examples here are XS modules that utilize external unsafe libraries that might only allow one thread to execute them.

Since Perl only shares when asked to, most pure Perl code probably falls into the thread-safe category, that doesn't mean you should trust it until you have review the source code or they have been marked with thread-safe by the author. Typical problems include using alarm(), mucking around with signals, working with relative paths and depending on %ENV. However remember that ALL XS modules that don't state anything fall into the definitive thread-unsafe category.

Why should I bother making my module thread-safe or thread-friendly?

Well, it usually isn't much work and it will make the users of this modules that want to use it in a threaded environment very happy. What? Threaded Perl environments aren't that common you say? Wait until Apache 2.0 and mod_perl 2.0 becomes available. One big change is that Apache 2.0 can run in threaded mode and then mod_perl will have to be run in threaded mode; this can be a huge performance gain on some operating systems. So if you want your modules to work with mod_perl 2.0, taking a look at thread-safety levels is a good thing to do.

So what do I do to make my module thread-friendly?

A good example of a module that needed a little modification to work with threads is Michael Schwern's most excellent Test::Simple suite (Test::Simple, Test::More and Test::Builder). Surprisingly, we had to change very little to fix it.

The problem was simply that the test numbering was not shared between threads.

For example


     use threads;
     use Test::Simple tests => 3;
     ok(1);
     threads->create(sub { ok(1) })->join();
     ok(1);

Now that will return


     1..3
     ok 1
     ok 2
     ok 2

Does it look similar to the problem we had earlier? Indeed it does, seems like somewhere there is a variable that needs to shared.

Now reading the documentation of Test::Simple we find out that all magic is really done inside Test::Builder, opening up Builder.pm we quickly find the following lines of code:


     my @Test_Results = ();
     my @Test_Details = ();
     my $Curr_Test = 0;

Now we would be tempted to add use threads::shared and :shared attribute.


     use threads::shared;
     my @Test_Results : shared = ();
     my @Test_Details : shared = ();
     my $Curr_Test : shared = 0;

However Test::Builder needs to work back to Perl 5.4.4! Attributes were only added in 5.6.0 and the above code would be a syntax error in earlier Perls. And even if someone were using 5.6.0, threads::shared would not be available for them.

The solution is to use the runtime function share() exported by threads::shared, but we only want to do it for 5.8.0 and when threads have been enabled. So, let's wrap it in a BEGIN block and an if.


     BEGIN{
         if($] >= 5.008 && exists($INC{'threads.pm'})) {
             require threads::shared;
             import threads::shared qw(share);
             share($Curr_Test);
             share(@Test_Details)
             share(@Test_Results);
         }

So, if 5.8.0 or higher and threads has been loaded, we do the runtime equivalent of use threads::shared qw(share); and call share() on the variables we want to be shared.

Now lets find out some examples of where $Curr_Test is used. We find sub ok {} in Test::Builder; I won't include it here, but only a smaller version which contains:


     sub ok {
         my($self, $test, $name) = @_;
         $Curr_Test++;
         $Test_Results[$Curr_Test-1] = 1 unless($test);
     }

Now, this looks like it should work right? We have shared $Curr_Test and @Test_Results. Of course, things aren't that easy; they never are. Even if the variables are shared, two threads could enter ok() at the same time. Remember that not even the statement $CurrTest++ is an atomic operation, it is just a shortcut for writing $CurrTest = $CurrTest + 1. So let's say two threads do that at the same time.


     Thread 1: add 1 + $Curr_Test
     Thread 2: add 1 + $Curr_Test
     Thread 2: Assign result to $Curr_Test
     Thread 1: Assign result to $Curr_Test

The effect would be that $Curr_Test would only be increased by one, not two! Remember that a switch between two threads could happen at ANY time, and if you are on a multiple CPU machine they can run at exactly the same time! Never trust thread inertia.

So how do we solve it? We use the lock() keyword. lock() takes a shared variable and locks it for the rest of the scope, but it is only an advisory lock so we need to find every place that $Curr_Test is used and modified and it is expected not to change. The ok() becomes:


     sub ok {
         my($self, $test, $name) = @_;
         lock($Curr_Test);
         $Curr_Test++;
         $Test_Results[$Curr_Test-1] = 1 unless($test);
     }

So are we ready? Well, lock() was only added in Perl 5.5 so we need to add an else to the BEGIN clause to define a lock function if we aren't running with threads. The end result would be.


     my @Test_Results = ();
     my @Test_Details = ();
     my $Curr_Test = 0;
     BEGIN{
         if($] >= 5.008 && exists($INC{'threads.pm'})) {
             require threads::shared;
             import threads::shared qw(share);
             share($Curr_Test);
             share(@Test_Details)
             share(@Test_Results);
         } else {
             *lock = sub(*) {};
         }
     }
     sub ok {
         my($self, $test, $name) = @_;
         lock($Curr_Test);
         $Curr_Test++;
         $Test_Results[$Curr_Test-1] = 1 unless($test);
     }

In fact, this is very like the code that has been added to Test::Builder to make it work nice with threads. The only thing not correct is ok() as I cut it down to what was relevant. There were roughly 5 places where lock() had to be added. Now the test code would print


     1..3
     ok 1
     ok 2
     ok 3

which is exactly what the end user would expect. All in all this is a rather small change for this 1291 line module, we change roughly 15 lines in a non intrusive way, the documentation and testcase code makes up most of the patch. The full patch is at http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2002-06/msg00816.html

Altering Perls behavior to be thread-safe, ex::threads::cwd

Somethings change when you use threads; some things that you or a module might do are not like what they used to be. Most of the changes will be due to the way your operating system treats processes that use threads. Each process has typically a set of attributes, which include the current working directory, the environment table, the signal subsystem and the pid. Since threads are multiple paths of execution inside a single process, the operating system treats it as a single process and you have a single set of these attributes.

Yep. That's right - if you change the current working directory in one thread, it will also change in all the other threads! Whoops, better start using absolute paths everywhere, and all the code that uses your module might use relative paths. Aaargh...

Don't worry, this is a solvable problem. In fact, it's solvable by a module.

Perl allows us to override functions using the CORE::GLOBAL namespace. This will let us override the functions that deal with paths and set the cwd correctly before issuing the command. So let's start off


     package ex::threads::safecwd;

     use 5.008;
     use strict;
     use warnings;
     use threads::shared;
     our $VERSION = '0.01';

Nothing weird here right? Now, when changing and dealing with the current working directory one often uses the Cwd module, so let us make the cwd module safe first. How do we do that?


  1) use Cwd;
  2) our $cwd = cwd;  #our per thread cwd, init on startup from cwd
  3) our $cwd_mutex : shared; # the variable we use to sync
  4) our $Cwd_cwd = \&Cwd::cwd;
  5) *Cwd::cwd = *my_cwd;     
     sub my_cwd {
  6)     lock($cwd_mutex);
  7)     CORE::chdir($cwd);
  8)     $Cwd_cwd->(@_);
     }

What's going on here? Let's analyze it line by line:

  1. We include Cwd.
  2. We declare a variable and assign to it the cwd we start in. This variable will not be shared between threads and will contain the cwd of this thread.
  3. We declare a variable we will be using to lock for synchronizing work.
  4. Here we take a reference to the &Cwd::cwd and store in $Cwd_cwd.
  5. Now we hijack Cwd::cwd and assign to it our own my_cwd so whenever someone calls Cwd::cwd, it will call my_cwd instead.
  6. my_cwd starts of by locking $cwd_mutex so no one else will muck. around with the cwd.
  7. After that we call CORE::chdir() to actually set the cwd to what this thread is expecting it to be.
  8. And we round off by calling the original Cwd::cwd that we stored in step 4 with any parameters that we were handed to us.

In effect we have hijacked Cwd::cwd and wrapped it around with a lock and a chdir so it will report the correct thing!

Now that cwd() is fixed, we need a way to actually change the directory. To do this, we install our own global chdir, simply like this.


     *CORE::GLOBAL::chdir = sub {
         lock($cwd_mutex);
         CORE::chdir($_[0]) || return undef;
         $cwd = $Cwd_cwd->();
     };

Now, whenever someone calls chdir() our chdir will be called instead, and in it we start by locking the variable controlling access, then we try to chdir to the directory to see if it is possible, otherwise we do what the real chdir would do, return undef. If it succeeds, we assign the new value to our per thread $cwd by calling the original Cwd::cwd()

The above code is actually enough to allow the following to work:


     use threads
     use ex::threads::safecwd;
     use Cwd;
     chdir("/tmp");
     threads->create(sub { chdir("/usr") } )->join();
     print cwd() eq '/tmp' ? "ok" : "nok";

Since the chdir("/usr"); inside the thread will not affect the other thread's $cwd variable, so when cwd is called, we will lock down the thread, chdir() to the location the thread $cwd contains and perform a cwd().

While this is useful, we need to get along and provide some more functions to extend the functionality of this module.


     *CORE::GLOBAL::mkdir = sub {
         lock($cwd_mutex);
         CORE::chdir($cwd);
         if(@_ > 1) {
             CORE::mkdir($_[0], $_[1]);
         } else {
             CORE::mkdir($_[0]);
         }
     };

     *CORE::GLOBAL::rmdir = sub {
         lock($cwd_mutex);
         CORE::chdir($cwd);
         CORE::rmdir($_[0]);
     };

The above snippet does essentially the same thing for both mkdir and rmdir. We lock the $cwd_mutex to synchronize access, then we chdir to $cwd and finally perform the action. Worth noticing here is the check we need to do for mkdir to be sure the prototype behavior for it is correct.

Let's move on with opendir, open, readlink, readpipe, require, rmdir, stat, symlink, system and unlink. None of these are really any different from the above with the big exception of open. open has a weird bit of special case since it can take both a HANDLE and an empty scalar for autovification of an anonymous handle.


     *CORE::GLOBAL::open = sub (*;$@) {
         lock($cwd_mutex);
         CORE::chdir($cwd);
         if(defined($_[0])) {
             use Symbol qw();
             my $handle = Symbol::qualify($_[0],(caller)[0]);
             no strict 'refs';
             if(@_ == 1) {
                 return CORE::open($handle);
             } elsif(@_ == 2) {
               return CORE::open($handle, $_[1]);
             } else {
               return CORE::open($handle, $_[1], @_[2..$#_]);
             }
         }

Starting off with the usual lock and chdir() we then need to check if the first value is defined. If it is, we have to qualify it to the callers namespace. This is what would happen if a user does open FOO, "+>foo.txt". If the user instead does open main::FOO, "+>foo.txt", then Symbol::qualify notices that the handle is already qualified and returns it unmodified. Now since $_[0] is a readonly alias we cannot assign it over so we need to create a temporary variable and then proceed as usual.

Now if the user used the new style open my $foo, "+>foo.txt", we need to treat it differently. The following code will do the trick and complete the function.


     else {
             if(@_ == 1) {
                 return CORE::open($_[0]);
             } elsif(@_ == 2) {
                 return CORE::open($_[0], $_[1]);
             } else {
                 return CORE::open($_[0], $_[1], @_[2..$#_]);
             }
         }
     };

Wonder why we couldn't just assign $_[0] to $handle and unify the code path? You see, $_[0] is an alias to the $foo in open my $foo, "+>foo.txt" so CORE::open will correctly work.

However, if we do $handle = $_[0] we take a copy of the undefined variable and CORE::open won't do what I mean.

So now we have a module that allows the you to safely use relative paths in most of the cases and vastly improves your ability to port code to a threaded environment. The price we pay for this is speed, since every time you do an operation involving a directory you are serializing your program. Typically, you never do those kinds of operations in a hot path anyway. You might do work on your file in a hot path, but as soon as we have gotten the filehandle no more locking is done.

A couple of problems remain. Performance-wise, there is one big problem with system(), since we don't get control back until the CORE::system() returns, so all path operations will hang waiting for that. To solve that we would need to revert to XS and do some magic with regard to the system call. We also haven't been able to override the file test operators (-x and friends), nor can we do anything about qx {}. Solving that problem requires working up and down the optree using B::Generate and B::Utils. Perhaps a future version of the module will attempt that together with a custom op to do the locking.

Conclusion

Threads in Perl are simple and straight forward, as long as we stay in pure Perl land everything behaves just about how we would expect it to. Converting your modules should be a simple matter of programming without any big wizardly things to be done. The important thing to remember is to think about how your module could possibly take advantage of threads to make it easier to use for the programmer.

Moving over to XS land is altogether different; stay put for the next article that will take us through the pitfalls of converting various kinds of XS modules to thread-safe and thread-friendly levels.

Apocalypse 5

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

This is the Apocalypse on Pattern Matching, generally having to do with what we call "regular expressions", which are only marginally related to real regular expressions. Nevertheless, the term has grown with the capabilities of our pattern matching engines, so I'm not going to try to fight linguistic necessity here. I will, however, generally call them "regexes" (or "regexen", when I'm in an Anglo-Saxon mood).

Here are the RFCs covered in this Apocalypse. PSA stands for "problem, solution, acceptance", my private rating of how this RFC will fit into Perl 6. Doubtless I have misclassified your RFC, though the other ratings are pretty accurate. :-)

    RFC   PSA   Title
    ---   ---   -----
    072   aaa   Variable-length lookbehind. 
    093   abb   Regex: Support for incremental pattern matching
    110   bbb   counting matches
    112   acc   Assignment within a regex
    135   acr   Require explicit m on matches, even with ?? and // as delimiters.
    144   aaa   Behavior of empty regex should be simple
    145   acr   Brace-matching for Perl Regular Expressions
    150   acc   Extend regex syntax to provide for return of a hash of matched subpatterns
    156   aaa   Replace first match function (C<?...?>) with a flag to the match command.
    164   ccr   Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade()
    165   acc   Allow Variables in tr///
    166   abc   Alternative lists and quoting of things
    191   bbc   smart container slicing
    197   cdr   Numeric Value Ranges In Regular Expressions
    198   adr   Boolean Regexes
    261   dbr   Pattern matching on perl values
    274   acc   Generalised Additions to Regexs
    276   aaa   Localising Paren Counts in qr()s.
    308   dar   Ban Perl hooks into regexes
    316   bcr   Regex modifier for support of chunk processing and prefix matching
    317   aaa   Access to optimisation information for regular expressions
    331   acc   Consolidate the $1 and \1 notations
    332   abc   Regex: Make /$/ equivalent to /\z/ under the '/s' modifier
    348   bcc   Regex assertions in plain Perl code
    360   acb   Allow multiply matched groups in regexes to return a listref of all matches
    361   abb   Simplifying split()

Interestingly, there were no withdrawn RFCs for pattern matching. That means either that there were no cork-brained ideas proposed, or that regex culture is so cork-brained already that the cork-brained ideas blend right in. I know where my money is... :-)

In fact, regular expression culture is a mess, and I share some of the blame for making it that way. Since my mother always told me to clean up my own messes, I suppose I'll have to do just that.

For prior Apocalypses, I've used the RFCs as a springboard for discussion of my thinking, but this one is special, because none of the RFCs were courageous enough (or foolhardy enough) to look at the big picture and propose radical change where it's needed. But Perl has often been tagged as a language in which it's easy to write programs that are difficult to read, and it's no secret that regular expression syntax that has been the chief culprit. Funny that other languages have been borrowing Perl's regular expressions as fast as they can...

That's primarily because we took several large steps in Perl 5 to enhance regex capabilities. We took one large step forwards with the /x option, which allowed whitespace between regex tokens. But we also took several large steps sideways with the (?...) extension syntax. I call them steps sideways, but they were simultaneously steps forward in terms of functionality and steps backwards in terms of readability. At the time, I rationalized it all in the name of backward compatibility, and perhaps that approach was correct for that time and place. It's not correct now, since the Perl 6 approach is to break everything that needs breaking all at once.

And unfortunately, there's a lot of regex culture that needs breaking.

Table of Contents
Accepted RFCs



Rejected RFCs

Apocalypse One

Apocalypse Two

Apocalypse Three

Apocalypse Four
 


Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.


O'Reilly Open Source Convention -- July 22-26, San Diego, CA.

From the Frontiers of Research to the Heart of the Enterprise

Don't miss Larry Walls's presentation, Introducing The Perl 6 Language at the O'Reilly Open Source Convention in July. The conference includes many sessions and tutorials of interest to Perl developers.

Regex culture has gone wrong in a variety of ways, but it's not my intent to assign blame--there's plenty of blame to go around, and plenty of things that have gone wrong that are nobody's fault in particular. For example, it's nobody's fault that you can't realistically complement a character set anymore. It's just an accident of the way Unicode defines combining characters. The whole notion of character classes is mutating, and that will have some bearing on the future of regular expression syntax.

Given all this, I need to warn you that this Apocalypse is going to be somewhat radical. We'll be proposing changes to certain "sacred" features of regex culture, and this is guaranteed to result in future shock for some of our more conservative citizens. Do not be alarmed. We will provide ways for you to continue programming in old-fashioned regular expressions if you desire. But I hope that once you've thought about it a little and worked through some examples, you'll like most of the changes we're proposing here.

So although the RFCs did contribute greatly to my thinking for this Apocalypse, I'm going to present my own vision first for where regex culture should go, and then analyze the RFCs with respect to that vision.

First, let me enumerate some of the things that are wrong with current regex culture.

  • Too much history

  • Too compact and "cute"

  • Poor Huffman coding

  • Too much reliance on too few metacharacters

  • Different things look too similar

  • Poor end-weight design

  • Too much reliance on modifiers

  • Too many special rules and boobytraps

  • Backreferences not useful enough

  • Too hard to match a literal string

  • Two-level interpretation is problematic

  • Too little abstraction

  • Little support for named captures

  • Difficult to use nested patterns

  • Little support for grammars

  • Inability to define variants

  • Poor integration with "real" language

  • Missing backtracking controls

  • Difficult to define assertions

I'm sure there are other problems, but that'll do for starters. Let's look at each of these in more detail.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Too much history

Most of the other problems stem from trying to deal with a rich history. Now there's nothing wrong with history per se, but those of us who are doomed to repeat it find that many parts of history are suboptimal and contradictory. Perl has always tried to err on the side of incorporating as much history as possible, and sometimes Perl has succeeded in that endeavor.

Cultural continuity has much to be said for it, but what can you do when the culture you're trying to be continuous with is itself discontinuous? As it says in Ecclesiastes, there's a time to build up, and a time to tear down. The first five versions of Perl mostly built up without tearing down, so now we're trying to redress that omission.

Too compact and "cute"

Regular expressions were invented by computational linguists who love to write examples like /aa*b*(cd)*ee/. While these are conducive to reasoning about pattern matching in the abstract, they aren't so good for pattern matching in the concrete. In real life, most atoms are longer than "a" or "b". In real life, tokens are more recognizable if they are separated by whitespace. In the abstract, /a+/ is reducible to /aa*/. In real life, nobody wants to repeat a 15 character token merely to satisfy somebody's idea of theoretical purity. So we have shortcuts like the + quantifier to say "one or more".

Now, you may rightly point out that + is something we already have, and we already introduced /x to allow whitespace, so why is this bullet point here? Well, there's a lot of inertia in culture, and the problem with /x is that it's not the default, so people don't think to turn it on when it would probably do a lot of good. The culture is biased in the wrong direction. Whitespace around tokens should be the norm, not the exception. It should be acceptable to use whitespace to separate tokens that could be confused. It should not be considered acceptable to define new constructs that contain a plethora of punctuation, but we've become accustomed to constructs like (?<=...) and (??{...}) and [\r\n\ck\p{Zl}\p{Zp}], so we don't complain. We're frogs who are getting boiled in a pot full of single-character morphemes, and we don't notice.

Poor Huffman coding

Huffman invented a method of data compaction in which common characters are represented by a small number of bits, and rarer characters are represented by more bits. The principle is more general, however, and language designers would do well to pay attention to the "other" Perl slogan: Easy things should be easy, and hard things should be possible. However, we haven't always taken our own advice. Consider those two regex constructs we just saw:

    (?<=...)
    (??{...})

Which one do you think is likely to be the most common in everyday use? Guess which one is longer...

There are many examples of poor Huffman coding in current regexes. Consider these:

    (...)
    (?:...)

Is it really the case that grouping is rarer than capturing? And by two gobbledygooky character's worth? Likewise there are many constructs that are the same length that shouldn't be:

    (?:...)
    (?#...)

Grouping is much more important than the ability to embed a comment. Yet they're the same length currently.

Too much reliance on too few metacharacters

A lot of our Huffman troubles came about because we were trying to shoehorn new capabilities into an old syntax without breaking anything. The (?...) construct succeeded at that goal, but it was new wine in old wineskins, as they say. More successful was the *? minimal matching hack, but it's still symptomatic of the problem that we only had three characters to choose from that would have worked at that point in the grammar. We've pretty nearly exhausted the available backslash sequences.

The waterbed theory of linguistic complexity says that if you push down one place, it goes up somewhere else. If you arbitrarily limit yourself to too few metacharacters, the complexity comes out somewhere else. So it seems obvious to me that the way out of this mess is to grab a few more metacharacters. And the metacharacters I want to grab are...well, we'll see in a moment.

Different things look too similar

Consider these constructs:

    (??{...})
    (?{...})
    (?#...)
    (?:...)
    (?i:...)
    (?=...)
    (?!...)
    (?<=...)
    (?<!...)
    (?>...)
    (?(...)...|...)

These all look quite similar, but some of them do radically different things. In particular, the (?<...) does not mean the opposite of the (?>...). The underlying visual problem is the overuse of parentheses, as in Lisp. Programs are more readable if different things look different.

Poor end-weight design

In linguistics, the notion of end-weight is the idea that people tend to prefer sentences where the short things come first and the long things come last. That minimizes the amount of stuff you have to remember while you're reading or listening. Perl violates this with regex modifiers. It's okay when you say something short like this:

    s/foo/bar/g

But when you say something like we find in RFC 360:

    while ($text =~ /name:\s*(.*?)\n\s*
                    children:\s*(?:(?@\S+)[, ]*)*\n\s*
                    favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {...}

it's not until you read the /sigx at the end that you know how to read the regex. This actually causes problems for the Perl 5 parser, which has to defer parsing the regular expression till it sees the /x, because that changes how whitespace and comments work.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Too much reliance on modifiers

The /s modifier in the previous example changes the meaning of the . metacharacter. We could, in fact, do away with the /s modifier entirely if we only had two different representations for "any character", one of which matched a newline, and one which didn't. A similar argument applies to the /m modifier. The whole notion of something outside the regex changing the meaning of the regex is just a bit bogus, not because we're afraid of context sensitivity, but because we need to have better control within the regex of what we mean, and in this case the context supplied outside the regex is not precise enough. (Perl 5 has a way to control the inner contexts, but it uses the self-obfuscating (?...) notation.)

Modifiers that control how the regex is used as a whole do make some sense outside the regex. But they still have the end-weight problem.

Too many special rules and boobytraps

Without knowing the context, you cannot know what the pattern // will do. It might match a null string, or it might match the previously successful match.

The local operator behaves differently inside regular expressions than it does outside.

It's too easy to write a null pattern accidentally. For instance, the following will never match anything but the null string:

    /
    | foo
    | bar
    | baz
    /x

Even when it's intentional, it may not look intentional:

    (a|b|c|)

That's hard to read because it's difficult to make the absence of something visible.

It's too easy to confuse the multiple meanings of dot. Or the multiple meanings of ^, and $. And the opposite of \A is frequently not \Z, but \z. Tell me again, when do I say \1, and when do I say $1? Why are they different?

Backreferences not useful enough

Speaking of \1, backreferences have a number of shortcomings. The first is actually getting ahold of the right backreference. Since captures are numbered from the beginning, you have to count, and you can easily count wrong. For many purposes it would be better if you could ask for the last capture, or the one before that. Or perhaps if there were a way to restart the numbering part way through...

Another major problem with backreferences is that you can't easily modify one to search for a variant. Suppose you match an opening parenthesis, bracket, or curly. You'll like to search for everything up to the corresponding closing parenthesis, bracket, or curly, but there's no way to transmogrify the opening version to the closing version, because the backref search is hardwired independently of ordinary variable matching. And that's because Perl doesn't instantiate $1 soon enough. And that's because Perl relies on variable interpolation to get subexpressions into regexes. Which leads us to...

Too hard to match a literal string

Since regexes undergo an interpolation pass before they're compiled, anything you interpolate is forced to be treated as a regular expression. Often that's not what you want, so we have the klunky \Q$string\E mechanism to hide regex metacharacters. And that's because...

Two-level interpretation is problematic

The problem with \Q$string\E arises because of the fundamental mistake of using interpolation to build regexes instead of letting the regex control how it treats the variables it references. Regexes aren't strings, they're programs. Or, rather, they're strings only in the sense that any piece of program is a string. Just as you have to work to eval a string as a program, you should have to work to eval a string as a regular expression. Most people tend to expect a variable in a regular expression to match its contents literally. Perl violates that expectation. And because it violates that expectation, we can't make $1 synonymous with \1. And interpolated parentheses throw off the capture count, so you can't easily use interpolation to call subrules, so we invented (??{$var}) to get around that. But then you can't actually get at the parentheses captured by the subrule. The ramifications go on and on.

Too little abstraction

Historically, regular expressions were considered a very low-level language, a kind of glorified assembly language for the regex engine. When you're only dealing with ASCII, there is little need for abstraction, since the shortest way to say [a-z] is just that. With the advent of the eighth bit, we started getting into a little bit of trouble, and POSIX started thinking about names like [:alpha:] to deal with locale difficulties. But as with the problem of conciseness, the culture was still biased away from naming abstractly anything that could be expressed concretely.

However, it's almost impossible to write a parser without naming things, because you have to be able to name the separate grammar rules so that the various rules can refer to each other.

It's difficult to deal with any subset of Unicode without naming it. These days, if you see [a-z] in a program, it's probably an outright bug. It's much better to use a named character property so that your program will work right in areas that don't just use ASCII.

Even where we do allow names, it tends to be awkward because of the cultural bias against it. To call a subrule by name in Perl 5 you have to say this:

    (??{$rule})

That has 4 or 5 more characters than it ought to. Dearth of abstraction produces bad Huffman coding.

Little support for named captures

Make that "no support" in Perl, unless you include assignment to a list. This is just a part of the bias against naming things. Instead we are forced to number our capturing parens and count. That works okay for the top-level regular expression, when we can do list assignment or assign $1 to $foo. But it breaks down as soon as you start trying to use nested regexes. It also breaks down when the capturing parentheses match more than once. Perl handles this currently by returning only the last match. This is slightly better than useless, but not by much.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Difficult to use nested patterns

For many of the reasons we've mentioned, it's difficult to make regexes refer to each other, and even if you do, it's almost impossible to get the nested information back out of them. And there are entire classes of parsing problems that are not solvable without recursive definitions.

Little support for grammars

Even if it were easier for regexes to refer to other regexes, we'd still have the problem that those other regexes aren't organized in any meaningful way. They might be off in variables that come and go at the whim of the surrounding context.

When we have an organized system of parsing rules, we call it a grammar. One advantage of having a grammar is that you can optimize based on the assumption that the rules maintain their relationship to each other. For instance, if you think of grammar rules as a funny kind of subroutine, you can write an optimizer to inline some of the subrules--but only if you know the subrule is fixed in the grammar.

Without support for grammar classes, there's no decent way to think of deriving one grammar from another. And if you can't derive one grammar from another, you can't easily evolve your language to handle new kinds of problems.

Inability to define variants

If we want to have variant grammars for Perl dialects, then what about regex dialects? Can regexes be extended either at compile time or at run time? Perl 5 has some rudimentary overloading magic for rewriting regex strings, but that's got the same problems as source filters for Perl code; namely that you just get the raw regex source text and have to parse it yourself. Once again the fundamental assumption is that a regex is a funny kind of string, existing only at the behest of the surrounding program.

Do we think of regexes as a real, living language?

Poor integration with rich languages

Let's face it, in the culture of computing, regex languages are mostly considered second-class citizens, or worse. "Real" languages like C and C++ will exploit regexes, but only through a strict policy of apartheid. Regular expressions are our servants or slaves; we tell them what to do, they go and do it, and then they come back to say whether they succeeded or not.

At the other extreme, we have languages like Prolog or Snobol where the pattern matching is built into the very control structure of the language. These languages don't succeed in the long run because thinking about that kind of control structure is rather difficult in actual fact, and one gets tired of doing it constantly. The path to freedom is not to make everyone a slave.

However, I would like to think that there is some happy medium between those two extremes. Coming from a C background, Perl has historically treated regexes as servants. True, Perl has treated them as trusted servants, letting them move about in Perl society better than any other C-like language to date. Nevertheless, if we emancipate regexes to serve as co-equal control structures, and if we can rid ourselves of the regexist attitudes that many of us secretly harbor, we'll have a much more productive society than we currently do. We need to empower regexes with a sense of control (structure). It needs to be just as easy for a regex to call Perl code as it is for Perl code to call a regex.

Missing backtracking controls

Perl 5 started to give regexes more control of their own destiny with the "grab" construct, (?>...), which tells the regex engine that when it fails to match the rest of the pattern, it should not backtrack into the innards of the grab, but skip back to before it. That's a useful notion, but there are problems. First, the notation sucks, but you knew that already. Second, it doesn't go far enough. There's no way to backtrack out of just the current grouping. There's no way to backtrack out of just the current rule. Both of these are crucial for giving first-class status to the control flow of regexes.

Difficult to define assertions

Notionally, a regex is an organization of assertions that either succeed or fail. Some assertions are easily expressed in traditional regex language, while others are more easily expressed in a procedural language like Perl.

The natural (but wrong) solution is to try to reinvent Perl expressions within regex language. So, for instance, I'm rejecting those RFCs that propose special assertion syntax for numerics or booleans. The better solution is to make it easier to embed Perl assertions within regexes.


Brave New World

I've just made a ton of negative assertions about the current state of regex culture. Now I'd like you to perform a cool mental trick and turn all those negatives assertions into positive assertions about what I'm going to say, because I'm not intending to give the rationale again, but just present the design as it stands. Damian will discuss an extended example in his Exegesis 5, which will show the big picture of how these various features work together to produce a much more readable whole.

So anyway, here's what's new.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Metacharacter Reform

Some things stay the same: (...) captures text just as it did before, and the quantifiers *, +, and ? are also unchanged. The vertical bar | still separates alternatives. The backslash \ still protects the following character from its ordinary interpretation. The ? suffix character still does minimal matching. (Note that these are by far the most commonly used metacharacters, so many ordinary regexes will look nearly identical in Perl 5 and Perl 6.)

Since /x extended syntax is now the default, # is now always a metacharacter indicating a comment, and whitespace is now always "meta". Whitespace is now the standard way to separate regex tokens that would otherwise be confused as a single token.

Even in character classes, whitespace is not taken literally any more. Backwhack the space if you mean it literally, or use <sp>, or \040, or \x20, or \c[SPACE]. But speaking of character classes...

Perhaps the most radical change is that I've taken [...] away from character classes and made it the non-capturing grouping operator, because grouping is more fundamental than character classes, and explicit character classes are becoming less common than named character classes. (You can still do character classes, just not with bare square brackets.)

I've also stolen {...} from generalized quantifiers and made them into closure delimiters. (Use <n,m> for the generalized quantifier now.)

I've stolen three new metacharacters. The new extensible metasyntax for assertions uses angle brackets, <...>. And the colon : is now used for declaration and backtracking control. (Recall Larry's 2nd Law of Language Redesign: Larry gets the colon.) The colon always introduces a token that controls the meaning of what is around it. The nature of the token depends on what follows the colon. Both the colon syntax and angle syntax are extensible. (Backslash syntax is also extensible.)

This may sound like we're complexifying things, but we're really simplifying. We now have the following regex invariants:

    (...)       # always delimits a capturing group
    [...]       # always delimits a non-capturing group
    {...}       # always delimits a closure
    <...>       # always delimits an assertion
    :...        # always introduces a metasyntactic token

(Note that we're using "assertion" here in the broad sense of anything that either matches or fails, whether or not it has a width.)

The nature of the angle assertion is controlled by the first character inside it. If the first character is alphabetic, it's a grammatical assertion, and the entire first word controls the meaning. The word is first looked up in the current grammar, if any. If not found there, it is checked to see if it is one of the built-in grammar rules such as those defined by the Unicode property classes. If the first character is not alphabetic, there will be special rules in the current grammar or in the Perl grammar for looking up the parse rule. For instance, by default, any assertion that begins with ! is simply negated. Assertions that start with a digit are assumed to be a range assertion (<n,m>) regarding the previous atom. (Taking the last two together, you can say <!n,m> to exclude a range.) Assertions that start with $, @, %, or & are assumed to interpolate an indirect regex rule stored in a variable or returned by a subroutine. An assertion that starts with a parenthesis is a closure being used as an assertion. An assertion that starts with a square bracket or another angle bracket is a character class. An assertion that starts with a quote asserts the match of a literal string. And so on.

Some metacharacters are still used but have a slightly different meaning, in part to get rid of the /s and /m modifiers, and in part because most strings in Perl 6 will come from the filehandle pre-chomped. So anchors ^ and $ now always mean the real beginning and ending of the string. Use ^^ and $$ to match the beginnings and endings of lines within a string. (They're doubled because they're "fancier", because they can match in multiple places, and because they'll be rarer, so Huffman says they should be longer.) The ^^ and $$ also match where ^ and $ would.

The dot . now always matches any character including newline. (Use \N to match a non-newline. Or better, use an autochomping filehandle, if you're processing line-by-line.)

In a sense, the sigils $, @, %, and & are different metacharacters because they don't interpolate, but are now subject to the interpretation of the regex engine. This allows us to change the default behavior of ordinary "interpolation" to a literal match, and also lets us put in lvalue-ish constructs like:

    / $name := (\S+) /
    / @kids := [(\S+) \s+]* /
    / %pets := [(\S+) \: (\S+) \s+]* /

(Notice also the delicate interplay of quantified non-capturing brackets with capturing parens, particularly for gathering multiple values or even multiple key/value pairs.)

Here are some of the metacharacter differences in table form:

    Old                   New
    ---                   ---
    /pat pat #text        /pat pat #text
        pat/x                 pat/              # Look Ma, no /x!
    /patpat(?#text)/       /pat pat <('text')>/ # can always use whitespace
    /pat pat/             / pat\ pat /          # match whitespace literally
                          / pat \s* pat /       # or generically
                          / pat \h* pat /       # or horizontally
                          / pat <' '> pat /     # or as a literal string
                          / pat <sp> pat /      # or by explicit rule
                          /:w pat pat/          # or by implicit rule
    /^pat$/               /^pat\n?$/            # ^ and $ mean string
    /^pat$/m              /^^pat$$/             # no more /m
    /\A...(^pat$)*...\z/m /^...(^^pat$$)*...$/  # no more \A or \z
    /.*\n/                /\N*\n/               # \N is negated \n
                          /.*?\n/               # this still works
    /.*/s                 /.*/                  # . always matches "any"
    \Q$string\E           $string               # interpret literally
    (?{ code })           { code }              # call code, ignore return
                          { code or fail }      # use code as an assertion
    (??{$rule})           <$var>                # call $var as regex
                          <name>                # call rule from current grammar
                          <Other::rule>         # call rule from some Other grammar
                          <*rule>               # bypass local rule to call built-in
                          <@array>              # call array of alternate rules
                          <%hash>               # parse keyword as key to rule
                          <@array[1]>           # call a rule from an array
                          <%hash{"x"}>          # call a rule from a hash
                          <&sub(1,2,3)>         # call a rule returned by a sub
                          <{ code }>            # call return value as anonymous rule
                          <( code )>            # call code as boolean assertion
                          <name(expr)>          # call rule, passing Perl args
                          { .name(expr) }       # same thing.
                          <$var(expr)>          # call rule indirectly by name
                          { .$var(expr) }       # same thing.
                          <name pat>            # call rule, passing regex arg
                          { .name(/pat/) }      # same thing.
                          # maybe...
                          <name: text>          # call rule, passing string
                          { .name(q<text>) }    # same thing.
    [\040\t\p{Zs}]        \h                    # horizontal whitespace
    [\r\n\ck\p{Zl}\p{Zp}] \v                    # vertical whitespace
    [a-z]                 <[a-z]>               # equivalently non-international
                          <alpha>               # more international
    [[:alpha:][:digit:]   <<alpha><digit>>      # POSIX classes are built-in rules
    {n,m}                 <n,m>                 # assert repeat count
    {$n,$m}               <$n,$m>               # indirect repeat counts
    (?>.*)                [.*]:                 # don't backtrack through [.*]
                          .*:                   # brackets not necessary on atom
                          (.*):                 # same, but capture
                          <xyz>:                # don't backtrack into subrule
                          :                     # skip previous atom when backtracking
                          ::                    # fail all |'s when backtracking
                          :::                   # fail current rule when backtracking
                          :=                    # bind a name to following atom
    my ($x) = /(.*)/      my $x; / $x:=(.*) /   # may now bind it inside regex
    (?i)                  :i                    # ignore case in the following
                          :ignorecase           # same thing, self-documenting form
    (?i:...)              [:i ...]              # can limit scope without capture
                          (:i ...)              # can limit scope with capture

Declarations like :i are lexically scoped and do not pass to any subrules. Each rule maintains its own sensitivity. There is no built-in operator to turn case ignorance back off--just call a different rule and it's automatically case sensitive again. (If you want a parameterized subrule, that can be arranged. It's just a method, after all. Proof of this assertion is left to future generations of hackers.)

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Backslash Reform

There are some changes to backslash sequences. Character properties \p and \P are no longer needed--predefined character classes are just considered intrinsic grammar rules. (You can negate any <...> assertion by using <!...> instead.) As mentioned in a previous Apocalypse, the \L, \U, and \Q sequences no longer use \E to terminate--they now require bracketing characters of some sort. And \Q will rarely be needed due to regex policy changes. In fact, they may all go away since it's easy to say things like:

    $(lc $foo)

For any bracketing construct, square brackets are preferred, but others are allowed:

    \x[...]     # preferred, indicates simple bracketing
    \x(...)     # okay, but doesn't capture.
    \x{...}     # okay, but isn't a closure.
    \x<...>     # okay, but isn't an assertion

The \c sequence is now a bracketing construct, having been extended from representing control characters to any named character.

Backreferences such as \1 are gone in favor of the corresponding variable $1. \A, \Z, and \z are gone with the disappearance of /s and /m. The position assertion \G is gone in favor of a :c modifier that forces continuation from where the last match left off. That's because \G was almost never used except at the front of a regex. In the unlikely event that you want to assert that you're at the old final position elsewhere in your regex, you can always test the current position (via the .pos method) with an assertion:

    $oldpos = pos $string;
    $string =~ m/... <( .pos == $oldpos )> .../;

You may be thinking of .pos as the final position of the previous match, but that's not what it is. It's the current position of the current match. It's just that, between matches, the current position of the current match happens to be the same as the final position of the current match, which happens to be the last match, which happens to be done. But as soon as you start another match, the last match is no longer the current match.

Note that the :c continuation is needed only on constructs that ordinarily force the search to start from the beginning. Subrules automatically continue at the current location, since their initial position is controlled by some other rule.

There are two new backslash sequences, \h and \v, which match horizontal and vertical whitespace respectively, including Unicode spacing characters and control codes. Note that \r is considered vertical even though it theoretically moves the carriage sideways. Finally, \n matches a logical newline, which is not necessarily a linefeed character on all architectures. After all, that's why it's an "n", not an "l". Your program should not break just because you happened to run it on a file from a partition mounted from a Windows machine. (Within an interpolated string, \n still produces whatever is the normal newline for the current architecture.)

    Old                 New
    ---                 ---
    \x0a                \x0a                    # same
    \x{263a}            \x263a                  # brackets required only if ambiguous
    \x{263a}abc         \x[263a]abc             # brackets required only if ambiguous
    \0123               \0123                   # same (no ambiguity with $123 now)
    \0123               \0[123]                 # can use brackets here too
    \p{prop}            <prop>                  # properties are just grammar rules
    \P{prop}            <!prop>
    [\040\t\p{Zs}]      \h                      # horizontal whitespace
    space               \h                      # not exact, but often more correct
    [\r\n\ck\p{Zl}\p{Zp}] \v                    # vertical whitespace
    \Qstring\E          \q[string]
                        <'string with spaces'>  # match literal string
                        <' '>                   # match literal space
    \E                  gone                    # use \Q[...] instead
    \A                  ^                       # ^ now invariant
    \a                  \c[BEL]                 # alarm (bell)
    \Z                  \n?$                    # clearer
    \z                  $                       # $ now invariant
    \G                  <( .pos == $oldpos )>   # match at particular position
                                                # typically just use m:c/pat/
    \N{CENT SIGN}       \c[CENT SIGN]           # named character
    \c[                 \e                      # escape
    \cX                 \c[^X]                  # control char
    \n                  \c[LF]                  # specifically a linefeed
    \x0a\x0d            \x[0a;0d]               # CRLF
    \x0a\x0d            \c[CR;LF]               # CRLF (conjectural)
    \C                  [:u0 .]                 # forced byte in utf8 (dangerous)
    [^\N[CENT SIGN]]    \C[CENT SIGN]           # match any char but CENT SIGN
    \Q$var\E            $var                    # always assumed literal,
    \1                  $1                      # so $1 is literal backref
    /$1/                my $old1 = $1; /$old1/  # must use temporary here
    \r?\n               \n                      # \n asserts logical newline
    [^\n]               \N                      # not a logical newline
                        \C[LF]                  # not a linefeed
    [^\t]               \T                      # not a tab (these are conjectural)
    [^\r]               \R                      # not a return
    [^\f]               \F                      # not a form feed
    [^\e]               \E                      # not an escape
    [^\x1B]             \X1B                    # not specified hex char
    [^\x{263a}]         \X[263a]                # not a Unicode SMILEY
    \X                  <.>                     # a grapheme (combining char seq)
                        [:u2 .]                 # At level 2+, dot means grapheme

Under level 2 Unicode support, a character is assumed to mean a grapheme, that is, a sequence consisting of a base character followed by 0 or more combining characters. That not only affects the meaning of the . character, but also any negated character, since a negated character is really a negative lookahead assertion followed by the traversal of a single character. For instance, \N really means:

    [<!before \n> . ]

So it doesn't really matter how many characters \n actually matches. \N always matches a single character--whatever that is...

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Modifier Reform

You can't use colon for a regex delimiter any more. That's because regex modifiers may now be placed in front of a regex construct:

    s:w:i:e /foo/bar/           # :words :ignorecase :each

That can also be written:

    s/:w:i:e foo/bar/           # :words :ignorecase :each

Single character modifiers may be bundled like this:

    s:wie /foo/bar/             # :words :ignorecase :each

...but only if the sequence as a whole is not already defined as a long modifier, since ambiguity will be resolved in favor of the long modifier. Long modifiers may not be bundled with any other modifier. So this is legal:

    s:once:wie /foo/bar/

but not these (unless you've defined them):

    s:wieonce /foo/bar/
    s:oncewie /foo/bar/

Not only is colon disallowed as a regex delimiter, but you may no longer use parentheses as the delimiters either. This will allow us to parameterize modifiers:

    s:myoption($x) /foo/bar/

This rule also allows us to differentiate s/// from an s() function, tr/// from tr(), etc. If you want matching brackets for the delimiters I'd suggest that you use square brackets, since they now mean grouping without capturing.

Several modifiers, /x, /s, and /m, are no longer needed and have been retired. It's unclear whether /o is necessary any more. We will assume it's gone unless it's shown that caching can't handle the problem. Note that the regex now has more control over when to cache subrules because it is no longer subject to the vagaries of standard interpolation.

The old /c modifier is gone because regexes never reset the position on failure any more. To do that, set $string.pos = 0 explicitly. But note also that assigning to a string automatically resets its position to 0, so any string in your typical loop is going to start with its current search position already set 0. Modifying a string in place causes the position to move to the end of the replacement section by default, if the position was within the span replaced. (This is consistent with s/// semantics.)

The /e modifier is also gone, since it did reverse parsing magic, and :e will be short for :each--see below. It's still easy to substitute the value of an expression though:

    s/pat/$( code )/;

or even

    s(/pat/, { code });

There's a new modifier, :once, that causes a match to succeed only once (like the old ?...? construct). To reset it, use the .reset method on the regex object. (If you haven't named the regex object, too bad...)

Another new modifier is :w, which causes an implicit match of whitespace wherever there's literal whitespace in a pattern. In other words, it replaces every sequence of actual whitespace in the pattern with a \s+ (between two identifiers) or a \s* (between anything else). So


    m:w/ foo bar \: ( baz )*/

really means (expressed in Perl 5 form):


    m:p5/\s*foo\s+bar\s*:(\s*baz\s*)*/

You can still control the handling of whitespace under :w, since we extend the rule to say that any explicit whitespace-matching token can't match whitespace implicitly on either side. So:

    m:w/ foo\ bar \h* \: (baz)*/

really means (expressed in Perl 5 form):


    m:p5/\s*foo bar[\040\t\p{Zs}]*:\s*(baz)*/

The first space in

    /[:w foo bar]/

matches \s* before "foo". That's usually what you want, but if it's not what you want, you have a little problem. Unfortunately you can't just say:

    /[:wfoo bar]/

That won't work because it'll look for the :wfoo modifier. However, there are several ways to get the effect you want:

    /[:w()foo bar]/ 
    /[:w[]foo bar]/ 
    /[:w\bfoo bar]/ 
    /[:w::foo bar]/

That last one is just our friend the :: operator in disguise. If you backtrack into it, you're leaving the brackets anyway, so it's essentially a no-op.

The new :c/:cont modifier forces the regex to continue at the current "pos" of the string. It may only be used outside the regex. (Well, it could be used inside but it'd be redundant.) The modifier also forces the regex to match only the next available thing. That's not quite the same as the ^ anchor, though, because it not only disables the implicit scanning done by m// and s///, but it also works on more than the first iteration. It forces all matches to be contiguous, in other words. So :c is short for both "continue" and "contiguous". If you say

    $_ = "foofoofoo foofoofoo";
    s:each:cont/foo/FOO/;

you get:

    FOOFOOFOO foofoofoo

This may seem odd, but it's precisely the semantics of any embedded regex:

    $_ = "foofoofoo foofoofoo";
    $rx = rx/foo/;
    m/<$rx>*/;          # matches "foofoofoo"

A modifier that starts with a number causes the pattern to match that many times. It may only be used outside the regex. It may not be bundled, because ordinals are distinguished from cardinals. That is, how it treats those multiple matches depends on the next character. If you say

    s:3x /foo/bar/

then it changes the first 3 instances. But if you say

    s:3rd /foo/bar/

it changes only the 3rd instance. You can say

    s:1st /foo/bar/

but that's just the default, and should not be construed as equivalent to :once, which matches only once, ever. (Unless you .reset it, of course.)

You can combine cardinals and ordinals:

    s:3x:3rd /foo/bar/

That changes the 3rd, 6th, and 9th occurrences. To change every other quote character, say

    s:each:2nd /"/\&rquot;/;

:each is synonymous with :3x (for large values of 3). Note that :each does not, in fact, generate every possible match, because it disallows overlaps. To get every possible match, use the :any modifier. Saying:

    $_ = "abracadabra";
    @all = m:any /a[^a]+a/;

produces:

    abra aca ada abra

It can even match multiple times at the same spot as long as the rest of the regex progresses somehow. Saying:

    @all = m:any /a.*?a/;

produces:

    abra abraca abracada abracadabra aca acada acadabra ada adabra abra

If you say

    $sentence.m:any /^ <english> $/

you'll get every possible parsing of the sentence according to the rules of english (not to be confused with the rules of English, which are already confusing enough, except when they aren't).

To indicate varying levels of Unicode support we have these modifiers, which may be used either inside or outside a regex:

    :u0         # use bytes       (. is byte)
    :u1         # level 1 support (. is codepoint)
    :u2         # level 1 support (. is grapheme)
    :u3         # level 1 support (. is language dependent)

These modifiers say nothing about the state of the data, but in general internal Perl data will already be in Normalization Form C, so even under :u1, the precomposed characters will usually do the right thing. Note that these modifiers are for overriding the default support level, which was probably set by pragma at the top of the file.

Finally, there's the :p5 modifier, which causes the rest of the regex (or group) to be parsed as a Perl 5 regular expression, including any interpolated strings. (But it still doesn't enable Perl 5's trailing modifiers.)

    Old                 New
    ---                 ---
    ?pat?               m:once/pat/             # match once only
    /pat/i              m:i/pat/                # ignorecase
                        /:i pat/                # ignorecase
    /pat/x              /pat/                   # always extended
    /pat\s*pat/         /:w pat pat/            # match word sequence
    /(?i)$p5pat/        m:p5/(?i)$p5pat/        # use Perl 5 syntax
    $n = () = /.../g    $n = +/.../;            # count occurrences
    for $i (1..3){s///} s:3///;                 # do 3 times
    /^pat$/m            /^^pat$$/               # no more /m
    /./s                /./                     # no more /s
    /./                 /\N/                    # . now works like /s

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Keyword and Context Reform

Deferred regex rules are now defined with rx// rather than qr//, because a regular expression is no longer a kind of quoted string.

Actually, just as you can define closures without an explicit sub, any // or rx// declares a deferred regex if it's not in a context that executes it immediately. A regex is executed automatically if it's in a boolean, numeric, or string context. But assignment to an untyped variable is not such a context, nor is passing the regex as an untyped parameter to a function. (Of course, an explicitly declared RULE parameter doesn't provide an evaluating context either.)

So these are equivalent:

    my $foo = /.../;            # create regex object
    my $foo = rx[...];          # create regex object
    my $foo = rule {...};       # create regex object

Likewise, these are equivalent:

    @x = split /.../;
    @x = split rx[...];
    @x = split rule {...};

The "rule" syntax is just a way of declaring a deferred regex as if it were a subroutine or method. More on that later.

To force an immediate evaluation of a deferred regex where it wouldn't ordinarily be, you can use the appropriate unary operator:

    my $foo = ?/.../;   # boolean context, return whether matched, 
    my $foo = +/.../;   # numeric context, return count of matches
    my $foo = _/.../;   # string context, return captured/matched string

The standard match and substitution forms also force immediate evaluation regardless of context:

    $result = m/.../;           # do match on topic string
    $result = s/.../.../;       # do substitution on topic string

These forms also force the regex to start matching at the beginning of the string in question and scan forward through the string for the match, as if there were an implicit .*? in front of every iteration. (Both of these behaviors are suppressed if you use the :c/:cont modifier). In contrast, the meaning of the deferred forms is dependent on context. In particular, a deferred regex naturally assumes :c when used as a subrule. That is, it continues where the last match left off, and the next thing has to match right there at the head of the string.

In any other context, including list context, a deferred regex is not immediately evaluated, but produces a reference to the regex object:

    my $rx = /.../;     # not evaluated
    my @foo = $rx;      # ERROR: type mismatch.
    my @foo = ($rx);    # One element, a regex object.
    my @foo = (/.../);  # Same thing.
    my @foo := $rx;     # Set autogrow rule for @foo.

To evaluate repeatedly in list context, treat the regex object as you would any other iterator:

    my @foo = <$rx>;

You can also use the more explicit form:

    my @foo = m/<$rx>/;

Those aren't identical, since the former assumes :c and starts up at the current position of the unmentioned topic, while the latter explicitly resets the position to the beginning before scanning. Also, since the deferred regex assumes a :c modifier, <$rx> won't scan through the string like m//. It can return multiple values to the list, but they have to be contiguous. You can get the scanning effect of m// by prepending the pattern with .*?.

But it's vitally important to understand this fundamental change, that // is no longer a short form of m//, but rather a short form of rx//. If you want to add modifiers to a //, you have to turn it into an rx//, not an m//. It's now wrong to call split like this:

    split m/.../

(That is, it's wrong unless you actually want the return value of the pattern match to be used as the literal split delimiter.)

The old ?...? syntax is gone. Indeed, it has to go for us to get the unary ? operator.

    Old                 New
    ---                 ---
    ?pat?               m:once/pat/
    qr//                rx//
                        rule { }

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Null String Reform

The null pattern is now illegal. To match whatever you used to match with a null pattern, use one of these:

    Old                 New
    ---                 ---
    //                  /<prior>/       # match what prior match did
    //                  /<null>/        # match the null string between chars
    (a|b|)              (a|b|<null>)    # match a null alternative

Note that, as an assertion, <null> always succeeds. You never want to say:

    / <null> | single | double | triple | home run /

because you'll never get to first base.

Extension Syntax Reform

There are no longer any (?...) sequences, because parens now always capture. Some of the replacement sequences take their intrinsic scoping from <...>, while others are associated with other bracketing characters, or with any arbitrary atom that could be a bracketed construct. Looking at the metasyntax problem from the perspective of a Perl5-to-Perl6 translator, here's what the various Perl 5 extension constructs translate to:

    Old                 New
    ---                 ---
    (??{$rule})         <$rule>         # call regex in variable
    (?{ code })         { code }        # call Perl code, ignore result
    (?#...)             <('...')>       # in-line comment, rarely needed
    (?:...)             [...]           # non-capturing brackets
    (?=...)             <before ...>    # positive lookahead
    (?!...)             <!before ...>   # negative lookahead
    (?<=...)            <after ...>     # positive lookbehind
    (?<!...)            <!after ...>    # negative lookbehind
    (?>...)             [...]:          # grab (any atom)
    (?(cond)yes|no)     [ cond :: yes | no ]
    (?(1)yes|no)        [ <(defined $1)> :: yes | no ]

The <$rule> construct does a "delayed" call of another regular expression stored in the $rule variable. If it is a regex object, it's just called as if it were a subroutine, so there's no performance problem. If it's a string, it is compiled as a regex and executed. The compiled form is cached as a property of the string, so it doesn't have to be recompiled unless the string changes. (This implies that we can have properties that invalidate themselves when their base object is modified.) In either case, the evaluated regex is treated as a subrule, and any captures it does are invisible to the outer regex unless the outer regex takes steps to retrieve them. In any event, subrule parens never change the paren count of the outer rule.

The {code} form doesn't return anything meaningful--it is used for its side effects. Any such closure may behave as an assertion. It merely has to throw an exception in order to fail. To throw such an exception (on purpose) one may use fail:

    $_ = "666";
    / (\d+) { $1 < 582 or fail }/

As with any assertion, the failing closure starts backtracking at the location of the closure. In this case, it backtracks into the \d+ and ends up matching "66" rather than "666". If you didn't want that, use \d+: instead.

It's more succinct, however, to use the code assertion syntax. Just put angles around a parenthesized Perl expression:

    / (\d+) <( $1 < 582 )> /

I find the parens to be vaguely reminiscent of the parentheses you have to put around conditionals in C (but not Perl (anymore)). Also, the parentheses are meant to remind you that you only want to put an expression there, not a full statement.

Don't use a bare closure to try to interpolate a calculated regex, since the result will be ignored. Instead, use the <{expr}> form to do that. As with <&rule()>, the result will be interpreted as a subrule, not as if it were interpolated.

Since a string is usually true, you can just assert it to get the effect of an inline comment: <("this is a comment")>. But I've never used one except to show it as an example. Line ending comments are usually much clearer. (Just bear in mind you can't put the final regex delimiter on the same line, because it won't be seen in the comment.) You could also use the {'...'} construct for comments, but then you risk warnings about "useless use of a string in void context".

The [...] is the new non-capturing bracket notation. It seems to work very well for this purpose--I tried the other brackets and they tend to "disappear" faster than square brackets. So we reserve (...) and <...> for constructs where the visual distance between opening and closing is typically shorter than for square brackets or curlies. The square brackets also work nicely when lined up vertically with vertical bars. Here's a declaration of a named rule from the class Perl6Grammar. It parses Perl 6 statements. (Think of it as a funny looking method declaration.)

    rule state  { <label>
                    [ <control>          {.control}
                    | <sideff> <eostate> {.sideff}
                    | <@other_statements>
                    ]
                };

Huffman coding says that rarer forms should be longer, and that's the case with the lookahead and lookbehind assertions, <before ...> and <after ...>. (The negations are formed via the general <!...> rule.) Note that these prepositions are interpreted as assertions, not operations. For example, <before X> is to be read "Assert that we are before X" rather than "Look before where we are for X".

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

The new : operator replaces the (?>...) construct. It modifies whatever comes before it, much like * does, so it's naturally scoped if the preceding atom (or quantified atom) is a bracketed construct. Parsers can use this every time they commit to the parsing of a token or phrase to tell the regex engine that there's no point in backtracking through the atom in question, so backtracking will skip backwards over the atom and continue with some earlier branch point. The following takes a long time to fail if it has to look at every sequence of "a" to see if there is a "b" after it:

    "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac" =~ /^ a* b /

But we already know that the only possible match is the longest one. So if you put in the colon, it fails in one pass because the * grabs everything and gives nothing back on backtracking.

    "aaaaaaaaaaaaaaaaaaaaaaaaaaaaac" =~ /^ a*: b /

You can use colon on a longer sequence too. The following might match a list of expressions separated by comma:

    / <expr> [ , <expr> ]*: /

It is an error to use : on any atom that does no backtracking. This will help to catch errors where you've forgotten to backslash a literal colon in things like:

    /^From: (.*)/

Perl 6 has no need for a special conditional construct like Perl 5's (?(cond)yes|no). That's because with a slight tweak, ordinary alternation can do the same thing. That tweak is our next backtracking modifier, the :: operator. If you backtrack across it, it fails all the way out of the current list of alternatives. Consider an ordinary list of alternatives:

    [ <A> <X> | <B> <Y> | <C> <Z> ]

The way the rules of backtracking work, if either <A> or <X> fail, it backtracks to the next alternative. Likewise for <B> and <Y>. In the case of <C> or <Z>, there is no next alternative, so it naturally fails out of the entire construct. That's not how a conditional is supposed to work, because in the conditional, only the condition determines which case is executed. Once you've committed to a particular case, it has to stand or fall as if the conditional hadn't been there. So all we need for our purposes is to have is something that separates the assertions that matter from those that don't. That's what :: does, and it reads rather well as a "then", or as a "corresponds to". If you write

    [ <A> :: <X>
    | <B> :: <Y>
    | <C> :: <Z>
    ]

then the failure of <A>, <B>, or <C> proceeds to the next case (if any), while any failure in <X>, <Y>, or <Z> is guaranteed to backtrack out of the front of the alternative list and revise a former choice (just as the success of <X>, <Y>, or <Z> is guaranteed to "forward track" out of the end of the alternative list and try to match more). It's a natural mapping to existing regex semantics. Here's a more realistic example from the Perl 6 grammar. It parses statement modifiers. (The <ws> rule parses optional whitespace.)

    rule modifier { if     <ws> :: <expr> { .new_cond(0,$expr) }
                  | unless <ws> :: <expr> { .new_cond(1,$expr) }
                  | while  <ws> :: <expr> { .new_loop(0,$expr) }
                  | until  <ws> :: <expr> { .new_loop(1,$expr) }
                  | for    <ws> :: <expr> { .new_for($expr)   }
                  | <@other_modifiers>  # user defined
                  | <null>              # no modifier
                  }

In each case, once we recognize a keyword (and its following whitespace), we need to look for an expression, and then call a closure that builds the syntax tree. If either of those fails, the entire modifier rule fails. We only get to the last two alternatives on failure of assertions before the ::.

Note that the :: only says that we can't backtrack from the "then" into the "if". It says nothing about backtracking into the alternative list as a whole. The alternatives are still choice points, so the regex engine is allowed to backtrack into the alternative list and try another alternative. (To disable that, simply put a : after the closing bracket of the alternative list.)

There is nothing in Perl 5 corresponding to the ::: operator, but it works just like ::, only more so. If you backtrack across it, it fails all the way out of the current rule definition (though not out of any rule invoking this definition). That is, it fails all the way out of the innermost lexically enclosing /.../, m/.../, s/...//, rx/.../, or rule {...}, skipping out through any enclosing nestings of <...>, [...], or (...). (A pattern nested within a closure is classified as its own rule, however, so it never gets the chance to pass out of a {...} closure.)

Since the alternatives in our last example are at the top level of the regex, we could have used the ::: operator to get the same effect as ::, because terminating the rule and terminating the alternation amount to the same thing in that case. You can think of all of these as variants on Prolog's "cut" operator.

If you backtrack over the :::: operator, it will delete your program from the disk. ;-)

Actually, the real name of the real :::: operator is <commit>. It fails the entire match if you backtrack over it, not just the current rule. That is, it fails all the way out of the outermost dynamically enclosing /.../, m/.../, s/...//, rx/.../, or rule {...} that is executing on the current string.

There is one "cut" operator that is beyond <commit>; it is appropriately named <cut>, for two reasons. First of all, it's a real cut operator in that, if you backtrack over it, the current match fails completely, just like <commit>. But that's just a side effect of the other reason, which is that <cut> cuts off the front of the string that you're currently matching on, turning the current position into the new beginning of the string. When you're matching on a potentially infinite string, it's important that you have a way of discarding that part of the match that you've already committed to. In Perl 5, the only way to do that was with a coordinated system of s/^pat// operations. With the <cut> assertion, however, you can just match normally, and cut at one spot in your top-level rule when you reach an "accept" state.

In the realm of idle speculation, we could go as far as to define a variant of <cut> that would render s/// slightly redundant:

    s/foo/bar/;
    m/foo <replace("bar")> /

Note that we don't need any special forms for controlling the scope of a "fail" in a closure. Just prefix the closure with the appropriate backtracking operator:

    / pattern ::: { code() or fail } /  # fails entire rule

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Character Class Reform

As we mentioned earlier, character classes are becoming more like standard grammar rules, because the definition of "character" is getting fuzzier. This is part of the motivation for demoting enumerated character classes and stealing the square brackets for another purpose. Actually, for old times' sake you still use square brackets on enumerated character classes, but you have to put an extra set of angles around it. But this actually tends to save keystrokes when you want to use any named character classes or Unicode properties, particularly when you want to combine them:

    Old                 New
    ---                 ---
    [a-z]               <[a-z]>
    [[:alpha:]]         <alpha>
    [^[:alpha:]]        <-alpha>
    [[:alpha:][:digit]] <<alpha><digit>>

The outer <...> also naturally serves as a container for any extra syntax we decide to come up with for character set manipulation:

    <[_]+<alpha>+<digit>-<Swedish>>

State

[This section gets pretty abstruse. It's okay if your eyes glaze over.]

Every regex match maintains a state object, and any closure within the regex is actually an anonymous method of that object, which means in turn that the closure's topic is the current state object. Since a unary dot introduces a method call on the current topic, it follows that you can call any method in the state object that way:

    /(.*) { print .pos }/       # print current position

The state object may in fact be an instance of a grammar class. A grammar object has additional methods that know how to build a parse tree. Its rules also know how to refer to each other or to rules of related grammars.

Note that $_ within the closure refers to this state object, not the original search string. If you search on the state object, however, it pretends that you wanted to continue the search on the original string. If the internal search succeeds, the position of the external state is updated as well, just as if the internal search had been a rule invoked directly from the outer regex.

Because the state object is aware of how the tree is being built, when backtracking occurs the object can destroy parts of the parse tree that were conjectured in error. Because the grammar's action methods have control of the regex state, they can access named fields in the regex without having to explicitly pass them to the method call.

For instance, in our earlier example we passed $expr explicitly to build the parse tree, but the method can actually figure that out itself. So we could have just written:

    rule modifier { if     <ws> :: <expr> { .new_cond(0) }
                  | unless <ws> :: <expr> { .new_cond(1) }
                  | while  <ws> :: <expr> { .new_loop(0) }
                  | until  <ws> :: <expr> { .new_loop(1) }
                  | for    <ws> :: <expr> { .new_for }
                  | <@other_modifiers>  # user defined
                  | <null>              # no modifier
                  },

See Variable Scoping below for where @other_modifiers gets looked up.

Within a closure, $_ represents the current state of the current regex, and by extension, the current state of all the regexes participating in the current match. (The type of the state object is the current grammar class, which may be an anonymous type if the current grammar has no name. If the regex is not a member of a grammar, it's of type RULE.) Part of the state of the current regex is the current node of the parse tree that is being built. When the current regex succeeds, the state object becomes a result object, and is returned to the calling regex. The calling regex can refer to the returned object as a "hypothetical" variable, the name of which is either implicitly generated from the name of the rule, or explicitly bound using :=. Through that variable you can get at anything captured by the subrule. (That is what $expr was doing earlier.)

When the entire match succeeds, the top-level node is returned as a result object that has various values in various contexts, whether boolean, numeric, or string context. The name of the result object is $0. The result object contains all the other information, such as $1, $2, etc. Unlike $& in Perl 5, $0 is lexically scoped to the enclosing block. By extension, $1, etc. are also lexically scoped.

As a kind of iterator, a regex stored in a variable doesn't expand in list context unless you put angles around it or use it with m//:

    $rx = /(xxx)/;
    print 1,2,<$rx($_)>;
    print 1,2,</(xxx)/>;
    my &rx := /(xxx)/;
    print 1,2,<rx($_)>;

$0, $1, etc. are not set in iterated cases like this. Each list item is a result object, though, and you can still get at the internal values that way.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Hypothetical Variables, er Values

Values that are determined within a regular expression should usually be viewed as speculative, subject to cancellation if backtracking occurs. This applies not only to the values captured by (...) within the regex, but also to values determined within closures embedded in the regex. The scope of these values is rather strange, compared to ordinary variables. They are dynamically scoped, but not like temp variables. A temporary variable is restored at the end of the current block. A hypothetical variable keeps its value after the current block exits, and in fact keeps that value clear to the end of its natural lifetime if the regex succeeds (where the natural lifetime depends on where it's declared). But if failure causes backtracking over where the variable was set, then it is restored to its previous state. Perl 5 actually coerced the local operator into supporting this behavior, but that was a mistake. In Perl 6 temp will keep consistent semantics, and restore values on exit from the current block. A new word, let, will indicate the desire to set a variable to a hypothetical value. (I was tempted to use "suppose", but "let" is shorter, and tends to mean the same thing, at least to mathematicians.)

    my $x;
    / (\S*) { let $x = .pos } \s* foo /

After this pattern, $x will be set to the ending position of $1--but only if the pattern succeeds. If it fails, $x is restored to undef when the closure is backtracked. It's possible to do things in a closure that the regex engine doesn't know how to backtrack, of course, but a hypothetical value doesn't fall into that category. For things that do fall into that category, perhaps we need to define a BACK block that is like UNDO, but scoped to backtracking.

Sometimes we'll talk about declaring a hypothetical variable, but as with temp variables, we're not actually declaring the variable itself, but the dynamic scope of its new value. In Perl 6, you can in fact say:

    my $x = 0;
    ...
    {
        temp $x = 1;    # temporizes the lexical variable
        ...
    }
    # $x restored to 0

(This is primarily useful for dynamically scoping a file-scoped lexical, which is slightly safer than temporizing a package variable since nobody can see it outside the file.)

You may declare a hypothetical variable only when the topic is a regex state. This is not as much of a hardship as it might seem. Suppose your closure calls out to some other routine, and passes the regex state as an argument, $rx_state. It suffices to say:

    given $rx_state { let $x = .pos }

As it happens, $1 and friends are all simply hypothetical variables. When we say "hypothetical variable", we aren't speaking of where the variable is stored, but rather how its contents are treated dynamically. If a regex sets a hypothetical variable that was declared with either my or our beforehand, then the regex modifies that lexical or package variable, and let is purely a run-time operation.

On the other hand, if the variable is not pre-declared, it's actually stored in the regex state object. In this case, the let also serves to declare the variable as lexically scoped to the rest of the regex, in addition to its run-time action. Such a variable is not directly visible outside the regex, but you can get at it through the $0 object (always presuming the match succeeded). For a regex variable named $maybe, its external name is $0._var_{'maybe'}. The $0 object can behave as a hash, so $0{maybe} is the short way to say that.

All other variable names are stored with their sigil, so the external name for @maybe is $0{'@maybe'}, and for %maybe is $0{'%maybe'}.

$1 is a special case--it's visible outside the regex, not because it's predeclared, but because Perl already knows that the numbered variable $1 is really stored as a subarray of the $0 object: $0[1]. The numbered variables are available only through the array, not the hash.

Since $0 represents the state of the currently executing regex, you can't use it within a rule to get at the result of a completed subrule. When you successfully call a subrule named <somerule>, the regex state is automatically placed in a hypothetical variable named $somerule. (Rules accessed indirectly must be captured explicitly, or they won't have a name by which you can get to them. More on that in the next section.)

As the current recursive regex executes, it automatically builds a tree of hashes corresponding to all captured hypothetical variables. So from outside the regex, you could get at the $1 of the subrule <somerule> by saying $0{somerule}[1].

Named Captures

Suppose you want to use a hypothetical variable to bind a name to a capture:

    / (\S+) { let $x := $1 } /

A shorthand for that is:

    / $x:=(\S+) /

The parens are numbered independently of any name, so $x is an alias for $1.

You may also use arrays to capture appropriately quantified patterns:

    / @x := (\S+ \s*)* /                # including space
    / @x := [ (\S+) \s* ]* /            # excluding space
    / @x := [ (\S+) (\s*) ]* /          # each element is [word, space]

Note that in general, naming square brackets doesn't cause the square brackets to capture, but rather provides a destination for the parens within the square brackets. Only parens and rules can capture. It's illegal to name square brackets that don't capture something inside.

You can also capture to a hash:

    / %x := [ (\S+)\: \s* (.*) ]* /     # key/value pairs

After that match, $1 returns a list of keys, and $2 returns a list of values. You can capture just the keys:

    / %x := [ (\S+) \s* ]* /            # just enter keys, values are undef

You can capture a closure's return value too:

    / $x := { "I'm in scalar context" } /
    / @x := { "I", "am", "in", "list", "context" } /
    / %x := { "I" => "am in hash context" } /

Note that these do not use parens. If you say:

    / $x := ({ code }) /

it would capture whatever text was traversed by the closure, but ignore the closure's actual return value.

You can reorder paren groups by naming them with numeric variables:

    / $2:=(.*?), \h* $1:=(.*) /

If you use a numeric variable, the numeric variables will start renumbering from that point, so subsequent captures can be of a known number (which clobbers any previous association with that number). So for instance you can reset the numbers for each alternative:

    / $1 := (.*?) (\:)  (.*) { process $1, $2, $3 }
    | $1 := (.*?) (=\>) (.*) { process $1, $2, $3 }
    | $1 := (.*?) (-\>) (.*) { process $1, $2, $3 }
    /

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

It's also possible to refer to captures relative to the current location. $-1 refers to the immediately preceding capture (what used to be known as $+). $-2 refers to the one before that. If you use anything above $-3 we'll come and take you away to the insane asylum.

Subrules called via <rule> also capture their result in hypothetical variables. It's possible to name the results of any <...>, but grammar rules already have a name by default, so you don't have to give them names unless you call the same rule more than once. So, presuming you have grammar rules defining "key" and "value", you can say:

    / <key> \: <value> { let %hash{$key} = $value } /

Of course, in a typical grammar the typical rule may not return a string, but a reference to an anonymous object representing a node of the parse tree. But that depends on what the subrule decides to capture. If the only thing captured in the subrule is a single string, that's what you get. (If nothing is captured, you get the entire match.)

Any capture that will capture multiple items will, if asked to put it into a scalar variable, produce an anonymous list automatically. This should rarely be a surprise, since it's obvious by inspection whether you've quantified the capture or not. So if you say any of:

    / $x := <word>*/
    / $x := <word>+/
    / $x := <word><1,3>/

then you've "pluralized" the naming, and you can expect to get some number of values in $x as an anonymous list. However, the ? quantifier specifically doesn't pluralize. If you say:

    / $x := <word>?/

then $x will either be the result of the subrule or undef.

You can name the results of a zero-width assertion, but you'd typically only get the null string out of it. This can still be useful, since it contrasts with the undefined value you'd have if the assertion fails. (It is possible with an explicit capture to return a non-zero-width string from a zero-width assertion, however.)

Variable Scoping

When you refer to a variable @foo as an rvalue in a regex, it searches for an existing variable in the following places:

  1. We first look to see if the variable is already declared lexically with either my @foo or our @foo. If so declared, we're done.

  2. Next we look for @foo in the current regex's name table. The name of the variable is really $0{'@foo'}.

  3. If the regex belongs to a grammar, we next look for @foo in the grammar object. If there, its real name is @.foo, or some such. (It might be objected that the grammar object is not yet constructed when the regex is compiled. After all, the regex is probably being passed to the grammar object's constructor. But I think if such a variable is declared as an object attribute we know that there will be such a variable/accessor later when we have finished constructing, and that seems like enough info to know how to compile the regex.)

  4. Next we look for @foo as a declared core global variable @*foo.

  5. Finally, if "strict vars" is not in effect, we assume that @foo is stored in the current package. Otherwise it's a stricture error.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Variable Interpretation

As we mentioned earlier, bare scalars match their contents literally. (Use <$var> instead to match a regex defined in $var.) Subscripted arrays and hashes behave just like a scalar as long as the subscripts aren't slices.

If you use a bare array (unsubscripted), it will match if any element of the array matches literally at that point. (A slice of an array or hash also behaves this way.) If you say

    @array = ("^", "$", ".");
    / @array /

it's as if you said

    / \^ | \$ | \. /

But if you you slice it like this:

    / @array[0..1] /

it won't match the dot.

If you want the array to be considered as a set of regex alternatives, enclose in angles:

    @array = ("^foo$", "^bar$", "^baz$");
    / <@array> /

Bare hashes in a regex provide a sophisticated match-via-lookup mechanism. Bare hashes are matched as follows:

  1. Match a key at the current point in the string.
    1a. If the hash has its keymatch property set to some regex, use that regex to match the key.

    1b. Otherwise, use /\w+:/ to match the key.

  2. If a key isn't found at the current position in the string, the match fails.

  3. Otherwise, get the value in the hash corresponding to the matched key.

  4. If the is no entry for that key, the match fails.

  5. If the hash doesn't have a valuematch property, the match succeeds immediately.

  6. Otherwise use the hash's valuematch property (typically itself a regex) to extract the value at the current point in the string.

  7. If no value can be extracted, matching of the hash fails.

  8. If the extracted value string is eq to the key's actual value, matching of the original hash immediately succeeds.

  9. Otherwise, matching of the original hash fails.

So matching a bare hash is equivalent to:

    rule {
        $key := <{ %hash.prop{keymatch} // /\w+:/ }>    # find key
        <( exists %hash{$key} )>                        # if exists
        [ <( not defined %hash.prop{valuematch} )> ::   # done?
            <null>                                      # succeed
        |                                               # else
            $val := <%hash.prop{valuematch}>            # find value
                <( $val eq %hash{$val} )>               # assert eq
        ]
    }

A typical valuematch might look like:

    rule {
        \s* =\> \s*             # match => 
        $q:=(<["']>)            # match initial quote 
        $0:=( [ \\. | . ]*? )   # return matched value
        $q                      # match trailing quote 
    }

In essence, the presence or absence of the valuematch property controls whether the hash tries to match only keys, or both keys and values.

A hash may be used inside angles as well. In that case, it finds the key by the same method (steps 1 and 2 above), but always treats the corresponding hash value as a regex (regardless of any properties the hash might have). The parse then continues according to the rule found in the hash. For example, we could parse a set of control structures with:

    rule { <%controls> }

The %controls hash can have keys like "if" and "while" in it. The corresponding entry says how to parse the rest of an if or a while statement. For example:

        %controls = ( 
            if     => / <condition>      <closure> /,
            unless => / <condition>      <closure> /,
            while  => / <condition>      <closure> /,
            until  => / <condition>      <closure> /,
            for    => / <list_expr>      <closure> /,
            loop   => / <loop_controls>? <closure> /,
        );

So saying:

    <%controls>

is really much as if we'd said:

    [ if     \b <%controls{if}>
    | unless \b <%controls{unless}> 
    | while  \b <%controls{while}>
    | until  \b <%controls{until}> 
    | for    \b <%controls{for}> 
    | loop   \b <%controls{loop}>
    ]

Only it actually works more like

    / $k=<{ %controls.prop{keymatch} // /\w+:/ }> <%controls{$k}> /

Note that in Perl 6 it's perfectly valid to use // inside an expression embedded in a regex delimited by slashes. That's because a regex is no longer considered a string, so we don't have to find the end of it before we parse it. Since we can parse it in one pass, the expression parser can handle the // when it gets to it without worrying about the outer slash, and the final slash is recognized as the terminator by the regex parser without having to worry about anything the expression parser saw.

A bare subroutine call may be used in a regex, provided it starts with & and uses parentheses around the arguments. The return value of the subroutine is matched literally. The subroutine may have side effects, and may throw an exception to fail.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Defining Your Own Rules

Suppose your name is Hugo and you don't like to use ! to negate an assertion. You can define your own assertion like this:

    my rule not (str $rx) { <!<{"<$rx>"}>> }            # define Hugo not
    / <not [a-z]> /     # same as <![a-z]>

That rule would be lexically scoped because of the my. If you think it looks like a sub declaration, you're right. In fact, it's possible you could even declare it anonymously like a closure:

    my $not = rule (str $rx) { <!<{"<$rx>"}>> };
    / <$not tonight dear> /

But maybe you don't want it lexically scoped because you're writing a grammar for general use:

    grammar HugoGrammar {
         rule not ($rx) { <!$rx> }
         rule identifier { <not before \d> \w+ }
         rule \j { \c[LF] }
         rule parse { ^ <identifier> \j $ }
    }
    HugoGrammar.parse($line);

In this case a rule is simply a method in a grammar class, and a grammar class is any class derived implicitly or explicitly from the universal RULE grammar class. The built-in regex assertions like <before \w> are really just calls to methods in the RULE class. The namespace of a grammar is simply the method namespace of the current class, which is the class's methods plus all inherited methods.

In addition to normal subrules, we allow some funny looking method names like:

    rule :a { ... }
    rule \a { ... }

Modules that mutate Perl's grammar on the fly can do so by deriving an anonymous grammar class from the default Perl6Grammar, and installing extra rules on the fly. The current regex state then continues parsing the rest of the lexical scope using some rule from the new rule set. Subsequent grammatical mutations will be derived from the current anonymous grammar unless you switch explicitly to an entirely different grammar.

Since we're writing grammar rules as if they were methods, we have access to the full syntax of method declaration, including formal parameter lists and compile-time properties. So we can easily annotate rules with pragmatic information such as operator precedence levels when you don't want to write a strictly recursive-descent parser, for instance. (And we don't want to, for Perl.)

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Accepted RFCs

RFC 072: Variable-length lookbehind.

This seems good to me. It's just a SMOP to reverse the ordering of nodes in the syntax tree, and I think we can pretty well determine when it's impossible to reverse the tree. The operation of a reversed syntax tree will not be totally transparent, however, so it will be necessary to document that quantifiers will actually be working right-to-left rather than left-to-right. (It's probably also a good idea to document that many syntactic constructs can't actually be reliably recognized in reverse. An attempt to do so probably means you needed to do a lookahead earlier, rather than a lookbehind later.)

The syntax of lookbehind uses the new assertion syntax:

    <after ...>         # positive lookbehind
    <!after ...>        # negative lookbehind

Yes, the pos() function could return multiple values in list context, but I think it's more reasonable for the individual captured elements to know where their positions are. The pos function is really just a special case of a more general data structure contained in the regex result object from the last successful match. In which case, maybe it really needs to have a better name than pos. Maybe $0 or something. Then we get $0.beg and $0.end, $1.beg, and $1.end, etc. Since @$0 returns a list of captures, you can do @$0^.beg and @$0^.end if you want a list of beginnings and endings. Did I mention that the magical @+ and @- arrays are gonna be real dead? Never could remember which one was which anyway...

RFC 093: Regex: Support for incremental pattern matching

I don't think this proposal is powerful enough. "Infinite" strings are a more powerful concept. But I don't think infinite strings are powerful enough either!

We're certainly going to have "infinite" arrays for which missing elements are defined by a generator (where the action could be as simple as reading more data from some other source). We could do the same thing for strings directly, or we could define strings that are implemented underneath via arrays (of strings or of stringifiable objects), and achieve infinitude that way. This latter approach has the benefit that the array element boundaries could be meaningful as zero-width boundaries between, say, tokens in a token stream. We're thinking that <,> could match such a boundary.

But beyond that, such arrays-as-strings could allow us to associate hidden metadata with the tokens, if the abstract string is constructed from a list of objects, or a list of strings with properties. This is typically how a parser would receive data from a lexical analyzer. It's the parser's job to transform the linear stream of objects into a parse tree of objects.

Matching against such boundaries or metadata would not be possible unless ether the regex engine is aware that it is matching against an array, or the string emulation provides visibility through the abstract string into the underlying array. The latter may be preferable, since (by the rules of the =~ matrix discussed in Apocalypse 4) @array =~ /regex/ is currently interpreted as matching against each element of the array individually rather than sequentially, and there are other uses for a string that's really an array. In fact, @array =~ /regex/ could conceivably be matching against a set of infinite strings in parallel, though that seems a bit scary.

Even if we don't care about the boundaries between array elements, this approach gives us the ability to read a file in chunks and not worry that the pattern won't match because it happens to span a boundary.

It might be objected that matching against a subroutine rather than an infinite string or array has the benefit of not promising to keep around the entire string or array in memory. But this is not really a feature, since in general a regex can potentially backtrack all the way to the beginning of the string. And there's nothing to say that the front of the infinite string or array has to stay around anyway. Whether to throw away the head of a string or array should really depend on the programmer, and I don't think there's a more intuitive way to manage that than to simply let the programmer whack off the front of the string or array using operators like substr or splice, or the new <cut> assertion.

Indeed, the very existence of the string/array precludes the caching problem that RFC 316 complains about.

The question remains how to declare such a string/array. If we decided to do a magical name identification, we could conceivably declare

    my $@array;

and then both $array and @array refer to the same object, but treated as a string when you say $array and as an array when you say @array). One is tempted to set up the input routine by saying

    my $@array is from { <$input> };

Additional lines (or chunks) would then come from the <$input> iterator.

But really, the infinite nature of the array is a feature of the underlying object, not the variable. After all, we want to be able to say

    @array := 1..Inf;

even with an ordinary array.

So we could even make this work:

    my $@array := <$input>;

But I think we need to break the aliasing down, which will give us more flexibility at the expense of more verbiage:

    my @array := <$input>;              # @array now bound to iterator
    my $array is ArrayString(@array);   # an ordinary tie

That would let us do cool and/or sick things like this:

    my @lines := <$article>;
    my $_ is ArrayString(@lines);
    s/^ .*? \n<2,> //;  # delete header from $_ AND @lines!
    for @lines { ... }  # process remaining lines

The for loop potentially runs forever, since @lines is implicitly extended from an iterator. The array is automatically extended on the end, but it's not automatically shifted on the front. So if you really did want the loop to run forever without exhausting memory, you'd need to say something like:

    substr($_, 0, $_.pos, "");

The same effect can be effected within a regex by asserting <cut>, which makes the current position the new string beginning. (If you backtrack over <cut>, the entire match will fail.)

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

RFC 110: counting matches

I think we can avoid using any options if we make a pattern count matches when used in a numeric context. If in doubt, make it explicit:

    $count = +/foo/;

If it turns out we do need an option, it'll probably be :n.

RFC 112: Assignment within a regex

This RFC is basically covered by the $foo:=(...) notation, plus variations. The RFC claims that such assignments are not done till the end, except that they are done ahead of closures. I'd rather state it the other way around: it always appears that the current hypothetical binding is assigned if you check, but as long as the optimizer can determine that you aren't looking, it doesn't have to keep up appearances. Contrariwise, if $foo is just a fancy way of saying $1, there may in fact be no more overhead in maintaining $foo than $1. Either is really just pointing into a table of offsets into the string. That's assuming we get the scoping right on hypothetical variables.

Some excerpts from the RFC:

The camel and the docs include this example:
       if (/Time: (..):(..):(..)/) {
            $hours = $1;
            $minutes = $2;
            $seconds = $3;
        }

This then becomes:

      /Time: (?$hours=..):(?$minutes=..):(?$seconds=..)/

Now that looks like this:

      /Time\: $hours:=(..) \: $minutes:=(..) \: $seconds:=(..)/
It may be appropriate for any assignments made before a code callout to be localized so they can unrolled should the expression finally fail.

Rather than localized (or temporized), they are hypothesized.

The first versions of this RFC did not allow for backrefs. I now think this was a shortcoming. It can be done with (??{quotemeta $foo}), but I find this clumsy, a better way of using a named back ref might be (?\$foo).

Backrefs are now unified with hypothetical variables, so the issue doesn't arise. Just use $foo.

Using this method for capturing wanted content, it might be desirable to stop ordinary brackets capturing, and needing to use (?:...). I therefore suggest that as an enhancement to regexes that /b (bracket?) ordinary brackets just group, without capture - in effect they all behave as (?:...).

There's no need for a /b now that we have [...] for non-capturing brackets.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

RFC 144: Behavior of empty regex should be simple

I agree, the behavior should be simple. However, rather than always matching, I propose to make it an error to actually have a null pattern, or a null choice in a list of alternatives. Use an explicit <null> if that's what you mean. (It's not a problem if $foo is null in /$foo/, since variables are now managed by the regex engine and not by interpolation.)

RFC 150: Extend regex syntax to provide for return of a hash of matched subpatterns

The $foo:=(...) notation essentially covers that case. One can say:

    / %hash{foo}:=(...) %hash{bar}:=(...) /

Fancier things can be done with closures.

RFC 156: Replace first match function (?...?) with a flag to the match command.

Having a :f modifier seems like a reasonable way to do it:

    m:f/.../

Though it's vaguely possible we should be having a set of verbs that parse like split:

    split /.../
    count /.../
    first /.../

It's not clear whether those are actually methods, and if so, on which object, the string or the regex. In any event, I don't think we have to nail that down quite yet. I'm accepting the basic premise of this RFC that the ?...? construct is going away, one way or another.

At the moment, it looks like this option is spelled :once.

RFC 165: Allow Variables in tr///

If interpolation of patterns by default is wrong, I think extending the tr/// interface via scalar interpolation is doubly wrong. Run-time generated transliterations should be based on mappings that aren't so position dependent. That is, rather than specifying it as two long lists:

    abc12xyz => ABC34XYZ

we specify something more like this:

    abc => ABC
    12  => 34
    xyz => XYZ

That looks more like a list of pairs of scalars than a pair of scalars. In fact, internally, it's done like a funny parallel substitution:

    s:e(/a/A/,
        /b/B/,
        /c/C/,
        /1/3/,
        /2/4/,
        /x/X/,
        /y/Y/,
        /z/Z/)

In any event, it's more like tr/@foo/@bar/ than tr/$foo/$bar/. But then, why stick with the fake string notation? Why not just say tr(@foo,@bar) if that's what we mean? Then we're not limited to character substitutions:

    $string.tr [ " "     , "<"   , ">"   , "\n"   ],
               [ "&nbsp;", "&lt;", "&gt;", "<br>" ];

Or how about tr(%trans)?

    %upper = {
        "a-z" => "A-Z",
    }
    $string.tr %upper;

or just pair lists of some sort:

    $string.tr("a-c" => "x-z",
               "1-2" => "3-4",
               "A-C" => "X-Z",
              );
    @trans = [
        "a-z" => "A-Z",
        @tr_danish,
    ];
    $string.tr(@trans)

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

RFC 166: Alternative lists and quoting of things

Alternative lists of literals are included simply by mentioning the array:

    /@names/

Alternative lists of subrules are included with:

    /<@names>/

There's no longer any need for quoting constructs because variables match as literals by default. You have to use angle brackets to get interpretation of a string as a subrule. (But it's still preferable to precompile your regexen.)

RFC 191: smart container slicing

As proposed, this might prevent us from using a regex object as a key to a hash. However, with some tweaking, it'll fit in with how slicing is done in Perl 6.

Perl 6 will DWIM subscripts based on their appearance. Obviously,

    %hash{"foo"}

has a single subscript. And just as obviously,

    %hash{"a" .. "z"}

has 26 subscripts or so. In the absence of any scalar guidance, a subscript will be interpreted in list context. So

    %hash{ @array }

will automatically slice on the list of keys in the array. Any function will be called in a list context by default, giving it the opportunity to return multiple values. Perl 6 subscripts are naturally biased toward slicing. To unbias it, here are some of the specifically recognized subscripts:

    %hash{"foo"}        # scalar literal
    %hash{bar}          # scalar literal
    %hash{1}            # scalar literal
    %hash{$x}           # scalar variable
    %hash{\$x}          # scalar reference
    %hash{["a", "b"]}   # array reference
    %hash{{"a" => "b"}} # hash reference
    %hash{ "a" => "b" } # pair reference
    %hash{ /pat/ }      # rule reference
    %hash{ _ expr }     # force expr to return a single string
    %hash{ + expr }     # force expr to return a single number

Boolean expressions and closures look like singular values but cause a match against all possible values of the subscript.

    %hash{ ?1 }         # select all subscripts
    %hash{ ?/pat/ }     # select subscripts for which pat matches
    %hash{ $_ =~ /pat/ }# select subscripts for which pat matches
    %hash{ $_ ge "a" }  # select lowercase keys (assuming ASCII)
    %hash{ .ge "a" }    # same thing, maybe
    %hash{ { expr } }   # select subscripts for which closure returns true

Multiple slice subscripts are separated by semicolons, so that you can use commas within each slice subscript for list building. This is more important for multi-dimensional arrays:

    my @array is dim(9,9,9) = cubic();
    @3d_slice = @array[ @x; @y; @z ];
    @3d_slice = @array[ 0,1,3,8 ; 0,1,3,8; ?1 ];
    @3d_slice = @array[ 0..9 ; 0..9:-1; ?test($_) ];
    @3d_slice = @array[ !($_ % 2) ; 0..9:3; ?test($_) ];

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

RFC 274: Generalized Additions to Regexs

This proposal has significant early/late binding issues. A definition that forces run-time overhead is not as useful as it might be. On the other hand, a pure compile-time mechanism is not as general as it might be--but a compile-time mechanism can always compile in a run-time mechanism if it chooses to defer evaluation.

So it seems like this is a good place for syntactic warpage of some sort or other. That would make it possible to do both compile-time and run-time bindings. We'll be using the <...> notation for our extensible syntax, and the grammar rules for parsing that particular part of Perl syntax will be just as easy to tweak as any other Perl grammar rule.

That being said, the very fact that we can associate a grammar with the regex means that it's easy to define any instance of <word> to mean whatever you want it to. (In a sense, these subrules are the very callbacks that the RFC proposes.) These subrules can be bound either at Perl compile time or at Perl run time. They can be defined to take a string, regex, or Perl expression as an argument. The latter two cases are efficient because they come in as a regex or code reference respectively.

Following on, if (?{...}) etc code is evaluated in forward match, it would be a good idea to likewise support some code block that is ignored on a forward match but is executed when the code is unwound due to backtracking.

Yes, though hypothetical values take some of the pressure off for this. But if a closure contained a BACK block, it could be automatically fired off on backtracking. As with LAST et al., I suppose there's a corresponding back property on variables. In a sense, saying

    let $var = $newval

is much like saying

    our $var is back { .set($oldval) } = $newval

except that $var may well be stored in the regex state object rather than in a global symbol table.

RFC 276: Localising Paren Counts in qr()s.

I agree totally. As for the problem of pulling captures out of a subrule, it's up to the subrule to determine what it "returns". We could make some intelligent defaults, though different kinds of rules might want different defaults. One approach might be to say that if there is a single capture, that is returned as the result. If there is no capture, it's as if the entire subpattern were captured. If there are multiple captures, they are returned as an anonymous list. So $1 from such a subrule might come through like this:

    / $sub:=<subrule> { print $sub[1] } /

or just:

    / <subrule> { print $subrule[1] } /

But named captures and named rules intrude on this idyllic picture. You'd also like a default anonymous hash value returned that is keyed by all the named captures or rules. The question is whether that forces numbered captures to come through the hash interface. Or maybe that's just always the case, so to get at $1 of a subrule, you'd say:

    / $sub:=<subrule> { print $sub{'1'} } /

But there are reasons for wanting to treat the result object as an array, so that

    / $sub:=<subrule> { process(@$sub) } /

processes all the numbered captures from the subrule. So I think the return object behaves either like a hash or an array as appropriate. (Note that such an array might be declared to have an origin at 1 rather than 0.)

RFC 317: Access to optimisation information for regular expressions

Seems like a no-brainer. All such information is likely to be available to Perl anyway, given that we'd like to do the parser, optimizers, and code generators in Perl if at all possible.

RFC 331: Consolidate the $1 and \1 notations

I like the title of this RFC. It fits in with the new my policy of immediate introduction. However, there are certain difficulties with the proposed implementation. The statement-by-statement setting of the @/ array looks pretty ugly to me. I'd rather have a consistent view of hypothetical variables that can live on outside the regex in question without regard to statement boundaries. In the rare event that someone needs to refer to $1 (or anything else) from a prior regex, a temporary variable should be used.

RFC 332: Regex: Make /$/ equivalent to /\z/ under the '/s' modifier

Another RFC that is accepted in principle, but that doesn't go far enough. The /s modifier is going away, along with /m. A $ will always mean end-of-string, and $$ will match at the end of any line. (The current process id is now $*PID, by the way, so there's no conflict there. But how often to you want to write a pattern to match the current process id anyway?)

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

RFC 348: Regex assertions in plain Perl code

This RFC makes some good points, though the code assertion syntax will be:

    <( code )>

The RFC also makes a case for getting rid of the special behavior of local in Perl 5, which treated local differently within a regex. However, something very like the local behavior will still be needed for making hypotheses, though the RFC is correct that it's not needed in the typical code assertion, In Perl 6, localization is done with temp, and it will not do the hypothetical variable hack that Perl 5 did. Instead there will be an explicit lvalue modifier, let, which specifically requests a variable's value to be scoped to the success of the current point in the regex. These hypothetical variables actually have much broader use than this RFC suggests.

Perl 5's hardwired use of $^R just translates to an appropriately named hypothetical variable in Perl 6.

RFC 360: Allow multiply matched groups in regexes to return a listref of all matches

I think that parens that can potentially match multiple times will automatically produce a list rather than matching the final one. It's not as if we can't tell whether something's inside a quantifier...

Here's the RFC's proposed solution:

    while ($text =~ /name:\s*(.*?)\n\s*
                    children:\s*(?:(?@\S+)[, ]*)*\n\s*
                    favorite\ colors:\s*(?:(?@\S+)[, ]*)*\n/sigx) {
        # now we have:
        #  $1 = "John Abajace";
        #  $2 = ["Tom", "Dick", "Harry"]
        #  $3 = ["red", "green", "blue"]
    }

Apart from the change in behavior of (...) within a quantifier, I have the urge to rewrite this example for several reasons:

  •     The C</x> and C</s> flags no longer exist.
  •     The C</i> and C</g> flags must be pulled out to the front for visibility.
        (And the C</g> flag is renamed C<:e>).
  •     There's now a C<\h> for horizontal whitespace, and C<\H> for the negation
        of that.  (Not that RFC is incorrect to use C<\s>.)
  •     The negation of C<\n> is now C<\N>.
  •     The C<:> character is now a metacharacter, and so must be backslashed.
  •     Character classes are now represented with C<< <[...]> >>.
  •     Grouping is now represented with C<[...]>.

With these changes, and making better use of whitespace, the sample regex ends up looking like this:

    for ($text =~ m:ie[
                            name             \: \h*   (\N*?)            \n
                        \h* children         \: \h* [ (\S+) <[,\h]>* ]* \n
                        \h* favorite\ colors \: \h* [ (\S+) <[,\h]>* ]* \n
                      ]
          )
    {
             # now we have:
             #  $1 = "John Abajace";
             #  $2 = ["Tom", "Dick", "Harry"]
             #  $3 = ["red", "green", "blue"]
    }

I think in the long run people will find this more readable once they're used to it. Certainly tabularizing the parallelisms will make any typing errors stand out.

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

RFC 361: Simplifying split()

The RFC makes five suggestions. I'll consider them one by one.

The first argument to split is currently interpreted as a regexp, regardless of whether or not it actually is one. (Yes, split '.', $foo doesn't split on dot -- it's currently the same an split /./, $foo.) I suggest that split be changed to treat only regexps as regexps, and everything else as literals.

Fine, I think. If the first argument to split is untyped, it should parse correctly, either evaluating a quoted string immediately or deferring interpretation of a regex. One could even do something like split on the first delimiter matched by another pattern:

    split _/(,|;)/;

That would split on either all commas or all semicolons, depending on which it found first in the string. The _ forces the regex to return a string, which is whatever was captured by the parens in this case.

Empty trailing fields are currently suppressed (although a -1 as the third argument disables this). I suggest that empty trailing fields be retained by default.

Probably okay, though we need a way to translate old code. It was originally done this way because split on whitespace would typically return an extra field after the newline. But most newlines will be prechomped in Perl 6.

When not in list context, split currently splits into @_. I suggest that this side-effect be removed.

Fine. It's easy enough to translate to an explicit assignment.

split ?pat? in any context currently splits into @_. I suggest that this side-effect be removed.

Fine. I don't think anyone uses that.

split ' ' (but not split / /) currently splits on whitespace, but also removes leading empty fields. I suggest that this irregularity be removed.

The question is, what to replace it with, since it's a very handy construct. We could use a different conventional pattern:

    @array = split /<ws>/, $string;

Or we could say that it's now a split on whitespace only if the split argument is unspecified. That wouldn't work very well with the old syntax, where we often have to supply the second argument. But given that the =~ operator now serves as a topicalizer for any term, we could translate:

    @array = split ' ', $string;

to this:

    @array = $string =~ split;

Oddly, this probably also works:

    $string =~ (@array = split);

or maybe even this:

    @array = split given $string;

But I think I like the OO notation better here anyway:

    @array = $string.split;

In fact, split may not be a function at all. The default split might just be a string method and use unary dot:

    @array = .split;

We still have the third argument to deal with, but that's likely to be specified like this:

    @array = $string.split(limit => 3);

We could conceivably make a different method for word splitting, much like REXX does:

    @array = .words;

Then a limit could be the first argument:

    @array = .words(3);

But there almost doesn't need to be such a method, since

    @array = m/ [ (\S*) \s* ]* /;

will do the right thing. Admittedly, a .words method would be much more readable...

Fortunately, split is a function, so I can put off that decision till Apocalypse 29. :-)

Editor's Note: this Apocalypse is out of date and remains here for historic reasons. See Synopsis 05 for the latest information.

Rejected RFCs

RFC 135: Require explicit m on matches, even with ?? and // as delimiters.

Squish that gnat... :-)

A decent Perl parser is still going to have to keep track of whether a term or an operator is expected. And while we're simplifying the grammar in many ways, it's also the case that we're letting users install their own grammar rules to perform syntactic warpage. Besides, people like to write patterns with /.../. So rather than impoverishing Perl's syntax artificially, let's make the standard parser more accessible by writing it all in Perl 6 regexes.

RFC 145: Brace-matching for Perl Regular Expressions

Good problem, not-so-good solution from a complexity point of view. I'd like to leverage existing character class and backref notations maybe. If there were simply some way to tell a backref to invert any match characters, that might do it. Or maybe reverse them when you remember them, and leave the backref ignorant? (Downside is nested brackets would probably need recursive patterns.)

Recursion might be advisable anyway--you can't really pick up the arguments to a function, for instance, without also handling things like quoted strings, which may have different bracketing rules than outside of strings. Certainly matching \" would be dependent on whether you're inside or outside of a string. Given that recursion is often necessary, I'm not sure making this construct recurse itself is all that useful.

Along the lines of how tr/// works (or ought to work), I think it'd be more generally useful to have character remapping facility within a backref generator:

    (
     <[ \( \[ \{ \< ] =>
      [ \) \] \} \> ]> )

That might match a left bracket of some sort but return the corresponding right bracket as $1. But maybe we should just use an "existing" mechanism to translate strings:

    my %closing = {
        '[' => ']',
        '(' => ')',
        '{' => '}',
        '<' => '>',
    };
    rule balanced {
        <![\[\(\{\<\]\)\}\>]>*  # any non-brackets
        [                       # followed by either
            $                   #   end of string
        |                       # or
            $b := <[[({<]>      #   an opening bracket
            <self>              #   containing a balanced expr
            %closing{$b}        #   followed by corresponding close bracket
            <self>              #   followed by a balanced expr
        ]
     }

RFC 164: Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade()

All operators will have a way to name them, which means it's possible to alias them to any other name. Rearranging the formal order of parameters would be a little harder, however. We need inlining to do that efficiently. Still, now that // doesn't evaluate in a typeless context, it's relatively straightforward to define a subroutine or method that does

    subst $string, /foo/, {"bar"}

in whatever order you like.

RFC 197: Numeric Value Ranges In Regular Expressions

If we go down this road, eventually we reinvent all of Perl syntax in regular expressions. Not that I'm against TMTOWTDI, but I'd rather have a better way to run Perl code from within a regex and have it "succeed" or "fail", and maybe better ways to test ranges from Perl code. Anything beyond that could be done with syntactic warpage.

In any event, overloading () and [] for this would be mentally treacherous, not to mention completely opaque to non-mathematicians. We'll stick with the standard boolean assertion:

    / (\d+) <( $1 =~ 1..10 )> /

Interestingly, that can also be written:

    / <( _/\d+/ =~ 1..10 )> /

RFC 198: Boolean Regexes

Again, I'm not much in favor of inventing new regex syntax that duplicates ordinary Perl syntax. I think we need richer ways of interconnecting related regexes via ordinary Perl syntax. Certainly it helps to have an easy way to specify a Perl assertion:

    / (\w+) <( %count{$1} > 3 )> /

But there's something to be said for forcing submatch assertions to be defined externally to the current regex, much like we discourage inline code where subroutine calls are in order.

So anyway, I think most of the submatches like onion rings should be handled simply by searching on captured strings within a closure. Booleans can be put into closures as well, but the new :: operator makes it pretty easy to AND and OR assertions together in a more regexly fashion without reinventing the wheel.

As proposed, there will be a "fail" token, but it's spelled <fail>, not \F. And the "true" token is spelled <null>. :-)

RFC 261: Pattern matching on perl values

This reminds me a bit of unification in Prolog. It's not explained very well here, and I'm wondering if it will be too hard to explain in general. I think this is probably too powerful a concept for the typical Perl programmer, who is lucky to understand simple lvalues that always do what they're told.

This sort of matching can probably be provided as syntactic warpage, though I'm not sure if that prevents useful optimizations. Anyway, this sort of thing is unlikely to make it into the Perl 6 core unless it generalizes usefully to function argument lists, and it may be too powerful for there too. For that purpose it would resemble a form of overloading, but with the "types" specified by keys. I suspect real types are more useful.

RFC 308: Ban Perl hooks into regexes

We must be able to call back into Perl code if we want to write parsers conveniently in Perl. Think of how yacc works. Certainly the way that Perl 5 does it is ugly, I'll admit. We can beautify that.

But the whole point of Perl is to have all the most useful "Krakken tentacles". And I don't really care if it makes it hard to put the Perl regex engine into some other language. :-)

RFC 316: Regex modifier for support of chunk processing and prefix matching

Infinite strings (via infinite arrays) seem like a more useful concept. It would be easy for the extension subroutine to fail and produce the results desired in this RFC, but without the necessity of the extra syntax specified by the RFC. A match naturally fails when it gets to the end of its string without finishing the pattern. Incremental matching can also easily be done via infinite strings, and the user interface can be a simple as we like, as long as extension rule is somehow associated with the string in question.

I think pos() is rather too low-level a concept for general use. Certainly it needs to be there, but I think we need some way of implying that one regex is a continuation of a previous one, but within some higher-level syntactic construct, so that it's easy to write parsers without invoking pos() or \g or /c all over the place.


<cut>

Well, I could say a lot more, but that's it for this time. I hope you're excited by all this, in a positive sort of way. But if your jaw lost all of its bounce when it hit the table, I expect Damian's upcoming Exegesis 5 will do a better job of showing how this all fits together into a pretty picture.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en