Perl.com 
 Published on Perl.com http://www.perl.com/pub/a/2002/06/26/synopsis5.html
See this if you're having trouble printing code examples

 

Synopsis 5
By Allison Randal, Damian Conway
June 26, 2002

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

A summary of the changes in Apocalypse 5:

Unchanged features

Modifiers

by Allison Randal, Damian Conway

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Changed metacharacters


New metacharacters


Bracket rationalization


Variable (non-)interpolation

Related Reading

Perl & LWP

Perl & LWP
By Sean M. Burke


Extensible metasyntax (<...>)

by Allison Randal, Damian Conway

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Backslash reform

  • The \p and \P properties become intrinsic grammar rules (<prop ...> and <!prop ...>).

  • The \L...\E, \U...\E, and \Q...\E sequences become \L[...], \U[...], and \Q[...] (\E is gone).

  • Note that \Q[...] will rarely be needed since raw variables interpolate as eq matches, rather than regexes.

  • Backreferences (e.g. \1) are gone; $1 can be used instead, because it's no longer interpolated.

  • New backslash sequences, \h and \v, match horizontal and vertical whitespace respectively, including Unicode.

  • \s now matches any Unicode whitespace character.

  • The new backslash sequence \N matches anything except a logical newline; it is the negation of \n.

  • A series of other new capital backslash sequences are also the negation of their lower-case counterparts:
    • \H matches anything but horizontal whitespace.

    • \V matches anything but vertical whitespace.

    • \T matches anything but a tab.

    • \R matches anything but a return.

    • \F matches anything but a formfeed.

    • \E matches anything but an escape.

    • \X... matches anything but the specified hex character.


Regexes are rules

  • The Perl 5 qr/pattern/ regex constructor is gone.

  • The Perl 6 equivalents are:
        rule { pattern }    # always takes {...} as delimiters
        rx/ pattern /       # can take (almost any) chars as delimiters

  • Related Reading

    Perl & XML

    Perl & XML
    By Erik T. Ray, Jason McIntosh

  • If either needs modifiers, they go before the opening delimiter:
        $regex = rule :ewi { my name is (.*) };
        $regex = rx:ewi/ my name is (.*) /;

  • The name of the constructor was changed from qr because it's no longer an interpolating quote-like operator.

  • As the syntax indicates, it is now more closely analogous to a sub {...} constructor.

  • In fact, that analogy will run very deep in Perl 6.

  • Just as a raw {...} is now always a closure (which may still execute immediately in certain contexts and be passed as a reference in others)...

  • ...so too a raw /.../ is now always a regex (which may still match immediately in certain contexts and be passed as a reference in others).

  • Specifically, a /.../ matches immediately in a void or Boolean context, or when it is an explicit argument of a =~.

  • Otherwise it's a regex constructor.

  • So this:
        $var = /pattern/;

    no longer does the match and sets $var to the result.


  • Instead it assigns a regex reference to $var.

  • The two cases can always be distinguished using m{...} or rx{...}:
        $var = m{pattern};    # Match regex, assign result
        $var = rx{pattern};   # Assign regex itself

  • Note that this means that former magically lazy usages like:
        @list = split /pattern/, $str;

    are now just consequences of the normal semantics.


  • It's now also possible to set up a user-defined subroutine that acts like grep:
        sub my_grep($selector, *@list) {
            given $selector {
                when RULE  { ... }
                when CODE  { ... }
                when HASH  { ... }
                # etc.
            }
        }

  • Using {...} or /.../ in the scalar context of the first argument causes it to produce a CODE or RULE reference, which the switch statement then selects upon.


Backtracking control

  • Backtracking over a single colon causes the regex engine not to retry the preceding atom:
        m:w/ \( <expr> [ , <expr> ]* : \) /

    (i.e. there's no point trying fewer <expr> matches, if there's no closing parenthesis on the horizon)


  • Backtracking over a double colon causes the surrounding group to immediately fail:
        m:w/ [ if :: <expr> <block>
             | for :: <list> <block>
             | loop :: <loop_controls>? <block>
             ]
        /

    (i.e. there's no point trying to match a different keyword if one was already found but failed)


  • Backtracking over a triple colon causes the current rule to fail outright (no matter where in the rule it occurs):
        rule ident {
              ( [<alpha>|_] \w* ) ::: { fail if %reserved{$1} }
            | " [<alpha>|_] \w* "
        }
        m:w/ get <ident>? /

    (i.e. using an unquoted reserved word as an identifier is not permitted)


  • Backtracking over a <commit> assertion causes the entire match to fail outright, no matter how many subrules down it happens:
        rule subname {
            ([<alpha>|_] \w*) <commit> { fail if %reserved{$1} }
        }
        m:w/ sub <subname>? <block> /

    (i.e. using a reserved word as a subroutine name is instantly fatal to the ``surrounding'' match as well)


  • A <cut> assertion always matches successfully, and has the side effect of deleting the parts of the string already matched.

  • Attempting to backtrack past a <cut> causes the complete match to fail (like backtracking past a <commit>. This is because there's now no preceding text to backtrack into.

  • This is useful for throwing away successfully processed input when matching from an input stream or an iterator of arbitrary length.

by Allison Randal, Damian Conway

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Named Regexes

  • The analogy between sub and rule extends much further.

  • Just as you can have anonymous subs and named subs...

  • ...so too you can have anonymous regexes and named regexes:
        rule ident { [<alpha>|_] \w* }
        # and later...
        @ids = grep /<ident>/, @strings;

  • As the above example indicates, it's possible to refer to named regexes, such as:
        rule serial_number { <[A-Z]> \d<8> })
        rule type { alpha | beta | production | deprecated | legacy }

    in other regexes as named assertions:

        rule identification { [soft|hard]ware <type> <serial_number> }


Nothing is illegal

  • The null pattern is now illegal.

  • To match whatever the prior successful regex matched, use:
        /<prior>/

  • To match the zero-width string, use:
        /<null>/


Hypothetical variables

  • In embedded closures it's possible to bind a variable to a value that only ``sticks'' if the surrounding pattern successfully matches.

  • A variable is declared with the keyword let and then bound to the desired value:
        / (\d+) {let $num := $1} (<alpha>+)/

  • Now $num will only be bound if the digits are actually found.

  • If the match ever backtracks past the closure (i.e. if there are no alphabetics following), the binding is ``undone''.

  • This is even more interesting in alternations:
        / [ (\d+)      { let $num   := $1 }
          | (<alpha>+) { let $alpha := $2 }
          | (.)        { let $other := $3 }
          ]
        /

  • There is also a shorthand for assignment to hypothetical variables:
        / [ $num  := (\d+)
          | $alpha:= (<alpha>+)
          | $other:=(.)
          ]
        /

  • The numeric variables ($1, $2, etc.) are also ``hypothetical''.

  • Numeric variables can be assigned to, and even re-ordered:
        my ($key, $val) = m:w{ $1:=(\w+) =\> $2:=(.*?)
                             | $2:=(.*?) \<= $1:=(\w+)
                             };

  • Repeated captures can be bound to arrays:
        / @values:=[ (.*?) , ]* /

  • Pairs of repeated captures can be bound to hashes:
        / %options:=[ (<ident>) = (\N+) ]* /

  • Or just capture the keys (and leave the values undef):
        / %options:=[ (<ident>) = \N+ ]* /

  • Subrules (e.g. <rule>) also capture their result in a hypothetical variable of the same name as the rule:
        / <key> =\> <value> { %hash{$key} = $value } /


Return values from matches

  • A match always returns a ``match object'', which is also available as (lexical) $0

  • The match object evaluates differently in different contexts:
    • in boolean context it evaluates as true or false (i.e. did the match succeed?):
          if /pattern/ {...}
          # or:
          /pattern/; if $0 {...}

    • in numeric context it evaluates to the number of matches:
          $match_count += m:e/pattern/;

    • in string context it evaluates to the captured substring (if there was exactly one capture in the pattern) or to the entire text that was matched (if the pattern does not capture, or captures multiple elements):
          print %hash{$text =~ /,? (<ident>)/};
          # or: 
          $text =~ /,? (<ident>)/  &&  print %hash{$0};

  • Within a regex, $0 acts like a hypothetical variable.

  • It controls what a regex match returns (like $$ does in yacc)

  • Use $0:= to override the default return behaviour described above:
        rule string1 { (<["'`]>) ([ \\. | <-[\\]> ]*?) $1 }
        $match = m/<string1>/;  # default: $match includes 
                                # opening and closing quotes
        rule string2 { (<["'`]>) $0:=([ \\. | <-[\\]> ]*?) $1 }
        $match = m/<string2>/;  # $match now excludes quotes
                                # because $0 explicitly bound 
                                # to second capture only

by Allison Randal, Damian Conway

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Matching against non-strings

  • Anything that can be tied to a string can be matched against a regex. This feature is particularly useful with input streams:
        my @array := <$fh>;           # lazy when aliased
        my $array is from(\@array);   # tie scalar
        # and later...
        $array =~ m/pattern/;         # match from stream


Grammars

  • Potential ``collision'' problem with named regexes

  • Of course, a named ident regex shouldn't clobber someone else's ident regex.

  • So some mechanism is needed to confine regexes to a namespace.

  • If subs are the model for rules, then modules/classes are the obvious model for aggregating them.

  • Such collections of rules are generally known as ``grammars''.

  • Just as a class can collect named actions together:
        class Identity {
            method name { "Name = $.name" }
            method age  { "Age  = $.age"  }
            method addr { "Addr = $.addr" }
            method desc {
                print .name(), "\n",
                      .age(),  "\n",
                      .addr(), "\n";
            }
            # etc.
        }

  • So too a grammar can collect a set of named rules together:
        grammar Identity {
            rule name :w { Name = (\N+) }
            rule age  :w { Age  = (\d+) }
            rule addr :w { Addr = (\N+) }
            rule desc {
                <name> \n
                <age>  \n
                <addr> \n
            }
            # etc.
        }

  • Like classes, grammars can inherit:
        grammar Letter {
            rule text     { <greet> <body> <close> }
            rule greet :w { [Hi|Hey|Yo] $to:=(\S+?) , $$}
            rule body     { <line>+ }
            rule close :w { Later dude, $from:=(.+) }
            # etc.
        }
        grammar FormalLetter is Letter {
            rule greet :w { Dear $to:=(\S+?) , $$}
            rule close :w { Yours sincerely, $from:=(.+) }
        }

  • Inherit rule definitions (polymorphically!)

  • So there's no need to respecify body, line, etc.

  • Perl 6 will come with at least one grammar predefined:
        grammar Perl {    # Perl's own grammar
            rule prog { <line>* }
            rule line { <decl>
                      | <loop>
                      | <label> [<cond>|<sideff>|;]
            }
            rule decl { <sub> | <class> | <use> }
            # etc. etc.
        }

  • Hence:
        given $source_code {
            $parsetree = m/<Perl::prog>/;
        }


Transliteration

  • The tr/// quote-like operator now also has a subroutine form.

  • It can be given either a single PAIR:
        $str =~ tr( 'A-C' => 'a-c' );

  • Or a hash (or hash ref):
        $str =~ tr( {'A'=>'a', 'B'=>'b', 'C'=>'c'} );
        $str =~ tr( {'A-Z'=>'a-z', '0-9'=>'A-F'} );
        $str =~ tr( %mapping );

  • Or two arrays (or array refs):
        $str =~ tr( ['A'..'C'], ['a'..'c'] );
        $str =~ tr( @UPPER, @lower );

  • Note that the array version can map one-or-more characters to one-or-more characters:
        $str =~ tr( [' ',      '<',    '>',    '&'    ],
                    ['&nbsp;', '&lt;', '&gt;', '&amp;' ]);

Dr. Damian Conway is a Senior Lecturer in Computer Science and Software Engineering at Monash University (Melbourne, Australia), where he teaches object-oriented software engineering. He is an effective teacher, an accomplished writer, and the author of several popular Perl modules. He is also a semi-regular contributor to the Perl Journal. In 1998 he was the winner of the Larry Wall Award for Practical Utility for two modules (Getopt::Declare and Lingua::EN::Inflect) and in 1999 he won his second "Larry" for his Coy.pm haiku-generation module. He has just published "Object-Oriented Perl" (Manning, 1999).

Allison Randal's first geek career was as a research linguist in eastern Africa. Working with minority languages led to a series of academic papers delivered in obscure places like the Czech Republic. But eventually her love of coding seduced her away from natural languages to artificial ones. In particular, to Perl. After serving several tours of duty in the dot.com trenches, she has recently returned to Darkest Academia, at the University of Portland. In her spare time she enjoys extreme sports: teaching Perl to Java programmers, Perl Monger wrangling, and debating linguistics with Larry and Damian.


Return to Perl.com.

Perl.com Compilation Copyright © 1998-2006 O'Reilly Media, Inc.