August 2002 Archives

Mail Filtering

There are many ways to filter your e-mail with Perl. Two of the more popular and interesting ways are to use PerlMx or Mail::Audit. I took a long look at both, and this is what I thought of them.

PerlMx

PerlMx is a server product from ActiveState that uses the milter support in recent versions of sendmail to hook in at almost every stage of the mail-handling process.

PerlMx comes with its own copy of Perl, and all the supporting modules it needs - it can't run from a normal Perl, as it needs Perl to be built with various options such as ithreads support and multiplicity. This means you need to install any modules you want to use with PerlMx twice if you already have them installed somewhere else on your system.

PerlMx provides a persistent daemon that processes e-mail for an entire mail-server - it avoids the overhead of starting a Perl process to handle each e-mail by running forever, and by using threads to ensure it can service more than one e-mail at a time.

PerlMx ships with two main filters - the Spam and Virus filters. The Virus filtering looks interesting, but ultimately I don't receive that many viruses in e-mail, so I was unable to test it beyond establishing that it didn't mangle my e-mail.

The Spam filtering in PerlMX is much more interesting - it seems to be based on Mail::SpamAssassin, a popular spam filtering module often used with Mail::Audit, procmail, or other ways of processing e-mail.

In two weeks of testing with PerlMx, using it to process a copy of all my personal e-mail, I found a lot useful functionality, and a few minor problems.

The first hassles were setup - I don't normally use sendmail, but PerlMx requires it for the milter API, so I installed sendmail, set it up, and hooked it into PerlMx.

Once you have sendmail setup, and built with milter support (as the default build from Debian Linux I used was), it's easy to add a connection to PerlMx with one line in your sendmail.mc file:


INPUT_MAIL_FILTER(`C<PerlMx>', `S=inet:3366@localhost, F=T, 
     T=S:3m;R:3m;E:8m'')

PerlMx essentially works out of the box - it asks a number of simple questions when you install and set it up, and assuming you get these right, no further configuration will be required.

The INPUT_MAIL_FILTER line also sets several key options, including the timeouts for communication between sendmail and PerlMx - I had to raise these significantly to deal with a problem I found where PerlMx was taking too much time to process spam (it appear to be doing DNS lookups), sendmail was timing out the connection to PerlMx, and refusing to accept mail.

In PerlMx 2.1, it even ships with its own sendmail install, pre-configured for use with PerlMx, but you can choose to ignore this and use an existing system sendmail.

Once you've done this, suddenly all the mail that goes through your mail-server is spam filtered, and virus checked. Mail that looks likely to be spam, or that contains a virus is stopped and held in a quarantine queue, the rest are sent to the user, possibly with a spam header added to indicate a score representing how likely to be spam they are. The quarantine queue is a systemwide collection of messages which, for one reason or another, weren't appropriate to deliver to the user - this will be normally as they are either suspected to contain viruses or spam.

If the filters supplied with PerlMx aren't to your tastes, then it comes supplied with an extension API, and extensive documentation and samples to allow you to write your own.

While testing PerlMx, I never managed to bounce or accidentally lose my e-mail - I made many configuration errors, which meant mail wasn't processed and a lot of stuff was somewhat over-enthusiastically marked as spam when it was actually valid. But as far as I can tell, nothing bounced or disappeared into the system - this is pretty impressive, as when configuring most new bits of e-mail I usually manage to delete everything I send to it in the first few attempts, or, worse, make myself look stupid by sending errors back to random people unfortunate enough to be on the same mailing list as me.

Mail::Audit

Mail::Audit is very different from PerlMx. For starters, once you've installed it, by default it doesn't do anything. Mail::Audit is just a Perl module - it's a powerful tool for implementing mail filters, but mostly you have to write them yourself. PerlMx ships with spam filtering and virus checking configured by default, Mail::Audit provides duplicate killing, a mailing list processing module (based on Mail::ListDetector), and a few simple spam filtering options based on Realtime Blackhole Lists or Vipul's Razor.

Mail::Audit is not designed to be used with an entire mail-server in the same way as PerlMx. Instead, it allows you to easily write little e-mail filter programs that can be triggered from the .forward file of a particular user. Mail::Audit can be easily configured and used on a per-user basis, whereas PerlMx takes over an entire mail-server and is an all-or-nothing choice.

The default Mail::Audit configuration starts one Perl process for each mail handled - normally this won't be a problem, but if you're processing large volumes of mail, or have a system which is already at or near capacity, it may be enough to tip the balance and cause performance problems (Translation: Long ago I installed Mail::Audit on an old, spare machine I was using as a mail-server, received 200 e-mails in less than a minute, and spent quite a while waiting for the system to stop gazing at its navel and start responding to the outside world again). If your mail comes to you via POP3, or can be made to do so (possibly by installing a POP3 daemon if you do not have one already), then a simple script supplied with Mail::Audit called popread provides a base you can use to feed articles from a POP3 server into Mail::Audit in a single Perl process, improving performance. I didn't do this myself, as I wanted to use what appeared to be the 'recommended' approach to Mail::Audit setup - the one that is, if not actively promoted in the documentation, most strongly suggested by it, of running a Mail::Audit script from a user's .forward file.

A popular Mail::Audit addition is SpamAssassin (the same codebase as PerlMx's mail processing is loosely based on) - this comes as a Mail::Audit plugin, among other forms.

Mail::Audit makes it easy to write mail filters that work on a per-user basis, whereas PerlMx by default applies to all mail processed on a given mailserver.

If you wanted to install Mail::Audit systemwide, then many mail-servers (such as exim) provide a way to configure a custom local delivery agent on flexible criteria. For example, this article provides some documentation on how to do this with exim.

Testing ... 1 ... 2 ... 3 ...

I decided to do an extended comparison of both PerlMx and Mail::Audit. As one of the most common applications of mail filtering tools is for spam filtering, I set up recent versions of both the tools on my personal e-mail, by various nefarious means, ran them for a week, and compared the results on two main criteria:

  • False positives (legitimate email recognized as spam)
  • False negatives (spam not recognized as spam)

Mail::Audit doesn't come with much spam filtering technology by default, so I decided to add SpamAssassin (http://www.spamassassin.org/) to the testing, as it can be used as a Mail::Audit extension.

I used procmail to copy all my incoming e-mail to two pop3 mailboxes setup for the purposes of testing - one would contain mail to be processed by Mail::Audit, the other mail to be processed by PerlMx's spam filtering. fetchmail was used to pull the mail down into the domain of Mail::Audit and PerlMx.

Once I had Mail::Audit and SpamAssassin setup, I started feeding mail into the test box with fetchmail, and was reminded that as the Mail::Audit approach of setting up a perl program to run from a .forward file has ... unpleasant effects if you receive more than a few e-mails in quick succession. As my test mail-server collapsed under the load, I checked the PerlMx machine, started at roughly the same time, and found that while it was working through the e-mail more slowly, it hadn't put any serious load on the machine.

Due to a PerlMx configuration error on my part, of the first 171 messages processed, 10 were quarantined as spam AND delivered to the inbox of my test user. PerlMx runs by default in 'training mode' when processing spam - in this mode, mail is spamchecked as normal, but even if it is found to be spam and quarantined, it is also delivered to the user.

I decided to keep track of any mail lost or mislaid during initial setup problems, so I could see what problems could arise from the tools being misconfigured. An important aspect of any software is not only how it behaves when configured right, but how much it punishes you when you get the configuration wrong.

Waking up the next morning, I found I'd bounced several hundred e-mails back to the account from which I was forwarding all the test e-mails, someone of which appeared to have gone back and forth, or found their way into the PerlMx test mailbox. Most of the problems appeared to be internal errors from within SpamAssassin. My mail-server still hadn't recovered.

I later found this was because of an compatibility issue with SpamAssassin / Mail::Audit, and there was a recommended fix in the SpamAssassin FAQ involving the nomime option to Mail::Audit (but not, sadly, in the documentation for the Mail::SpamAssassin module itself).

The SpamAssassin / Mail::Audit script I ended up using in the end was:


  #!/usr/local/bin/perl -w

  use strict;
  use C<Mail::Audit>;
  use Mail::SpamAssassin;

  # create C<Mail::Audit> object, log to /tmp, disable mime processing
  # for SpamAssassin compatibility, and store mail in ~/emergency_mbox
  # if processing fails
  my $mail = C<Mail::Audit>->new(emergency=>"~/emergency_mbox",
                              log => '/tmp/audit.log',
                              loglevel => 4, nomime => 1);

  my $spamtest = Mail::SpamAssassin->new;
  
  # check mail with SpamAssassin
  my $status = $spamtest->check($mail);
  
  # if it was spam, rewrite to indicate what the problem was, and 
  # store in the file ass-spam in our home directory
  if ($status->is_spam) {
          $status->rewrite_mail;
          $mail->accept("/home/spam1/ass-spam");
  # if if wasn't spam, accept it as normal mail
  } else {
          $mail->accept;
  }
  
  exit 0;

After clearing down all my mail, and losing two days of testing, I started again. It was only the nature of the testing setup that meant the bounce mail went to me and not the original sender. So, at 23:25 on Tuesday, I had another go. This time I knew enough to limit SpamAssassin to receiving messages in batches of five (using fetchmail) - something I could do in testing, but wouldn't be an easy option in most production setups. This meant my test machine could just about cope with delivering mail using SpamAssassin.

At 10 p.m. Sunday, I declared the testing closed, and examined the accuracy or otherwise of each system.

During the testing between Aug. 6 and 11, Mail::Audit marked 16 pieces of e-mail as spam. Seven of these e-mails proved to be false positives - mail that I had actually solicited and would have liked to have received. Six spam emails were accepted into my Inbox. There were 874 e-mails received in all. Mail::Audit appeared to receive 15 pieces of spam mail in total.

PerlMx marked 14 e-mails as spam. Two of these e-mails proved to be false positives - mail that was not spam. Impressively, it received 886 e-mails in the same period that Mail::Audit received 874 e-mails. I was unable to work out the exact cause of this, although the power-cut in the middle of the testing period will always be a major suspect. Eleven spam messages were incorrectly allowws through into my Inbox. PerlMx appeared to receive 23 pieces of spam mail in total.

The sample was small, as all I had was my own personal e-mail to work with, and I get what I'm told is surprisingly little spam, but it shows that Mail::Audit / SpamAssassin seems to decide more mail is spam than PerlMx does, but is also wrong more of the time. PerlMx marked slightly less e-mail as spam, and let more spam through, but when it did claim e-mail was spam it was right more of the time.

These tests would benefit significantly from being re-run during a long period of time on a larger mail-server, but I had neither the time nor the mail-server available.

Both tools can be extensively configured in terms of what is considered spam, and are likely to need regular updating to ensure they keep up to date with new tricks of the spammers. Here I only considered the behavior with the default configuration of the latest release at the time I ran my tests.

Feature Comparison

To help you choose, I've summarized the basic characteristics of both systems below. Some of the points are quite subjective and are more my impressions of the tools rather than hard facts - these are marked separately.

PerlMX Mail::Audit
Scalable Yes - persistent server Maybe - depends on config - obvious default configurations scale poorly
Ships with wide range of existing filtering functionality Yes Limited range, more available from third-parties
Target use System-wide mail filtering for mailservers Per-use mail filtering as a replacement for programs like procmail
Extensible? Yes Yes
Licensing Commercial Open-source
Mail Server Compatibility Sendmail Almost any mail server
Spam filtering Yes Third-party extension
Virus filtering Yes No
Easy to setup Yes Not so easy, requires custom code
Efficient and Scalable Very scalable - easily separated from the mailserver, and no noticable performance impact during testing Performance problems during testing in default configuration

Conclusions

During testing, PerlMx was significantly more reliable, both in terms of the amount of mail bounced due to configuration problems (none), and in terms of the load put in the mailserver (minimal) than Mail::Audit. Although Mail::Audit appears able to be setup for good performance, the obvious suggested configuration showed extremely poor scalability during testing. Also, as Mail::Audit requires writing some filtering code, bugs, mostly in this code, resulted in nontrivial quantities of mail being bounced during testing due to code/configuration errors, a problem that simply didn't occur with PerlMx's more pre-supplied, configuration file based system.

Both PerlMx and Mail::Audit provide good mail filtering solutions using Perl, but are targeted at entirely different markets. PerlMx is a systemwide solution providing drop-in functionality on mailservers, with Perl extensibility as well, whereas Mail::Audit is a more low-level tool, mostly focused on use by individuals, designed to let users build their own mail processing tools more easily.

Exegesis 5

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.


Exegesis 5

Come gather round Mongers, whatever you code
And admit that your forehead's about to explode
'Cos Perl patterns induce complete brain overload
If there's source code, you should be maintainin'
Then you better start learnin' Perl 6 patterns soon
For the regexes, they are a-changin'

Apocalypse 5 marks a significant departure in the ongoing design of Perl 6.

Previous Apocalypses took an evolutionary approach to changing Perl's general syntax, data structures, control mechanisms and operators. New features were added, old features removed, and existing features were enhanced, extended and simplified. But the changes described were remedial, not radical.

Larry could have taken the same approach with regular expressions. He could have tweaked some of the syntax, added new (?...) constructs, cleaned up the rougher edges, and moved on.

Fortunately, however, he's taking a much broader view of Perl's future than that. And he saw that the problem with regular expressions was not that they lacked a (?$var:...) extension to do named captures, or that they needed a \R metatoken to denote a recursive subpattern, or that there was a [:YourNamedCharClassHere:] mechanism missing.

Related articles:

He saw that those features, laudable as they were individually, would just compound the real problem, which was that Perl 5 regular expressions were already groaning under the accumulated weight of their own metasyntax. And that a decade of accretion had left the once-clean notation arcane, baroque, inconsistent and obscure.

It was time to throw away the prototype.

Even more importantly, as powerful as Perl 5 regexes are, they are not nearly powerful enough. Modern text manipulation is predominantly about processing structured, hierarchical text. And that's just plain painful with regular expressions. The advent of modules like Parse::Yapp and Parse::RecDescent reflects the community's widespread need for more sophisticated parsing mechanisms. Mechanisms that should be native to Perl.

As Piers Cawley has so eloquently misquoted: “It is a truth universally acknowledged that any language in possession of a rich syntax must be in want of a rewrite.” Perl regexes are such a language. And Apocalypse 5 is precisely that rewrite.


What's the diff?

So let's take a look at some of those new features. To do that, we'll consider a series of examples structured around a common theme: recognizing and manipulating data in the Unix diff

A classic diff consists of zero-or-more text transformations, each of which is known as a “hunk”. A hunk consists of a modification specifier, followed by one or more lines of context. Each hunk is either an append, a delete, or a change, and the type of hunk is specified by a single letter ('a', 'd', or 'c'). Each of these single-letter specifiers is prefixed by the line numbers of the lines in the original document it affects, and followed by the equivalent line numbers in the transformed file. The context information consists of the lines of the original file (each preceded by a '<' character), then the lines of the transformed file (each preceded by a '>'). Deletes omit the transformed context, appends omit the original context. If both contexts appear, then they are separated by a line consisting of three hyphens.

Phew! You can see why natural language isn't the preferred way of specifying data formats.

The preferred way is, of course, to specify such formats as patterns. And, indeed, we could easily throw together a few Perl 6 patterns that collectively would match any data conforming to that format:

    $file = rx/ ^  <$hunk>*  $ /;
    $hunk = rx :i { 
        [ <$linenum> a :: <$linerange> \n
          <$appendline>+ 
        |
          <$linerange> d :: <$linenum> \n
          <$deleteline>+
        |
          <$linerange> c :: <$linerange> \n
          <$deleteline>+
          --- \n
          <$appendline>+
        ]
      |
        (\N*) ::: { fail "Invalid diff hunk: $1" }
    };
    $linerange = rx/ <$linenum> , <$linenum>
                   | <$linenum>
                   /;
    $linenum = rx/ \d+ /;
    $deleteline = rx/^^ \< <sp> (\N* \n) /;
    $appendline = rx/^^ \> <sp> (\N* \n) /;
    # and later...
    my $text is from($*ARGS);
    print "Valid diff" 
        if $text =~ /<$file>/;

Starting gently

There's a lot of new syntax there, so let's step through it slowly, starting with:

    $file = rx/ ^  <$hunk>*  $ /;

This statement creates a pattern object. Or, as it's known in Perl 6, a “rule”. People will probably still call them “regular expressions” or “regexes” too (and the keyword rx reflects that), but Perl patterns long ago ceased being anything like “regular”, so we'll try and avoid those terms.

In any case, the rx constructor builds a new rule, which is then stored in the $file variable. The Perl 5 equivalent would be:

    # Perl 5
    my $file = qr/ ^  (??{$hunk})*  $ /x;

This illustrates quite nicely why the entire syntax needed to change.

The name of the rule constructor has changed from qr to rx, because in Perl 6 rule constructors aren't quotelike contexts. In particular, variables don't interpolate into rx constructors in the way they do for a qq or a qx. That's why we can embed the $hunk variable before it's actually initialized.

In Perl 6, an embedded variable becomes part of the rule's implementation rather than part of its “source code”. As we'll see shortly, the pattern itself can determine how the variable is treated (i.e., whether to interpolate it literally, treat it as a subpattern or use it as a container).


Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Lay it out for me

In Perl 6, each rule implicitly has the equivalent of the Perl 5 /x modifier turned on, so we could lay out (and annotate) that first pattern like this:

    $file = rx/ ^               # Must be at start of string
                <$hunk>         # Match what the rule in $hunk would match...
                        *       #          ...zero-or-more times
                $               # Must be at end of string (no newline allowed)
              /;

Because /x is the default, the whitespace in the pattern is ignored, which allows us to lay out the rule more readably. Comments are also honored, which enables us to document the rule sensibly. You can even use the closing delimiter in a comment safely:

    $caveat = rx/ Make \s+ sure \s+ to \s+ ask
                  \s+ (mum|mom)                 # handle UK/US spelling
                  \s+ (and|or)                  # handle and/or
                  \s+ dad \s+ first
                /;

Of course, the examples in this Exegesis don't represent good comments in general, since they document what is happening, rather than why.

The meanings of the ^ and * metacharacters are unchanged from Perl 5. However, the meaning of the $ metacharacter has changed slightly: it no longer allows an optional newline before the end of the string. If you want that behavior, then you need to specify it explicitly. For example, to match a line ending in digits: / \d+ \n? $/

The compensation is that, in Perl 6, a \n in a pattern matches a logical newline (that is any of: "\015\012" or "\012" or "\015" or "\x85" or "\x2028"), rather than just a physical ASCII newline (i.e. just "\012"). And a \n will always try to match any kind of physical newline marker (not just the current system's favorite), so it correctly matches against strings that have been aggregated from multiple systems.


Interpolate ye not ...

The really new bit in the $file rule is the <$hunk> element. It's a directive to grab whatever's in the $hunk variable (presumably another pattern) and attempt to match it at that point in the rule. The important point is that the contents of $hunk are only grabbed when the pattern matching mechanism actually needs to match against them, not when the rule is being constructed. So it's like the mysterious (??{...}) construct in Perl 5 regexes.

The angle brackets themselves are a much more general mechanism in Perl 6 rules. They are the “metasyntactic markers” and replace the Perl 5 (?...) syntax. They are used to specify numerous other features of Perl 6 rules, many of which we will explore below.

Note that if we hadn't put the variable in angle-brackets, and had just written:

    rx/ ^  $hunk*  $ /;

then the contents of $hunk would still not be interpolated when the pattern was parsed. Once again, the pattern would grab the contents of the variable when it reached that point in its match. But, this time, without the angle brackets around $hunk, the pattern would try to match the contents of the variable as an atomic literal string (rather than as a subpattern). “Atomic” means that the * repetition quantifier applies to everything that's in $hunk, not just to the last character (as it does in Perl 5).

In other words, a raw variable in a Perl 6 pattern is matched as if it was a Perl 5 regex in which the interpolation had been quotemeta'd and then placed in a pair of noncapturing parentheses. That's really handy in something like:

    # Perl 6
    my $target = <>;                  # Get literal string to search for
    $text =~ m/ $target* /;           # Search for them as literals

which in Perl 5 we'd have to write as:

    # Perl 5
    my $target = <>;                  # Get literal string to search for
    chomp $target;                    # No autochomping in Perl 5 
    $text =~ m/ (?:\Q$target\E)* /x;  # Search for it, quoting metas

Raw arrays and hashes interpolate as literals, too. For example, if we use an array in a Perl 6 pattern, then the matcher will attempt to match any of its elements (each as a literal). So:

    # Perl 6
    @cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
    $str =~ / @cmd \( .*? \) /;     # Match a cmd, followed by stuff in parens

is the same as:

    # Perl 5 
    @cmd = ('get','put','try','find','copy','fold','spindle','mutilate');
    $cmd = join '|', map { quotemeta $_ } @cmd;
    $str =~ / (?:$cmd) \( .*? \) /;

By the way, putting the array into angle brackets would cause the matcher to try and match each of the array elements as a pattern, rather than as a literal.


The incredible $hunk

The rule that <$hunk> tries to match against is the next one defined in the program. Here's the annotated version of it:

    $hunk = rx :i {                             # Case-insensitively...
        [                                       #   Start a non-capturing group
            <$linenum>                          #     Match the subrule in $linenum
            a                                   #     Match a literal 'a'
            ::                                  #     Commit to this alternative
            <$linerange>                        #     Match the subrule in $linerange
            \n                                  #     Match a newline
            <$appendline>                       #     Match the subrule in $appendline...
                          +                     #         ...one-or-more times
        |                                       #   Or...
          <$linerange> d :: <$linenum> \n       #     Match $linerange, 'd', $linenum, newline
          <$deleteline>+                        #     Then match $deleteline once-or-more
        |                                       #   Or...
          <$linerange> c :: <$linerange> \n     #     Match $linerange, 'c', $linerange, newline
          <$deleteline>+                        #     Then match $deleteline once-or-more
          --- \n                                #     Then match three '-' and a newline
          <$appendline>+                        #     Then match $appendline once-or-more
        ]                                       #   End of non-capturing group
      |                                         # Or...
        (                                       #   Start a capturing group
            \N*                                 #     Match zero-or-more non-newlines
        )                                       #     End of capturing group
        :::                                     #     Emphatically commit to this alternative
        { fail "Invalid diff hunk: $1" }        #     Then fail with an error msg
    };

The first thing to note is that, like a Perl 5 qr, a Perl 6 rx can take (almost) any delimiters we choose. The $hunk pattern uses {...}, but we could have used:

    rx/pattern/     # Standard
    rx[pattern]     # Alternative bracket-delimiter style
    rx<pattern>     # Alternative bracket-delimiter style
    rx«forme»       # Délimiteurs très chic
    rx>pattern<     # Inverted bracketing is allowed too (!)
    rx»Muster«      # Begrenzungen im korrekten Auftrag
    rx!pattern!     # Excited
    rx=pattern=     # Unusual
    rx?pattern?     # No special meaning in Perl 6
    rx#pattern#     # Careful with these: they disable internal comments

Modified modifiers

In fact, the only characters not permitted as rx delimiters are ':' and '('. That's because ':' is the character used to introduce pattern modifiers in Perl 6, and '(' is the character used to delimit any arguments that might be passed to those pattern modifiers.

In Perl 6, pattern modifiers are placed before the pattern, rather than after it. That makes life easier for the parser, since it doesn't have to go back and reinterpret the contents of a rule when it reaches the end and discovers a /s or /m or /i or /x. And it makes life easier for anyone reading the code -- for precisely the same reason.

The only modifier used in the $hunk rule is the :i (case-insensitivity) modifier, which works exactly as it does in Perl 5.

The other rule modifiers available in Perl 6 are:

:e or :each

This is the replacement for Perl 5's /g modifier. It causes a match (or substitution) to be attempted as many times as possible. The name was changed because “each” is shorter and clearer in intent than “globally”. And because the :each modifier can be combined with other modifiers (see below) in such a way that it's no longer “global” in its effect.

:x($count)

This modifier is like :e, in that it causes the match or substitution to be attempted repeatedly. However, unlike :e, it specifies exactly how many times the match must succeed. For example:

    "fee fi "       =~ m:x(3)/ (f\w+) /;  # fails
    "fee fi fo"     =~ m:x(3)/ (f\w+) /;  # succeeds (matches "fee","fi","fo")
    "fee fi fo fum" =~ m:x(3)/ (f\w+) /;  # succeeds (matches "fee","fi","fo")

Note that the repetition count doesn't have to be a constant:

    m:x($repetitions)/ pattern /

There is also a series of tidy abbreviations for all the constant cases:

    m:1x/ pattern /         # same as: m:x(1)/ pattern /
    m:2x/ pattern /         # same as: m:x(2)/ pattern /
    m:3x/ pattern /         # same as: m:x(3)/ pattern /
    # etc.

:nth($count)

This modifier causes a match or substitution to be attempted repeatedly, but to ignore the first $count-1 successful matches. For example:

    my $foo = "fee fi fo fum";
    $foo =~ m:nth(1)/ (f\w+) /;        # succeeds (matches "fee")
    $foo =~ m:nth(2)/ (f\w+) /;        # succeeds (matches "fi")
    $foo =~ m:nth(3)/ (f\w+) /;        # succeeds (matches "fo")
    $foo =~ m:nth(4)/ (f\w+) /;        # succeeds (matches "fum")
    $foo =~ m:nth(5)/ (f\w+) /;        # fails
    $foo =~ m:nth($n)/ (f\w+) /;       # depends on the numeric value of $n
    $foo =~ s:nth(3)/ (f\w+) /bar/;    # $foo now contains: "fee fi bar fum"

Again, there is also a series of abbreviations:

    $foo =~ m:1st/ (f\w+) /;           # succeeds (matches "fee")
    $foo =~ m:2nd/ (f\w+) /;           # succeeds (matches "fi")
    $foo =~ m:3rd/ (f\w+) /;           # succeeds (matches "fo")
    $foo =~ m:4th/ (f\w+) /;           # succeeds (matches "fum")
    $foo =~ m:5th/ (f\w+) /;           # fails
    $foo =~ s:3rd/ (f\w+) /bar/;       # $foo now contains: "fee fi bar fum"

By the way, Perl isn't going to be pedantic about these “ordinal” versions of repetition specifiers. If you're not a native English speaker, and you find :1th, :2th, :3th, :4th, etc., easier to remember, then that's perfectly OK.

The various types of repetition modifiers can also be combined by separating them with additional colons:

    my $foo = "fee fi fo feh far foo fum ";
    $foo =~ m:2nd:2x/ (f\w+) /;        # succeeds (matches "fi", "feh")
    $foo =~ m:each:2nd/ (f\w+) /;      # succeeds (matches "fi", "feh", "foo")
    $foo =~ m:x(2):nth(3)/ (f\w+) /;   # succeeds (matches "fo", "foo")
    $foo =~ m:each:3rd/ (f\w+) /;      # succeeds (matches "fo", "foo")
    $foo =~ m:2x:4th/ (f\w+) /;        # fails (not enough matches to satisfy :2x)
    $foo =~ m:4th:each/ (f\w+) /;      # succeeds (matches "feh")
    $foo =~ s:each:2nd/ (f\w+) /bar/;  # $foo now "fee bar fo bar far bar fum ";

Note that the order in which the two modifiers are specified doesn't matter.

:p5 or :perl5

This modifier causes Perl 6 to interpret the contents of a rule as a regular expression in Perl 5 syntax. This is mainly provided as a transitional aid for porting Perl 5 code. And to mollify the curmudgeonly.

:w or :word

This modifier causes whitespace appearing in the pattern to match optional whitespace in the string being matched. For example, instead of having to cope with optional whitespace explicitly:

    $cmd =~ m/ \s* <keyword> \s* \( [\s* <arg> \s* ,?]* \s* \)/;

we can just write:

    $cmd =~ m:w/ <keyword> \( [ <arg> ,?]* \)/;

The :w modifier is also smart enough to detect those cases where the whitespace should actually be mandatory. For example:

    $str =~ m:w/a symmetric ally/

is the same as:

    $str =~ m/a \s+ symmetric \s+ ally/

rather than:

    $str =~ m/a \s* symmetric \s* ally/

So it won't accidentally match strings like "asymmetric ally" or "asymmetrically".

:any

This modifier causes the rule to match a given string in every possible way, simultaneously, and then return all the possible matches. For example:

    my $str = "ahhh";
    @matches =  $str =~ m/ah*/;         # returns "ahhh"
    @matches =  $str =~ m:any/ah*/;     # returns "ahhh", "ahh", "ah", "a"

:u0, :u1, :u2, :u3

These modifiers specify how the rule matches the dot (.) metacharacter against Unicode data. If :u0 is specified, then dot matches a single byte; if :u1 is specified, then dot matches a single codepoint (i.e. one or more bytes representing a single Unicode “character”). If :u2 is specified, then dot matches a single grapheme (i.e. a base codepoint followed by zero or more modifier codepoints, such as accents). If :u3 is specified, then dot matches an appropriate “something” in a language-dependent manner.

It's OK to ignore this modifier if you're not using Unicode (and maybe even if you are). As usual, Perl will try to do the right thing. To that end, the default behavior of rules is :u2, unless an overriding pragma (e.g. use bytes) is in effect.

Note that the /s, /m, and /e modifiers are no longer available. This is because they're no longer needed. The /s isn't needed because the . (dot) metacharacter now matches newlines as well. When we want to match “anything except a newline”, we now use the new \N metatoken (i.e. “opposite of \n”).

The /m modifier isn't required, because ^ and $ always mean start and end of string, respectively. To match the start and end of a line, we use the new ^^ and $$ metatokens instead.

The /e modifier is no longer needed, because Perl 6 provides the $(...) string interpolator (as described in Apocalypse 2). So a substitution such as:

    # Perl 5
    s/(\w+)/ get_val_for($1) /e;

becomes just:

    # Perl 6
    s/(\w+)/$( get_val_for($1) )/;

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Take no prisoners

The first character of the $hunk rule is an opening square bracket. In Perl 5, that denoted the start of a character class, but not in Perl 6. In Perl 6, square brackets mark the boundaries of a noncapturing group. That is, a pair of square brackets in Perl 6 are the same as a (?:...) in Perl 5, but less line-noisy.

By the way, to get a character class in Perl 6, we need to put the square brackets inside a pair of metasyntactic angle brackets. So the Perl 5:

    # Perl 5
    / [A-Za-z] [0-9]+ /x          # An A-Z or a-z, followed by digits

would become in Perl 6:

    # Perl 6
    / <[A-Za-z]> <[0-9]>+ /       # An A-Z or a-z, followed by digits

The Perl 5 complemented character class:

    # Perl 5
    / [^A-Za-z]+ /x               # One-or-more chars-that-aren't-A-Z-or-a-z

becomes in Perl 6:

    # Perl 6
    / <-[A-Za-z]>+ /              #  One-or-more chars-that-aren't-A-Z-or-a-z

The external minus sign is used (instead of an internal caret), because Perl 6 allows proper set operations on character classes, and the minus sign is the “difference” operator. So we could also create:

    # Perl 6
    / < <alpha> - [A-Za-z] >+ /   # All alphabetics except A-Z or a-z
                                  # (i.e. the accented alphabetics)

Explicit character classes were deliberately made a little less convenient in Perl 6, because they're generally a bad idea in a Unicode world. For example, the [A-Za-z] character class in the above examples won't even match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham, Cherokee, or Klingon.


Meanwhile, back at the $hunk ...

The noncapturing group of the $hunk pattern groups together three alternatives, separated by | metacharacters (as in Perl 5). The first alternative:

    <$linenum> a :: <$linerange>
    \n                         
    <$appendline>+

grabs whatever is in the $linenum variable, treats it as a subpattern, and attempts to match against it. It then matches a literal letter 'a' (or an 'A', because of the :i modifier on the rule). Then whatever the contents of the $linerange variable match. Then a newline. Then it tries to match whatever the pattern in $appendline would match, one or more times.

But what about that double-colon after the a? Shouldn't the pattern have tried to match two colons at that point?


This or nothing

Actually, no. The double-colon is a new Perl 6 pattern-control structure. It has no effect (and is ignored) when the pattern is successfully matching, but if the pattern match should fail, and consequently back-track over the double-colon -- for example, to try and rematch an earlier repetition one fewer times -- the double-colon causes the entire surrounding group (i.e. the surrounding [...] in this case) to fail as well.

That's a useful optimization in this case because, if we match a line number followed by an 'a' but subsequently fail, then there's no point even trying either of the other two alternatives in the same group. Because we found an 'a', there's no chance we could match a 'd' or a 'c' instead.

So, in general, a double-colon means: “At this point I'm committed to this alternative within the current group -- don't bother with the others if this one fails after this point”.

There are other control directives like this, too. A single colon means: “Don't bother backtracking into the previous element”. That's useful in a pattern like:

    rx:w/ $keyword [-full|-quick|-keep]+ : end /

Suppose we successfully match the keyword (as a literal, by the way) and one or more of the three options, but then fail to match 'end'. In that case, there's no point backtracking and trying to match one fewer option, and still failing to find an 'end'. And then backtracking another option, and failing again, etc. By using the colon after the repetition, we tell the matcher to give up after the first attempt.

However, the single colon isn't just a “Greed is Good” operator. It's much more like a “Resistance is Futile” operator. That is, if the preceding repetition had been non-greedy instead:

    rx:w/ $keyword [-full|-quick|-keep]+? : end /

then backtracking over the colon would prevent the +? from attempting to match more options. Note that this means that x+?: is just a baroque way of matching exactly one repetition of x, since the non-greedy repetition initially tries to match the minimal number of times (i.e. once) and the trailing colon then prevents it from backtracking and trying longer matches. Likewise, x*?: and x??: are arcane ways of matching exactly zero repetitions of x.

Generally, though, a single colon tells the pattern matcher that there's no point trying any other match on the preceding repetition, because retrying (whether more or fewer repetitions) would just waste time and would still fail.

There's also a three-colon directive. Three colons means: “If we have to backtrack past here, cause the entire rule to fail” (i.e. not just this group). If the double-colon in $hunk had been triple:

    <$linenum> a ::: <$linerange>
    \n                         
    <$appendline>+

then matching a line number and an 'a' and subsequently failing would cause the entire $hunk rule to fail immediately (though the $file rule that invoked it might still match successfully in some other way).

So, in general, a triple-colon specifies: “At this point I'm committed to this way of matching the current rule -- give up on the rule completely if the matching process fails at this point”.

Four colons ... would just be silly. So, instead, there's a special named directive: <commit>. Backtracking through a <commit> causes the entire match to immediately fail. And if the current rule is being matched as part of a larger rule, that larger rule will fail as well. In other words, it's the “Blow up this Entire Planet and Possibly One or Two Others We Noticed on our Way Out Here” operator.

If the double-colon in $hunk had been a <commit> instead:

    <$linenum> a <commit> <$linerange>
    \n                         
    <$appendline>+

then matching a line number and an 'a' and subsequently failing would cause the entire $hunk rule to fail immediately, and would also cause the $file rule that invoked it to fail immediately.

So, in general, a <commit> means: “At this point I'm committed to this way of completing the current match -- give up all attempts at matching anything if the matching process fails at this point”.


Failing with style

The other two alternatives:

    | <$linerange> d :: <$linenum> \n
      <$deleteline>+                 
    | <$linerange> c :: <$linerange> \n
      <$deleteline>+  --- \n  <$appendline>+

are just variants on the first.

If none of the three alternatives in the square brackets matches, then the alternative outside the brackets is tried:

    |  (\N*) ::: { fail "Invalid diff hunk: $1" }

This captures a sequence of non-newline characters (\N means “not \n”, in the same way \S means “not \s” or \W means “not \w”). Then it invokes a block of Perl code inside the pattern. The call to fail causes the match to fail at that point, and sets an associated error message that would subsequently appear in the $! error variable (and which would also be accessible as part of $0).

Note the use of the triple colon after the repetition. It's needed because the fail in the block will cause the pattern match to backtrack, but there's no point backing up one character and trying again, since the original failure was precisely what we wanted. The presence of the triple-colon causes the entire rule to fail as soon as the backtracking reaches that point the first time.

The overall effect of the $hunk rule is therefore either to match one hunk of the diff, or else fail with a relevant error message.


Home, home on the (line)range

The third and fourth rules:

    $linerange = rx/ <$linenum> , <$linenum>
                   | <$linenum> 
                   /;
    $linenum = rx/ \d+ /;

specify that a line number consists of a series of digits, and that a line range consists of either two line numbers with a comma between them or a single line number. The $linerange rule could also have been written:

    $linerange = rx/ <$linenum> [ , <$linenum> ]? /;

which might be marginally more efficient, since it doesn't have to backtrack and rematch the first $linenum in the second alternative. It's likely, however, that the rule optimizer will detect such cases and automatically hoist the common prefix out anyway, so it's probably not worth the decrease in readability to do that manually.


What's my line?

The final two rules specify the structure of individual context lines in the diff (i.e. the lines that say what text is being added or removed by the hunk):

    $deleteline = rx/^^ \< <sp> (\N* \n) /
    $appendline = rx/^^ \> <sp> (\N* \n) /

The ^^ markers ensure that each rule starts at the beginning of an entire line.

The first character on that line must be either a '<' or a '>'. Note that we have to escape these characters since angle brackets are metacharacters in Perl 6. An alternative would be to use the “literal string” metasyntax:

    $deleteline = rx/^^ <'<'> <sp> (\N* \n) /
    $appendline = rx/^^ <'>'> <sp> (\N* \n) /

That is, angle brackets with a single-quoted string inside them match the string's sequence of characters as literals (including whitespace and other metatokens).

Or we could have used the quotemeta metasyntax (\Q[...]):

    $deleteline = rx/^^ \Q[<] <sp> (\N* \n) /
    $appendline = rx/^^ \Q[>] <sp> (\N* \n) /

Note that Perl 5's \Q...\E construct is replaced in Perl 6 by just the \Q marker, which now takes a group after it.

We could also have used a single-letter character class:

    $deleteline = rx/^^ <[<]> <sp> (\N* \n) /
    $appendline = rx/^^ <[>]> <sp> (\N* \n) /

or even a named character (\c[CHAR NAME HERE]):

    $deleteline = rx/^^ \c[LEFT ANGLE BRACKET] <sp> (\N* \n) /
    $appendline = rx/^^ \c[RIGHT ANGLE BRACKET] <sp> (\N* \n) /

Whether any of those MTOWTDI is better than just escaping the angle bracket is, of course, a matter of personal taste.


The final frontier

After the leading angle, a single literal space is expected. Again, we could have specified that by escapology () or literalness (<' '>) or quotemetaphysics (\Q[ ]) or character classification (<[ ]>), or deterministic nomimalism (\c[SPACE]), but Perl 6 also gives us a simple name for the space character: <sp>. This is the preferred option, since it reduces line-noise and makes the significant space much harder to miss.

Perl 6 provides predefined names for other useful subpatterns as well, including:

<dot>

which matches a literal dot ('.') character (i.e. it's a more elegant synonym for \.);

<lt> and <gt>

which match a literal '<' and '>' respectively. These give us yet another way of writing:

    $deleteline = rx/^^ <lt> <sp> (\N* \n) /
    $appendline = rx/^^ <gt> <sp> (\N* \n) /
<ws>
which matches any sequence of whitespace (i.e. it's a more elegant synonym for \s+). Optional whitespace is, therefore, specified as <ws>? or <ws>* (Perl 6 will accept either);
<alpha>
which matches a single alphabetic character (i.e. it's like the character class <[A-Za-z]> but it handles accented characters and alphabetic characters from non-Roman scripts as well);
<ident>
which is a short-hand for [ [<alpha>|_] \w* ] (i.e. a standard identifier in many languages, including Perl)

Using named subpatterns like these makes rules clearer in intent, easier to read, and more self-documenting. And, as we'll see shortly, they're fully generalizable...we can create our own.


Match-maker, match-maker...

Finally, we're ready to actually read in and match a diff file. In Perl 5, we'd do that like so:

    # Perl 5
    local $/;          # Disable input record separator (enable slurp mode)
    my $text = <>;     # Slurp up input stream into $text
    print "Valid diff" 
        if $text =~ /$file/;

We could do the same thing in Perl 6 (though the syntax would differ slightly) and in this case that would be fine. But, in general, it's clunky to have to slurp up the entire input before we start matching. The input might be huge, and we might fail early. Or we might want to match input interactively (and issue an error message as soon as the input fails to match). Or we might be matching a series of different formats. Or we might want to be able to leave the input stream in its original state if the match fails.

The inability to do pattern matches immediately on an input stream is one of Perl 5's few weaknesses when it comes to text processing. Sure, we can read line-by-line and apply pattern matching to each line, but trying to match a construct that may be laid out across an unknown number of lines is just painful.

Not in Perl 6 though. In Perl 6, we can bind an input stream to a scalar variable (i.e. like a Perl 5 tied variable) and then just match on the characters in that stream as if they were already in memory:

    my $text is from($*ARGS);       # Bind scalar to input stream
    print "Valid diff" 
        if $text =~ /<$file>/;      # Match against input stream

The important point is that, after the match, only those characters that the pattern actually matched will have been removed from the input stream.

It may also be possible to skip the variable entirely and just write:

    print "Valid diff" 
        if $*ARGS =~ /<$file>/;     # Match against input stream

or:

    print "Valid diff" 
        if <> =~ /<$file>/;         # Match against input stream

but that's yet to be decided.


A cleaner approach

The previous example solves the problem of recognizing a valid diff file quite nicely (and with only six rules!), but it does so by cluttering up the program with a series of variables storing those precompiled patterns.

It's as if we were to write a collection of subroutines like this:

    my $print_name = sub ($data) { print $data{name}, "\n"; };
    my $print_age  = sub ($data) { print $data{age}, "\n"; };
    my $print_addr = sub ($data) { print $data{addr}, "\n"; };
    my $print_info = sub ($data) {
        $print_name($data);
        $print_age($data);
        $print_addr($data);
    };
    # and later...
    $print_info($info);

You could do it that way, but it's not the right way to do it. The right way to do it is as a collection of named subroutines or methods, often collected together in the namespace of a class or module:

    module Info {
        sub print_name ($data) { print $data{name}, "\n"; }
        sub print_age ($data)  { print $data{age}, "\n"; }
        sub print_addr ($data) { print $data{addr}, "\n"; }
        sub print_info ($data) {
            print_name($data);
            print_age($data);
            print_addr($data);
        }
    }
    Info::print_info($info);

So it is with Perl 6 patterns. You can write them as a series of pattern objects created at run-time, but they're much better specified as a collection of named patterns, collected together at compile-time in the namespace of a grammar.

Here's the previous diff-parsing example rewritten that way (and with a few extra bells-and-whistles added in):

    grammar Diff {
        rule file { ^  <hunk>*  $ }
        rule hunk :i { 
            [ <linenum> a :: <linerange> \n
              <appendline>+ 
            |
              <linerange> d :: <linenum> \n
              <deleteline>+
            |
              <linerange> c :: <linerange> \n
              <deleteline>+
              --- \n
              <appendline>+
            ]
          |
            <badline("Invalid diff hunk")>
        }
        rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" }
        rule linerange { <linenum> , <linenum>
                       | <linenum>
                       }
        rule linenum { \d+ }
        rule deleteline { ^^ <out_marker> (\N* \n) }
        rule appendline { ^^ <in_marker>  (\N* \n) }
        rule out_marker { \<  <sp> }
        rule in_marker  { \>  <sp> }
    }
    # and later...
    my $text is from($*ARGS);
    print "Valid diff" 
        if $text =~ /<Diff.file>/;

What's in a name?

The grammar declaration creates a new namespace for rules (in the same way a class or module declaration creates a new namespace for methods or subroutines). If a block is specified after the grammar's name:

    grammar HTML {
        rule file :iw { \Q[<HTML>]  <head>  <body>  \Q[</HTML>] }
        rule head :iw { \Q[<HEAD>]  <head_tag>+  \Q[<HEAD>] }
        # etc.
    } # Explicit end of HTML grammar

then that new namespace is confined to that block. Otherwise the namespace continues until the end of the source section of the current file:

    grammar HTML;
    rule file :iw { \Q[<HTML>]  <head>  <body>  \Q[</HTML>] }
    rule head :iw { \Q[<HEAD>]  <head_tag>+  \Q[<HEAD>] }
    # etc.
    # Implicit end of HTML grammar
    __END__

Note that, as with the blockless variants on class and module, this form of the syntax is designed to simplify one-namespace-per-file situations. It's a compile-time error to put two or more blockless grammars, classes or modules in a single file.

Within the namespace, named rules are defined using the rule declarator. It's analogous to the sub declarator within a module, or the method declarator within a class. Just like a class method, a named rule has to be invoked through its grammar if we refer to it outside its own namespace. That's why the actual match became:

    $text =~ /<Diff.file>/;         # Invoke through grammar

If we want to match a named rule, we put the name in angle brackets. Indeed, many of the constructs we've already seen -- <sp>, <ws>, <ident>, <alpha>, <commit> -- are really just predefined named rules that come standard with Perl 6.

Like subroutines and methods, within their own namespace, rules don't have to be qualified. Which is why we can write things like:

    rule linerange { <linenum> , <linenum>
                   | <linenum>
                   }

instead of:

    rule linerange { <Diff.linenum> , <Diff.linenum>
                   | <Diff.linenum>
                   }

Using named rules has several significant advantages, apart from making the patterns look cleaner. For one thing, the compiler may be able to optimize the embedded named rules better. For example, it could inline the attempts to match <linenum> within the linerange rule. In the rx version:

    $linerange = rx{ <$linenum> , <$linenum>
                   | <$linenum>
                   };

that's not possible, since the pattern matching mechanism won't know what's in $linenum until it actually tries to perform the match.

By the way, we can still use interpolated <$subrule>-ish subpatterns in a named rule, and we can use named subpatterns in an rx-ish rule. The difference between rule and rx is just that a rule can have a name and must use {...} as its delimiters, whereas an rx doesn't have a name and can use any allowed delimiters.


Bad line! No match!

This version of the diff parser has an additional rule, named badline. This rule illustrates another similarity between rules and subroutines/methods: rules can take arguments. The badline rule factors out the error message creation at the end of the hunk rule. Previously that rule ended with:

    |  (\N*) ::: { fail "Invalid diff hunk: $1" }

but in this version it ends with:

    |  <badline("Invalid diff hunk")>

That's a much better abstraction of the error condition. It's easier to understand and easier to maintain, but it does require us to be able to pass an argument (the error message) to the new badline subrule. To do that, we simply declare it to have a parameter list:

    rule badline($errmsg) { (\N*) ::: { fail "$errmsg: $1" }

Note the strong syntactic parallel with a subroutine definition:

    sub  subname($param)  { ... }

The argument is passed to a subrule by placing it in parentheses after the rule name within the angle brackets:

    |  <badline("Invalid diff hunk")>

The argument can also be passed without the parentheses, but then it is interpreted as if it were the body of a separate rule:

    rule list_of ($pattern) { 
            <$pattern> [ , <$pattern> ]*
    }
    # and later...
    $str =~ m:w/  \[                  # Literal opening square bracket
                  <list_of \w\d+>     # Call list_of subrule passing rule rx/\w\d+/
                  \]                  # Literal closing square bracket
               /;

A rule can take as many arguments as it needs to:

    rule seplist($elem, $sep) {
            <$elem>  [ <$sep> <$elem> ]*
    }

and those arguments can also be passed by name, using the standard Perl 6 pair-based mechanism (as described in Apocalypse 3).

    $str =~ m:w/
                \[                                      # literal left square bracket
                <seplist(sep=>":", elem=>rx/<ident>/)>  # colon-separated list of identifiers
                \]                                      # literal right square bracket
               /;

Note that the list's element specifier is itself an anonymous rule, which the seplist rule will subsequently interpolate as a pattern (because the $elem parameter appears in angle brackets within seplist).


Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Thinking ahead

The only other change in the grammar version of the diff parser is that the matching of the '<' and '>' at the start of the context lines has been factored out. Whereas before we had:

    $deleteline = rx/^^ \< <sp> (\N* \n) /
    $appendline = rx/^^ \> <sp> (\N* \n) /

now we have:

    rule deleteline { ^^ <out_marker> (\N* \n) }
    rule appendline { ^^ <in_marker>  (\N* \n) }
    rule out_marker { \<  <sp> }
    rule in_marker  { \>  <sp> }

That seems like a step backwards, since it complicated the grammar for no obvious benefit, but the benefit will be reaped later when we discover another type of diff file that uses different markers for incoming and outgoing lines.


What you match is what you get

Both the variable-based and grammatical versions of the code above do a great job of recognizing a diff, but that's all they do. If we only want syntax checking, that's fine. But, generally, if we're parsing data what we really want is to do something useful with it: transform it into some other syntax, make changes to its contents, or perhaps convert it to a Perl internal data structure for our program to manipulate.

Suppose we did want to build a hierarchical Perl data structure representing the diff that the above examples match. What extra code would we need?

None.

That's right. Whenever Perl 6 matches a pattern, it automatically builds a “result object” representing the various components of the match.

That result object is named $0 (the program's name is now $*PROG) and it's lexical to the scope in which the match occurs. The result object stores (amongst other things) the complete string matched by the pattern, and it evaluates to that string when used in a string context. For example:

    if ($text =~ /<Diff.file>/) {
        $difftext = $0;
    }

That's handy, but not really useful for extracting data structures. However, in addition, any components within a match that were captured using parentheses become elements of the object's array attribute, and are accessible through its array index operator. So, for example, when a pattern such as:

    rule linenum_plus_comma { (\d+) (,?) };

matches successfully, the array element 1 of the result object (i.e. $0[1]) is assigned the result of the first parenthesized capture (i.e. the digits), whilst the array element 2 ($0[2]) receives the comma. Note that array element zero of any result object is assigned the complete string that the pattern matched.

There are also abbreviations for each of the array elements of $0. $0[1] can also be referred to as...surprise, surprise...$1, $0[2] can also be referred to as $2, $0[3] as $3, etc. Like $0, each of these numeric variables is also lexical to the scope in which the pattern match occurred.

The parts of a matched string that were matched by a named subrule become entries in the result object's hash attribute, and are subsequently accessible through its hash lookup operator. So, for example, when the pattern:

    rule deleteline { ^^ <out_marker> (\N* \n) }

matches, the result object's hash entry for the key 'out_marker' (i.e. $0{out_marker}) will contain the result object returned by the successful nested match of the out_marker subrule.


A hypothetical solution to a very real problem

Named capturing into a hash is very convenient, but it doesn't work so well for a rule like:

    rule linerange {
          <linenum> , <linenum>
        | <linenum>
    }

The problem is that the hash attribute of the rule's $0 can only store one entry with the key 'linenum'. So if the <linenum> , <linenum> alternative matches, then the result object from the second match of <linenum> will overwrite the entry for the first <linenum> match.

The solution to this is a new Perl 6 pattern matching feature known as “hypothetical variables”. A hypothetical variable is a variable that is declared and bound within a pattern match (i.e. inside a closure within a rule). The variable is declared, not with a my, our, or temp, but with the new keyword let, which was chosen because it's what mathematicians and other philosophers use to indicate a hypothetical assumption.

Once declared, a hypothetical variable is then bound using the normal binding operator. For example:

    rule checked_integer {
            (\d+)                   # Match and capture one-or-more digits
            { let $digits := $1 }   # Bind to hypothetical var $digits
            -                       # Match a hyphen
            (\d)                    # Match and capture one digit
            { let $check := $2 }    # Bind to hypothetical var $check
    }

In this example, if a sequence of digits is found, then the $digits variable is bound to that substring. Then, if the dash and check-digit are matched, the digit is bound to $check. However, if the dash or digit is not matched, the match will fail and backtrack through the closure. This backtracking causes the $digits hypothetical variable to be automatically un-bound. Thus, if a rule fails to match, the hypothetical variables within it are not associated with any value.

Each hypothetical variable is really just another name for the corresponding entry in the result object's hash attribute. So binding a hypothetical variable like $digits within a rule actually sets the $0{digits} element of the rule's result object.

So, for example, to distinguish the two line numbers within a line range:

    rule linerange {
          <linenum> , <linenum>
        | <linenum>
    }

we could bind them to two separate hypothetical variables -- say, $from and $to -- like so:

    rule linerange {
          (<linenum>)               # Match linenum and capture result as $1
          { let $from := $1 }       # Save result as hypothetical variable
          ,                         # Match comma
          (<linenum>)               # Match linenum and capture result as $2
          { let $to := $2 }         # Save result as hypothetical variable
        |
          (<linenum>)               # Match linenum and capture result as $3
          { let $from := $3 }       # Save result as hypothetical variable
    }

Now our result object has a hash entry $0{from} and (maybe) one for $0{to} (if the first alternative was the one that matched). In fact, we could ensure that the result always has a $0{to}, by setting the corresponding hypothetical variable in the second alternative as well:

    rule linerange {
          (<linenum>)
          { let $from := $1 }
          ,         
          (<linenum>)
          { let $to := $2 }
        |
          (<linenum>)
          { let $from := $3; let $to := $from }
    }

Problem solved.

But only by introducing a new problem. All that hypothesizing made our rule ugly and complex. So Perl 6 provides a much prettier short-hand:

    rule linerange {
          $from := <linenum>          # Match linenum rule, bind result to $from
          ,                           # Match comma
          $to := <linenum>            # Match linenum rule, bind result to $to
        |                             # Or...
          $from := $to := <linenum>   # Match linenum rule,
    }                                 #   bind result to both $from and $to

or, more compactly:

    rule linerange {
          $from:=<linenum> , $to:=<linenum>
        | $from:=$to:=<linenum>
    }

If a Perl 6 rule contains a variable that is immediately followed by the binding operator (:=), that variable is never interpolated. Instead, it is treated as a hypothetical variable, and bound to the result of the next component of the rule (in the above examples, to the result of the <linenum> subrule match).

You can also use hypothetical arrays and hashes, binding them to a component that captures repeatedly. For example, we might choose to name our set of hunks:

    rule file { ^  @adonises := <hunk>*  $ }

collecting all the <hunk> matches into a single array (which would then be available after the match as $0{'@adonises'}. Note that the sigil is included in the key in this case).

Or we might choose to bind a hypothetical hash:

    rule config {
        %init :=            # Hypothetically, bind %init to...
            [               # Start of group
                (<ident>)   # Match and capture an identifier
                \h*=\h*     # Match an equals sign with optional horizontal whitespace
                (\N*)       # Match and capture the rest of the line
                \n          # Match the newline
            ]*
    }

where each repetition of the [...]* grouping captures two substrings on each repetition and converts them to a key/value pair, which is then added to the hash. The first captured substring in each repetition becomes the key, and the second captured substring becomes its associated value. The hypothetical %init hash is also available through the rule's result object, as $0{'%init'} (again, with the sigil as part of the key).


The nesting instinct

Of course, those line number submatches in:

    rule linerange {
          $from:=<linenum> , $to:=<linenum>
        | $from:=$to:=<linenum>
    }

will have returned their own result objects. And it's a reference to those nested result objects that actually gets stored in linerange's $0{from} and $0{to}.

Likewise, in the next higher rule:

    rule hunk :i { 
        [ <linenum> a :: <linerange> \n
          <appendline>+ 
        |
          <linerange> d :: <linenum> \n
          <deleteline>+
        |
          <linerange> c :: <linerange> \n
          <deleteline>+
          --- \n
          <appendline>+
        ]
    };

the match on <linerange> will return its $0 object. So, within the hunk rule, we could access the “from” digits of the line range of the hunk as: $0{linerange}{from}.

Likewise, at the highest level:

    rule file { ^  <hunk>*  $ }

we are matching a series of hunks, so the hypothetical $hunk variable (and hence $0{hunk}) will contain a result object whose array attribute contains the series of result objects returned by each individual <hunk> match.

So, for example, we could access the “from” digits of the line range of the third hunk as: $0{hunk}[2]{linerange}{from}.


Extracting the insertions

More usefully, we could locate and print every line in the diff that was being inserted, regardless of whether it was inserted by an “append” or a “change” hunk. Like so:

    my $text is from($*ARGS);
    if $text =~ /<Diff.file>/ {
        for @{ $0{file}{hunk} } -> $hunk
             print @{$hunk{appendline}}
                 if $hunk{appendline};
        }
    }

Here, the if statement attempts to match the text against the pattern for a diff file. If it succeeds, the for loop grabs the <hunk>* result object, treats it as an array, and then iterates each hunk match object in turn into $hunk. The array of append lines for each hunk match is then printed (if there is in fact a reference to that array in the hunk).


Don't just match there; do something!

Because Perl 6 patterns can have arbitrary code blocks inside them, it's easy to have a pattern actually perform syntax transformations whilst it's parsing. That's often a useful technique because it allows us to manipulate the various parts of a hierarchical representation locally (within the rules that recognize them).

For example, suppose we wanted to “reverse” the diff file. That is, suppose we had a diff that specified the changes required to transform file A to file B, but we needed the back-transformation instead: from file B to file A. That's relatively easy to create. We just turn every “append” into a “delete”, every “delete” into an “append”, and reverse every “change”.

The following code does exactly that:

    grammar ReverseDiff {
        rule file { ^  <hunk>*  $ }
        rule hunk :i { 
            [ <linenum> a :: <linerange> \n
              <appendline>+ 
              { @$appendline =~ s/<in_marker>/< /;
                let $0 := "${linerange}d${linenum}\n"
                        _ join "", @$appendline;
              }
            |
              <linerange> d :: <linenum> \n
              <deleteline>+
              { @$deleteline =~ s/<out_marker>/> /;
                let $0 := "${linenum}a${linerange}\n"
                        _ join "", @$deleteline;
              }
            |
              $from:=<linerange> c :: $to:=<linerange> \n
              <deleteline>+
              --- \n
              <appendline>+
              { @$appendline =~ s/<in_marker>/</;
                @$deleteline =~ s/<out_marker>/>/;
                let $0 := "${to}c${from}\n"
                        _ join("", @$appendline)
                        _ "---\n"
                        _ join("", @$deleteline);
              }
            ]
          |
            <badline("Invalid diff hunk")>
        }
    rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" } }
    rule linerange { $from:=<linenum> , $to:=<linenum>
                       | $from:=$to:=<linenum>
                       }
    rule linenum { (\d+) }
    rule deleteline { ^^ <out_marker> (\N* \n) }
        rule appendline { ^^ <in_marker>  (\N* \n) }
    rule out_marker { \<  <sp> }
        rule in_marker  { \>  <sp> }
    }
    # and later...
    my $text is from($*ARGS);
    print @{ $0{file}{hunk} }
        if $text =~ /<Diff.file>/;

The rule definitions for file, badline, linerange, linenum, appendline, deleteline, in_marker and out_marker are exactly the same as before.

All the work of reversing the diff is performed in the hunk rule. To do that work, we have to extend each of the three main alternatives of that rule, adding to each a closure that changes the result object it returns.


Smarter alternatives

In the first alternative (which matches “append” hunks), we match as before:

    <linenum> a :: <linerange> \n
    <appendline>+

But then we execute an embedded closure:

    { @$appendline =~ s/<in_marker>/</;
      let $0 := "${linerange}d${linenum}\n"
              _ join "", @$appendline;
    }

The first line reverses the “marker” arrows on each line of data that was previously being appended, using the smart-match operator to apply the transformation to each line. Note too, that we reuse the in_marker rule within the substitution.

Then we bind the result object (i.e. the hypothetical variable $0) to a string representing the “reversed” append hunk. That is, we reverse the order of the line range and line number components, put a 'd' (for “delete”) between them, and then follow that with all the reversed data:

    let $0 := "${linerange}d${linenum}\n"
            _ join "", @$appendline;

The changes to the “delete” alternative are exactly symmetrical. Capture the components as before, reverse the marker arrows, reverse the $linerange and $linenum, change the 'd' to an 'a', and append the reversed data lines.

In the third alternative:

    $from:=<linerange> c :: $to:=<linerange> \n
    <deleteline>+   
    --- \n
    <appendline>+
    { @$appendline =~ s/<in_marker>/</;
      @$deleteline =~ s/<out_marker>/>/;
      let $0 := "${to}c${from}\n"
              _ join("", @$appendline)
              _ "---\n"
              _ join("", @$deleteline);
    }

there are line ranges on both sides of the 'c'. So we need to give them distinct names, by binding them to extra hypothetical variables: $from and $to. We then reverse the order of two line ranges, but leave the 'c' as it was (because we're simply changing something back to how it was previously). The markers on both the append and delete lines are reversed, and then the order of the two sets of lines is also reversed.

Once those transformations has been performed on each hunk (i.e. as it's being matched!), the result of successfully matching any <hunk> subrule will be a string in which the matched hunk has already been reversed.

All that remains is to match the text against the grammar, and print out the (modified) hunks:

    print @{ $0{file}{hunk} }
        if $text =~ /<ReverseDiff.file>/;

And, since the file rule is now in the ReverseDiff grammar's namespace, we need to call the rule through that grammar. Note the way the syntax for doing that continues the parallel with methods and classes.


Editor's note: this document is out of date and remains here for historic interest. See Synopsis 5 for the current design information.

Rearranging the deck-chairs

It might have come as a surprise that we were allowed to bind the pattern's $0 result object directly, but there's nothing magical about it. $0 turns out to be just another hypothetical variable...the one that happens to be returned when the match is complete.

Likewise, $1, $2, $3, etc. are all hypotheticals, and can also be explicitly bound in a rule. That's very handy for ensuring that the right substring always turns up in the right numbered variable. For example, consider a Perl 6 rule to match simple Perl 5 method calls (matching all Perl 5 method calls would, of course, require a much more sophisticated rule):

    rule method_call :w {
        # Match direct syntax:   $var->meth(...)
        \$  (<ident>)  -\>  (<ident>)  \(  (<arglist>)  \)
      | # Match indirect syntax: meth $var (...)
        (<ident>)  \$  (<ident>)  [ \( (<arglist>) \) | (<arglist>) ]
    }
    my ($varname, methodname, $arglist);
    if ($source_code =~ / $0 := <method_call> /) {
        $varname    = $1 // $5;
        $methodname = $2 // $4;
        $arglist    = $3 // $6 // $7;
    }

By binding the match's $0 to the result of the <method_call> subrule, we bind its $0[1], $0[2], $0[3], etc. to those array elements in <method_call>'s result object. And thereby bind $1, $2, $3, etc. as well. Then it's just a matter of sorting out which numeric variable ended up with which bit of the method call.

That's okay, but it would be much better if we could guarantee that the variable name was always in $1, the method name in $2, and the argument list in $3. Then we could replace the last six lines above with just:

    my ($varname, methodname, $arglist) =
            $source_code =~ / $0 := <method_call> /;

In Perl 5 there was no way to do that, but in Perl 6 it's relatively easy. We just modify the method_call rule like so:

    rule method_call :w {
        \$  $1:=<ident>  -\>  $2:=<ident>  \( $3:=<arglist> \)
      | $2:=<ident>  \$  $1:=<ident>  [ \( $3:=<arglist> \) | $3:=<arglist> ]
    }

Or, annotated:

    rule method_call :w {
        \$                          #   Match a literal $
        $1:=<ident>                 #   Match the varname, bind it to $1
        -\>                         #   Match a literal ->
        $2:=<ident>                 #   Match the method name, bind it to $2
        \(                          #   Match an opening paren
        $3:=<arglist>               #   Match the arg list, bind it to $3
        \)                          #   Match a closing paren
      |                             # Or
        $2:=<ident>                 #   Match the method name, bind it to $2
        \$                          #   Match a literal $
        $1:=<ident>                 #   Match the varname, bind it to $1
        [                           #   Either...
          \( $3:=<arglist> \)       #     Match arg list in parens, bind it to $3
        |                           #   Or...
             $3:=<arglist>          #     Just match arg list, bind it to $3
        ]
    }

Now the rule's $1 is bound to the variable name, regardless of which alternative matches. Likewise $2 is bound to the method name in either branch of the |, and $3 is associated with the argument list, no matter which of the three possible ways it was matched.

Of course, that's still rather ugly (especially if we have to write all those comments just so others can understand how clever we were).

So an even better solution is just to use proper named rules (with their handy auto-capturing behaviour) for everything. And then slice the required information out of the result object's hash attribute:

    rule varname    { <ident> }
    rule methodname { <ident> }
    rule method_call :w {
        \$  <varname>  -\>  <methodname>  \( <arglist> \)
      | <methodname>  \$  <varname>  [ \( <arglist> \) | <arglist> ]
    }
    $source_code =~ / <method_call> /;
    my ($varname, $methodname, $arglist) =
            $0{method_call}{"varname","methodname","arglist"}

Deriving a benefit

As the above examples illustrate, using named rules in grammars provides a cleaner syntax and a reduction in the number of variables required in a parsing program. But, beyond those advantages, and the obvious benefits of moving rule construction from run-time to compile-time, there's yet another significant way to gain from placing named rules inside a grammar: we can inherit from them.

For example, the ReverseDiff grammar is almost the same as the normal Diff grammar. The only difference is in the hunk rule. So there's no reason why we shouldn't just have ReverseDiff inherit all that sameness, and simply redefine its notion of hunk-iness. That would look like this:

    grammar ReverseDiff is Diff {
        rule hunk :i { 
            [ <linenum> a :: <linerange> \n
              <appendline>+ 
              { $appendline =~ s/ <in_marker> /</;
                let $0 := "${linerange}d${linenum}\n"
                        _ join "", @$appendline;
              }
            |
              <linerange> d :: <linenum> \n
              <deleteline>+
              { $deleteline =~ s/ <out_marker> />/;
                let $0 := "${linenum}a${linerange}\n"
                        _ join "", @$deleteline;
              }
            |
              $from:=<linerange> c :: $to:=<linerange> \n
              <deleteline>+
              --- \n
              <appendline>+
              { $appendline =~ s/ <in_marker> /</;
                $deleteline =~ s/ <out_marker> />/;
                let $0 := "${to}c${from}\n"
                        _ join("", @$appendline)
                        _ "---\n"
                        _ join("", @$deleteline);
              }
            ]
          |
            <badline("Invalid diff hunk")>
        }
    }

The ReverseDiff is Diff syntax is the standard Perl 6 way of inheriting behaviour. Classes will use the same notation:

    class Hacker is Programmer {...}
    class JAPH is Hacker {...}
    # etc.

Likewise, in the above example Diff is specified as the base grammar from which the new ReverseDiff grammar is derived. As a result of that inheritance relationship, ReverseDiff immediately inherits all of the Diff grammar's rules. We then simple redefine ReverseDiff's version of the hunk rule, and the job's done.


Different diffs

Grammatical inheritance isn't only useful for tweaking the behaviour of a grammar's rules. It's also handy when two or more related grammars share some characteristics, but differ in some particulars. For example, suppose we wanted to support the “unified” diff format, as well as the “classic”.

A unified diff consists of two lines of header information, followed by a series of hunks. The header information indicates the name and modification date of the old file (prefixing the line with three minus signs), and then the name and modification date of the new file (prefixing that line with three plus signs). Each hunk consists of an offset line, followed by one or more lines representing either shared context, or a line to be inserted, or a line to be deleted. Offset lines start with two “at” signs, then consist of a minus sign followed by the old line offset and line-count, and then a plus sign followed by the nes line offset and line-count, and then two more “at” signs. Context lines are prefixed with two spaces. Insertion lines are prefixed with a plus sign and a space. Deletion lines are prefixed with a minus sign and a space.

But that's not important right now.

What is important is that we could write another complete grammar for that, like so:

    grammar Diff::Unified {
        rule file { ^  <fileinfo>  <hunk>*  $ }
        rule fileinfo {
            <out_marker><3> $oldfile:=(\S+) $olddate:=[\h* (\N+?) \h*?] \n
            <in_marker><3>  $newfile:=(\S+) $newdate:=[\h* (\N+?) \h*?] \n
        }
        rule hunk { 
            <header>
            @spec := ( <contextline>
                     | <appendline>
                     | <deleteline>
                     | <badline("Invalid line for unified diff")>
                     )*
        }
        rule header {
            \@\@ <out_marker> <linenum> , <linecount> \h+
                 <in_marker>  <linenum> , <linecount> \h+
            \@\@ \h* \n
        }
        rule badline ($errmsg) { (\N*) ::: { fail "$errmsg: $1" } }
        rule linenum   { (\d+) }
        rule linecount { (\d+) }
        rule deleteline  { ^^ <out_marker> (\N* \n) }
        rule appendline  { ^^ <in_marker>  (\N* \n) }
        rule contextline { ^^ <sp> <sp>    (\N* \n) }
        rule out_marker { \+ <sp> }
        rule in_marker  {  - <sp> }
    }

That represents (and can parse) the new diff format correctly, but it's a needless duplication of effort and code. Many the rules of this grammar are identical to those of the original diff parser. Which suggests we could just grab them straight from the original -- by inheriting them:

    grammar Diff::Unified is Diff  {
        rule file { ^  <fileinfo>  <hunk>*  $ }
        rule fileinfo {
            <out_marker><3> $newfile:=(\S+) $olddate:=[\h* (\N+?) \h*?] \n
            <in_marker><3>  $newfile:=(\S+) $newdate:=[\h* (\N+?) \h*?] \n
        }
        rule hunk { 
            <header>
            @spec := ( <contextline>
                     | <appendline>
                     | <deleteline>
                     | <badline("Invalid line for unified diff")>
                     )*
        }
        rule header {
            \@\@ <out_marker> <linenum> , <linecount> \h+
                 <in_marker>  <linenum> , <linecount> \h+
            \@\@ \h* \n
        }
        rule linecount { (\d+) }
        rule contextline { ^^ <sp> <sp>  (\N* \n) }
        rule out_marker { \+ <sp> }
        rule in_marker  {  - <sp> }
    }

Note that in this version we don't need to specify the rules for appendline, deleteline, linenum, etc. They're provided automagically by inheriting from the Diff grammar. So we only have to specify the parts of the new grammar that differ from the original.

In particular, this is where we finally reap the reward for factoring out the in_marker and out_marker rules. Because we did that earlier, we can now just change the rules for matching those two markers directly in the new grammar. As a result, the inherited appendline and deleteline rules (which use in_marker and out_marker as subrules) will now attempt to match the new versions of in_marker and out_marker rules instead.

And if you're thinking that looks suspiciously like polymorphism, you're absolutely right. The parallels between pattern matching and OO run very deep in Perl 6.


Let's get cooking

To sum up: Perl 6 patterns and grammars extend Perl's text matching capacities enormously. But you don't have to start using all that extra power right away. You can ignore grammars and embedded closures and assertions and the other sophisticated bits until you actually need them.

The new rule syntax also cleans up much of the “line-noise” of Perl 5 regexes. But the fundamentals don't change that much. Many Perl 5 patterns will translate very simply and naturally to Perl 6.

To demonstrate that, and to round out this exploration of Perl 6 patterns, here are a few common Perl 5 regexes -- some borrowed from the Perl Cookbook, and others from the Regexp::Common module -- all ported to equivalent Perl 6 rules:

Match a C comment:
# Perl 5
$str =~ m{ /\* .*? \*/ }xs;
# Perl 6
$str =~ m{ /\* .*? \*/ };
Remove leading qualifiers from a Perl identifier
# Perl 5
$ident =~ s/^(?:\w*::)*//;
# Perl 6
$ident =~ s/^[\w*\:\:]*//;
Warn of text with lines greater than 80 characters
# Perl 5
warn "Thar she blows!: $&"
        if $str =~ m/.{81,}/;
# Perl 6
warn "Thar she blows!: $0"
        if $str =~ m/\N<81,>/;
Match a Roman numeral
# Perl 5
$str =~ m/ ^ m* (?:d?c{0,3}|c[dm]) (?:l?x{0,3}|x[lc]) (?:v?i{0,3}|i[vx]) $ /ix;
# Perl 6
$str =~ m:i/ ^ m* [d?c<0,3>|c<[dm]>] [l?x<0,3>|x<[lc]>] [v?i<0,3>|i<[vx]>] $ /;
Extract lines regardless of line terminator
# Perl 5
push @lines, $1
        while $str =~ m/\G([^\012\015]*)(?:\012\015?|\015\012?)/gc;
# Perl 6
push @lines, $1
        while $str =~ m:c/ (\N*) \n /;
Match a quote-delimited string (Friedl-style), capturing contents:
# Perl 5
$str =~ m/ " ( [^\\"]* (?: \\. [^\\"]* )* ) " /x;
# Perl 6
$str =~ m/ " ( <-[\\"]>* [ \\. <-[\\"]>* ]* ) " /;
Match a decimal IPv4 address:
# Perl 5
my $quad = qr/(?: 25[0-5] | 2[0-4]\d | [0-1]??\d{1,2} )/x;
$str =~ m/ $quad \. $quad \. $quad \. $quad /x;
# Perl 6
rule quad {  (\d<1,3>) :: { fail unless $1 < 256 }  }
$str =~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /x;
# Perl 6 (same great approach, now less syntax)
rule quad {  (\d<1,3>) :: <($1 < 256)>  }
$str =~ m/ <quad> <dot> <quad> <dot> <quad> <dot> <quad> /x;
Match a floating-point number, returning components:
# Perl 5
($sign, $mantissa, $exponent) =
        $str =~ m/([+-]?)([0-9]+\.?[0-9]*|\.[0-9]+)(?:e([+-]?[0-9]+))?/;
# Perl 6
($sign, $mantissa, $exponent) =
        $str =~ m/(<[+-]>?)(<[0-9]>+\.?<[0-9]>*|\.<[0-9]>+)[e(<[+-]>?<[0-9]>+)]?/;
Match a floating-point number maintainably, returning components:
# Perl 5
my $digit    = qr/[0-9]/;
my $sign_pat = qr/(?: [+-]? )/x;
my $mant_pat = qr/(?: $digit+ \.? $digit* | \. digit+ )/x;
my $expo_pat = qr/(?: $signpat $digit+ )? /x;
($sign, $mantissa, $exponent) =
        $str =~ m/ ($sign_pat) ($mant_pat) (?: e ($expo_pat) )? /x;
# Perl 6
rule sign     { <[+-]>? }
rule mantissa { <digit>+ [\. <digit>*] | \. <digit>+ }
rule exponent { [ <sign> <digit>+ ]? }
($sign, $mantissa, $exponent) = 
        $str =~ m/ (<sign>) (<mantissa>) [e (<exponent>)]? /;
Match nested parentheses:
# Perl 5
our $parens = qr/ \(  (?: (?>[^()]+) | (??{$parens}) )*  \) /x;
$str =~ m/$parens/;
# Perl 6
$str =~ m/ \(  [ <-[()]> + : | <self> ]*  \) /;
Match nested parentheses maintainably:
# Perl 5
our $parens = qr/
           \(                   # Match a literal '('
           (?:                  # Start a non-capturing group
               (?>              #     Never backtrack through...
                   [^()] +      #         Match a non-paren (repeatedly)
               )                #     End of non-backtracking region
           |                    # Or
               (??{$parens})    #    Recursively match entire pattern
           )*                   # Close group and match repeatedly
           \)                   # Match a literal ')'
         /x;
$str =~ m/$parens/;
# Perl 6
$str =~ m/ <'('>                # Match a literal '('
           [                    # Start a non-capturing group
                <-[()]> +       #    Match a non-paren (repeatedly)
                :               #    ...and never backtrack that match
           |                    # Or
                <self>          #    Recursively match entire pattern
           ]*                   # Close group and match repeatedly
           <')'>                # Match a literal ')'
         /;


Return to the Perl.com.

Web Basics with LWP

Sean M. Burke is the author of Perl & LWP

Introduction

LWP (short for "Library for WWW in Perl") is a popular group of Perl modules for accessing data on the Web. Like most Perl module-distributions, each of LWP's component modules comes with documentation that is a complete reference to its interface. However, there are so many modules in LWP that it's hard to know where to look for information on doing even the simplest things.

Introducing you to using LWP would require a whole book--a book that just happens to exist, called Perl & LWP. This article offers a sampling of recipes that let you perform common tasks with LWP.

Getting Documents with LWP::Simple

If you just want to access what's at a particular URL, the simplest way to do it is to use LWP::Simple's functions.

In a Perl program, you can call its get($url) function. It will try getting that URL's content. If it works, then it'll return the content; but if there's some error, it'll return undef.


  my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
    # Just an example: the URL for the most recent /Fresh Air/ show

  use LWP::Simple;
  my $content = get $url;
  die "Couldn't get $url" unless defined $content;

  # Then go do things with $content, like this:

  if($content =~ m/jazz/i) {
    print "They're talking about jazz today on Fresh Air!\n";
  } else {
    print "Fresh Air is apparently jazzless today.\n";
  }

The handiest variant on get is getprint, which is useful in Perl one-liners. If it can get the page whose URL you provide, it sends it to STDOUT; otherwise it complains to STDERR.


  % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"

This is the URL of a plain-text file. It lists new files in CPAN in the past two weeks. You can easily make it part of a tidy little shell command, like this one that mails you the list of new Acme:: modules:


  % perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"  \
     | grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER

There are other useful functions in LWP::Simple, including one function for running a HEAD request on a URL (useful for checking links, or getting the last-revised time of a URL), and two functions for saving and mirroring a URL to a local file. See the LWP::Simple documentation for the full details, or Chapter 2, "Web Basics" of Perl & LWP for more examples.

The Basics of the LWP Class Model

LWP::Simple's functions are handy for simple cases, but its functions don't support cookies or authorization; they don't support setting header lines in the HTTP request; and generally, they don't support reading header lines in the HTTP response (most notably the full HTTP error message, in case of an error). To get at all those features, you'll have to use the full LWP class model.

While LWP consists of dozens of classes, the two that you have to understand are LWP::UserAgent and HTTP::Response. LWP::UserAgent is a class for "virtual browsers," which you use for performing requests, and HTTP::Response is a class for the responses (or error messages) that you get back from those requests.

The basic idiom is $response = $browser->get($url), or fully illustrated:


  # Early in your program:
  
  use LWP 5.64; # Loads all important LWP classes, and makes
                #  sure your version is reasonably recent.

  my $browser = LWP::UserAgent->new;
  
  ...
  
  # Then later, whenever you need to make a get request:
  my $url = 'http://freshair.npr.org/dayFA.cfm?todayDate=current';
  
  my $response = $browser->get( $url );
  die "Can't get $url -- ", $response->status_line
   unless $response->is_success;

  die "Hey, I was expecting HTML, not ", $response->content_type
   unless $response->content_type eq 'text/html';
     # or whatever content-type you're equipped to deal with

  # Otherwise, process the content somehow:
  
  if($response->content =~ m/jazz/i) {
    print "They're talking about jazz today on Fresh Air!\n";
  } else {
    print "Fresh Air is apparently jazzless today.\n";
  }
There are two objects involved: $browser, which holds an object of the class LWP::UserAgent, and then the $response object, which is of the class HTTP::Response. You really need only one browser object per program; but every time you make a request, you get back a new HTTP::Response object, which will have some interesting attributes:
  • A status code indicating success or failure (which you can test with $response->is_success).

  • An HTTP status line, which I hope is informative if there is a failure (which you can see with $response->status_line, and which returns something like "404 Not Found").

  • A MIME content-type like "text/html", "image/gif", "application/xml", and so on, which you can see with $response->content_type

  • The actual content of the response, in $response->content. If the response is HTML, that's where the HTML source will be; if it's a GIF, then $response->content will be the binary GIF data.

  • And dozens of other convenient and more specific methods that are documented in the docs for HTTP::Response, and its superclasses, HTTP::Message and HTTP::Headers.

Adding Other HTTP Request Headers

The most commonly used syntax for requests is $response = $browser->get($url), but in truth, you can add extra HTTP header lines to the request by adding a list of key-value pairs after the URL, like so:


  $response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );

For example, here's how to send more Netscape-like headers, in case you're dealing with a site that would otherwise reject your request:


  my @ns_headers = (
   'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
   'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, 
        image/pjpeg, image/png, */*',
   'Accept-Charset' => 'iso-8859-1,*,utf-8',
   'Accept-Language' => 'en-US',
  );

  ...
  
  $response = $browser->get($url, @ns_headers);

If you weren't reusing that array, you could just go ahead and do this:



  $response = $browser->get($url,
   'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
   'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, 
        image/pjpeg, image/png, */*',
   'Accept-Charset' => 'iso-8859-1,*,utf-8',
   'Accept-Language' => 'en-US',
  );

If you were only going to change the 'User-Agent' line, you could just change the $browser object's default line from "libwww-perl/5.65" (or the like) to whatever you like, using LWP::UserAgent's agent method:


   $browser->agent('Mozilla/4.76 [en] (Win98; U)');

Enabling Cookies

A default LWP::UserAgent object acts like a browser with its cookies support turned off. There are various ways of turning it on, by setting its cookie_jar attribute. A "cookie jar" is an object representing a little database of all the HTTP cookies that a browser can know about. It can correspond to a file on disk (the way Netscape uses its cookies.txt file), or it can be just an in-memory object that starts out empty, and whose collection of cookies will disappear once the program is finished running.

To give a browser an in-memory empty cookie jar, you set its cookie_jar attribute like so:


  $browser->cookie_jar({});

To give it a copy that will be read from a file on disk, and will be saved to it when the program is finished running, set the cookie_jar attribute like this:


  use HTTP::Cookies;
  $browser->cookie_jar( HTTP::Cookies->new(
    'file' => '/some/where/cookies.lwp',
        # where to read/write cookies
    'autosave' => 1,
        # save it to disk when done
  ));

That file will be an LWP-specific format. If you want to access the cookies in your Netscape cookies file, you can use the HTTP::Cookies::Netscape class:


  use HTTP::Cookies;
    # yes, loads HTTP::Cookies::Netscape too
  
  $browser->cookie_jar( HTTP::Cookies::Netscape->new(
    'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt',
        # where to read cookies
  ));

You could add an 'autosave' => 1 line as we did earlier, but at time of writing, it's uncertain whether Netscape might discard some of the cookies you could be writing back to disk.

Posting Form Data

Many HTML forms send data to their server using an HTTP POST request, which you can send with this syntax:


 $response = $browser->post( $url,
   [
     formkey1 => value1, 
     formkey2 => value2, 
     ...
   ],
 );
Or if you need to send HTTP headers:

 $response = $browser->post( $url,
   [
     formkey1 => value1, 
     formkey2 => value2, 
     ...
   ],
   headerkey1 => value1, 
   headerkey2 => value2, 
 );

For example, the following program makes a search request to AltaVista (by sending some form data via an HTTP POST request), and extracts from the HTML the report of the number of matches:


  use strict;
  use warnings;
  use LWP 5.64;
  my $browser = LWP::UserAgent->new;
  
  my $word = 'tarragon';
  
  my $url = 'http://www.altavista.com/sites/search/web';
  my $response = $browser->post( $url,
    [ 'q' => $word,  # the Altavista query string
      'pg' => 'q', 'avkw' => 'tgz', 'kl' => 'XX',
    ]
  );
  die "$url error: ", $response->status_line
   unless $response->is_success;
  die "Weird content type at $url -- ", $response->content_type
   unless $response->content_type eq 'text/html';

  if( $response->content =~ m{AltaVista found ([0-9,]+) results} ) {
    # The substring will be like "AltaVista found 2,345 results"
    print "$word: $1\n";
  } else {
    print "Couldn't find the match-string in the response\n";
  }

Sending GET Form Data

Some HTML forms convey their form data not by sending the data in an HTTP POST request, but by making a normal GET request with the data stuck on the end of the URL. For example, if you went to imdb.com and ran a search on Blade Runner, the URL you'd see in your browser window would be:


  http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TV

To run the same search with LWP, you'd use this idiom, which involves the URI class:


  use URI;
  my $url = URI->new( 'http://us.imdb.com/Tsearch' );
    # makes an object representing the URL
  
  $url->query_form(  # And here the form data pairs:
    'title'    => 'Blade Runner',
    'restrict' => 'Movies and TV',
  );
  
  my $response = $browser->get($url);

See Chapter 5, "Forms" of Perl & LWP for a longer discussion of HTML forms and of form data, as well as Chapter 6 through Chapter 9 for a longer discussion of extracting data from HTML.

Absolutizing URLs

The URI class that we just mentioned above provides all sorts of methods for accessing and modifying parts of URLs (such as asking sort of URL it is with $url->scheme, and asking what host it refers to with $url->host, and so on, as described in the docs for the URI class. However, the methods of most immediate interest are the query_form method seen above, and now the new_abs method for taking a probably relative URL string (like "../foo.html") and getting back an absolute URL (like "http://www.perl.com/stuff/foo.html"), as shown here:


  use URI;
  $abs = URI->new_abs($maybe_relative, $base);

For example, consider this program that matches URLs in the HTML list of new modules in CPAN:


  use strict;
  use warnings;
  use LWP 5.64;
  my $browser = LWP::UserAgent->new;
  
  my $url = 'http://www.cpan.org/RECENT.html';
  my $response = $browser->get($url);
  die "Can't get $url -- ", $response->status_line
   unless $response->is_success;
  
  my $html = $response->content;
  while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print "$1\n";  
  }

When run, it emits output that starts out something like this:


  MIRRORING.FROM
  RECENT
  RECENT.html
  authors/00whois.html
  authors/01mailrc.txt.gz
  authors/id/A/AA/AASSAD/CHECKSUMS
  ...

However, if you actually want to have those be absolute URLs, you can use the URI module's new_abs method, by changing the while loop to this:


  while( $html =~ m/<A HREF=\"(.*?)\"/g ) {    
      print URI->new_abs( $1, $response->base ) ,"\n";
  }

(The $response->base method from HTTP::Message is for returning the URL that should be used for resolving relative URLs--it's usually just the same as the URL that you requested.)

That program then emits nicely absolute URLs:


  http://www.cpan.org/MIRRORING.FROM
  http://www.cpan.org/RECENT
  http://www.cpan.org/RECENT.html
  http://www.cpan.org/authors/00whois.html
  http://www.cpan.org/authors/01mailrc.txt.gz
  http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
  ...

See Chapter 4, "URLs", of Perl & LWP for a longer discussion of URI objects.

Of course, using a regexp to match hrefs is a bit simplistic, and for more robust programs, you'll probably want to use an HTML-parsing module like HTML::LinkExtor, or HTML::TokeParser, or even maybe HTML::TreeBuilder.

Other Browser Attributes

LWP::UserAgent objects have many attributes for controlling how they work. Here are a few notable ones:

  • $browser->timeout(15): This sets this browser object to give up on requests that don't answer within 15 seconds.

  • $browser->protocols_allowed( [ 'http', 'gopher'] ): This sets this browser object to not speak any protocols other than HTTP and gopher. If it tries accessing any other kind of URL (like an "ftp:" or "mailto:" or "news:" URL), then it won't actually try connecting, but instead will immediately return an error code 500, with a message like "Access to ftp URIs has been disabled".

  • use LWP::ConnCache;
    $browser->conn_cache(LWP::ConnCache->new())
    : This tells the browser object to try using the HTTP/1.1 "Keep-Alive" feature, which speeds up requests by reusing the same socket connection for multiple requests to the same server.

  • $browser->agent( 'SomeName/1.23 (more info here maybe)' ): This changes how the browser object will identify itself in the default "User-Agent" line is its HTTP requests. By default, it'll send "libwww-perl/versionnumber", like "libwww-perl/5.65". You can change that to something more descriptive like this:

    
      $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)' );
    

    Or if need be, you can go in disguise, like this:

    
      $browser->agent( 
         'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)' );
    
  • push @{ $ua->requests_redirectable }, 'POST': This tells this browser to obey redirection responses to POST requests (like most modern interactive browsers), even though the HTTP RFC says that should not normally be done.

For more options and information, see the full documentation for LWP::UserAgent.

Writing Polite Robots

If you want to make sure that your LWP-based program respects robots.txt files and doesn't make too many requests too fast, you can use the LWP::RobotUA class instead of the LWP::UserAgent class.

LWP::RobotUA class is just like LWP::UserAgent, and you can use it like so:


  use LWP::RobotUA;
  my $browser = LWP::RobotUA->new(
    'YourSuperBot/1.34', 'you@yoursite.com');
    # Your bot's name and your email address

  my $response = $browser->get($url);

But HTTP::RobotUA adds these features:

  • If the robots.txt on $url's server forbids you from accessing $url, then the $browser object (assuming it's of the class LWP::RobotUA) won't actually request it, but instead will give you back (in $response) a 403 error with a message "Forbidden by robots.txt". That is, if you have this line:

  • 
      die "$url -- ", $response->status_line, "\nAborted"
       unless $response->is_success;
    

    then the program would die with an error message like this:

    
      http://whatever.site.int/pith/x.html -- 403 Forbidden 
      by robots.txt
      Aborted at whateverprogram.pl line 1234
    
  • If this $browser object sees that the last time it talked to $url's server was too recently, then it will pause (via sleep) to avoid making too many requests too often. How long it will pause for, is by default one minute--but you can control it with the $browser->delay( minutes ) attribute.

  • For example, this code:

    
      $browser->delay( 7/60 );
    

    means that this browser will pause when it needs to avoid talking to any given server more than once every 7 seconds.

For more options and information, see the full documentation for LWP::RobotUA.

Using Proxies

In some cases, you will want to (or will have to) use proxies for accessing certain sites or for using certain protocols. This is most commonly the case when your LWP program is running (or could be running) on a machine that is behind a firewall.

To make a browser object use proxies that are defined in the usual environment variables (HTTP_PROXY), just call the env_proxy on a user-agent object before you go making any requests on it. Specifically:


  use LWP::UserAgent;
  my $browser = LWP::UserAgent->new;
  
  # And before you go making any requests:
  $browser->env_proxy;

For more information on proxy parameters, see the LWP::UserAgent documentation, specifically the proxy, env_proxy, and no_proxy methods.

HTTP Authentication

Many Web sites restrict access to documents by using "HTTP Authentication". This isn't just any form of "enter your password" restriction, but is a specific mechanism where the HTTP server sends the browser an HTTP code that says "That document is part of a protected 'realm', and you can access it only if you re-request it and add some special authorization headers to your request".

For example, the Unicode.org administrators stop email-harvesting bots from harvesting the contents of their mailing list archives by protecting them with HTTP Authentication, and then publicly stating the username and password (at http://www.unicode.org/mail-arch/)--namely username "unicode-ml" and password "unicode".

For example, consider this URL, which is part of the protected area of the Web site:


  http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html

If you access that with a browser, you'll get a prompt like "Enter username and password for 'Unicode-MailList-Archives' at server 'www.unicode.org'", or in a graphical browser, something like this:

Screenshot of site with Basic Auth required

In LWP, if you just request that URL, like this:


  use LWP 5.64;
  my $browser = LWP::UserAgent->new;

  my $url =
   'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
  my $response = $browser->get($url);

  die "Error: ", $response->header('WWW-Authenticate') || 
    'Error accessing',
    #  ('WWW-Authenticate' is the realm-name)
    "\n ", $response->status_line, "\n at $url\n Aborting"
   unless $response->is_success;

Then you'll get this error:


  Error: Basic realm="Unicode-MailList-Archives"
   401 Authorization Required
   at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
   Aborting at auth1.pl line 9.  [or wherever]

because the $browser doesn't know any the username and password for that realm ("Unicode-MailList-Archives") at that host ("www.unicode.org"). The simplest way to let the browser know about this is to use the credentials method to let it know about a username and password that it can try using for that realm at that host. The syntax is:


  $browser->credentials(
    'servername:portnumber',
    'realm-name',
    'username' => 'password'
  );

In most cases, the port number is 80, the default TCP/IP port for HTTP; and you usually call the credentials method before you make any requests. For example:


  $browser->credentials(
    'reports.mybazouki.com:80',
    'web_server_usage_reports',
    'plinky' => 'banjo123'
  );

So if we add the following to the program above, right after the $browser = LWP::UserAgent->new; line:


  $browser->credentials(  # add this to our $browser 's "key ring"
    'www.unicode.org:80',
    'Unicode-MailList-Archives',
    'unicode-ml' => 'unicode'
  );

and then when we run it, the request succeeds, instead of causing the die to be called.

Accessing HTTPS URLs

When you access an HTTPS URL, it'll work for you just like an HTTP URL would--if your LWP installation has HTTPS support (via an appropriate Secure Sockets Layer library). For example:


  use LWP 5.64;
  my $url = 'https://www.paypal.com/';   # Yes, HTTPS!
  my $browser = LWP::UserAgent->new;
  my $response = $browser->get($url);
  die "Error at $url\n ", $response->status_line, "\n Aborting"
   unless $response->is_success;
  print "Whee, it worked!  I got that ",
   $response->content_type, " document!\n";

If your LWP installation doesn't have HTTPS support set up, then the response will be unsuccessful, and you'll get this error message:


  Error at https://www.paypal.com/
   501 Protocol scheme 'https' is not supported
   Aborting at paypal.pl line 7.   [or whatever program and line]

If your LWP installation does have HTTPS support installed, then the response should be successful, and you should be able to consult $response just like with any normal HTTP response.

For information about installing HTTPS support for your LWP installation, see the helpful README.SSL file that comes in the libwww-perl distribution.

Getting Large Documents

When you're requesting a large (or at least potentially large) document, a problem with the normal way of using the request methods (like $response = $browser->get($url)) is that the response object in memory will have to hold the whole document--in memory. If the response is a 30-megabyte file, this is likely to be quite an imposition on this process's memory usage.

A notable alternative is to have LWP save the content to a file on disk, instead of saving it up in memory. This is the syntax to use:


  $response = $ua->get($url,
                         ':content_file' => $filespec,
                      );

For example,


  $response = $ua->get('http://search.cpan.org/',
                         ':content_file' => '/tmp/sco.html'
                      );

When you use this :content_file option, the $response will have all the normal header lines, but $response->content will be empty.

Note that this ":content_file" option isn't supported under older versions of LWP, so you should consider adding use LWP 5.66; to check the LWP version, if you think your program might run on systems with older versions.

If you need to be compatible with older LWP versions, then use this syntax, which does the same thing:


  use HTTP::Request::Common;
  $response = $ua->request( GET($url), $filespec );

Resources

Remember, this article is just the most rudimentary introduction to LWP--to learn more about LWP and LWP-related tasks, you really must read from the following:

  • LWP::Simple: Simple functions for getting, heading, and mirroring URLs.

  • LWP: Overview of the libwww-perl modules.

  • LWP::UserAgent: The class for objects that represent "virtual browsers."

  • HTTP::Response: The class for objects that represent the response to a LWP response, as in $response = $browser->get(...).

  • HTTP::Message and HTTP::Headers: Classes that provide more methods to HTTP::Response.

  • URI: Class for objects that represent absolute or relative URLs.

  • URI::Escape: Functions for URL-escaping and URL-unescaping strings (like turning "this & that" to and from "this%20%26%20that").

  • HTML::Entities: Functions for HTML-escaping and HTML-unescaping strings (like turning "C. & E. Brontë" to and from "C. &amp; E. Bront&euml;").

  • HTML::TokeParser and HTML::TreeBuilder: Classes for parsing HTML.

  • HTML::LinkExtor: Class for finding links in HTML documents.

  • And last but not least, my book Perl & LWP.


Copyright ©2002, Sean M. Burke. You can redistribute this document and/or modify it, but only under the same terms as Perl itself.


This week on Perl 6 (week ending 2002-08-18)

The story so far... Larry, Tom, Randal, Damian, Jon, Chip, Gnat, Ziggy, Dick and the rest of the gang were chatting with all the other cool Perl kids about where Perl was and should be going. Then Jon threw a coffee cup and the rest is history...

So, as is now traditional, we'll kick off with the goings on in perl6-internals. Confused? You will be after this week's summary.

Scratchpad.pmc

Jonathan Sillito harrassed Dan about subroutines, continuations and and other things to do with function calls. He wondered if his scratchpad.pmc patch fitted what Dan had in mind, and added a few supplementary questions. Dan answered the supplementary questions and apologised for not having done the docs on this bit of the design, saying he'd try and get it out by the end of the week. Melvin Smith (whose code was involved in some of the questions) also gave some answers.

http://groups.google.com/groups -- Questions

http://groups.google.com/groups -- Answers.

http://groups.google.com/groups -- More answers

Perl 6 regexes...

Last week Dan threw down the 'full perl 6 regex engine' gauntlet, and this week Sean ``Hero'' O'Rourke admitted that he was working on it and that he already had a good chunk of it working. Apparently hypothetical variables and the various cut operators are looking a little tricky. Dan was impressed. Simon Cozens told us that, when he said he'd written a Perl 6 regex parser before breakfast at YAPC, he hadn't been joking, and that he'd put it up in his CVS if people want to poke at it. But the crowd was strangely silent.

Steve Fink also had some questions to ask about the regex engine.

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

[COMMIT] GC_DEBUG, Some GC Fixes, and Remaining GC Bugs

Mike Lambert has re-added the GC_DEBUG define. The idea behind this is to allow `various limits and settings and logic to be setup such that GC bugs occur relatively soon after the offending code.' and generally to let you write relatively simple code that will trip the kind of bugs that are normally only seen when running complex code. Mike also listed the areas where he knows there are still problems.

Various people have had problems using GC_DEBUG, including at least one instance where <end> threw a segfault, but it also drew a flurry of patches from Jason Gloudon and Steve Fink. Much of the discussion in this thread was rather more technical than I have the skills to summarize, so I'll just point you at the root node.

http://groups.google.com/groups

[PATCH] quotematch speedup

Joseph Ryan did some regex optimization in assemble.pl to speed up the matching of strings. For some reason, this kicked off the longest thread of the week, about the merits of optimizing a pure perl assembler when we should really do it in C if we really wanted speed. Or maybe we should implement the assembler in Parrot proper. And wouldn't it be cool if we could write self modifying code in Parrot (needed for eval, Dan's working on a design). Juergen Boemmels spotted that the patch didn't quite work as promised and provided a better version, but I don't think it's been applied.

http://rt.perl.org/rt2/Ticket/Display.html

Keyed access to PerlArray/PerlHash

Tom Hughes wondered about the semantics of indexing PerlArrays with strings and PerlHashes with integers. He and Dan discussed it back and forth for a while, and then Tom posted a patch, which Dan liked the look of, but held off on applying while everyone else had a good look. Mike Lambert had a look and raised some issues, which Tom addressed and issued another, enormous patch.

http://groups.google.com/groups

http://rt.perl.org/rt2/Ticket/Display.html

http://groups.google.com/groups

[PASM] problem opening / reading file

Jerome Quelin has been having some problems with file I/O under parrot 0.0.7. The consensus seems to be that the current Parrot I/O system is, ahem, not as good as it could be, mostly because people have had other priorities. Dan asked for a volunteer to help `get I/O off of Melvin's altogether too-full plate before his wife hunts [Dan] down and does nasty things'. Clinton Pierce seems to have stepped up to handle part of what Dan wants. Well done that man.

http://groups.google.com/groups

http://groups.google.com/groups

set Boolean to 2

Leopold Toetsch found something confusing in the perl 6 compiler which ends up generating some unexpected byte code. Peter Gibbs reckoned that the generated code was right, but Leopold still sounds unconvinced.

http://groups.google.com/groups

[INFO] The first pirate parrot takes to the air

Peter Gibbs chucked a cat in amongst the parrots when he posted some performance numbers for his own private `African Grey' version of Parrot which utilizes some GC speed up techniques that Dan had rejected. Peter's numbers are impressive, but the techniques used break in some places. Peter helpfully posted pointers to his original patches. Dan wondered which bits contributed what to the improved performance, and explained why he had problems with the approaches that Peter was using.

This thread also spawned the thread called `Stack Walk Speedups?' which was kicked off by Mike Lambert outlining the current workings of the GC's stack walking code and wondered if anyone could come up with a faster way of doing it (that worked in the face of the constraints on Parrot's GC.) Peter Gibbs offered one approach, and Jason Gloudon pointed out that his 'stack direction' config patch would help in this area too. Jason also offered a patch giving a 12% speedup on his machine, which was applied. Mike Lambert is also doing some cunning stuff to improve things using Copy On Write (COW) tricks.

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

http://rt.perl.org/rt2/Ticket/Display.html

[DRAFT PDD] External Data Interfaces

Brent Dax donned his Technical Author hat, and offered a draft PDD (Parrot Design Document) covering Parrot's external data interfaces. Nicholas Clark offered a pile of constructive criticism and discussion is ongoing.

http://groups.google.com/groups


Meanwhile, over in perl6-language

The strangely named `Autovivi' thread rolled on. Miko O'Sullivan asked Larry for some clarification about the default behaviour of pass-by-values, Larry wasn't sure because he hasn't worked out the syntax yet. Luke Palmer suggested that copy-on-write tricks could be used to do let us have our cake and eat it. Deven Corzine suggested defaulting to doing pass-by-value, but Larry suspects that that might make Perl 6 sub calls even slower than perl 5's. David Whipp wondered about threading; nobody replied.

Deven managed to drag the thread back to its nominal subject of Autovivification when he wondered if <func($x{1}{2}{3})> would cause an implementation headache. Larry explained that it was cases like this that made him choose the 'pass by immutable reference', attempting to give the speed of pass by reference combined with the guarantees of pass by value. Deven pressed his point and wondered what would happen in the is rw case. Nicholas Clark explained how Perl 5 avoids autovivification in this case (but it's only one level deep). Uri Guttman pointed out that this is a 'Hard Problem', but Leopold Toetsch thought that the Parrot KEY operators may make it relatively easy to solve.

http://groups.google.com/groups

http://groups.google.com/groups

A Perl 6 class question

Last week, Chris Dutton had asked about using Perl 6 to create anonymous classes, and proposed a syntax. This week, Allison Randal pointed him at the section of Exegesis 4 which described how to do exactly that. Rather satisfyingly, the syntax that Chris had invented was exactly the syntax used in the Exegesis, so the choice appears to be natural. Trey Harris wondered about some of the other clumsinesses in Perl 5 when defining classes, and asked some questions about class methods. Damian confirmed that most of the clumsiness was going away, and clarified the behaviour of subs in a class definition.

http://groups.google.com/groups

Just reading up on Pike...

Chris Dutton has been reading up on Pike, and had some observations (whilst neglecting to explain what Pike is). Apparently pike allows you to define combination types <private string|int bar> would mean that <bar> could be a string or an int for instance. Damian wheeled out his superpositions hobby horse and suggested <my any(str,int) $bar>, as a way of doing this. Luke Palmer trumped that with <my all(str,int) $foo> and wondered what such a construct would actually mean. Damian suggested that it meant Luke needed some serious therapy. Nicholas Clark got all boring and practical and wondered how one would implement the any case. Damian reckons it'd just fall out of the implementation of superpositions.

Somewhere in the thread, Andy Wardley asked a serious question about what builtin types perl 6 would had, and what the results of ref $foo would look like.

http://groups.google.com/groups

http://groups.google.com/groups

Balanced Matches in Regexps?

Peter Behroozi wondered about using Apocalypse 5 rules to capture balanced and wondered if there shouldn't be a <balanced> directive. Brent Dax pointed Peter at the cunning recursion capabilities that come with Apocalypse 5:

  rule parenthesized { \( ( <-[()]>+ | <parenthesized> )* \) }

Larry commented that he was considering a builtin <self> rule that'd allow one to refer to the current rule without having to name it. Peter then offered his first attempt at a recursive rule to capture nested HTML tables.

http://groups.google.com/groups


In brief

Nicholas Clark patched assemble.pl so that it could complete its run before the heat death of the universe when run under perl 5.005_03.

Jeff Goff is back and doing cool things with Ruby in parrot.

Jerome Quelin asked what was wanted in a patch. Consensus seems to be either a unified or a context diff, as a mime attachment with a name for the patch and a short description of what it does. Tests are always good (essential) if you're adding new functionality. So are docs.

Jerome also offered a patch which implemented a rand operator but wondered about portability and when to call srand. Jeff Goff wondered if we shouldn't implement our own 'rand' using, say, the Mersenne Twister (http://www-personal.engin.umich.edu/~wagnerr/MersenneTwister.html) and Dan pointed out where srand should be called.

Josef Höök reworked his multidimensional arrays patch one more time and it got applied. Well done Josef, perseverance pays off in the end.

Jason Gloudon added a config patch which tests the direction of stack growth. Applied.

Leopold Toetsch wondered why assembler.pl uses its own PMC.pm instead of the generated lib/Parrot/PMC.pm. Warnock's Dilemma applies.

Pete Sergeant has written a `quick and dirty' PASM tokenizer and a syntax highlighter based on it.

Jarkko continues his sterling work of squashing Tru64 compiler warnings.

Steve Fink has had a crack at adding a --debugging flag to Configure.pl. Warnock applies.

Jerome Quelin won my 'making it easy to keep a running joke going' prize when he supplied a Befunge-93 interpreter in parrot, and commented: ``Note to Leon Brocard: There's a lot more to do in order to provide a Befunge-98 compliant interpreter, so don't worry! There's still a lot of fun waiting... :o)''. Which was very good of him. Thanks Jerome.

The rule/pattern/regex thing kept going. Damian explained that he'd try and use `pattern' for the contents and `rule' for the container. But we that we weren't to hold him to that.

Brent Dax has uploaded a new version of Perl6::Parameters to CPAN. It's his attempt allowing one to declare perl 5 subroutines in a Perl 6ish style.


Who's who in Perl 6

Who are you?
Daniel Alejandro Grunblatt, 21 years university student from Argentina.

What do you do for/with Perl 6?
I'm mostly doing the JIT for Parrot but occasionally move on to another fields (like the Parrot Compiler in Parrot or the Parrot Debugger and others).

Where are you coming from?
My mother.

When do you think Perl 6 will be released?
Some day.

Why are you doing this?
I started working on the JIT to learn assembly, and I'm still learning it (yes, I'm slow), I want to see Parrot running fast, and because it's fun.

You have 5 words. Describe yourself.
I like playing basketball, what I haven't done for while now, and used to like playing role games.

Do you have anything to declare?
No.


Acknowledgements and the funding drive.

This summary was prepared with the aid of more GNER tea (on the down train today). Sadly the GNER ham and cheese toasted sandwich has declined; they've replaced the cheddar in the old version with Wensleydale, which really doesn't melt as well.

Because I'm a bit late this week I've not had time to get the estimable Pete Sergeant to proofread it for me, so if the spelling is worse and the language less coherent this week, just be grateful for Pete's sterling work on my earlier summaries.

Again, if your name appears in this, or any previous summary, and you've still not sent your answers, please consider answering the questions in the ``Perl 6 Who's who?'' section and sending your answers to 5Ws@bofh.org.uk.

Well, so far nobody dislikes my summaries enough to post an alternative. Which is nice. If you think my time writing this is worth anything don't give me money, give it to the Perl Foundation http://donate.perl-foundation.org and help support the ongoing development of Perl. Remember, suitably large donations will earn you a plug in a future summary, but only if you tell me about it.

Acme::Comment


/*
    This is a Perl comment
*/

Commenting in Perl

"But what?", I hear you think, "Perl doesn't have multi-line comments!"

That's true. Perl doesn't have multiline comments. But why is that? What is wrong with them? Mostly, it is because Larry doesn't like them, and reasons that we can have the same effect with POD and Perl doesn't need Yet Another Comment Marker.

To illustrate, here we are at YAPC::America::North 2002, held in beautiful St. Louis. The weather is warm, the sun is shining, the sights are pretty and the beer is cold. In short, it's all things we've come to love and expect of a Perl conference. It's thursday, the conference is winding down and Siv is having a barbeque at his house. So a few of us end up in a car, headed to the barbeque. Uri Guttman is driving -- you know, the Stem guy -- with myself riding shotgun, and Ann and Larry in the backseat.

It's a friendly chatter and from one topic, it goes on to another. And at one point, Perl6 is being discussed. Ann is asking questions about the new operators, techniques and generally how shiny Perl6 will be. And there, Larry explains us new and wonderous things. Some already mentioned in the apocalypses, some still ideas waiting to become firm concepts. And granted it does sound good, very good... even if they are taking away my beloved arrow operator. Then, a question comes to mind, and I ask: "So Larry, tell me, does Perl 6 have multiline comments?"

All I hear from the backseat is some grumbling and the 2 words: "use POD!";

Needless to say, the tone was set, and I didn't see nor speak to Larry all evening.

Multi-line comments emulation in Perl

But I disagree. I think multiline comments are good.

I hate tinkering with the # sign and the 80 character-per-line limit; I write a comment over a few lines and prefix each with a #. Which of course means inserting a newline.

Then I need to add a few words in the comments. The line becomes longer than 80 characters. I need to add another newline. And add a new # sign. Remove the former # sign. And now nothing is aligned anymore and I need to redo it. *sigh*

And apparently I'm not the only one who has had a gripe with this. It's been a consistent request for change through out the development of Perl 5, and here's a post on Perlmonks discussing exactly this

The idea is to find a way of doing multi-line comments in Perl, without breaking things. These are the four solutions they came up with to do multi-line comments, and why I think they are bad:

1. Use POD.

For example, you could use:


=for comment
  Commented text
=cut

It's POD. POD is documentation for users of your program. Pod is not meant to display things like 'here i change $var' or 'this part will only be executed if $foo is true, which is determined by some_method()'; More over since the part you are commenting on is not in POD format, it won't be displayed in the first place. This will mean that the comments would be displayed in the POD, but the piece of code they refer to, will not be. Granted, most POD parsers are smart enough (or dumb, depending on how you look at it) that they see that the =for is not something valid and ignore it, whilst the Perl interpreter will say 'hey, it starts with =, so it must be POD'; In the end, if you have the 'proper' POD parser, you will get sort of what you want.

But you are really circumventing the problem here, since you are relying on the way any given POD parser parses POD; some might 'use warnings' and report an invalid =for tag on some lines. Others will just display them.

And what we wanted was a way that would allow multi-line comments without possibly breaking things.

2. Any decent editor should allow you to put a # in front of n lines easily...

That's not an answer. We wanted multi-line comments, not an editor trick that allows me to do multiple single-line comments. That and I'd rather not get into the 'vi is better than emacs' flamewar ;)

3. use HERE-docs


<<'#';
this is a
multiline
comment
#

<< '*/';
this is a
multiline
comment
*/

This works, if you remembered to put a newline directly after the end marker. Well, more accurately: it parses correctly. It's a here doc, that means that variables WILL get interpolated if you use a double quoted string.

Meaning if you do something like this:


use strict;
<<"#";
this is a
multiline
comment
about $foo
#

It will blow up right in your face with a compile time error. Also, when running under 'use warnings' --which you should-- this will generate a 'Useless use of a constant in void context at foo.pl line X' warning.

So not completely foolproof. Plus it looks ugly ;)

4. use quote operators


q{
    some
    comment
};

Ok, a much more rigid solution and much more elegant. Does it work? Yes.

Well, almost. It runs under strict. But I didn't expect anything else, since the guy who posted this (juerd) is an experienced Perl hacker and probably knows what he's doing. But it does cast warnings, just like the previous HERE-doc solution:


'Useless use of a constant in void context at foo.pl line X'

Now, your comment is supposed to be there to help other coders, not to be generating warnings. It's a NO OP! It shouldn't make them think things are going wrong!

True multi-line comments: Acme::Comment

So is there an answer to this?

Well, when Ann pointed this out, I began to think there must be some way to do multiline comments. I mean, many languages support it, why not Perl? We claim to make everything as easy as possible, yet the easy things aren't possible? That struck me as odd.

This is where the writing of Acme::Comment began. First to provide a more usable solution to multi-line commenting than the four mentioned above, and secondly to just prove Perl doesn't have to suffer from lack of multi-line comments.

And this is how you use it:


use Acme::Comment type => 'C++';

/*
    This is a comment ...

    ... C++ style!
*/

It's as simple as that. Now, to just do one language seemed a waste of this idea. Many languages have nice multiline comments or even single line comments. So, we decided to support a few more languages - in fact, 44 in total right now

Below are 5 styles of doing multi- or single-line comments in a language that Acme::Comment supports. So let's play a game of 'Name That Language'! (answers at the bottom)

  • This language uses (* and *) as delimiters for its multiline comments
  • ! is used to denote a single line comment in this programming language
  • Simply the word 'comment' indicates a one line comment in this language
  • A single line comment is indicated by preceding it with: DO NOTE THAT
  • \/\/ is the way to do a one line comment in this language

Contestants who answered all questions correct won a free subscription to the Perl beginners mailing list at http://learn.perl.org, where they can share their knowledge with others!

Also, did you know, there were programming languages called: Hugo, Joy, Elastic, Clean and Parrot?

You can of course also create your own commenting style if you'd so desire, by saying:


use Acme::Comment start => '[[', end => ']]';

[[
    This is a comment ...
    ... made by me!
]]

Putting the comment markers on their own line is always safest, since that reduces the possible ambiguity, but this is totally left up to the user. By default, you must put the comment markers on their own line (only whitespace may also be on the same line) and you may not end a comment marker on the line it begins.

But if you were so inclined, you could also do:


use Acme::Comment type => 'C++', one_line => 1, own_line => 0;

/* my comment */

/*  my
    other
    comment
*/

The technology behind this

So how does this all work anyway?

Basically, Acme::Comment is a source filter. This means that BEFORE the Perl interpreter gets to look at the source code, Acme::Comment is given the chance to modify it.

That means you can change any part of a source file into anything else. In Acme::Comment's case, it removes the comments from the source, so they'll never be there when trying to compile.

This is not something to be scared of, since comments are optimized away during compile time anyway (the interpreter has no need for your comments, why keep them?).

Now, source filtering is not something that's terribly complicated and is one of the immensely powerful features of recent versions of Perl. It allows you to extend the language, simplify it it or even completely recast it.

As an example, here are two other (famous) uses of source filters in Perl:

Lingua::Romana::Perligata
Which allows you to program in latin
Switch
An extension to the Perl language, allowing you to use switch statements

Now, Acme::Comment has a spiffy import routine that determines what it needs to do with the options you passed it, and one big subroutine that parses out comments (the largest part of the code is spent on determining nested comments).

Acme::Comment uses, indirectly, the original source filter module called Filter::Util::Call. This module provides a Perl interface to source filtering. It is very powerful, but not as simple as it could be. It works roughly like this:

  1. Download, build, and install the Filter::Util::Call module. (It comes standard with Perl 5.8.0)
  2. Then, set up a module that does a use Filter::Util::Call.
  3. Within that module, create an import subroutine.
  4. Within the import subroutine do a call to filter_add, passing it a subroutine reference.
  5. Within the subroutine reference, call filter_read or filter_read_exact to "prime" $_ with source code data from the source file that will use your module.
  6. Check the status value returned to see if any source code was actually read in.
  7. Then, process the contents of $_ to change the source code in the desired manner.
  8. Return the status value.
  9. If the act of unimporting your module (via a no) should cause source code filtering to cease, create an unimport subroutine, and have it call filter_del.
  10. Make sure that the call to filter_read or filter_read_exact in step 5 will not accidentally read past the no. Effectively this limits source code filters to line-by-line operation, unless the import subroutine does some fancy pre-pre-parsing of the source code it's filtering.

As you can see, that's quite a few steps and things to think of when writing your source filter module. Of course, to make everyone's life easier when source filtering, Damian wrote a wrapper around the Filter::Util::Call module, called Filter::Simple. And although that limits the power you have somewhat, the interface is much nicer. Here's what you need to do:

  1. Download and install the Filter::Simple module. (It comes standard with Perl 5.8.0)
  2. Set up a module that does a use Filter::Simple and then calls FILTER { ... }.
  3. Within the anonymous subroutine or block that is passed to FILTER, process the contents of $_ to change the source code in the desired manner.

And that's it.

There is just one caveat to be mentioned: Due to the nature of source filters, they will not work if you eval the file the code is in.

How to make your own source filters

Finally, I'll discuss some examples on how to set up your own source filters.


package My::Filter;
use Filter::Simple;

### remove all that pesky 'use strict' and 'use warnings' ###
FILTER {
        s|^\s*use strict.+$||g;
        s|^\s*use warnings.+$||g;
    }

Now, if a module uses your My::Filter module, all mentions of 'use strict' and 'use warnings' will be removed, allowing for much easier compiling and running!

Of course, Filter::Simple can do many more things. It can discriminate between different kinds of things it might find in source code. For example,

It can filter based on whether a part of text is:
  • code (sections of source that are not quotelike, POD or __DATA__)
  • executable (sections of source that are not POD or __DATA__)
  • quotelike (sections that are Perl quotelikes as interpreted by Text::Balanced)
  • string (string literal parts of a Perl quotelike, like either half of tr///)
  • regex (sections of source that are regexes, like qr// and m//)
  • all (the default, behaves the same as the FILTER block)

Also, you can apply the same filter multiple times, and it will be checked in order. For example, here's a simple macro-preprocessor that is only applied within regexes, with a final debugging pass that prints the resulting source code:


    use Regexp::Common;
    FILTER_ONLY
        regex => sub { s/!\[/[^/g },
    	regex => sub { s/%d/$RE{num}{int}/g },
    	regex => sub { s/%f/$RE{num}{real}/g },
    	all   => sub { print if $::DEBUG };

It understands the 'no My::Filter' directive and does not filter that part of the source.

So you can say:


    use My::Filter;

        { .. this code is filtered .. }

    no My::Filter

        { .. this code is not .. }

If you want to learn more about source filtering, take a look at the Filter::Simple manpage.

Answers to the comment game

  1. Bliss
  2. Fortran
  3. Focal
  4. Intercal
  5. Pilot

This week on Perl 6 (week ending 2002-08-11)

Far out in the uncharted backwaters of the unfashionable end of the Western Spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-eight million miles is an utterly insignificant little blue-green planet whose ape-descended life forms are so amazingly primitive that they still think inventing programming languages is a pretty neat idea.

So, as one amazingly primitive life form to another, here's what's been going on in Perl 6 development this week. As is becoming tediously traditional, we'll start with the internals list...

Array vs. PerlArray

Way back in the mists of a fortnight ago (that's 'two weeks' for the unenlightened Americans among us.) Melvin Smith announced that he was thinking about copying some of the fixes that had been made to the PerlArray PMC over to array.pmc as well. This week, Steve Fink sent in a patch for an alternative approach, which makes PerlArray a subclass of Array. Being appropriately lazy, Steve also added some rather useful looking keywords to the PMC parser, which added some shiny looking OO features. He then delivered an extended, 3 post monologue in which he kept replying to himself and refining his explanation of what the patch did. When he finally stopped talking to himself Dan told him to just commit it.

A side point in Steve's discussion with himself (and Sean O'Rourke off-list, apparently) was the idea of dramatically simplifying default.pmc so it just throws an exception for operations it doesn't know how to deal with, rather than trying manfully to satisfy the demands of an unreasonable programmer. Dan reckons he should go ahead and rip out the old, obsequious code.

http://groups.google.com/groups -- Start here

Unifying PMCs and Buffers for GC

Mike Lambert had stepped up to this task and gave us a brain dump of his plans. This week Jerome Vouillon wondered about the long term plans for the Garbage collector, and wondered if we knew how to implement a generational garbage collector when one has several pools. Dan's long term plan for the outside world is that 'GC is a black box that just works.' Internally, he doesn't care, so long as it satisfies the 'just works' criterion. Mike Lambert thinks that a generational collector shouldn't be too hard.

Josef 'how do I type "ö" again?' Höök wondered how/if this would affect his matrix implementation, and when/whether his code would be merged with the current tree. Dan answered that he was a little concerned about PMCs which add code to the core, and Josef explained that he wanted to be able to reuse some of the code in some future multiarray.pmc

http://groups.google.com/groups

http://groups.google.com/groups

Register allocation for the JIT

Nicholas Clark and Daniel Grunblatt held a learned discussion about this, with particular reference to the ARM and i386 architectures. There were diagrams. And discussions of hardware documentation. It was both scary and surprisingly easy to understand. And a consensus was reached, which is nice.

http://groups.google.com/groups

Stack mark ops & such

Dan announced that he was 'about half a step from putting pushmark, popmark, stack marks, and suchlike things into the core.' and that this was everyone's opportunity to tell him what a bad idea it was. Jerome Vouillon wondered how they'd interact with continuations and coroutines. Answer: 'Interestingly'. Jerome offered a code sample which may be surprising, and asked Dan a hard question. Which Dan hasn't answered yet.

Melvin Smith wondered if there was a real issue with continuations, since each continuation got its own copy of the call stack. There is an issue, but it'll likely be solved by documenting its existence.

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

Exceptions

Dan posted his road map to exception handling. Or rather, he described the two ops (push and throw) needed to do exception handling, but punted on what an exception object should look like.

Florian Weimer wondered if it wouldn't be possible to handle exception objects in the throwing context. Dan reckoned that this could be problematic, which is why we're not going to do it 'at the moment'.

Jerome wondered if we really needed the pushx opcode at all. Tanton Gibbs disagreed and there was some back and forth about that, and about whether it made sense to think of a return as an exception (it does, but only rarely, when you're writing your own control structures; Damian will explain more later).

Dan also thought that pushx would be necessary, and Jerome clarified his suggestion. Simon Glover wondered what happened when a program didn't define an exception handler and was reassured that there would be some kind of default (probably language specific) handler in place, and if an exception got past that then Parrot would handle it itself with a suitably 'big fit to STDERR' before dying in a fit of pique.

http://groups.google.com/groups

http://groups.google.com/groups -- Discussion of whether pushx is necessary starts here.

Regex speedup

Angel reposted his patch to the regex subsystem, this time with appropriate diff flags. Brent Dax needed more clarification, which Angel gave; Dan accepted that, given docs and tests, the patch could go in. Angel made another patch with docs and tests, although Mike Lambert then asked for more clarification, and pointed out some issues with the inline keyword not being standard across all C compilers. He also wondered if Brent's old 'Tutorial' docs could be updated and reinstated. Angel responded with more clarification, and a promise that the tutorial would go in when he'd rewritten it, and commented that 'The previous Rx version had excellent documentation, so it will take a bit of time to get there.' I don't think this patch has been applied yet.

http://groups.google.com/groups -- Start here.

Lexicals and globals assembly question

Brian Wheeler has been looking at store_lex, find_lex, store_global and find_global and pointed out that the store_* ops seemed to be the wrong way round, since the parrot rule is that the destination is always the first operand. The response to this can be summed up in one word: "Oops." Brian has patched it, and added tests. Leopold Toetsch fixed things up in Builtins.pm.

http://groups.google.com/groups

Questions about pdd03_calling_conventions.pod

Jonathan Sillito asked a few questions about calling conventions, and noticed that the callcc op appeared to be going the way of all flesh. Dan answered the questions and remarked that callcc was to be reinstated and shouldn't have disappeared in the first place. Sean O'Rourke wondered why, since it could be replaced with calls to invoke and other cleverness. Dan agreed. So callcc will stay dead. (The only reason it was on the perch in the first place was because it'd been nailed there.)

http://groups.google.com/groups -- Start here

Hash optimisations

Having had a patch to remove a function call from the hash algorithm accepted, Jason Greene offered a more comprehensive set of hash optimisations. Dan applied the patch. Nicholas Clark and Steve Fink had a few questions. Nicholas wondered why a constant had been changed, and Steve worried about memory use. Dan reckoned that worrying about memory to the tune of an unsigned int + alignment bytes at the expense of speed was the wrong trade-off to make. However, the other subthread had led to a different design which saved the space and only gave a small performance hit in the less common case. Dan agreed.

http://rt.perl.org/rt2/Ticket/Display.html

Various PMC issues

Stephen Rawls is still trying to write tuple.pmc. Which means he's been nosing around the sources of all the other PMCs, looking at how they do things, so, he had a bunch of questions, mostly to do with PerlNums and PerlInts. The answer to many of the questions appears to be 'the problem will go away when we get multimethod dispatch'. Dan is pouting because he had 'wanted to go with a left-side-wins scheme, but alas correctness has trumped speed...'. Dan has promised details of Parrot multimethods 'soonish'.

http://groups.google.com/groups

http://groups.google.com/groups

Never ending story Keys

Josef wondered where we are with multi keyed access. Tom Hughes, who has promised to have a go at fixing things wondered if he was waiting for hell to freeze over. At which point it started 'snowing in Texas' as Dan came through with the design. There was much rejoicing, and a snowball fight was threatened. Tom had some questions (no surprise there: when you're the one doing the implementation, there's almost never enough information). Clarification was provided, as were supplementary questions, and supplementary answers. As Blur put it so eloquently: "Woohoo!"

http://groups.google.com/groups

http://groups.google.com/groups

Hatchet job on assemble.pl / the fixup table

Brian Wheeler took an axe to the assembler and some associated files and asked for comments on his plans. Dan liked the sound of it, so Brian posted a patch. Leopold "How do you pronounce" Toetsch suggested making it a standalone library so stuff could just use Assembler instead of calling yet another external program. Sean O'Rourke asked for a summary of what had changed, and Brian gave him one. (Oo er).

http://groups.google.com/groups -- proposal

http://groups.google.com/groups -- patch

http://groups.google.com/groups -- summary

PMC assignment stuff

Dan has been a positive goldmine of designs and specs this week. This time he offered the design for a new PMC assignment opcode. We already have SET, which copies the pointers and CLONE making a 'full clone', but we also need ASSIGN to 'stuff a value from one PMC to another'. Peter Gibbs made a start on implementation and asked for some clarification, which was forthcoming, and then supplied a patch with his implementation, along with tests. Go Peter!

http://groups.google.com/groups

http://groups.google.com/groups

Status on matrix patch?

Josef is worried about what's happening to the matrix patch. Simon Cozens used this as an opportunity to crack the worst joke yet seen on any perl6 mailing list, but Josef appears not to have seen the film and didn't get the joke. Anyhoo, back on the main line of the thread, the problem seems to be that in order to have things sensibly arranged, it made sense for Josef to break the .pmc file up into a couple of .c and .h files, along with matrix.pmc, but there's currently no scheme for multi-file PMC classes. Dan also thinks that putting code into the parrot core just to facilitate code sharing between PMCs is probably not the right answer. This appears to have opened a can of worms. Dan is searching for a bigger can -- sorry -- designing the infrastructure that this needs.

http://groups.google.com/groups -- Thread starts here.

http://groups.google.com/groups -- Terrible joke...

Constant & opcode swap ops

Dan offered designs to support multiple segment bytecodes, or at least, to swap in constant tables. Steve Fink wondered if it might be a good idea to wrap the interpreter in a PMC and use vtable methods to do constant table manipulations etc. Dan thinks that's a great idea for introspection, but worries that it'll slow things down if we make the interpreter itself go through those hoops all the time. Nicholas Clark also offered a few caveats ('Are you sure you want to do it this way?', words to strike fear into the heart of any designer I think). Dan isn't sure.

Faster assembler

Nicholas Clark offered a patch which makes Assembler::_generate_bytecode faster. By around 1.5%. Dan applied it. Sean O'Rourke offered more tweaks. Mattia Barbon had a patch which meant that the tests could just 'use Parrot::Assembler' rather than call assemble.pl, but the tweaks have outdated it. Dan and Nicholas asked for the patch anyway because it would make remaking it easier.

http://rt.perl.org/rt2/Ticket/Display.html

Perl 6 regexes...

Dan threw down the gauntlet of implementing Perl 6's regexes from Apocalypse. Sean O'Rourke has apparently already taken it up. I think we'll be seeing more of this thread next week...

http://groups.google.com/groups

Meanwhile, in perl6-language

The terribly named thread about different default values of true and false rumbled on. Damian averred that code which 'relies on the poorly specified standard values of truth and falsehood deserves to break'. Damian also argues that Perl ought to have a proper boolean type. After all, it has proper numeric and string types, and that they should be used by all built ins. (Which does lead me to wonder about code like 0 but true). Chip reckons that the standard values of truth and falsehood aren't that badly specified, it's not like they've ever changed since perl 1, though he does agree that a 'real' boolean would be nice. Elsewhere in the thread, Chip worried about the scoping of operator definitions and Damian reassured him a bit.

http://groups.google.com/groups

Use of regular expression on non-strings

Threads in perl6-language seem to be longer lived than those over in internals. I wonder why? This was another thread that started last week and kept on going. David Whipp clarified his original question and various people responded. Err... I'm not entirely sure how to summarise this one, and following the threading is a tad tricky because Mr Whipp's mailer doesn't do the In-Reply-To: thing.

http://groups.google.com/groups is probably as good a place as any to start looking.

Autovivi

Luke Palmer wondered how autovivification was going to work with specific reference to print %foo{bar}{baz}. Would %foo{bar} still be created if it didn't exist? Answer: "NO! Thank ghod!". Unless Larry says otherwise.

http://groups.google.com/groups

Perl summary for week ending 2002-08-04

See recursion. Miko O'Sullivan queried the throw-away comment that regexes were now called 'rules' and wondered if the term 'regex' would be going away. It turns out that I wasn't quite accurate, but that Damian still reckons that 'regexes' should be deprecated in favour of either 'rules' or 'patterns', at least in his own writing. For some reason this ended up spinning off into a discussion of regular, context-free and context-dependent languages and the nitty gritty details of when an expression was regular or not. Mark J. Reed ended up posting a fine essay to the list on the differences between the various types.

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups -- Mark's essay.

while <> { in Perl 6

Adam Lopresto wondered if the while (<) {> idiom (with appropriate DWIMmery) would be necessary in Perl 6 since, given a lazy list implementation, one could probably get by with for <> {. Trey Harris agreed. So did Larry, sort of. He's leaning towards only having for <> {...} do implicit topicalisation, while will require you to be explicit (damn, another spamassassin trigger word...).

http://groups.google.com/groups

http://groups.google.com/groups


In brief

Piers Cawley missed the point about doing copy on write of the entire stack when doing closuresque things. Various people explained, very politely that he was wrong. In a moment of selfish editorial wossname, I'll spare Piers's blushes by omitting the URL.

Josef added a couple of useful ops. Dan applied the patch, but wondered where the tests were. Josef has, of course, tested the ops, but Dan wants 'Comprehensive tests in t/ for everything' so we notice when we break things later. Nicholas Clark raised the spectre of 'The Schwern with the big stick.' if we failed to meet that testing goal.

Mike Lambert posted the first of his patches which will, eventually, unify PMCs and buffers.

Steve Fink posted a patch to convert the hashtable from pointer based to index based, making the GC system happier. Applied.

Angel Faus posted a 'loop discovery' patch for imcc which attempts to avoid array spilling whilst in an inner loop. This patch was applied. Then Leopold Toetsch pointed out a few bugs, and Angel posted another patch, which hasn't been applied yet.

Andy Dougherty sent some patches to eliminate warnings under Solaris 8. Applied.

Jarkko sent in more warnings patches. Applied.

Sean O'Rourke offered a patch to hashes to do deep cloning. Dan asked him to hold on a sec, adding that if he hadn't addressed the issue inside a day or to we should nudge him. Dan, consider yourself heartily nudged.

Peter Gibbs has removed the set_string_unicode and set_string_other vtable methods, which are, frankly, unnecessary. Leopold caught perlundef.pmc, which had been missed from the original purge.

Jonathan Sillito offered a patch to give scratchpads a pointer to their parent pads, and ended up with a new Scratchpad PMC. Warnock's Dilemma applies.

Peter Gibbs offers perlscalar.pmc, Aldo Calpini asked a few questions, and Warnock currently applies.

Daniel Grunblatt announced that the PPC JIT is now working.

Dan noted that there are some size restriction -- we need to make sure that our chosen INTVAL is at least as big as a pointer. Andy Dougherty pointed out some possible caveats and noted that his warnings patch gave some assistance in this area.

Nicholas Clark found a bug in the assembler under perl 5.005_03. Simon Glover agreed it was a bug, but nobody is quite sure how to fix it. Nicholas is working on it though.

Boris "Thank heavens for cut and paste" Tschirschwitz wondered if the headers of the various PDD (Parrot Design Documents) were kept up to date, and if he could, in general, trust the docs to be accurate. Dan reckons they should be mostly up to date. Patches are almost certainly welcome.

In perl6-language, Chris Dutton wondered if we'd be able to create anonymous classes ( my $foo = class {...} ) in the same way as we currently create anonymous subs. Personally, I hope so. Chris worries slightly about my $foo_class $foo_obj = $foo_class.new , but I think he's getting his compile time and runtime constructs mixed up. But what do I know?

SpamAssassin marked last weeks summary as spam in some mailboxes. John Porter suggested that 'Maybe people should add SpamAssassin rule that deducts 5 points if the message contains /leon brocard/ ?'


Who's who in Perl 6

Who are you?
Dave Mitchell. I work for a small UK-based software company.
What do you do for/with Perl 6?
I wrote PDD7 (coding standards), then got sidetracked into fixing perl 5 instead. I intend to get more involved in Perl 6 post 5.8.0 release.
Where are you coming from?
Sheffield?
When do you think Perl 6 will be released?
In about 2 years, then ready for production use in a further year.
Why are you doing this?
Because I love Perl and want to contribute to OSS development
You have 5 words. Describe yourself.
Lazy, apathetic, enjoyerofbugfixing, unsufferinggladlyoffools, rabidsceptic.
Do you have anything to declare?
My lack of genius?

Acknowledgements, corrections, threats and funding drives

Nicholas Clark would just like to clarify that, as far as right shifts of signed integers go, we should offer both arithmetic and logical right shifts. He doesn't appear to have any preference about which should be the default.

This summary was again prepared with the aid of GNER tea, supplemented this week by a ham and cheese toasted sandwich, the toasted sandwich of the gods (though their bacon and tomato toastie also has its adherents I remain faithful to good old ham and cheese.)

Thanks are also due to the wonderful Pete Sergeant, and to a certain person who I'll not mention again (but his favourite colour is orange) for their excellent proofreading skills. Anything which may have snuck past them (especially in this paragraph) is, of course, entirely my fault.

If your name appears in this, or any previous summary and you've still not sent me your answers to the perl6 questionnaire, please consider doing so. My "Perl 6 Who's who?" archive hasn't run out yet, but it'd be good to know that I'd got a good supply to fall back on.

Still no T?iBook, but this week's haul of egoboo is well up there. Many thanks to the flatterers in perl-golf@perl.org.

If you didn't like this summary, write your own. Go on, I dare you. If you did like it, send money to the Perl Foundation at http://donate.perl-foundation.org/ and remember, suitably large donations will earn you or your company a plug in a future summary. (You have to let me know though). See last week for details. And you'll be helping to fund the next generation of Perl, which will give you the warm fuzzies and a general feeling of virtue and well being. In the words of Mrs Doyle, "Go on, go on, go on, go on!"

Proxy Objects

In your time as a Perl programmer, it becomes almost inevitable that at some point you will have to manage in-memory tree structures of some sort. When you do this, it becomes important to be aware of how Perl manages memory, and when you might come up against a situation where Perl will not free its memory -- a situation that can happen easily ... as we'll see below.

In writing the XML::XPath module, (a library that implements some of the XML Document Object Model, or DOM) I came across a particular problem with Perl's memory-management mechanism. In this article, I will detail the problem and demonstrate a technique for building data structures that do not exhibit this problem.

The problem is circular references.

What Is a Circular Reference?

Circular references are simply self-referential data structures -- a complex data structure that at some point contains a reference to a part of itself further up in the hierarchy. They are a useful idiom when you need two parts of a structure to be able to refer to each other -- often we see them used in parent/child relationships, where a child object might need access to the methods in the parent:

figure 1

The case we're concerned with here is an XML tree, where we have a root node, and each node can have one or more children. In an XML DOM, the child nodes need the ability to refer back to their parents:

my $parent_name = $node->getParentNode()->getNodeName();

While it is often extremely useful to encode this type of relationship in your code, it is rather problematic for Perl.

Reference Counting

Before we see why circular references are problematic, first we need to understand how Perl's memory management works. In order to return memory for use by other parts of your program when it becomes available, Perl uses a technique called "reference counted garbage collection".

By using a count on each variable within Perl of the number of references to that variable (the number of other things in your program that refer to it), Perl's garbage collector can ensure timely destruction of lexical variables (those created with my). Each time the Perl interpreter sees something referencing that variable, it increments the variable's internal reference count. When the reference count goes down to zero, that variable's memory is freed, and its destructor is called.

You can see the reference count of a variable using the Devel::Peek module:

use Devel::Peek;
my $x = "Hello";
my $y = \$x;
Dump($x);

Which outputs:

SV = PV(0x804b85c) at 0x8057620
  REFCNT = 2
  FLAGS = (PADBUSY,PADMY,POK,pPOK)
  PV = 0x805d800 "Hello"\0
  CUR = 5
  LEN = 6

The important field for these purposes is the REFCNT field, which shows us there are two references to our variable - one for the main copy of the variable (the one we see as $x) and one for the reference that $y holds.

Reference Counting with Circular References

Reference counting works well until you need to build self-referencing data structures. Let's look again at the design of our XML DOM library, where children need access to their parent nodes. The DOM specification requires you to maintain a two-way relationship between the parent and the child nodes. This leads to our circular reference -- the parent holds a reference to the child, and the child holds a reference to the parent.

Circular References in Detail

Circular references occur when a variable either directly or indirectly refers to itself somehow. We can easily show this using hash references and Devel::Peek again:

use Devel::Peek;
for (1..1) { # enter scope here
  my %x = ();
  my %y = ();
  $x{child} = \%y;
  $y{parent} = \%x;
  Dump(\%x);
} # leave scope here

Now we can see that both variables have a refcount of 2 (the 3 for %x is because we have to take an extra reference in order to Dump() it):

...
SV = PVHV(0x8060b40) at 0x805702c
  REFCNT = 3               ### This is %x
  ...
  Elt "child" HASH = 0x77420b6
  SV = RV(0x8063008) at 0x804b494
    REFCNT = 1
    FLAGS = (ROK)
    RV = 0x8056fe4
    SV = PVHV(0x8060ba0) at 0x8056fe4
      REFCNT = 2           ### This is %y
      ...
      Elt "parent" HASH = 0xcc03940
      SV = RV(0x806300c) at 0x8056f84
        REFCNT = 1
        FLAGS = (ROK)
        RV = 0x805702c

Now problems arise. When %y goes out of scope, it cannot be completely garbage collected, because its reference count was 2, not 1. %x also cannot be garbage collected (if it could, then it would free the extra reference to %y) for the same reason. So we end up with two zombie variables, which can neither be garbage collected nor freed back to the system. This is why we get what appears to be memory leaks when we use circular references. Had we modified our for () loop to repeat more than once, we would see our memory usage steadily growing.

Rather than proving this by watching our memory grow though, we can demonstrate what his happening with objects. We can create a simple object and output something in its DESTROY method:

package CircObj;
sub new { bless {}, shift }
sub parent { my $self = shift; @_ ? $self->{parent} = 
    shift : $self->{parent} }
sub child { my $self = shift; @_ ? $self->{child} = 
    shift : $self->{child} }
sub DESTROY { warn("CircObj::DESTROY\n") }

Now a normal instance of this will output the destroy method as soon as its scope exits:

{
  my $x = CircObj->new();
  warn("Leaving scope\n");
}
warn("Scope left\n");

Results in the following output:

Leaving scope
CircObj::DESTROY
Scope left

However, if we fill in our circular references, then we get a different result:

for (1..1) {
  my $parent = CircObj->new;
  my $child = CircObj->new;
  $parent->child($child);
  $child->parent($parent);
  warn("Leaving scope\n");
}
warn("Scope left\n");

And we see the output:

Leaving scope
Scope left
CircObj::DESTROY
CircObj::DESTROY

What this means is our variables are only getting DESTROYed when the Perl interpreter does its global cleanup -- at the time our program exits, not in the timely manner we are used to with normal objects. This may not be too problematic for some scripts, but if you're running any sort of large program, or a long-running persistent interpreter like mod_perl, then you will see your memory steadily grow as you do this repeatedly.

A common way to "fix" this problem is to use a manual destructor -- a method or function call that breaks the circle somehow. In our original hashes example, this is as simple as adding delete $x{child} to our code (deleting $y{parent} is optional then, as the circle is already broken), and in the object example we can supply a destroy() method that the user can call before exiting the scope. But neither of those options is terribly user friendly, and I believe in letting people who use my modules to expect it to "just work".

Fixing Circular References - with Perl 5.6+

Perl 5.6.0 introduced a new feature to "fix" all of the problems with circular references. This feature is called weakrefs. The basic idea is to flag a reference as weakened, so as to not include it in the reference counting. In order to use weakrefs, you need to install the Scalar::Util module (which is included with Perl 5.8.0). It is simple to use. Let's see what happens with our example above:

package CircObj;
  use Scalar::Util qw(weaken);
  sub new { bless {}, shift }
  sub parent { my $self = shift; @_ ? weaken($self->{parent} = 
       shift) : $self->{parent} }
  sub child { my $self = shift; @_ ? $self->{child} = 
       shift : $self->{child} 
  }
  sub DESTROY { warn("CircObj::DESTROY\n") }
  
  for (1..1) {
    my $parent = CircObj->new;
     my $child = CircObj->new;
     $parent->child($child);
     $child->parent($parent);
     warn("Leaving scope\n");
  }
warn("Scope left\n");

It's important to note that we only need to weaken our parent reference -- since one reference is OK. But this time, it outputs the expected:

Leaving scope
CircObj::DESTROY
CircObj::DESTROY
Scope left

The weaken method is well-proven, and is a stable way to ensure that circular references don't mess up the operation of Perl's garbage collector. You should definitely use them if you can -- for example in your in-house code.

However, that still leaves CPAN module authors and those stuck with Perl 5.005 with a problem: what do we do with older versions of Perl?

Fixing Circular References - with Perl 5.00503 and Lower

Anyone who works for a large company, or who puts out open-source Perl modules, will know that there are still an awful lot of people using Perl 5.00503 -- upgrading takes time, and many people are running older OS's that only come with Perl 5.00503.

So we have to get inventive -- we need some proxy objects.

Proxy objects are a way to access objects indirectly via another object. The term proxy object is from the Design Patterns book (http://hillside.net/patterns/DPBook/DPBook.html). A proxy object works by being an intermediary between an object and its methods -- passing all communication on to the real object, perhaps after doing something based on the method called. We can implement a basic proxy object in Perl using AUTOLOAD:

package ProxyObject;

sub new {
  my $class = shift;
  my $real_obj = shift;
  my $self = { real_obj => $real_obj };
  return bless $self, $class;
}

sub AUTOLOAD {
  my $self = shift;
  my $method = $AUTOLOAD;
  $method =~ s/.*:://;
  warn("Proxying: $method\n");
  $self->{real_obj}->$method(@_);
}

1;

And we can use that as follows:

use ProxyObject;
use Time::localtime;
my $time = localtime();  
### Create a localtime object
my $proxy = ProxyObject->new($time);  
### Create a proxy to that object
print $time->hour, " is the same as ", $proxy->hour, "\n";

Which gives us the output:

Proxying: hour
14 is the same as 14
Proxying: DESTROY

So, this all looks very interesting, but you're probably wondering how that helps with circular references. Well, let's look at a circular-reference example that uses ProxyObject:

package CircObj;
sub new { bless {}, shift }
sub parent { my $self = shift; @_ ? $self->{parent} = 
     shift : $self->{parent} }
sub child { my $self = shift; @_ ? $self->{child} = 
     shift : $self->{child} }
sub DESTROY { warn("CircObj::DESTROY\n") }

use ProxyObject;
for (1..1) {
  my $parent = CircObj->new;
  my $child = CircObj->new;
  my $proxy = ProxyObject->new($parent);
  $parent->child($child);
  $child->parent($parent);
  warn("Leaving scope\n");
}
warn("Scope left\n");

Now what we see from our output is:

Leaving scope
Proxying: DESTROY
CircObj::DESTROY
Scope left
CircObj::DESTROY
CircObj::DESTROY

So we have made some progress; the DESTROY method on our $parent variable is now called, albeit twice. But what we can do now is use that call to DESTROY to break our reference loop. To do this, we implement a small DESTROY method in our class that clears out the circular reference:

package CircObj;
...
sub DESTROY {
  my $self = shift;
  warn("CircObj::DESTROY\n");
  if ($self->{child}) {
    $self->{child}->parent(undef); 
	# set child's parent to undef, breaking the loop
  }
}

Which finally leads to the desired output:

Leaving scope
Proxying: DESTROY
CircObj::DESTROY
CircObj::DESTROY
CircObj::DESTROY
Scope left

Of course, this being Perl, it is possible to wrap all of this up into a fairly simple class, which I'll demonstrate in the next section.

Problems With This Technique

There is one major problem with this technique: It's possible to confuse the garbage collector with it. The easiest way to see this happen is if you build a tree using the above technique, and then hold onto a sub-tree while letting the root of the tree go out of scope. All of a sudden you'll find your data structure has become untangled, like so much wool from a snagged jumper. Unfortunately, there's little you can do about this, except perhaps for allowing the use of a proxy object as an option, and providing a method for freeing the tree that manually breaks the links. This is what XML::XPath does: By specifying $XML::XPath::SafeMode = 1 at runtime, you can switch this behavior on and off.

Another problem is that every property access now needs to go through method calls. If you have the proxy object (which is just a simple hash ref in this example, but could be a scalar ref), then you can't try and access the hash entries directly and expect to get those in the object we are proxying to. You may be able to get this working via the TIE mechanism, but it doesn't come for free. So we potentially lose some speed this way, though you should be using methods anyway for the sake of encapsulation.

Doing the Right Thing

What we can now do (in true Blue Peter fashion) is create a module that has totally transparent weak-references support on Perl 5.6, while still allowing garbage collection on lower Perl versions. This is a little more complex than the scenario above, but here's how it works. First, we create a base class that our complex data-structure classes can subclass:

# base class for all circular reffing classes 
# rename this before use
package BaseClass;
use strict;

In that class we do some compile-time checking for Scalar::Util:

BEGIN {
  eval "use Scalar::Util qw(weaken);";
  if ($@) {
    $BaseClass::WeakRefs = 0;
  }
  else {
    $BaseClass::WeakRefs = 1;
  }
}

Next, we create a clever constructor. This constructor turns our class name (e.g. "XML::MyDOM::Node") into an implementation class name ("XML::MyDOM::NodeImpl"), and constructs an object of that type. Then if weak refs are available, then it returns that object, otherwise it constructs a proxy object that proxies to that object, and returns the proxy object instead:

sub new {
  my $class = shift;
  no strict 'refs';
  my $impl = $class . "Impl";
  my $this = $impl->new(@_);
  if ($BaseClass::WeakRefs) {
    return $this;
  }
  my $self = \$this;
  return bless $self, $class;
}

Finally, we have a more advanced version of our proxy object's AUTOLOAD method. This does some error checking to ensure that unknown methods are detected correctly, and that methods on this class are executed in the correct way. But otherwise it's doing exactly the same job as our original:

sub AUTOLOAD {
  my $method = $AUTOLOAD;
  $method =~ s/.*:://; # strip the package name
  no strict 'refs';
  *{$AUTOLOAD} = sub {
    my $self = shift;
    my $olderror = $@; # store previous exceptions
    my $obj = eval { $$self };
    if ($@) {
      if ($@ =~ /Not a SCALAR reference/) {
        croak("No such method $method in " . ref($self));
      }
      croak $@;
    }
    if ($obj) {
      # make sure $@ propogates if this method call was the result
      # of losing scope because of a die().
      if ($method =~ /^(DESTROY|del_parent_link)$/) {
        $obj->$method(@_);
        $@ = $olderror if $olderror;
        return;
      }
      return $obj->$method(@_);
    }
  };
  goto &$AUTOLOAD;
}

package BaseClassImpl; # Implementation class

sub new { die "Virtual base class" }

# All base class methods go below here

1;

Now, all classes derived from that need to look like this:

package CircRefClass;
@ISA = ('BaseClass');

package CircRefClassImpl;
@ISA = ('BaseClassImpl', 'CircRefClass');

sub new {
  my $class = shift;
  # ordinary constructor here
  my $self = bless {}, $class;
  # ... yada yada yada
  return $self;
}

sub DESTROY {
  my $self = shift;
  # Break child's link to me here
  $self->child->del_parent_link();
}

sub set_parent {
  my $self = shift;
  if ($BaseClass::WeakRefs) {
    Scalar::Util::weaken($self->{parent} = shift);
  }
  else {
    $self->{parent} = shift;
  }
}

sub del_parent_link {
  my $self = shift;
  $self->{parent} = undef;
}

# All class methods here

1;

And that's about all you need to change. The rest of your methods can remain the same, as long as they are moved to the PackageImpl class, rather than the Package class. And this will auto-detect the presence of Scalar::Util's weakref() method and use it accordingly if it's available.

The complicated work is in the AUTOLOAD class in the BaseClass package. The reason it looks so complex is because it implements a method cache for autoloaded methods, ensuring that this implementation won't slow down your program.

Also note that you can rename the del_parent_link and set_parent methods if those names aren't appropriate for your application. They just happen to work for a tree structure.

Conclusions

Circular references are a useful tool, and sometimes a neccessary evil. By using methods available to Perl 5.6 and up, combined with a custom proxying technique, we can create classes that implement circular references while still managing to be garbage collected on all versions of Perl. This is a powerful design to add to your Perl programmer's toolbox.

This week on Perl 6 (week ending 2002-08-04)

Back on the regular schedule complete with new and improved article links via those lovely people at Google groups. Hopefully, we'll be staying on this schedule for the foreseeable future.

As usual, we'll kick off with perl6-internals, which was, once again, the busiest group this week.

We Need More Ops

This thread rumbled on (slowly) with a mixture of serious and humorous messages. Melvin Smith worried that parrot would turn into "a Linux kernel and forever accumulates custom ops, PMCs."

The appropriately named Eric Kidder proposed ':-) the Positivity operator. Nobody threw peanuts.

http://groups.google.com/groups

http://groups.google.com/groups

Of Mops, JIT and perl6

This thread kicked off the week before, but carried on into this week. Leopold Toetsch had got some MOPS numbers for various bits of parrot, including the perl6 mini language. Dan wondered whether the perl6 numbers included the time to generate the assembly and assemble it, because the assembler is 'rather slow.' Sean reckoned that, compared to the parser, the assembler was quick and that he suspects that even without that the perl6 numbers "would suck, since the compiler does some pretty heavy pessimization." Melvin Smith reckoned that imcc may have had something to do with the slowness as well, because it didn't do the right thing with loop invariants, but that "things can only get better."

Leopold then told us that the numbers were pure runtime and that the compilation phase added 2.3 seconds to the whole process. Ick. However, following a bugfix to perlarray.pmc he posted new numbers that were "not that abysmal," and that we already have the same MOPS as perl5 ... .

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

Dan Sugalski Is Back

And it's like he never went away. At the top of his list is catching up with his mail, sorting out keys, defining the extension mechanism, the exception infrastructure, ruthless efficiency and a fanatical devotion to the Pope.

He also suggested that if someone were to implement the cmpi and cmps ops for integer and string comparison, then that would be a good thing. Sean O'Rourke asked questions about that, and the semantics of 2 < $i++ < 23 were discussed

http://groups.google.com/groups -- Start here.

Regex Speedup

Angel Faus has been working on the regex engine and provided a patch for it, "designed with the single goal of seriously cheating for speed." Dan applauded and claimed to be feeling "decidedly superfluous", but wondered whether Angel could use the -u or -c switches to diff next time.

Also on the regex front, Stephen Rawls offered some better documentation of the regex subsystem.

http://rt.perl.org/rt2/Ticket/Display.html

http://rt.perl.org/rt2/Ticket/Display.html

ARM JIT v2

Nicholas Clark released a "very minimal" ARM JIT framework. The initial version only JITed a small number of trivial ops, but it's the framework that was important. Daniel Grunblatt sent Nicholas a copy of the work he'd done in that area and we ended up with a new, unified patch that handled a whole lot more ops, and which wondered where the PPC JIT had got to. It turns out that Daniel is working on that, too. Go Daniel.

On the subject of JITs, Jarkko Hietaniemi, fresh from his trials as the perl 5.8 pumpking popped up with a survey of the state of the JIT art in parrot. Apparently we're still missing PPC and POWER, MIPS, HPPA and IA64. Jarkko intends to "do nothing on these except raise gui^H^H^Hawareness :-)". Which is nice of him.

Nicholas also wondered whether he was "allowed to write ancillary functions I want the JIT to call in assembler?" Dan reckoned that, in a JIT, "evil things for speed reasons are almost obligatory."

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

Lexical Scope Ops, Test and Example

Jonathan Sillito offered a patch that implements lexical pads and some ops to manipulate them. Stephen Rawls pointed out a bug in the implementation, and Dan agreed that it was a bug but that once that was fixed the patch should go in, because "it gets a good chunk of the semantics we need." (Speaking as someone who keeps making noises about implementing Scheme in parrot, having parrot native lexicals will be a *huge* win.) So, Jonathan fixed it, and Melvin applied it.

This led Jerome Vouillon to wonder whether we actually needed explicit lexicals and scratchpads. The thread is rather tricky to summarize, so I'll just point you at it. However, Dan chimed in to say that, for now at least, he'd rather keep the separate interface to lexicals and globals because it would make changing the implementation possible without breaking existing bytecode.

http://rt.perl.org/rt2/Ticket/Display.html

http://rt.perl.org/rt2/Ticket/Display.html

http://groups.google.com/groups

Unifying PMCs and Buffers for GC.

Dan has accepted that we need to unify PMCs and strings/buffers for GC purposes, and asked for volunteers. Mike Lambert stepped forward (or maybe everyone else took one pace back). Again, this thread is hard to summarize because it's so information dense. If you're interested, then you should read the whole thread.

http://groups.google.com/groups

Negative Indices in Arrays.

Stephen Rawls wondered about what happened when someone did @foo = (1,2); @foo[-3]. This is an error in Perl5. As Dan put it, so elegantly, "Larry? Semantic call for you on the white telephone." Larry has, so far, not responded.

http://groups.google.com/groups

resize_array (PerlArray)

Aldo Calpini discovered that the parrot equivalent of @foo = (1,2); print @foo[9999]; would magically extend @foo to contain 10,000 elements -- which isn't exactly ideal. This spun off into a discussion of autovivification, especially with nested arrays. Dan pointed out that Perl 5's current behavior when you, for example, print $a[10000][0] is an artifact of the way multidimensional arrays work in Perl 5, and that most people would be happy if it went away in Perl 6. (Personally, I'd be happy if it went away in perl 5.9, but that's another mailing list.)

http://groups.google.com/groups

sub/continuation/dlsym/coroutine cleanup

Sean O'Rourke offered a patch that "implements native extensions and continuations as PMCs. It also cleans up the existing "Sub and Coroutine types," and removes a load of now obsolete ops that are now handled through invoke. For some bizarre reason, SpamAssassin thought his message was infected with Klez -- which isn't good.

The patch doesn't handle lexicals automatically yet. This may be good or bad, we haven't reached consensus yet, and Dan hasn't settled on a position either. Jerome Vouillon suggested a couple of changes, and Sean O'Rourke agreed. Nicholas Clark wondered whether this meant that loop variables would only have to be allocated once; which looked nice.

http://groups.google.com/groups

Meanwhile, Over in Perl6-language

Phew. I always like it when I get to switch lists. It generally means the finish line is in sight because perl6-language is usually much quieter than p6i. (My how times have changed; I'm really glad I wasn't writing the summaries back in the RFC days ... .)

"A Light and Refreshing Summer Fruit Salad"

That's how Miko O'Sullivan described his collection of ideas for the Perl6 language. At least one of his proposals is already implemented in perl5, but it still sparked a lively and generally interesting thread. Especially when Damian started doing tricks with operator definitions and grammar munging. Trey Harris proposed the rather wonderful no strict 'physics', which made me smile. Damian also points out in this thread that the new, Perl6 word for `regex' will be `rule' because, well, they're no longer regular, and they'll often live in grammars.

Damian also mentioned that there was some thought of making composite objects the topic within their subscript brackets, which would enable some powerful slicing operations: @public = %hash{ /^(<-[_]+>.*)/ }; anyone? There's some debate as to whether this would violate the principle of least surprise though.

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

Use of Regular Expressions on Nonstrings.

David Whipp wondered about using regular expressions (that's `rules' now, of course ...) on none strings, and wondered about using them to query databases. (It would appear to me that, if you tied a database to a hash, then Damian's proposed slicing syntax above would be a start down that road, implementation could be tricky though ...)

In Brief ...

Sean O'Rourke told us that the his perl6 parser/compiler/all round groovy thing, now works with perl 5.005_03 and with 5.6.1, but that it didn't play well with 5.6.0.

Remember RECALL, and how it got renamed to AVOID? Well, actually, it didn't. It got renamed to AGAIN, but Tanton Gibbs had a brain fade when he typed the subject line of the submitted patch.

http://rt.perl.org/rt2/Ticket/Display.html -- Stephen Rawls offers patches for genclass.pl and introduces an addclass.pl script, intended to make the business of setting up a new PMC type that much easier.

http://groups.google.com/groups Michel Lambert offered a 'Getting Started' FAQ

Simon Glover posted some code that makes GC segfault. Richard Cameron fixed it. Mike Lambert applied it.

Angel Faus offered a patch to make imcc take into account the data-flow info it gathered. Melvin Smith applied it.

Jarkko has been doing sterling work on getting parrot to compile on SGIs, and generally tidying up code.

Brian Ingerson patched assemble.pl to allow it to accept code on STDIN. Applied.

Jason Gloudon offered a patch that revised the JIT docs, and added some stuff to the SPARC JIT overview. He also changed the way the x86 JIT invokes functions. Applied

Josef Höök offered a matrix implementation. Warnock's dilemma applies. You can find the patch at http://rt.perl.org/rt2/Ticket/Display.html if you're interested.

Sean O'Rourke pointed to http://lambda.weblogs.com/discuss/msgReader$3850.

Aldo Calpini offered docs for the parrot debugger. http://groups.google.com/groups

"Mr. Nobody" has patched things to allow Configure to run on Windows 9x. Applied.

Nicholas Clark wondered what to do about right shifting signed integers. Nicholas reckoned that sign extending signed types when doing right shifts was the Right Thing.

Daniel Grunblatt has committed a register allocator for the JIT. Not optimal yet, but it's a start. Nicholas Clark was impressed. So were the rest of us, probably.

Simon Glover has added some more tests of the GC ops. Applied.

Mark J. Reed gave the language crowd and update on Unicode, UTF-16 and Java. http://groups.google.com/groups

Leon Brocard, who I thought I wasn't going to be able to mention this week, announced the publication of his targeting parrot slides to the list. But I told you about those last week, so maybe I shouldn't have mentioned Leon this week after all. Hmm ...

Who's Who in Perl6

Who are you?
Josef Höök
What do you do for/with Perl 6?
Working with matrix and multidimensional array implementation in parrot.
Where are you coming from?
Sweden
When do you think Perl 6 will be released?
1 year
Why are you doing this?
It's fun and i learn a lot.
You have five words. Describe yourself.
icant, that was only 2words
Do you have anything to declare?
nope

Acknowledgements, Threats and Funding Drives.

This summary was once again prepared with the aid of GNER tea, and with the unwitting assistance of the fine folks at Google who provided me with a way of generating links to articles that don't require me to surf the fine Web in order to work out what the URLs should be.

Once again, if you didn't like this, then don't read it. If you did, then consider donating some of your hard-earned money (or your employer's hard-earned money) to the Perl Foundation at http://donate.perl-foundation.org/ and help support the ongoing development of Perl 6. If you're going to donate $100 or more in response to this summary, then please let me know and I'll give you a mention in the acknowledgements. If you're going to donate $250 or more, again, let me know and you'll get a "This week's summary was sponsored by Joe Bloggs" (for appropriate values of "Joe Bloggs") at the top of the summary.

Sadly, nobody has yet sent me a TiBook, which is probably a good thing; I'd only want a pony next.

Google is almost certainly a trademark. I should probably mention that shouldn't I?

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en