June 2005 Archives

This Week in Perl 6, June 21-28, 2005


All--

Long time no see ... err, write ... uh, read ... um ... this. Yeah, long time no this. As Piers hinted, two weeks ago I moved. Moving sucks. For those of you who care, I am still in Cambridge; for those of you who care more, I think you misunderstand the summarizer/summary reader relationship. Essentially it revolves around summaries; the summary of my move is Cambridge to Cambridge.

As Piers noted last week, this is a low-volume, high-action week, in no small part due to the hack-a-thons. Last week's was in Austria, this week's is near Toronto. Perhaps some nice soul who was actually at these hack-a-thons will summarize them when it is over.

Perl 6 Compiler

PGE Announcements

Patrick announced that PGE now supports grammars and more built-in rules. He even offered to field requests for built-in rules (although he would prefer patches).

Caller's Context

Gerd Pokorra wanted to know how to determine if his sub is called in void context. He conjectured that want might fill his wants. There is no response yet.

Self-Hosting Goals

Millsa Erlas explained that one good reason for Perl 6 to be self-hosting is that it would allow the people who love it most (Perl hackers) to hack on it. The theory is that low-level languages like C unnecessarily narrow the field of contributors (especially those that only know Perl). People expressed some concerns expressed over confusion about the language Ponie should be written in. No one disputes that this is C.

Parrot

Indexing Hashtables

Klaas-Jan Stol asked for a clue bat with respect to indexing hash tables in PIR. Joshua Juran and Leo each took a swing.

Parrot Loses with Fedora Core 4

Patrick reported that Fedora Core 4 and Parrot don't get along well. Leo suggested a possible solution. Patrick has posted no response.

Default Method Resolution Order

Roger Browne wondered what the default MRO order is. Leo provided the answer: left-to-right, depth-first, discard all but the last occurrence of duplicates, divine intervention.

Win32 Tests Failing

Craig the Last-Nameless-One posted a list of failing tests and problems on Windows. Leo provided a few answers.

Method Inheritance Needs Perl Loving

Leo announced a Perl job for the interested: method inheritance in the PMC compiler. This naturally led to discussion of numerical hierarchies. I was a little disappointed that quaternions appeared, but Hamiltonian and Surreal Numbers did not. Honestly, where are our priorities?

Tracing and Debugging Pain

Matt Diephouse posted a general description of the problems he was having with tracing, debugging, and GC. Warnock might apply in a day or two.

Segmented Context and Register Memory

Chip posted a partial reply to Leo's context and register overhaul patch. Andy Dougherty responded to some of Chip's finer points. If you find the nuances of C's pointer pain interesting, this thread is for you.

Improving Parrot's Test Framework

Chromatic wants to improve Parrot's test framework by stealing ideas from Test::Class. He wants to know if anyone else has an interest.

setattribute Fails with Multi-Level Inheritance

Roger Browne opened a ticket describing an error with setattribute when using several layers of inheritance.

Register Allocation Bug

Leo opened a ticket for a problem with improper control flow tracking. Bill Coffman wondered whether the new register design is in place yet.

Pass by Value PMCs

Klaas-Jan Stol mused that the new calling conventions could work to allow passing PMCs by value.

Parrot Fall Down Go Boom

Matt Fowles reported a segfaulting Parrot that passes its tests. Sadly, no one solved his problem in the four hours between his posting it and writing the summary.

Perl 6 Language

You Know That, But You Go On

As Piers noted, arguments about ./method versus .method continue. Like Piers, I don't like ./. I guess I was the only person who liked $^ as the invocant. Ah well, I guess I will just go on summarizing.

Binding Functions

Piers wanted to use a Ruby idiom involving rebinding functions. Damian told him that he could, but also pointed him to wrap.

OO Questions

BÁRTHÁZI András posted a question about method calls in Perl 6. Juerd and Piers provided answers.

AUTOLOAD and $_

Last week's thread about AUTOLOAD continued. It still seems to be fishing for some official decision.

Magic Mutators and Proxies

Sam Vilain wondered if he could make proxies behave like he wanted to. Luke Palmer explained, yes, but he would need to use binding instead of assignment.

Quasiquoting and PPI

Brad Bowman asked how Quasiquoting and PPI would interact with the AST. Autrijus posted some explanation and Adam Kennedy cleared up some terms.

The Usual Footer

To post to any of these mailing lists please subscribe by sending email to perl6-internals-subscribe@perl.org, perl6-language-subscribe@perl.org, or perl6-compiler-subscribe@perl.org. If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send feedback to

Annotating CPAN


AnnoCPAN is a new website that shows the documentation for every Perl module available on CPAN and allows anyone to post annotations in the margins of the documents. The notes are public, so everyone can read and reuse them under the same terms as Perl itself (the entire note database is available as an XML dump). The inspiration came from other open source documentation websites such as those for PHP and MySQL, but the implementation has adapted to the idiosyncrasies of Perl documentation regarding document length and versioning. This article discusses the origins and the ideas behind this project and how I implemented them.

How it All Started

People often complain about Perl documentation. This is not entirely fair, given the prodigious amount of available documents. Although the quality of the documentation for Perl modules outside of the core distribution ranges from excellent to non-existent, I believe that it is, overall, pretty good. But there will always be some minor niggling details, inaccuracies, or omissions in the documents. I know that, as a programmer writing documentation, it is easy to forget what should actually go in there for the benefit of the user, because I have the advantage (or disadvantage?) of knowing too much about the product. Often the users themselves are in the best position to point out the gaps in the documentation.

A typical response when a user complains about the documentation of an open source project is "send a patch!" However, this is often not practical, whether because the user doesn't know how to do it or doesn't want to spend the time, or the maintainer is unresponsive. Even under ideal conditions it often takes too long for the patch to make it to the official documentation, and too long a delay in gratification discourages further participation. Several projects allow users to modify the documentation directly, perhaps by using a wiki or by letting the users post comments. Notable examples include MySQL and PHP. On more than one occasion, people have asked why Perl didn't have something like that, and most agreed that it would be good if someone did something about it. Well, I decided to try to do the dirty job.

Note: While I was working on this project, CPAN::Forum appeared. It shares some of the goals of AnnoCPAN but has some differences, as well. Both have the goal of providing a place for discussion about Perl modules that have no other discussion venues, but only AnnoCPAN shows that discussion right next to the relevant parts of the documentation.

Where to Put the Notes?

One of the first things I noticed when analyzing the problem was that documentation pages for PHP and MySQL are usually fairly small--the sites have split the manual into small chunks, which allows the comments to remain close to the relevant part of the documentation. (It wouldn't be as helpful to add a footnote at the end of a 50-page-long document saying "By the way, the first paragraph is wrong!"). The problem is that Perl documentation usually comes in fairly long documents--just look at the PODs for CGI, DBI, or perlfunc! Splitting these documents into "reasonably sized" chunks is not trivial, because there's little standardization of the organization of the documents: for some documents I might find that =head1 sections are short enough, while for others I have to split at the =head3 or =item level. While this is an interesting and maybe even tractable problem, I decided to take a different approach: rather than splitting the document, I decided to attach the user comments to specific paragraphs and show them right on the margin or between the paragraphs (depending on a user-configurable stylesheet).

The decision of attaching the notes to specific paragraphs opened another can of worms: if someone adds a note to paragraph 42 of My::Module version 0.10, where should that note go (if at all) in the documentation for My::Module version 0.20?

To prepare the AnnoCPAN site, first I had to create a full CPAN mirror. That wasn't hard, but it requires quite a bit of space (2.5GB). Then I pre-parsed it using a module derived from Pod::Parser (discussed later) and loaded each paragraph into a database. The database schema underwent several revisions, due to the difficulties in modeling CPAN.

CPAN is a Wild Jungle!

Something that anyone who tries to parse anything out of the full CPAN archive quickly finds out is that it is a maze of exceptions and corner cases. While most distributions, prepared with ExtUtils::MakeMaker or something similar, share certain structural and naming conventions, some authors deviate from the convention and package their distribution in strange ways. The first hurdle is how to figure out the distribution name and version number from the distribution filename. Luckily, Graham Barr (from search.cpan.org) has already worked on that, so I just used his module CPAN::DistnameInfo.

The next hurdle is the structure of the package itself. Most packages are .tar.gz files that unwrap to a directory with the same name as the distribution filename (sans the .tar.gz extension). Some packages are .zip files, and a few are .ppm files, or something else. Even for the .tar files there are inconsistencies, depending on the version of the program used to create them. I couldn't even open some of them correctly! Then there are files that don't unwrap to a single directory. I decided to deal only with reasonably clean .tar.gz and .zip packages and ignore everything else.

After unwrapping the distribution, my program had to figure out which files were significant for documentation purposes. I wanted to include only the modules, scripts, and standalone POD documents, excluding test files, examples, and bundled modules that belong to some other distribution. First the program filters based on the filename, to exclude some obvious negatives such as MANIFEST and META.yml, and include some likely positives such as .pm and .pod files. In uncertain cases, it opens the file and sees if it has some POD, such as a =head1 line.

Having decided that a file has documentation in POD format and should be included, the program has to figure out the title of the document. This is not as easy as it seems. I decided to use this rule, which seems to work most of the time: if the first POD paragraph is =head1 NAME, take the first word from the second paragraph and use it as the title. If not, guess the name from the pathname of the file. This is easy with "modern-style" packages; for example, My-Module-0.10/lib/My/Module.pm turns into My::Module. "Old-style" distributions are a bit trickier: My-Module-0.10/Module.pm also turns into My::Module. A third option that I haven't used is to look for the first package declaration in .pm files, but that wouldn't work for most .pod and .pl files, and that's without even considering that some .pm files have zero or more than one package declaration!

All of the above leads me to wish that, if someone were to start CPAN from scratch again, it would be a bit more strict in the structure required for distributions, especially with the filename of the distribution package. The way things are now, it is impossible to know for sure if distribution A has a "higher" version number than distribution B. (It's possible to compare the dates, but what if there is more than one active branch?) Luckily, there is already work in that direction; having a way of measuring kwalitee may encourage authors to pack their modules in standard ways.

Ontological Questions

OK, so my program has produced a nicely unwrapped distribution with some PODs. What is a distribution and what is a POD, though? These may seem silly questions to ask, but they are very important. Is this POD a different version of some other POD? Is this distribution just a different version of some other distribution, or is it a completely different distribution? For distributions, I've assumed that if it has the same filename, except for the version part, they are indeed two different version of the same distribution (for example, DBI-1.47.tar.gz and DBI-1.48.tar.gz). This is a reasonable assumption, but unfortunately, there's no guarantee that it will always be true, because anyone could upload a file called DBI-1.49.tar.gz with completely unrelated contents (luckily, I haven't seen that happen yet). Note that a distribution can have more than one author, with various versions in each author's directory.

The problem becomes more complicated for modules, because unfortunately, there are many known cases of modules that appear in more than one distribution. The most common situation is when an author maintains a module as a separate CPAN distribution that is also part of the Perl core (that is, the perl distribution). A common example is the CGI module. However, there's no guarantee that two documents with the same name are indeed versions of the same module. The most dramatic example is the number of Install or Tutorial documents that have no relationship to each other. Luckily, this is not as common for real modules as it is for other documents, but I decided to play it safe and assume by default that two documents are the same only if they have the same name and belong to distributions with the same name. There is a manual override, however. For example, I can tell the system that CGI in the perl distribution is the same as CGI in the CGI.pm distribution.

Loading the Database

Because I wanted to have paragraph granularity for attaching notes to modules, I loaded all of the CPAN documentation into my database, one row per paragraph. To parse the POD, I created a very simple subclass of Pod::Parser (which comes with perl). The subclass only overrides the paragraph-level methods and uses them to store the POD in the database without any further processing.

package AnnoCPAN::PodParser;

use base qw(Pod::Parser);

sub verbatim {
    my ($self, $text, $line_num, $pod_para) = @_;
    $self->store_section(VERBATIM, $text);
}

sub textblock {
    my ($self, $text, $line_num, $pod_para) = @_;
    $self->store_section(TEXTBLOCK, $text);
}

sub command {
    my ($self, $cmd, $text, $line_num, $pod_para)  = @_;
    $self->store_section(COMMAND, $pod_para->raw_text);
}

sub store_section {
    my ($self, $type, $content) = @_;
    # ... 
    # load $content into database 
    # ...
}

Here again I encountered problems with some of the modules that exist in the wild. Pod::Parser generally works very well, but it becomes extremely slow when a document has a very long paragraph (with thousands of lines). Most modules don't have paragraphs with more than a hundred lines, so the problem had likely never surfaced before, but I found a few modules that appear to contain lots of machine-generated data. They took about ten minutes each to parse. I went into the code of Pod::Parser and found that by deleting one line (an apparently unnecessary line!), the scaling problem goes away and parsing takes under a second.

For the database access itself, I used Class::DBI, which simplifies things enormously. For example, this is the code for creating a section (i.e., a paragraph):

$section = AnnoCPAN::DBI::Section->create({
    podver  => $podver,
    type    => $type,
    content => $content,
    pos     => $pos,
});

Translating the Notes

By translating, I mean "figuring out where the note goes in a different version of the same document," not "translating into a different language." Suppose that someone adds a note next to some paragraph of My::Module 0.10. To figure out where to put the note in the POD for My::Module 0.20, I decided to place it next to the paragraph in 0.20 that is most "similar" to the reference paragraph in 0.10. To decide which paragraph is most similar, I used the String::Similarity module by Marc Lehmann. The essential code is something like:

package AnnoCPAN::DBI::Note;
use String::Similarity 'similarity';

sub guess_section {
    my ($self, $podver) = @_;
    # $podver is a specific version of a pod

    my $ref_section = $self->section;
    my $orig_cont   = $ref_section->content;

    my $max_sim = AnnoCPAN::Config->option('min_similarity');
    my $best_sect;
    for my $sect ($podver->raw_sections) {
        # don't attach notes to commands
        next if $sect->{type} & COMMAND;
        my $sim = similarity($orig_cont, 
            $sect->{content}, $max_sim);
        if ($sim > $max_sim) {
            $max_sim   = $sim;
            $best_sect = $sect;
        }
    }
    if ($best_sect) {
        AnnoCPAN::DBI::NotePos->create({ note => $self, 
            section => $best_sect->{id}, 
            score => int($max_sim * SCALE),
            status => CALCULATED });
    }
}

Adding a Web Interface

The web interface combines the strengths of Class::DBI and the Template Toolkit, using the methods discussed in "How to Avoid Writing Code--Using Template Toolkit and Class::DBI," by Kake Pugh. The only thing remaining, besides writing the templates, was to provide a controller module (called as a part of the Model-View-Controller (MVC) design pattern. The controller module has to parse the CGI parameters and cookies, decide what to do with them, authenticate the user if necessary, fetch something from the database, choose the template to use, and pass all of the required information to the Template Toolkit rendering engine. Some people advocate using modules such as CGI::Application as a base class for the controller module, but I found that writing it by hand was simple enough for my purposes.

Conclusion

In this article, I have discussed some of the logic and technical problems behind the design and implementation of AnnoCPAN. What remains to be done is to ensure that people use the site so that it becomes a valuable resource. That depends on users (which means you!) adding helpful annotations. Please take a look at annocpan.org!

Acknowledgments

I thank The Perl Foundation for a grant for working on this project and BUU for hosting the website.

This Week in Perl 6, June 8-21, 2005


Surprise! It's me again. You may be wondering what happened to last week's summary (I know I was) and where Matt had gone. Well, I'm not entirely sure where exactly he is now, but last week was moving week for him.

Those of you reading this on the mailing lists may also be wondering why this summary is so late. Um ... three words: World of Warcraft.

This Week in perl6-compiler

As a summarizer, when you see the "last fortnight" view of a mailing list containing 21 messages, several thoughts spring, unbidden, to your mind: "Is my mail broken again?" "Has everyone given up?" "Phew, this group won't take long to do."

It turns out that the answer to both of those questions is "No." What actually happened was that most of the stuff that normally happens in mail happened at the Austrian Perl Workshop and Leo Tötsch's house, with a side order of IRC conversation and a bunch of spinoff threads in p6l and p6i.

In the last fortnight, Pugs reached the point where it has a (mostly) working Parrot back end, and BÁRTHAZI Andras wondered if we shouldn't start a perl6-general mailing list.

This Week in perl6-internals

140 messages in this one. p6c lulled me into a false sense of security. Again, you may notice a bewilderingly fast rate of change this summary. It turns out that they weren't just working on Pugs at Leo's house. Perl 6 Hackathons give great productivity.

This Is Not Your Father's Parrot

There's been some serious work going on under the Parrot hood in the last two weeks. Leo and Chip have drastically reworked the calling conventions, which now use four new opcodes: set_args, set_returns, get_params, and get_results. At the time of writing, IMCC doesn't give you full syntactic help with them, but they're easy enough to use explicitly for the time being and the help is getting there. Check out the Parrot Calling Conventions PDD for details.

Also getting rejigged is the continuation/register frame architecture. Taking advantage of the fact that this is a virtual machine, we now have an unlimited number of registers per register frame. Combine this with the new calling conventions, in which arguments are passed outside of the register frame, and all of a sudden a full continuation becomes a simple pointer to the register frame and everything gets saved as if by magic, which opens up a whole bunch of possibilities, which has interesting implications for the register allocator.

Chip's design notes

New Generational GC Scheme

Alexandre Buisse posted his outline for a Google Summer of Code project to implement a shiny new Generational Garbage Collection scheme. Discussion of tunability and threading issues followed.

Ordered Hashes: More Thoughts

Steve Tolkin helpfully provided a summary of his thoughts about ordered hashes: "An ordered hash that does not support deletes could cause a user-visible bug. At a minimum, it should support the special case of delete that is supported by the Perl each() operator." Dan pointed out that reusing the ordered hash code for anything other than the lexical pad it was specifically implemented for was just looking for trouble.

The Thread That I Really Hoped Matt Would Be Summarizing

AKA "Attack of the 50-foot register allocator vs. the undead continuation monster." Piers Cawley and Chip had something of a disagreement about interactions between continuations and the register allocator. After discussion on IRC, it became apparent that they were talking past each other. The new "the register frame is the continuation" means that yes, the register allocator definitely can't rely on being able to reuse registers that persist over function calls, but that's all right because you can always grab more registers.

Missing MMD Default Functions

Remember the missing multimethod functions I mentioned last time? At the time, Chip hadn't ruled on whether taking them out was the Right Thing or not. He has since ruled that it was.

This is probably not quite the right place to suggest this, but what the heck. Maybe in future when planning user visible changes of this sort, they should spend at least one release period deprecated and throwing warnings when used.

PGE, Namespaced Rules

William Coleda worried that PGE subrules appear to be globally scoped. It turns out that Patrick worries, too, but is currently in the process of thrashing out how to scope them. He outlined his current thinking.

PMCs and Objects Question

Klaas-Jan Stol wondered about the possibilities of overriding PMC behavior with Parrot classes. He outlined possibilities and wondered if he was correct. Chip thought that it should be possible to implement (for instance) Perl's datatypes in pure PIR, if only for debugging and fun. I'm still not entirely sure if it's possible to make a ParrotClass that inherits from a PMC, though.

Software Transactional Memory

It seems the design team have drunk deeply of the Software Transaction Memory (STM) Kool Aid. STM is, to quote Chip, a "wicked cool" way of doing threading. Expect a more-fleshed-out design document eventually.

Parrot bc

According to the configuration scripts, Parrot looks for the GNU version of bc solely for checking that Parrot bc is working. This is all very well, but there is no Parrot implementation of bc in the SVN repository. Apparently, there's a broken version of it sitting on Bernhard Schmalhofer's local hard disk.

None of which addressed the issue of why, even with a "working" version, the tests need to access GNU bc. Surely it's possible to write tests statically. The only time you'd need an authoritative version would be when you were adding tests. Oops, editorializing again.

Substituting for PGE

Will Coleda wondered if it was possible to do substitutions with PGE yet. "Yes, sort of," Patrick replied. You can substitute the first occurrence by grabbing the match data and using substr. Everything else is for another day.

Unexpected Behavior Calling Method

Klaas-Jan Stol had some problems implementing delegated addition. Apparently it's because the signatures of the __add methods caught him out. Also, it's a really bad idea to delegate to a method called __add, because Parrot expects some very particular behavior from it. Think about calling it add instead.

Parrot Goals and Priorities

Chip's put the slides of his Austrian Perl Workshop talk on the Parrot project and its priorities up on feather. Check them out; they're good.

New TODOs

Will Coleda's been busy injecting a bunch of handy TODO items in the Parrot RT system. Check 'em out, you might be able to do some of them.

New List for Pirate

Michal Wallace announced the creation of a new list for work on Pirate, a Python compiler for Parrot. If Python on Parrot is your bag, I suggest you sign up.

Adding Methods to Existing Classes

Patrick wondered how to add methods to existing classes. It turns out that the trick is to use find_type instead of findclass. According to Leo, findclass is deprecated.

Meanwhile, in perl6-language

Hmm. 1242 GMT+1 on Thursday as I write this, and there are, oh, 246 messages in perl6-language. This could get sketchy.

Reduce Metaoperator on an Empty List

Wow! The "Reduce metaoperator on an empty list" discussion is still going.

return() in Pointy Type Blocks

Much to my personal chagrin, it looks like return() inside a of pointy block will use an escape continuation and will probably be picky about making sure that you can only invoke the pointy block from somewhere dynamically "below" the block in which it was created. This means no cunning tricks like:

sub call_with_current_continuation(Code $code) {
  $code({ return $^cc })
}

which is probably a good thing.

caller and want

Gaal Yahas asked for clarification about the behavior of the caller builtin. Larry provided it.

Musing on Registerable Event Handlers for Some Specific Events

Adam Kennedy hoped that Perl 6 would have some sort of minimal set of hooks for handling events. (Personally, I'd like a maximal set of hooks for anything that changes the runtime structure of Perl, but I'm greedy like that.) Larry said that there would be such a thing, but that it wasn't designed yet. He appeared to volunteer Adam as an initial designer. Discussion ensued, but there's no concrete design yet. Slightly tangentially, Dan discussed his thoughts about a Parrot notifications manager on his blog, which might be useful to some.

Speed Bump Placement

In a thread discussing adding an eval STRING-type behavior to the right-hand side of a substitution, Larry said that "Deciding where (and where not) to put the speed bumps is a pretty good description of my job. It's impossible to have a language without bumps, so I reserve the right to put the necessary bumps where I think they'll do the most good and/or least harm."

Well, I thought that was worth reading by more than just the list subscribers.

MMD Vs. Anonymous Parameter Types Referencing Early Parameters

Chip threw up his hands and despaired of ever efficiently implementing:

  multi sub is_equal(Integer $a, Integer where { $_ == $a } $b: ) { 1 }

Which is cute, but Chip claims you need Jedi mind powers if you want to make it work.

Then Thomas Sandlaß popped up to say that actually, there was already a language called Cecil that allowed you to do precisely that sort of thing (called Predicate Dispatch) and there were several efficient implementation strategies. After a nudge from Chip, he even provided a link. Larry thought it eminently doable, too, and sketched out a strategy.

That strategy (which applies almost everywhere in Perl, when you think about) boils down to "If you can't do it at compile time, do it at runtime (and pretend you did it at runtime)."

State of the Design Documents

Joshua Gatcomb worries about the state of the synopses. He argued (quite persuasively) that the thing to do would be to put the synopses into public change control with global read access, but with write access limited to @Larry. The community could then provide new documentation in the form of patches, which @Larry would approve, reject, or modify as appropriate, which all hangs on whether @Larry has sufficient tuits.

Patrick pointed out that this already exists and that he had volunteered as gatekeeper and patch dispatcher, but that there were very few patches so far. But now you all know about it, right?

Some discussion followed about how to flesh out things, but the important thing is the Perl 6 design document repository URL.

How Much Do We Close Over?

Piers Cawley wants to be able to write code like:

sub foo { my $x = 1; return sub { eval $^codestring } }
say foo().('$x'); # 1

In Perl 5, this gives warnings about an undeclared variable. Chip maintained that this is actually the Right Thing. Piers understood that it may not be the right thing in all cases, but he wanted to be able to make it work when needed, if necessarily, with predeclaration. There was some discussion, but nothing from @Larry yet.

BEGIN {...} and IO

Ingo Blechschmidt noted that that BEGIN {...} can be a little scary when you want to compile to bytecode. Consider:

my $fh = BEGIN { open "some_file" err ... }

This is okay, until you have a version of Perl that compiles that to bytecode. The response ran along the lines of "Don't do that, then!"

Personally I'd write that as:

my $fh = INIT { open "some_file" err ... }

Assuming that my recollection that INIT blocks happen after the code is compiled but before it starts to run--or do I mean a CHECK block?

Anonymous Macros

Ingo also wondered if anonymous macros (at compile time) were allowed. Larry had no problem with macros being first-class objects during the compile. He also went on to wonder if they should be multidispatch, too.

Perl Defined Object, Array, Hash Classes

While toying with Pugs, Eric Hodges managed to overwrite the internal definition of the Object class, which obviously caused him pain. Larry reckons we'll have constructs like:

class Object is augmented { ... };
class Object is replaced { ... };

(names up for grabs). My personal preference is for making augmented the default behavior, but I'll live if I can have a pragma that makes it that way.

%hash1 »...« %hash2

David Formosa wondered about the behavior of hyperops when applied to a pair of hashes. He wanted things arranged so that if you had a hash with keys in common, then the hypering process would keep them together. Luke agreed that it would be useful (so do I, for that matter) and then everyone started talking about inner and outer joins and my database comprehension head swapped out for the moment.

Binding Slices

With a small correction for syntactical niceness, Piers wondered if:

my @y := @foo[0...][1]

would bind @y to a column of the two-dimensional matrix represented by @foo[][], so that writing to @y would affect @foo and vice versa. @Larry hasn't said anything yet.

alias the RubyMeter

BÁRTHAZI Andras wondered if Perl 6 would have something like Ruby's rather lovely alias. Larry thought you should be able to write a macro to do the job, but wasn't entirely sure how exactly it would be done. Further discussion centered on whether the feature was a good idea and whether it had the right name. One school of thought thinks it already exists as :=, but I'm not quite so sure.

&?CALLER::BLOCK Vs. Any Hope of Efficiency

Chip hopes that using &?CALLER::BLOCK as a general-purpose block promoter will be disallowed unless the calling block has already marked itself as callable. Larry thought that this would be okay, noting that he saw &?CALLER::BLOCK being mostly used for introspective purposes.

Creating a Web Templating Engine

Wow! Perl 6 isn't even finished, and already Andras is talking about writing a web templating engine for it. He outlined his plan and wondered how to go about implementing it. He and Ingo discussed it.

Hyper Concat

Thomas Klausner has been playing with »~« and uncovered some weirdness. Said weirdness lead to a discussion of the default strings/patterns in split and join.

sub my_zip (...?) {}

Autrijus worried that the current Pugs implementation of zip was signatureless, which, among other things, makes it uncompilable to Parrot. He wondered what its function signature should be. Larry came up with the (admittedly slightly weird) goods.

Ignoring Parameters

Gaal Yahas wondered if he'd be able to write a class method as:

method greet(Class undef:) {...}

when his class methods made no references to the class object itself. Damian thought that the syntax should actually be:

method greet(FooClass ::class) {...}

and that subs and methods should complain about unused non-optional non-invocant parameters. There's more; see the sub for details.

Scalar Dereferencing

Autrijus wondered about the semantics of a scalar reference in the face of stringification and numification. He provided an example of Pugs' current behavior that may or may not be correct. Larry described broken behavior before thinking again and describing the really correct behavior, along with a summary of his raccoon problems.

Taking given as Read

Piers wondered how to write a function that would look like a given block to any whens inside of it. It turns out that you can't, yet. Damian thought that the right way to do it would be:

sub factorial (Int $n is topic) {
  return 1 when 0;
  $n * factorial($n - 1);
}

Reading this again, I find myself wondering if the return is really necessary.

./method

People don't like ./method. Other people don't like .method in methods. I think we have what we have on the "least worst option" principle--but I would say that I don't like ./method.

AUTOLOAD and $_

Sam Vilain wondered about the prototype of AUTOLOAD. In the discussion that ensued, some people felt that whatever happened, AUTOLOAD should return a code ref that perl would call.

Th-Th-The-That's All, Folks!

I remember now why I gave up writing summaries in the first place. First, I started missing weeks, which meant that there was so much to write up in the fortnightly summaries, and then discussions grew interesting, which meant writing them took so much longer because there were hard things to understand first.

Still, once in a while is refreshing, but I really should stop putting things off until the last minute.

Ahem.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl.

Or, you can check out my website. Maybe now I'm back writing stuff I'll start updating it. There are also vaguely pretty photos by me.

Data Munging with Sprog


We've all been there--a data translation problem rears its head and you reach for your toolkit of Perl snippets. It might involve parsing a CSV file, extracting MIME attachments, generating bulk SQL insert statements, or scraping data from a web application. You know you have code lying around that'll take you halfway there, if only you could find it. Then there's the problem of pulling it all together.

Wouldn't it be great if there was a way to catalog your code snippets? How about a way to browse or search by keyword, a way to modularize your code for easy reuse, and a way to document it and easily access that documentation? Wouldn't it be even better if you could pull the pieces together to assemble a solution without having to actually write code at all?

Now there is. Now there's Sprog.

The Assignment

Picture yourself as a sysadmin at Example Corp. Your boss calls you in to say he's setting up an LDAP server and he needs you to whip up an Lightweight Directory Interchange Format (LDIF) file, containing every employee's name, phone number, and email account information. Oh, and he needs it this afternoon, so you'd better get typing.

You sit back down at your desk to contemplate your fate. Who ya gonna call? The answer hits you--the company phone list on the intranet! It has all of the information you need (Figure 1); you just need to get it out.

Phone List Web Page
Figure 1. The company phone list web page

Getting Started

With a second flash of inspiration, you download the latest version of Sprog. You install the Perl Gtk bindings and a few other modest prerequisites and before you know it, you're looking at a clean green GUI (Figure 2).

The Sprog Workspace
Figure 2. The Sprog workspace

A quick scan of the palette on the left reveals something labeled Retrieve URL, which sounds like a good start. Reading the instructions at the bottom of the window, you learn that the thing is a gear and that you can drag it across and drop it on the workspace.

Having dragged the gear onto the workspace, you right-click on it and select Properties. Up pops a properties dialog (Figure 3), with a handy box where you paste in the URL of the phone list.

A properties dialog
Figure 3. A properties dialog

Next, you drag across the Text Window gear from the palette and drop it on top of the Retrieve URL gear. It snaps reassuringly into position so that the two gears' connectors fasten together securely (Figure 4).

Two gears connected together
Figure 4. Two gears connected together

O'Reilly Open Source Convention 2005.

When you click the Run button on the toolbar, the machine leaps into life. The gears turn, it retrieves data, and a text window appears to display the HTML of the phone list page. Okay, great, you haven't written a line of code and already you've replicated your browser's View Source function.

You save the machine to a file called phonelist.sprog and then look for the next clue.

Making Connections

Returning to the palette, you find a gear labelled Parse HTML Table. It looks promising, so you drag it onto the workbench. You pull apart the first two gears and attempt to add the new one in between them. Unfortunately, the new gear has a funny shaped output connector and the Text Window gear doesn't seem to fit onto it.

You right-click on the new gear and select Help. From the help page you learn that the output connector is a list connector. The gear takes a stream of HTML text and outputs rows of data, where each row is a list of values plucked from adjacent table cells.

Once more, back at the palette, you discover a List To CSV gear, which has an input connector to match the table parser and an output connector to match the text window gear. You drag it over and snap them all together (Figure 5).

Machine to produce CSV
Figure 5. A machine to produce CSV

Now when you run the machine again, the text window fills with lovely CSV data. It's not that you want CSV data, of course, but at least you can see that the machine has parsed the relevant data out from the HTML page. Or has it?

On closer inspection, you realize that the machine has parsed the wrong table from the HTML. In true 1998 style, the page designer used nested tables to lay out the page. Even the list of navigation links is a table. Oh dear!

The properties dialog for the Parse HTML Table gear allows you to specify which table you want to parse. The help page explains that you can enter just a number (such as 2 for the second table) or an XPath expression. There's even an example XPath expression which you can cut and paste to select a table based on the contents of the first cell in the first row:

//table[./tr[1]/th[1 and contains(text(), 'First Name')]]

As it happens, a bit of trial and error reveals that the phone list data is in the third table, so you set the selector to 3. Now when you run the machine, you see exactly the data you want in beautiful CSV format (not that you want CSV data, of course).

Hang on! The data still isn't quite right. The data values don't contain any HTML tags, but they do seem to have lots of leading and trailing white space and embedded newlines. Racing back to the palette, you grab the Strip Whitespace gear, slot it into your machine and tweak its properties to specify exactly which white space you want stripped. Now when you run the machine again, you do get truly lovely CSV data (Figure 6).

CSV output
Figure 6. CSV output

Of course, there's no getting away from the fact that you still don't want CSV data.

What Were You Trying To Do, Again?

Did you know? Your email client can import LDIF files into your address book.

To import an LDIF file into your Thunderbird address book:

  1. Open your address book.
  2. Select Tools -> Import.
  3. Select Address Books and then Next.
  4. Select "Text file (LDIF ...)" and then Next.
  5. Select the LDIF file created from Sprog and then Open.
  6. Select Finish.

Remembering your original orders, you do a bit of reading about LDIF files. Your research shows that LDIF is a fairly simple text format. You need to generate a text file with an entry for each person separated by blank lines and formatted something like:

dn: uid=cat,ou=Staff,ou=People,dc=example,dc=com
objectClass: person
objectClass: inetOrgPerson
cn: Catherine Trenton
uid: cat 
sn: Trenton
givenName: Catherine
mail: cat@example.com
organizationName: Example Corp
telephoneNumber: 555-2349
mobileTelephoneNumber: 555-9623

This looks like a job for a template, and sure enough, you find a gear entitled Apply Template (TT2) in the palette. The template gear's input connector is unlike either the pipe or the list connectors you've encountered so far. The help page tells you it's a record connector that passes data using Perl hashes rather than arrays.

Back on the palette, you find a handy gear called "List to Record" that automagically converts lists to records by assuming the first row contains a column heading, which it uses for field names (hash keys). You remove the List To CSV gear and replace it with "List to Record," followed by the Apply Template (TT2) gears. With a few clicks and drags, you reassemble your machine into its "almost final" shape (Figure 7).

Machine including template gear
Figure 7. Machine including template gear

In the properties dialog for the template gear, you add a template:

dn: uid=[% email %],ou=Staff,ou=People,dc=example,dc=com
objectClass: person
objectClass: inetOrgPerson
cn: [% first_name %] [% surname %]
uid: [% email %]
sn: [% surname %]
givenName: [% first_name %]
mail: [% email %]
organizationName: Example Corp
telephoneNumber: [% phone %]
mobileTelephoneNumber: [% cell %]

The results are almost exactly what you want, except that there are a couple of places where you wanted a uid field and the closest you had available was email. You need to strip out the @example.com from the dn and uid lines. You unplug the Text Window gear and insert a "Find and Replace" gear to fix the first occurrence (Figure 8):

Find/replace to fix dn
Figure 8. Adding a "Find and Replace" gear

and another one to fix the second (Figure 9):

Find/replace to fix uid
Figure 9. Adding another "Find and Replace" gear

and now the output is exactly like what you wanted (Figure 10).

Final output
Figure 10. Final output

You swap out the Text Window gear for a Write File gear, select a filename, and run the machine one last time (Figure 11).

Final machine
Figure 11. The final machine

You've finished the job and you never had to touch your semicolon key once.

More About Sprog

I hope this article has given you some idea of what's possible with Sprog. It can be a useful addition to the toolbox of people who write scripts to transform data. Beyond that, though, I intend it to be a useful tool for people who don't write scripts--scripting for the GUI guys, if you will. Anyone who's smart enough to drive a spreadsheet is smart enough to drive Sprog. It's just a different way of working with data. Even if the only thing those people use it for is getting data into a form their spreadsheets can handle, then that's surely useful.

Sprog is under active development, with the framework being extended and new gears added all the time. Writing your own gears is easier than you might imagine, and there is a mailing list to ask questions and share your ideas.

Understanding and Using Iterators

The purpose of this tutorial is to give a general overview of what iterators are, why they are useful, how to build them, and things to consider to avoid common pitfalls. I intend to give the reader enough information to begin using iterators, though this article assumes some understanding of idiomatic Perl programming. Please consult the "See Also" section if you need supplemental information.

What Is an Iterator?

Iterators come in many forms, and you have probably used one without even knowing it. The readline and glob functions, as well as the flip-flop operator, are all iterators when used in scalar context. A user-defined iterator usually takes the form of a code reference that, when executed, calculates the next item in a list and returns it. When the iterator reaches the end of the list, it returns an agreed-upon value. While implementations vary, a subroutine that creates a closure around any necessary state variables and returns the code reference is common. This technique is called a factory and facilitates code reuse.

Why Are Iterators Useful?

The most straightforward way to use a list is to define an algorithm to generate the list and store the results in an array. There are several reasons why you might want to consider an iterator instead:

  • The list in its entirety would use too much memory.

    Iterators have tiny memory footprints, because they can store only the state information necessary to calculate the next item.

  • The list is infinite.

    Iterators return after each iteration, allowing the traversal of an infinite list to stop at any point.

  • The list should be circular.

    Iterators contain state information, as well as logic allowing a list to wrap around.

  • The list is large but you only need a few items.

    Iterators allow you to stop at any time, avoiding the need to calculate any more items than is necessary.

  • The list needs to be duplicated, split, or variated.

    Iterators are lightweight and have their own copies of state variables.

How to Build an Iterator

The basic structure of an iterator factory looks like this:

sub gen_iterator {
    my @initial_info = @_;

    my ($current_state, $done);

    return sub {
        # code to calculate $next_state or $done;
        return undef if $done;
        return $current_state = $next_state;   
    };
}

To make the factory more flexible, the factory may take arguments to decide how to create the iterator. The factory declares all necessary state variables and possibly initializes them. It then returns a code reference--in the same scope as the state variables--to the caller, completing the transaction. Upon each execution of the code reference, the state variables are updated and the next item is returned, until the iterator has exhausted the list.

The basic usage of an iterator looks like this:

my $next = gen_iterator( 42 );
while ( my $item = $next->() ) {
    print "$item\n";
}

Example: The List in Its Entirety Would Use Too Much Memory

You work in genetics and you need every possible sequence of DNA strands of lengths 1 to 14. Even if there were no memory overhead in using arrays, it would still take nearly five gigabytes of memory to accommodate the full list. Iterators come to the rescue:

my @DNA = qw/A C T G/;
my $seq = gen_permutate(14, @DNA);
while ( my $strand = $seq->() ) {
    print "$strand\n";
}

sub gen_permutate {
    my ($max, @list) = @_;
    my @curr;
    return sub {
        if ( (join '', map { $list[ $_ ] } @curr) eq $list[ -1 ] x @curr ) {
            @curr = (0) x (@curr + 1);
        }
        else {
            my $pos = @curr;
            while ( --$pos > -1 ) {
                ++$curr[ $pos ], last if $curr[ $pos ] < $#list;
                $curr[ $pos ] = 0;
            }
        }
        return undef if @curr > $max;
        return join '', map { $list[ $_ ] } @curr;
    };
}

Example: The List Is Infinite

You need to assign IDs to all current and future employees and ensure that it is possible to determine if an ID is valid with nothing more than the number itself. You have already taken care of persistence and number validation (using the LUHN formula). Iterators take care of the rest:

my $start = $ARGV[0] || 999999;
my $next_id = gen_id( $start );
print $next_id->(), "\n" for 1 .. 10;  # Next 10 IDs

sub gen_id {
    my $curr = shift;
    return sub {
        0 while ! is_valid( ++$curr );
        return $curr;
    };
}

sub is_valid {
    my ($num, $chk) = (shift, '');
    my $tot;
    for ( 0 .. length($num) - 1 ) {
        my $dig = substr($num, $_, 1);
        $_ % 2 ? ($chk .= $dig * 2) : ($tot += $dig);
    }

    $tot += $_ for split //, $chk;

    return $tot % 10 == 0 ? 1 : 0;
}

Example: The List Should Be Circular

You need to support legacy apps with hardcoded filenames, but want to keep logs for three days before overwriting them. You have everything you need except a way to keep track of which file to write to:

my $next_file = rotate( qw/FileA FileB FileC/ );
print $next_file->(), "\n" for 1 .. 10;

sub rotate {
    my @list  = @_;
    my $index = -1;

    return sub {
        $index++;
        $index = 0 if $index > $#list;
        return $list[ $index ];
    };
}

Adding one state variable and an additional check would provide the ability to loop a user-defined number of times.

Example: The List Is Large But Only a Few Items May Be Needed

You have forgotten the password to your DSL modem and the vendor charges more than the cost of a replacement to unlock it. Fortunately, you remember that it was only four lowercase characters:

while ( my $pass = $next_pw->() ) {
    if ( unlock( $pass ) ) {
        print "$pass\n";
        last;
    }
}

sub fix_size_perm {

    my ($size, @list) = @_;
    my @curr          = (0) x ($size - 1);

    push @curr, -1;

    return sub {
        if ( (join '', map { $list[ $_ ] } @curr) eq $list[ -1 ] x @curr ) {
            @curr = (0) x (@curr + 1);
        }
        else {
            my $pos = @curr;
            while ( --$pos > -1 ) {
                ++$curr[ $pos ], last if $curr[ $pos ] < $#list;
                $curr[ $pos ] = 0;
            }
        }

        return undef if @curr > $size;
        return join '', map { $list[ $_ ] } @curr;
    };
}

sub unlock { $_[0] eq 'john' }

Example: The List Needs To Be Duplicated, Split, or Modified into Multiple Variants

Duplicating the list is useful when each item of the list requires multiple functions applied to it, if you can apply them in parallel. If there is only one function, it may be advantageous to break the list up and run duplicate copies of the function. In some cases, multiple variations are necessary, which is why factories are so useful. For instance, multiple lists of different letters might come in handy when writing a crossword solver.

The following example uses the idea of breaking up the list to enhance the employee ID example. Assigning ranges to departments adds additional meaning to the ID.

my %lookup;

@lookup{ qw/sales support security management/ }
    = map { { start => $_ * 10_000 } } 1..4;

$lookup{$_}{iter} = gen_id( $lookup{$_}{start} ) for keys %lookup;

# ....

my $dept = $employee->dept;
my $id   = $lookup{$dept}{id}();
$employee->id( $id );

Things To Consider

The iterator's @_ is Different Than the Factory's

The following code doesn't work as you might expect:

sub gen_greeting {
    return sub { print "Hello ", $_[0] };
}

my $greeting = gen_greeting( 'world' );
$greeting->();

It may seem obvious, but closures need lexicals to close over, as each subroutine has its own @_. The fix is simple:

sub gen_greeting {
    my $msg = shift;
    return sub { print "Hello ", $msg };
}

The Return Value Indicating Exhaustion Is Important

Attempt to identify a value that will never occur in the list. Using undef is usually safe, but not always. Document your choice well, so calling code can behave correctly. Using while ( my $answer = $next->() ) { ... } would result in an infinite loop if 42 indicated exhaustion.

If it is not possible to know in advance valid values in the list, allow users to define their own values as an argument to the factory.

References to External Variables for State May Cause Problems

Problems can arise when factory arguments needed to maintain state are references. This is because the variable being referred to can have its value changed at any time during the course of iteration. A solution might be to de-reference and make a copy of the result. In the case of large hashes or arrays, this may be counterproductive to the ultimate goal. Document your solution and your assumptions so that the caller knows what to expect.

You May Need to Handle Edge Cases

Sometimes, the first or the last item in a list requires more logic than the others in the list. Consider the following iterator for the Fibonacci numbers:

sub gen_fib {
    my ($low, $high) = (1, 0);

    return sub {
        ($low, $high) = ($high, $low + $high);
        return $high;
    };
}

my $fib = gen_fib();
print $fib->(), "\n" for 1 .. 20;

Besides the funny initialization of $low being greater than $high, it also misses 0, which should be the first item returned. Here is one way to handle it:

sub gen_fib {

    my ($low, $high) = (1, 0);

    my $seen_edge;

    return sub {
        return 0 if ! $seen_edge++;
        ($low, $high) = ($high, $low + $high);
        return $high;
    };
}

State Variables Persist As Long As the Iterator

Reaching the end of the list does not necessarily free the iterator and state variables. Because Perl uses reference counting for its garbage collection, the state variables will exist as long as the iterator does.

Though most iterators have a small memory footprint, this is not always the case. Even if a single iterator doesn't consume a large amount of memory, it isn't always possible to forsee how many iterators a program will create. Be sure to document how the caller can destroy the iterator when necessary.

In addition to documentation, you may also want to undef the state variables at exhaustion, and perhaps warn the caller if the iterator is being called after exhaustion.

sub gen_countdown {
   my $curr = shift;

   return sub {
       return $curr++ || 'blast off';
   }
}

my $t = gen_countdown( -10 );
print $t->(), "\n" for 1..12; # off by 1 error

Becomes:

sub gen_countdown {
   my $curr = shift;

   return sub {
       if ( defined $curr && $curr == 0 ) {
           undef $curr, return 'blast off';
       }

       warn 'List already exhausted' and return if ! $curr;

       return $curr++;
   }
}

See Also

This Week in Perl 6, June 1-7, 2005

Crumbs. I've remembered to write the summary this week. Now if I can just remember to bill O'Reilly for, err, 2003's summaries. Heck, it's not like waiting for the dollar to get stronger has paid off.

Ah well, no use crying over spilled milk. On with the show. Maybe, just maybe, darwinports will work its magic and I'll have a working Haskell compiler by the time I've finished writing.

This Week in perl6-compiler

undef Issues

I'd probably forgotten this, but Larry pointed out that in Perl 6 there will no longer be a function undef() and a value undef. Instead there'll be a function undefine() and a value undef, but he thinks that we should usually fail() to construct our undefined values.

This Week in perl6-internals

Keys

I'm not sure I understood what TOGoS was driving at with a suggestion about keys and properties. Luckily Leo, Dan, and Chip all seemed to. The discussion continued through the week.

Loop Improvements

Oh no! It's the register allocator problems again. One of these days, I swear I'm going to swot up on this stuff properly, work out whether it's really the case that full continuations break any conceivable register allocator, and summarize all of the issues for everyone in a nice white paper/summary.

HP-UX Build Notes

Nick Glencross posted some of his issues with getting Parrot up on an HP-UX machine. After a good deal of discussion and tool-chain fettling, he made things build and posted a patch to fix the knowledge, which was promptly applied (r8280, for those of you with the svn chops to know how to take advantage of that).

mod_pugs Status

Jeff Horwitz announced that mod_parrot now comes bundled with mod_pugs, which means that you can now write Apache extensions in Perl 6. I don't know about you, but my mind is still boggling.

Parrot 0.2.1

Parrot spent most of the week in a feature freeze for the release of Parrot 0.2.1 "APW," which went ahead as planned on the 4th of June.

Parrot on Solaris

Peter Sinnott reported problems with Parrot on Solaris. It turns out that different implementations of atan behave slightly differently, which isn't good. I believe the problem remains unresolved.

Parrot on the Mac OS

Joshua Juran's questions about getting Parrot running on Mac OS Classic went Warnocked.

Parrot Tests Get TODO

Continuing the drive for consistent testing structures everywhere in Perl land, Chromatic applied a patch to Parrot::Test that makes TODO tests work in a way that Test::Builder understands. Hurrah!

Missing MMD Default Functions

Dan was somewhat bemused to find that the MMD functions' defaults had disappeared when he did a sync with subversion. He wondered whether this was deliberate. Turns out that it was. I'm not sure whether Chip's ruled that it was right, though.

Google's Summer of Code 2005

Remember earlier when I talked about IMCC's register allocation? Well, Dheeraj Khumar Arora is looking at working on improving IMCC's optimizations as part of Google's Summer of Code 2005. The usual thread ensued.

Building nci/dynclasses on HP-UX

Not content with getting Parrot to build on HP-UX, Nick Glencross next set his sights on making nci/dynclasses work on HP-UX. It sounds like there'll be a patch forthcoming some time next week.

Nick Paints the Big HP-UX Picture

Announcing Amber for Parrot 0.2.1

Roger Browne announced another new language that targets Parrot: Amber. It borrows a good deal of syntax and semantics from Eiffel, with a large sprinkling of Ruby for good measure.

A note WRT exception handlers

Leo posted a quick discussion of the correct use of exception handlers in Parrot. Essentially, the rule is that your exception handler should jump back to the point just after the exception handler block:

    push_eh except_N
    # Code that might fail
    clear_eh
resume_N:
    ...
except_N:
    ...
    goto resume_N

Easy, eh?

Meanwhile in perl6-language

The Reduce Metaoperator Thread

Remember when I discussed this thread two weeks ago? It's still going strong.

Larry ended up stating that there will be an optional property, identval, on operators which will be set by default on all operators with obvious identity values. Or it might be called initvalue.

Larry Makes Up His Mind

Construction Clarification

Carl Franks wondered about how object constructors will work. It turned out that the code he'd carefully written by hand pretty much described the default behavior. Damian and Larry provided details. Hopefully, some keen p6porter has already incorporated any new information into the appropriate Synopses.

A Comprehensive List of Perl 6 Rule Tokens

Patrick responded to his own post last week to clarify some things about the capturing behavior of various rule types. He, Japhy, and Thomas Sandlaß thrashed out the gory details.

Default Invocant of Methods

Larry addressed Ingo Blechschmidt's questions about class methods.

Class is a role? My head hurts.

returns and Context

Gaal Yahas wondered how to specify the signature of a context-sensitive function. The consensus seems to be to use a junction, like so:

sub foo() returns Str|Int {...}

Declarations of Constants

Adam Kennedy had wondered how much compile-time optimization of constants would happen. Damian thought not as much as Adam thought, but suggested that he could use macros to get more optimization if he needed it.

Time Functions

The good thing about localtime et al. is that everyone knows them. The bad thing about them is that they're at such a low level that you either end up reinventing wheels, getting it wrong, or boggling at the size of the library you need to install to get access to good time manipulation. I wonder what Perl 6 will end up with.

Empty Hash

Luke wondered if {} should be an empty hash rather than empty code, and why { %hash } no longer makes a shallow copy of the hash, but code that returns %hash. There was some discussion, but no answers came from anyone else on the design team.

chars in a List Context

Joshua Gatcomb revisited a long-Warnocked subject. He wants:

@chars = 'hello'.chars; # <h e l l o>

That is, in a list context, chars should return a list of the characters in the string. Stuart Cook thought it was a good idea.

Transparent/Opaque References

Um ... I'm not sure what Thomas Sandlaß and Juerd were talking about. I'll tell you what, let's swap places: you read the thread and write me a summary of it.

Idea for Making @, %, and $ Optional

Millsa Erlas wondered if it would be possible to make variable sigils optional. The short answer is yes, with a pragma, and probably left for CP6AN.

Using Rules

BÁRTHAZI András wondered about using rules in a web templating system he was working on. Aankhen supplied an answer.

(Look, it's two messages. Any summary I wrote that told you more than the above sentence would be about as long as the original messages.)

(Multi)Subroutine Names

Dakkar wondered how he could get at the long name of a multi sub. Rod Adams thought it'd be:

&foo<Array, Int>
&foo<Hash, Int>

but also thought it might been changed. Thomas Sandlaß agreed that it had changed to:

&foo:(Array, Int)
&foo:(Hash, Int)

Easy.

Flattening Arguments

BÁRTHAZI András wondered about the behavior of flattening arguments in Pugs when compared to that described in Perl 6 and Parrot Essentials. Answer: The book's right, they're just not implemented in Pugs. Yet.

return() in Pointy Blocks

Oh boy. Ingo Blechschmidt opened a can of worms when he asked about return within pointy subs. However, because the worms were slow in starting, you'll have to wait for Matt's summary next week when he explains:

sub callcc (Code $code) { $code(-> $r {return $r}) }

Meanwhile, in Another Place

Once upon a long time ago, Jon Orwant threw coffee cups and swore and Perl 6 was born. Later that afternoon, Dan Sugalski started doodling design sketches for what was to become Parrot. Parrot's first README in CVS dates from August 11th, 2001, and the first archived mailing list post is from August 1st, 2000, but that's a reply.

As well as being Parrot's original developer, Dan is also Parrot's first commercial user.

Last week, he announced in his blog that, having already given up his designer's hat earlier this year, he's stopped doing any Parrot development. The plan is that he'll be publishing a few design documents and historical explanations of various bits of Parrot design on his blog, but otherwise, that's all he wrote.

I'm not the first, and I'm sure I won't be the last to say this. Dan, thank you very much for all the work you've put into Parrot over the years. Good luck with whatever you do next.

The End ... for Now

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl.

Or, you can check out my website. Maybe now I'm back writing stuff I'll start updating it. There are also vaguely pretty photos by me.

Independently Parsing Perl


A few years into my programming career, I found myself involved in a somewhat unusual web project for an enormous global IT company. Due to some odd platform issues, we could write the intranet half of the project only in Perl and the almost-identical public internet half only in Java.

In my efforts to pick up enough Java to help my Perl code interoperate with the code from the Java guys, I stumbled on a relatively new editor with the rather expansive name of JetBrains IntelliJ IDEA.

What a joy! It quite simply made learning Java an absolute pleasure, with comprehensive tab completion, light and simple API docs, easy exploration of the mountain of Java classes, and unobtrusive indicators showing me my mistakes and offering to fix them. In short, it had lots of brains and a fast, clean user interface.

Where Is IntelliPerl?

Although I only needed it heavily for a few months, it's been my gold standard ever since, and my definition of what a "great" editor should be. I install every new Perl editor and IDE I come across in the hope that Perl might one day get an editor half as good as what Java has had for years.

These great editors are spreading. Java is now up to one and a half (Eclipse is nearly great but still seems not quite "effortless" enough about what it does). Dreamweaver gave HTML people their great editor years ago, and I've heard that Python may now have something that qualifies.

Interestingly, these great editors seem to share one major thing in common.

How to Build a Great Editor

Rather than relying on the language's parser to examine code, great editors seem to implement special parsers of their own. These parsers treat a file less like code and more like a generic document (that just also happens to be code).

It's a key distinction, and one that provides two critical capabilities.

First, it creates a "round-trip" capability, parsing a file into an internal model and back out again without moving a single white space character out of place. Even if parts of a file are broken or badly formatted, you can still change other parts of the file and save it correctly without it changing anything you don't alter.

Second, it makes the parser extremely safe and error-tolerant. Any code open in an editor is there for a reason--generally because it isn't finished yet, is broken, or needs changing. A document parser can hit a problem, flag it, stumble for a character or so until it finds something it recognizes, and then continue on.

Parsing as code is an entirely different task, and one often unsuited to these type of faults.

For example, take the following.

print "Hello World!\n";
  
}
  
MyModule->foobar;

For an editor using Perl itself to understand this code, it's game over once it hits the naked closing brace, because the code is invalid. Without knowledge of what is below the brace, you lose all of the intelligence that needs the parser: syntax highlighting, module checking, helpful tips, the lot.

It's just simply not a reasonable way to build an editor, where a file can be both unfinished and have dozens of bugs.

Building a Document Parser for Perl

Even without an editor to put it in (yet), a document parser for Perl would be extraordinarily useful for all sorts of tasks. At the time, though, all I really wanted was a really accurate HTML syntax highlighter.

Some time in early 2002, I was bored one afternoon and had a first stab at the problem. The result was pretty predictable, given patterns I've seen in others trying the same thing. It was A) based on regular expressions, and B) useless for anything even remotely interesting.

Between then and the start of The Perl Foundation grant in December 2004, I've spent a day or so a month on the problem, rewriting and throwing away code. I've junked two tokenizers, one lexer, an analysis package, three syntax highlighters, an obfuscation package, a quote engine, and half of the classes in the current object tree.

Now, finally, PPI is complete, bar some minor features and testing. It is 100 percent round-trip safe, and it's been stress tested against the 38,000 (non-Acme) modules in CPAN, handling all but 28 of the most broken and bizarre.

What Does It Do?

PPI should be the basis for any task where you need to parse, analyze or manipulate Perl, and it finally provides a platform for doing these tasks to their full potential. This covers a huge range of possible tasks; far too many to cover in any depth here.

For this article, I want to demonstrate how PPI can improve existing tools that currently only do a very basic job, when there is the potential for so much more.

One of these is part of the PAR application-packaging module. When PAR bundles a module into the internal include directory, it tries to reduce the size of the modules by stripping out POD. Of course, what would be better would be to strip out everything that is excess and cut PAR file sizes even more.

This is a form of compression, but given the potential confusion in using something like "Compress::Perl" as a name, I'm picking my own term. I hereby anoint the term "Squish". A squished module occupies as little space as possible, having had redundant characters removed. It will be extremely small, although it might look a little "squished" to look at :)

Perl::Squish

Rather than showing you the final project, I prefer to show the process of squishing a single module.

# Where is File::Spec on our system?
use Class::Inspector;
my $filename = Class::Inspector->resolved_filename( 'File::Spec' );

# Load File::Spec as a document
use PPI;
my $Document = PPI::Document->new( $filename );

Everything you do with PPI starts and finishes with PPI::Document objects. If you find yourself using the lexer directly, you are probably doing something wrong.

Where can I start cutting out the fat? For starters, many core modules have an __END__ section.

# Get the (one and only) __END__ section
my $End = $Document->find_first( 'Statement::End' );
  
# Delete it from the document
$End->delete if $End;

PPI provides a set of search methods that you can use on any element that has children. find_first is a safe guess, because there can only be one __END__ section. The search methods actually take &wanted functions like File::Find, so 'Statement::End' is really syntactic sugar for:

sub wanted {
    my ($Document, $Element) = @_;
    $Element->isa('PPI::Statement::End');
}

Of course, there's a faster way to do the same thing. The prune method finds and immediately deletes all elements that match a particular condition.

# Delete all comments and POD
$Document->prune( 'Token::Pod' );
$Document->prune( 'Token::Comment' );

For a more serious example, here's how to strip the non-compulsory braces from ->method():

# Remove useless braces
$Document->prune( sub {
    my $Braces = $_[1];
    $Braces->isa('PPI::Structure::List')      or return '';
    $Braces->children == 0                    or return '';
    my $Method = $Braces->sprevious_sibling   or return '';
    $Method->isa('PPI::Token::Word')          or return '';
    $Method->content !~ /:/                   or return '';
    my $Operator = $Method->sprevious_sibling or return '';
    $Operator->isa('PPI::Token::Operator')    or return '';
    $Operator->content eq '->'                or return '';
    return 1;
    } );

It's a little bit wordy, but is relatively straightforward to write. Just add conditions and discard as you go. You can get other elements, calculate anything or call sub-searches.

When you have finished, be sure to save the file.

# Save the file
$Document->save( "$filename.squish" );

Wrapping It All Up

All you need to do now is is wrap it all up in some typical module boilerplate.

package Perl::Squish;
  
use strict;
use PPI;
  
our $VERSION = '0.01';
  
# Squish a file in place
# Perl::Squish->file( $filename )
sub file {
    my ($class, $file) = @_;
    my $Document = PPI::Document->new( $file ) or return undef;
    $class->document( $Document ) or return undef;
    $Document->save( $file );
}
  
# Squish a document object
# Perl::Squish->document( $Document );
sub document {
    my ($squish, $Document) = @_;
      
    # Remove the stuff we did earlier
    $Document->prune('Statement::End');
    $Document->prune('Token::Comment');
    $Document->prune('Token::Pod');
      
    $Document->prune( sub {
        my $Braces = $_[1];
        $Braces->isa('PPI::Structure::List')      or return '';
        $Braces->elements == 0                    or return '';
        my $Method = $Braces->sprevious_sibling   or return '';
        $Method->isa('PPI::Token::Word')          or return '';
        $Method->content !~ /:/                   or return '';
        my $Operator = $Method->sprevious_sibling or return '';
        $Operator->isa('PPI::Token::Operator')    or return '';
        $Operator->content eq '->'                or return '';
        return 1;
        } );

    # Let's also do some whitespace cleanup
    my @whitespace = $Document->find('Token::Whitespace');
    foreach ( @whitespace ) {
        $_->{content} = $_->{content} =~ /\n/ ? "\n" : " ";
    }
      
    1;
}
  
1;

Finally, the last step is to wrap it all up as a proper module. You can see the finished product prettied up with PPI's syntax highlighter at CPAN::Squish. I've added a few additional small features to the basic code described above, but you get the idea. See also Perl::Squish for more details.

In 15 minutes, I've knocked together a pretty simple module that dramatically improves on what you could do without something like PPI. Now imagine the hard things it makes possible.

This Week in Perl 6, May 25, 2005-May 31, 2005

All~

Welcome to another Perl 6 summary, brought to you by Aliya's new friends, Masha Nannifer and Philippe, and my own secret running joke. Without further ado, I bring you Perl 6 Compiler.

Perl 6 Compiler

Method Chaining

Alex Gutteridge discovered that he couldn't chain attribute access such as $bowl.fish.eyes.say; in Pugs. Later he provided his example in test form (in case anyone wanted to add it). Maybe someone added them, maybe not; Warnock applies.

Pugs Link Issues on Debian Sid

BÁRTHÁZI András had trouble making Pugs work on Debian Sid with Perl 5 support. Autrijus provided helpful pointers. I assume from his final silence that the final pointer worked.

Pugs.AST.* Compilation

Samuel Bronson wanted to speed up the compilation of Pugs.AST.* modules by turning off optimizations. Autrijus told him that this was a core module that needed its speed, and optimizations would stay.

Pugs.Types Export List

Samuel Bronson added an export list to Pugs.Types. Autrijus happily applied it and sent him a commit bit.

Export withArgs from Main

Samuel Bronson added an export to Main. Samuel Bronson happily applied it himself this time.

Out-of-Date hs-plugins

Vadim had trouble compiling Pugs with Parrot support. Autrijus helped him fix his problem, and there was much rejoicing.

chomp Problem

Jens Rieks found a problem with chomp and submitted a test. Warnock applies.

Pugs makefile Issue

Grégoire Péan noticed that Pugs was creating a useless Pugs.exe.bat. Autrijus asked if he would be willing to investigate a patch. He responded that he would put it in his queue.

loop or do

Gerd Pokorra wondered why do { ... } was in Pugs, reasoning that loop { ... } while was the correct thing. Luke Palmer explained that do { ... } was part of the with-or-without-a-postfix while.

PxPerl 5.8.6.2 with Pugs 6.2.5 and Parrot 0.2.0

Grégoire Péan announced the release of PxPerl 5.8.6.2, which includes Pugs 6.2.5 and Parrot 0.2.0. This means that Windows folk can test Pugs and Parrot without having to fight with compilers.

BUILD Errors

Carl Franks found the handling of a named argument to a constructor confusing. He asked for confirmation but no one provided it. Perhaps this poor summary can save him.

White Space and Function Calls

David D. Zuhn didn't know about the forbidding of white space between a function call and its parentheses. Carl told him that and about the .() variant that allows white space.

Pug's make clean Issues Long Commands

Carl Franks noticed that make clean issued a command so long that it broke his nmake. Fortunately he had a really old nmake and updating fixed the problem.

Parrot

thr_windows.h with MinGW

François Perrad provided a patch fixing two compilation problems in thr_windows.h. Warnock applies.

Parrot Slides?

Adam Preble posted a request for slides and notes on Parrot and Perl 6 for a presentation he was working on. Many people provided links in various languages. I usually steal from Dan's presentations when I need something like this.

Problems with Perl 5.6.1

François Perrad had a problem building Parrot with MinGW and Perl 5.6.1, related to Windows and its binary vs. text distinction. This problem will also crop up if you ever try to seek on files in Windows; not that I have ever lost several days debugging that problem.

Ordered Hash Thoughts

Leo posted his thoughts on a modification to ordered hash, as adding a new element by index breaks the string-hashing part of it. Dan suggested that the ordered hash just pitch exceptions in the bad cases, as it was designed to be lightweight and fast.

Subrules Tests

Dino Morelli provided a patch adding tests for subrules to PGE. Warnock applied, at least until Patrick read this summary.

Python on Parrot

Bloves inquired as to the state of Python on Parrot. The phrasing of the question itself provided some confusion. Michal Wallace provided a link to Pirate, hoping it would help.

Resizable*Array Defeats list.c

Slowly but steadily, my {Fixed,Resizable}<type>Array PMCs are defeating the less consistent array implementations. Leo offered the job of slaying list.c to any interested party. Jerry Gay expressed interest.

Encodings on "Char Constants"

Bob Rogers wants to be able to supply an encoding for his character constants that use '. He also wanted to find the integer that corresponds to a character. Leo explained how he could do the former and that ord is useful for the latter.

Broken Links

Fayland Lam pointed out that the links from the last summary were a little broken. Hopefully this summary will be better.

Refcounts and DOD

Michal Wallace wondered how best to make Python's refcounts work for embedding it in Parrot. Nicholas Clark pointed out that Parrot_[un]register_pmc would work. Dan noted that if the Python library were to become a Parrot extension, these could become no-ops as Parrot's DOD would do the necessary work.

BigInt Fixes

Kevin Tew added some tests and fixes to BigInt.pmc. Leo applied the patch.

MinGW and GMP

François Perrad provided a patch fixing GMP for MinGW. Leo provided a slight correction, which François incorporated. Leo then applied the patch.

index Failures

Roger Browne found a failure in the index opcode. Leo fixed it.

MinGW and GDBM

François Perrad provided a patch fixing GDBMfor MinGW. Leo applied the patch.

mod Operation Fails with Negative Integers

Roger Browne noticed that moding with or by negative integers could produce negatives, such as 3 mod -3 = -3. Leo fixed them to provide 0. I hate that fact about C; not that I have ever lost several days debugging that problem, either.

Tracing and Debugging

Leo noted that debugging Parrot has grown more difficult, as the number of abstractions has increased. Is your compiler, IMCC, your PMC, or Parrot broken? Maybe two or three of them? To facilitate debugging, Leo suggested a debug_break opcode and a Debugger PMC. It sounds nifty. He also added support for lexically scoped trace and debug flags.

Adding Unicode, Hex, and Octal Escapes

Will Coleda added more complete escape sequence support to Tcl. Matt Diephouse integrated the patch into his latest version.

the State of ParTcl

Will Coleda proudly noted that as of r8193, ParTcl passed all tests, even with gc-debug. Much praise goes to Matt Diephouse for his cleanup of the Tcl parser.

Strength Reduction Optimization

Curtis Rawls provided a flurry of patches improving Parrot's strength reduction optimization. Leo applied the patches.

TODO: readline Support

Leo put out a request for adding readline support to Parrot.

get_mmd_dispatch_type Fix

Vladimir Lipsky provided a patch that fixes a bug in mmd.c. Leo applied it.

Uninitialized Variable

Vladimir Lipsky fixed an uninitialized variable. Leo applied the patch.

Improved loadlib Handling

Bob Rogers improved loadlib's handling of absolute paths. Leo applied the patch.

DOD Sweep Fix

Vladimir Lipsky prevented the NULL PMC_EXT from being added to the PMC ext pool during a DOD sweep. Leo applied the patch.

Packfile Double Destroy

Vladimir Lipsky fixed a problem with double destruction of nested packfiles. Leo applied the patch.

Tcl Autoconverts List <-> String

Tcl can autoconvert between lists and strings. Will Coleda wondered how to implement this behavior to best support language interoperability. People offered suggestions, but there was no real agreement on the best solution, though.

TODO Classification

Chromatic, inspired by Pugs, added TODO classification to Parrot::Test. He threatened to apply the patch if there were no objections ... none yet.

HLL Group Support

Leo added support for loading high-level language PMC groups dynamically using the .HLL directive. This will load the lib dynamically and change the return type of some ops to reflect the HLL's preferences. It is nifty.

Pistol-Wielding Parrot

Leo put out a request for PIR versions of the Computer Language Shootout tests. This will provide a means of gauging Parrot's performance against other languages. Kinda nifty.

nmake v1.5 Issues

Nigel Sandever also had trouble with overly long lines and nmake. Upgrading nmake fixed his problem, too.

Optimizer Producing Bad Code

Nick Glencross noticed that the optimizer was producing some bad code. Leo fixed one of the problems, but missed the other.

Keys Design

Dan posted an explanation of his original design for keys.

Loop Improvements

Curtis Rawls provided a patch to improve the loop struct in the IMCC optimizer. Leo applied it and asked if he would take a whack at reducing the resource consumption of Bill Coffman's register allocation patch. Dan and Bob Rogers both expressed interested in speeding up the compiler.

Perl 6 Language

Hash Slices

Carl Franks thought he was having trouble with hash slices. Actually he was having trouble with the s/->/./ in his Perl 5 conversion.

Perl 6 and Refactoring Support

Piers Cawley resurrected Matisse Enzer's thread about IDEs and tools for Perl 6. He observed that Perl 6 might provide a great deal of support for such things. Deborah Pickett noted that it might not be theoretically possible to parse Perl 6 safely. Luke Palmer felt that it would not be possible, given BEGIN blocks and the like.

$*OS but OS::unix

Rob Kinyon suggested that $*OS be a class that mixes in the correct OS::class. Then MMD could do the heavy lifting. I like this idea.

Reduction Junctions and Cribbage Scoring

Rob Kinyon wanted to use junctions and reductions to score cribbage hands. Unfortunately, he used junctions as a set. This led to discussions of the correct implementation, and of a Set module that should be part of Perl 6. I want such a set module to have a powerset function that returns the powerset of a particular set (preferably lazily instantiated). Also, my cribbage scoring algorithm is better: 1) lay down hand, 2) announce score, 3) peg.

Syntax for Using Perl 5

Autrijus added support to Pugs for using Perl 5 modules. This led him to wonder what the correct syntax for this actually was. Many suggestions, but no decisions, arrived.

MMD and SMD Interaction

Yuval Kogman wondered how MMD and SMD would interact. Warnock applied.

Making Perl 6 Grammars Generative

Aufrank wondered if Perl 6 grammars could be made generative. I would say that this does not belong in the core simply because of its niche application; however, if I were to do this, I would start by using the Perl 6 grammar grammar and modify the way the parse tree is used. Sadly, aufrank posted to Google Groups, so nobody else expressed opinions.

Links and References

Thomas Sandlaß suggested a Link class to fill the role of auto-dereferencing variables that Luke called "transparent" references.

Use Syntax

Rob Kinyon wondered how exactly ranges of versions and multi-language interoperability would work in Perl 6. Rod Adams provided a few answers.

Anonymous Classes

Simon Cozens announced that he was having a lot of fun converting Maypole to Perl 6. Then he asked how make anonymous subclasses that inherit from other classes and add new methods. Ingo Blechschmidt provided the answers.

Introspectable Code Objects

Ingo Blechschmidt thought it would be nifty if code objects were fully introspectable. Luke agreed, but felt that being able to access them at the statement level might be problematic. I think most of what Ingo would want this for is doable with macros that parse normally (or modify a block), but then munge the resulting match object appropriately.

Signatures As First-Class Types

Yuval Kogman hoped that signatures would be available as first-class types in Perl 6. Ingo Blechschmidt agreed. Sam Vilain pointed to the start of such a translation.

new and MMD

Carl Franks wanted to create a specialized on arguments to new using MMD. Damian told him that his technique was a way to do it, and that bless would still call BUILDALL.

Code Ownership and Debuggability

Yuval Kogman posted his thoughts on code ownership and debuggability in the age of frameworks and generated code.

Strongly Typed Containers

Yuval Kogman wondered how to make a strongly typed container class (similar to C++ templates). Sam Villain provided a pointer to earlier threads and mentioned Haskell's Generic Algebraic Data Types.

Constants and Optimizations

Ingo Blechschmidt wondered how to create constants so that the optimizer would be able to do as much as possible. Damian suggested that macros would be one solution. This makes me wonder if there is a way to declare a function so that the compiler can create a macro version for constants automatically. That would be nifty.

Date/Time Formatting

Nathan Gray wondered what sort of date/time formating Perl 6 would support. Rob Kinyon suggested porting DateTime. This certainly sounds like something that belongs in a module.

And Pow, I Found Illumination

After much discussion, @Larry has weighed in on the thread I started about reductions on empty lists. Damian feels that it should fail, as finding an appropriate identity operator is no simple task. Somehow a discussion of modulus and division slipped in too.

Sub Call Versus MMD

Luke Palmer wanted a sanity check on sub calls versus MMD. Larry provided an answer, but did not weigh in on his sanity.

(1,(2,3),4)[2]

People continue to feel very divided about (1,(2,3),4)[2] and @x = [1,2,3]. There are strong opinions on both sides about arrayrefs in array and scalar contexts. I appear to be allied with the losing side. Hopefully things will change.

Unicode Cheat Sheet

Rob Kinyon posted a request for a Unicode cheat sheet so he could make his own nifty symbols. Gaal Yahas and Sam Vilain provided pointers.

Comprehensive List of Rules Tokens

Jeff Pinyan wants a comprehensive list of Perl 6 rule tokens so he can create a Perl 5 module to parse them. Much discussion ensued.

Default Invocant of Methods

Ingo Blechschmidt wondered if a method's Class could be a default invocant or only instances of it. Somehow this led Larry to musing about Class as a role than people can mixin instead of inherit from. He confuses me.

The Usual Footer

Posting via the Google Groups interface does not work. To post to any of these mailing lists please subscribe by sending email to perl6-internals-subscribe@perl.org, perl6-language-subscribe@perl.org, or perl6-compiler-subscribe@perl.org. If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send feedback to

Catalyst

Web frameworks are an area of significant interest at the moment. Now that we've all learned the basics of web programming, we're ready to get the common stuff out of the way to concentrate on the task at hand; no one wants to spend time rewriting the same bits of glue to handle parameter processing, request dispatching, and the like.

A model currently favored for web applications is MVC, or Model-View-Controller. This design pattern, originally from Smalltalk, supports the separation of the three main areas of an application--handling application flow (Controller), processing information (Model), and outputting results (View)--so that it is possible to change or replace any one without affecting the others.

Catalyst is a new MVC framework for Perl. It is currently under rapid development, but the core API is now stable, and a growing number of projects use it. Catalyst borrows from other frameworks, such as Ruby on Rails and Apache Struts, but its main goal is to be a flexible, powerful, and fast framework for developing any type of web project in Perl. This article, the first of a series of two, introduces Catalyst and shows a simple application; a later article will demonstrate how to write a more complex project.

Inspirations

Catalyst grew out of Maypole, an MVC framework developed by Simon Cozens (and discussed last year on Perl.com; see "Rapid Web Application Development with Maypole," for example). Maypole works well for typical CRUD (Create, Retrieve, Update, Delete) databases on the Web. It includes a variety of useful methods and prewritten templates and template macros that make it very easy to set up a powerful web database. However, it focuses so strongly on CRUD that it is less flexible for other tasks. One of the goals of Catalyst is to provide a framework well suited for any web-related project.

Ruby on Rails was another inspiration; this popular system has done much to promote interest in the Ruby programming language. Features we borrowed from RoR are the use of helper scripts to generate application components and the ability to have multiple controllers in a single application. Both RoR and Struts allow the use of forwarding within applications, which also proved useful for Catalyst.

Features

Speed

We planned Catalyst as an enterprise-level framework, able to handle a significant load. It makes heavy use of caching. Catalyst applications register their actions in the dispatcher at compile time, making it possible to process runtime requests quickly, without needing elaborate checks. Regex dispatches are all precompiled. Catalyst builds only the structures it needs, so there are no delays to generate (for example) unused database relations.

Simplicity

Components

Catalyst has many prebuilt components and plugins for common modules and tasks. For example, there are View classes available for Template Toolkit, HTML::Template, Mason, Petal, and PSP. Plugins are available for dozens of applications and functions, including Data::FormValidator, authentication based on LDAP or Class::DBI, several caching modules, HTML::FillInForm, and XML-RPC.

Catalyst supports component auto-discovery; if you put a component in the correct place, Catalyst will find and load it automagically. Just place a Catalog controller in /AppName/Controller/Catalog.pm (or, in practice, in the shortened /AppName/C/Catalog.pm); there's no need to use each item. You can also declare plugins in the application class with short names, so that:

use Catalyst qw/Email Prototype Textile/;

will load Catalyst::Plugin::Email, Catalyst::Plugin::Prototype, and Catalyst::Plugin::Textile in one shot.

Development

Catalyst comes with a built-in lightweight HTTP server for development purposes. This runs on any platform; you can quickly restart it to reload any changes. This server functions similarly to production-level servers, so you can use it throughout the testing process--or longer; it's a great choice if you want to deliver a self-contained desktop application. Scalability is simple, though: when you want to move on, it is trivial to switch the engine to use plain CGI, mod_perl1, mod_perl2, FastCGI, or even the Zeus web server.

Debugging (Figure 1) and logging (Figure 2) support is also built-in. With debugging enabled, Catalyst sends very detailed reports to the error log, including summaries of the loaded components, fine-grained timing of each action and request, argument listings for requests, and more. Logging works by using the the Catalyst::Log class; you can log any action for debugging or information purposes by adding lines like:

$c->log->info("We made it past the for loop");
$c->log->debug( $sql_query );

Log screenshot
Figure 1. Logging

Crashes will display a flashy debug screen showing details of relevant data structures, software and OS versions, and the line numbers of errors.

Debug screenshot
Figure 2. Debugging

Helper scripts, generated with Template Toolkit, are available for the main application and most components. These allow you to quickly generate starter code (including basic unit tests) for the application framework. With a single line, you can create a Model class based on Class::DBI that pulls in the appropriate Catalyst base model class, sets up the pattern for the CDBI configuration hash, and generates a perldoc skeleton.

Flexibility
Catalyst allows you to use multiple models, views, and controllers--not just as an option when setting up an application, but as a totally flexible part of an application's flow. You can mix and match different elements within the same application or even within the same method. Want to use Class::DBI for your database storage and LDAP for authentication? You can have two models. Want to use Template Toolkit for web display and PDF::Template for print output? No problem. Catalyst uses a simple building-block approach to its add-ins: if you want to use a component, you say so, and if you don't say so, Catalyst won't use it. With so many components and plugins available, based on CPAN modules, it's easy to use what you want, but you don't have to use something you don't need. Catalyst features advanced URL-to-action dispatching. There are multiple ways to map a URL to an action (that is, a Catalyst method), depending on your requirements. First, there is literal dispatching, which will match a specific path:
package MyApp::C::Quux;

# matches only http://localhost:3000/foo/bar/yada
sub baz : Path('foo/bar/yada') { }

A top-level, or global, dispatch matches the method name directly at the application base:

package MyApp::C::Foo;

# matches only http://localhost:3000/bar
sub bar : Global { }

A local, or namespace-prefixed, dispatch acts only in the namespace derived from the name of your Controller class:

package MyApp::C::Catalog::Product;

# matches http://localhost:3000/catalog/product/buy
sub buy : Local { }

package MyApp::C::Catalog::Order;

# matches http://localhost:3000/catalog/order/review
sub review : Local { }

The most flexible is a regex dispatch, which acts on a URL that matches the pattern in the key. If you use capturing parentheses, the matched values are available in the $c->request->snippets array.

package MyApp::C::Catalog;

# will match http://localhost:3000/item23/order189
sub bar : Regex('^item(\d+)/order(\d+)$') { 
   my ( $self, $c ) = @_;
   my $item_number  = $c->request->snippets->[0];
   my $order_number = $c->request->snippets->[1];
   # ...    
}

The regex will act globally; if you want it to act only on a namespace, use the name of the namespace in the body of the regex:

sub foo : Regex('^catalog/item(\d+)$') { # ...

Finally, you can have private methods, which are never available through URLs. You can only reach them from within the application, with a namespace-prefixed path:

package MyApp::C::Foo;
# matches nothing, and is only available via $c->forward('/foo/bar').
sub bar : Private { }

A single Context object ($context, or more usually as its alias $c) is available throughout the application, and is the primary way of interacting with other elements. Through this object, you can access the request object ($c->request->params will return or set parameters, $c->request->cookies will return or set cookies), share data among components, and control the flow of your application. A response object contains response-specific information ($c->response->status(404)) and the Catalyst::Log class is made directly available, as shown above. The stash is a universal hash for sharing data among application components:

$c->stash->{error_message} = "You must select an entry";

# then, in a TT template:
[% IF error_message %]
   <h3>[% error_message %]</h3>
[% END %]

Stash values go directly into the templates, but the entire context object is also available:

<h1>[% c.config.name %]</h1>

To show a Mason example, if you want to use Catalyst::View::Mason:

% foreach my $k (keys $c->req->params) {
  param: <% $k %>: value: <% $c->req->params->{$k} %>
% }

Sample Application: MiniMojo, an Ajax-Based Wiki in 30 Lines of Written Code

Now that you have a sense of what Catalyst is, it's time to look at what it can do. The example application is MiniMojo, a wiki based on Ajax, which is a JavaScript framework that uses the XMLHttpRequest object to create highly dynamic web pages without needing to send full pages back and forth between the server and client.

Remember that from the Catalyst perspective, Ajax is just a case of sending more text to the browser, except that this text is in the form of client-side JavaScript that talks to the server, rather than a boilerplate copyright notice or a navigation sidebar. It makes no difference to Catalyst.

Installation

Catalyst has a relatively large number of requirements; most, however, are easy to install, along with their dependencies, from CPAN. The following list should take care of everything you need for this project:

Generate the Application Skeleton

Run this command:

$ catalyst.pl MiniMojo
$ cd MiniMojo

You've just created the skeleton for your entire application, complete with a helper script keyed to MiniMojo to generate individual classes, basic test scripts, and more.

Run the built-in server:

$ script/minimojo_server.pl

MiniMojo is already running, though it isn't doing much just yet. (You should have received a web page consisting solely of the text "Congratulations, MiniMojo is on Catalyst!") Press Ctrl-C to stop the server.

Add Basic Methods to Your Application Class

Add a private end action to your application class, lib/MiniMojo.pm, by editing the new file:

sub end : Private {
    my ( $self, $c ) = @_;
    $c->forward('MiniMojo::V::TT') unless $c->res->output;
}

Catalyst automatically calls the end action at the end of a request cycle. It's one of four built-in Private actions. It's a typical pattern in Catalyst to use end to forward the application to the View component for rendering, though if necessary you could do it yourself (for example, if you want to use different Views in the same application--perhaps one to generate web pages with Template Toolkit and another to generate PDFs with PDF::Template).

Replace the existing, helper-generated default action in the same class with:

sub default : Private {
    my ( $self, $c ) = @_;
    $c->forward('/page/show');
}

In case the client has specified no other appropriate action, this will forward on to the page controller's show method. As Private actions, nothing can call these from outside the application. Any method from within the application can call them. The default action is another built-in Private action, along with begin, auto, and end. Again, Catalyst calls them automatically at relevant points in the request cycle.

Set Up the Model (SQLite Database) and Use the Helper to Create Model Classes

Next, create a file, minimojo.sql, that contains the SQL for setting up your page table in SQLite.

-- minimojo.sql
CREATE TABLE page (
    id INTEGER PRIMARY KEY,
    title TEXT,
    body TEXT
);

Create a database from it, using the sqlite command-line program:

$ sqlite minimojo.db < minimojo.sql

Depending on your setup, it might be necessary to call this as sqlite3.

Use the helper to create model classes and basic unit tests (Figure 3 shows the results):

$ script/minimojo_create.pl model CDBI CDBI dbi:SQLite:/path/to/minimojo.db

Model-creation screenshot
Figure 3. Creating the model

The minimojo_create.pl script is a helper that uses Template Toolkit to automate the creation of particular modules. The previous command creates a model (in contrast to a controller or a view) called CDBI.pm, using the CDBI helper, setting the connection string to dbi:SQLite:/path/to/minimojo.db, the database you just created. (Use the appropriate path for your system.) The helper will write the models into lib/MiniMojo/M/. There are various options for the helper scripts; the only requirement is the type and the name. (You can create your own modules from scratch, without using the helper.)

Set Up the View (Template::Toolkit) and Use the Helper to Create View Classes

Use the helper to create a view class:

$ script/minimojo_create.pl view TT TT

View classes go into lib/MiniMojo/V/.

Set Up a Controller Class Using the Helper

Create a controller class called Page with the helper:

$ script/minimojo_create.pl controller Page

Controller classes live in lib/MiniMojo/C/.

Add a show action to lib/MiniMojo/C/Page.pm:

sub show : Regex('^(\w+)\.html$') {
    my ( $self, $c ) = @_;
    $c->stash->{template} = 'view.tt';
    # $c->forward('page');
}

The Regex dispatch matches a page in foo.html, where foo is any sequence of word characters. This sequence is available in the $context->request->snippets array, where the page action uses it to display an existing page or to create a new one. The rest of this action sets the appropriate template and sends the application to the page action. (Leave the forward command commented out until you have written the page action.)

Restart the server with $ script/minimojo_server.pl and point a web browser to http://localhost:3000/show/ to see the debug screen (you don't yet have the template that show is trying to send people to).

Create root/view.tt:

<html>
    <head><title>MiniMojo</title></head>
    <body>
        <h1>MiniMojo is set up!</h1>
    </body>
</html>

Test again by killing the server with Ctrl-C and restarting it, and go to http://localhost:3000/show/. You should see the page you just defined.

Add the Display and Edit Code

Modify the application class lib/MiniMojo.pm to include the Prototype and Textile plugins:

use Catalyst qw/-Debug Prototype Textile/;

Note that you can use the plugins by specifying their base names; Catalyst figures out what you mean without making you use Catalyst::Plugin::Prototype.

Modify the page controller, lib/MiniMojo/C/Page.pm, to add page-view and editing code:

sub page : Private {
    my ( $self, $c, $title ) = @_;
    $title ||= $c->req->snippets->[0] || 'Frontpage';
    my $query = { title => $title };
    $c->stash->{page} = MiniMojo::M::CDBI::Page->find_or_create($query);
}

The private page method sets a title--whether passed in to it, taken from the snippets array (that matches the regex in show), or defaulting to "Frontpage." The $query variable holds a hashref used for Class::DBI's find_or_create method, seeding the stash for the page variable with the result of this CDBI query. At the end of the method, control flow returns to the calling method.

Now uncomment the $c->forward('page'); line in the show action.

sub edit : Local {
    my ( $self, $c, $title ) = @_;
    $c->forward('page');
    $c->stash->{page}->body( $c->req->params->{body} )
      if $c->req->params->{body};
    my $body = $c->stash->{page}->body || 'Just type something...';
    my $html = $c->textile->process($body);

    my $base = $c->req->base;
    $html    =~ s{(?<![\?\\\/\[])(\b[A-Z][a-z]+[A-Z]\w*)}
                 {<a href="$base$1.html">$1</a>}g;

    $c->res->output($html);
}

The edit method first forwards the action off to page, so that the stash's page object contains the result of the CDBI query. If there is a value for body, it will use this; otherwise "Just type something..." is the default. The code then processes the body with Textile, which converts plain text to HTML, and then runs the body through a regex to convert camel-case text into links, with the URL base taken from the Catalyst request object. Finally, it outputs the HTML.

Set Up the Wiki with Ajax

Modify root/view.tt to include Ajax code:

<html>
     <head><title>MiniMojo</title></head>
     [% c.prototype.define_javascript_functions %]
     [% url = base _ 'page/edit/' _ page.title %]
     <body Onload="new Ajax.Updater( 'view',  '[% url %]' )">
         <h1>[% page.title %]</h1>
         <div id="view"></div>
         <textarea id="editor" rows="24" cols="80">[% page.body %]</textarea>
         [% c.prototype.observe_field( 'editor', {
             url => url,
             with => "'body='+value",
             update => 'view' }
         ) %]
     </body>
</html>

The line:

[% c.prototype.define_javascript_functions %]

includes the whole prototype.js library in a script block. Note that the prototype plugin is available in the context object.

The section

[% url = base _ 'page/edit/' _ page.title %] 
<body Onload="new Ajax.Updater( 'view',  '[% url %]' )">
<h1>[% page.title %]</h1>
<div id="view"></div>

constructs the Ajax URL and updates the view div when loading the page.

Finally:

<textarea id="editor" rows="24" cols="80">[% page.body %]</textarea>
    [% c.prototype.observe_field( 'editor', {
        url => url,
        with => "'body='+value",
        update => 'view' }
    ) %]

periodically checks the textarea for changes and makes an Ajax request on demand.

That's it! Now you can re-run the server and your wiki is up and running (Figure 4). To use the wiki, simply start typing in the textarea. As you type, the wiki will regularly echo your entry above, passing it through the formatter. When you type something in camel case, it will automatically create a link you can click to go to the new page.

screenshot of the running wiki
Figure 4. The running wiki

Enjoy your new Catalyst-powered Ajax wiki!

Resources

For more information, see the Catalyst documentation, in particular the Catalyst::Manual::Intro module, which gives a thorough introduction to the framework. There are two Catalyst mailing lists, a general list and a developer list. The best place to discuss Catalyst, though, is the #catalyst IRC channel at irc.perl.org. The Catalyst home page is currently just a collection of a few links, but we will extend it in the near future.

Thanks to Catalyst lead developer Sebastian Riedel for help with this article and, of course, for Catalyst itself.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en