July 2003 Archives

Exegesis 6

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

As soon as she walked through my door I knew her type: she was an argument waiting to happen. I wondered if the argument was required... or merely optional? Guess I'd know the parameters soon enough.

"I'm Star At Data," she offered.

She made it sound like a pass. But was the pass by name? Or by position?

"I think someone's trying to execute me. Some caller."

"Okay, I'll see what I can find out. Meanwhile, we're gonna have to limit the scope of your accessibility."

"I'd prefer not to be bound like that," she replied.

"I see you know my methods," I shot back.

She just stared at me, like I was a block. Suddenly I wasn't surprised someone wanted to dispatch her.

"I'll return later," she purred. "Meanwhile, I'm counting on you to give me some closure."

It was gonna be another routine investigation.

— Dashiell Hammett, "The Maltese Camel"

This Exegesis explores the new subroutine semantics described in Apocalypse 6. Those new semantics greatly increase the power and flexibility of subroutine definitions, providing required and optional formal parameters, named and positional arguments, a new and extended operator overloading syntax, a far more sophisticated type system, multiple dispatch, compile-time macros, currying, and subroutine wrappers.

As if that weren't bounty enough, Apocalypse 6 also covers the object-oriented subroutines: methods and submethods. We will, however, defer a discussion of those until Exegesis 12.

Playing Our Parts

Suppose we want to be able to partition a list into two arrays (hereafter known as "sheep" and "goats"), according to some user-supplied criterion. We'll call the necessary subroutine &part, because it partitions a list into two parts.

In the most general case, we could specify how &part splits the list up by passing it a subroutine. &part could then call that subroutine for each element, placing the element in the "sheep" array if the subroutine returns true, and into the "goats" array otherwise. It would then return a list of references to the two resulting arrays.

For example, calling:

($cats, $chattels) = part &is_feline, @animals;

would result in $cats being assigned a reference to an array containing all the animals that are feline and $chattels being assigned a reference to an array containing everything else that exists merely for the convenience of cats.

Note that in the above example (and throughout the remainder of this discussion), when we're talking about a subroutine as an object in its own right, we'll use the & sigil; but when we're talking about a call to the subroutine, there will be no & before its name. That's a distinction Perl 6 enforces too: subroutine calls never have an ampersand; references to the corresponding Code object always do.

Part: The First

The Perl 6 implementation of &part would therefore be:

sub part (Code $is_sheep, *@data) {
	my (@sheep, @goats);
	for @data {
		if $is_sheep($_) { push @sheep, $_ }
		else             { push @goats, $_ }
	}
	return (\@sheep, \@goats);
}

As in Perl 5, the sub keyword declares a subroutine. As in Perl 5, the name of the subroutine follows the sub and — assuming that name doesn't include a package qualifier — the resulting subroutine is installed into the current package.

Unlike Perl 5, in Perl 6 we are allowed to specify a formal parameter list after the subroutine's name. This list consists of zero or more parameter variables. Each of these parameter variables is really a lexical variable declaration, but because they're in a parameter list we don't need to (and aren't allowed to!) use the keyword my.

Just as with a regular variable, each parameter can be given a storage type, indicating what kind of value it is allowed to store. In the above example, for instance, the $is_sheep parameter is given the type Code, indicating that it is restricted to objects of that type (i.e. the first argument must be a subroutine or block).

Each of these parameter variables is automatically scoped to the body of the subroutine, where it can be used to access the arguments with which the subroutine was called.

A word about terminology: an "argument" is a item in the list of data that is passed as part of a subroutine call. A "parameter" is a special variable inside the subroutine itself. So the subroutine call sends arguments, which the subroutine then accesses via its parameters.

Perl 5 has parameters too, but they're not user-specifiable. They're always called $_[0], $_[1], $_[2], etc.

Not-So-Secret Alias

However, one way in which Perl 5 and Perl 6 parameters are similar is that, unlike Certain Other Languages, Perl parameters don't receive copies of their respective arguments. Instead, Perl parameters become aliases for the corresponding arguments.

That's already the case in Perl 5. So, for example, we can write a temperature conversion utility like:

# Perl 5 code...
sub Fahrenheit_to_Kelvin {
	$_[0] -= 32;
	$_[0] /= 1.8;
	$_[0] += 273.15;
}

# and later...

Fahrenheit_to_Kelvin($reactor_temp);

When the subroutine is called, within the body of &Fahrenheit_to_Kelvin the $_[0] variable becomes just another name for $reactor_temp. So the changes the subroutine makes to $_[0] are really being made to $reactor_temp, and at the end of the call $reactor_temp has been converted to the new temperature scale.

That's very handy when we intend to change the values of arguments (as in the above example), but it's potentially a very nasty trap too. Many programmers, accustomed to the pass-by-copy semantics of other languages, will unconsciously fall into the habit of treating the contents of $_[0] as if they were a copy. Eventually that will lead to some subroutine unintentionally changing one of its arguments — a bug that is often very hard to diagnose and frequently even harder to track down.

So Perl 6 modifies the way parameters and arguments interact. Explicit parameters are still aliases to the original arguments, but in Perl 6 they're constant aliases by default. That means, unless we specifically tell Perl 6 otherwise, it's illegal to change an argument by modifying the corresponding parameter within a subroutine.

All of which means that a the naïve translation of &Fahrenheit_to_Kelvin to Perl 6 isn't going to work:

# Perl 6 code...
sub Fahrenheit_to_Kelvin(Num $temp) {
	$temp -= 32;
	$temp /= 1.8;
	$temp += 273.15;
}

That's because $temp (and hence the actual value it's an alias for) is treated as a constant within the body of &Fahrenheit_to_Kelvin. In fact, we'd get a compile time error message like:

Cannot modify constant parameter ($temp) in &Fahrenheit_to_Kelvin

If we want to be able to modify arguments via Perl 6 parameters, we have to say so up front, by declaring them is rw ("read-write"):

sub Fahrenheit_to_Kelvin (Num $temp is rw) {
	$temp -= 32;
	$temp /= 1.8;
	$temp += 273.15;
}

This requires a few extra keystrokes when the old behaviour is needed, but saves a huge amount of hard-to-debug grief in the most common cases. As a bonus, an explicit is rw declaration means that the compiler can generally catch mistakes like this:

$absolute_temp = Fahrenheit_to_Kelvin(212);

Because we specified that the $temp argument has to be read-writeable, the compiler can easily catch attempts to pass in a read-only value.

Alternatively, we might prefer that $temp not be an alias at all. We might prefer that &Fahrenheit_to_Kelvin take a copy of its argument, which we could then modify without affecting the original, ultimately returning it as our converted value. We can do that too in Perl 6, using the is copy trait:

sub Fahrenheit_to_Kelvin(Num $temp is copy) {
	$temp -= 32;
	$temp /= 1.8;
	$temp += 273.15;
	return $temp;
}

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

Defining the Parameters

Meanwhile, back at the &part, we have:

sub part (Code $is_sheep, *@data) {...}

which means that &part expects its first argument to be a scalar value of type Code (or Code reference). Within the subroutine that first argument will thereafter be accessed via the name $is_sheep.

The second parameter (*@data) is what's known as a "slurpy array". That is, it's an array parameter with the special marker (*) in front of it, indicating to the compiler that @data is supposed to grab all the remaining arguments passed to &part and make each element of @data an alias to one of those arguments.

In other words, the *@data parameter does just what @_ does in Perl 5: it grabs all the available arguments and makes its elements aliases for those arguments. The only differences are that in Perl 6 we're allowed to give that slurpy array a sensible name, and we're allowed to specify other individual parameters before it — to give separate sensible names to one or more of the preliminary arguments to the call.

But why (you're probably wondering) do we need an asterisk for that? Surely if we had defined &part like this:

sub part (Code $is_sheep, @data) {...}   # note: no asterisk on @data

the array in the second parameter slot would have slurped up all the remaining arguments anyway.

Well, no. Declaring a parameter to be a regular (non-slurpy) array tells the subroutine to expect the corresponding argument to be a actual array (or an array reference). So if &part had been defined with its second parameter just @data (rather than *@data), then we could call it like this:

part \&selector, @animal_sounds;

or this:

part \&selector, ["woof","meow","ook!"];

but not like this:

part \&selector, "woof", "meow", "ook!";

In each case, the compiler would compare the type of the second argument with the type required by the second parameter (i.e. an Array). In the first two cases, the types match and everything is copacetic. In the third case, the second argument is a string, not an array or array reference, so we get a compile-time error message:

Type mismatch in call to &part: @data expects Array but got Str instead

Another way of thinking about the difference between slurpy and regular parameters is to realize that a slurpy parameter imposes a list (i.e. flattening) context on the corresponding arguments, whereas a regular, non-slurpy parameter doesn't flatten or listify. Instead, it insists on a single argument of the correct type.

So, if we want &part to handle raw lists as data, we need to tell the @data parameter to take whatever it finds — array or list — and flatten everything down to a list. That's what the asterisk on *@data does.

Because of that all-you-can-eat behaviour, slurpy arrays like this are generally placed at the very end of the parameter list and used to collect data for the subroutine. The preceding non-slurpy arguments generally tell the subroutine what to do; the slurpy array generally tells it what to do it to.

Splats and Slurps

Another aspect of Perl 6's distinction between slurpy and non-slurpy parameters can be seen when we write a subroutine that takes multiple scalar parameters, then try to pass an array to that subroutine.

For example, suppose we wrote:

sub log($message, $date, $time) {...}

If we happen to have the date and time in a handy array, we might expect that we could just call log like so:

log("Starting up...", @date_and_time);

We might then be surprised when this fails even to compile.

The problem is that each of &log's three scalar parameters imposes a scalar context on the corresponding argument in any call to log. So "Starting up..." is first evaluated in the scalar context imposed by the $message parameter and the resulting string is bound to $message. Then @date_and_time is evaluated in the scalar context imposed by $date, and the resulting array reference is bound to $date. Then the compiler discovers that there is no third argument to bind to the $time parameter and kills your program.

Of course, it has to work that way, or we don't get the ever-so-useful "array parameter takes an unflattened array argument" behaviour described earlier. Unfortunately, that otherwise admirable behaviour is actually getting in the way here and preventing @date_and_time from flattening as we want.

So Perl 6 also provides a simple way of explicitly flattening an array (or a hash for that matter): the unary prefix * operator:

log("Starting up...", *@date_and_time);

This operator (known as "splat") simply flattens its argument into a list. Since it's a unary operator, it does that flattening before the arguments are bound to their respective parameters.

The syntactic similarity of a "slurpy" * in a parameter list, and a "splatty" * in an argument list is quite deliberate. It reflects a behavioral similarity: just as a slurpy asterisk implicitly flattens any argument to which its parameter is bound, so too a splatty asterisk explicitly flattens any argument to which it is applied.

I Do Declare

By the way, take another look at those examples above — the ones with the {...} where their subroutine bodies should be. Those dots aren't just metasyntactic; they're real executable Perl 6 code. A subroutine definition with a {...} for its body isn't actually a definition at all. It's a declaration.

In the same way that the Perl 5 declaration:

# Perl 5 code...
sub part;

states that there exists a subroutine &part, without actually saying how it's implemented, so too:

# Perl 6 code...
sub part (Code $is_sheep, *@data) {...}

states that there exists a subroutine &part that takes a Code object and a list of data, without saying how it's implemented. In fact, the old sub part; syntax is no longer allowed; in Perl 6 you have to yada-yada-yada when you're making a declaration.

Body Parts

With the parameter list taking care of getting the right arguments into the right parameters in the right way, the body of the &part subroutine is then quite straightforward:

{
    my (@sheep, @goats);
    for @data {
        if $is_sheep($_) { push @sheep, $_ }
        else             { push @goats, $_ }
    }
    return (\@sheep, \@goats);
}

According to the original specification, we need to return references to two arrays. So we first create those arrays. Then we iterate through each element of the data (which the for aliases to $_, just as in Perl 5). For each element, we take the Code object that was passed as $is_sheep (let's just call it the selector from now on) and we call it, passing the current data element. If the selector returns true, we push the data element onto the array of "sheep", otherwise it is appended to the list of "goats". Once all the data has been divvied up, we return references to the two arrays.

Note that, if this were Perl 5, we'd have to unpack the @_ array into a list of lexical variables and then explicitly check that $is_sheep is a valid Code object. In the Perl 6 version there's no @_, the parameters are already lexicals, and the type-checking is handled automatically.

Call of the Wild

With the explicit parameter list in place, we can use &part in a variety of ways. If we already have a subroutine that is a suitable test:

sub is_feline ($animal) {
    return $animal.isa(Cat);
}

then we can just pass that to &part, along with the data to be partitioned, then grab the two array references that come back:

($cats, $chattels) = part &is_feline, @animals;

This works fine, because the first parameter of &part expects a Code object, and that's exactly what &is_feline is. Note that we couldn't just put is_feline there (i.e. without the ampersand), since that would indicate a call to &is_feline, rather than a reference to it.

In Perl 5 we'd have had to write \&is_feline to get a reference to the subroutine. However, since the $is_sheep parameter specifies that the first argument must be a scalar (i.e. it imposes a scalar context on the first argument slot), in Perl 6 we don't have to create a subroutine reference explicitly. Putting a code object in the scalar context auto-magically enreferences it (just as an array or hash is automatically converted to a reference in scalar context). Of course, an explicit Code reference is perfectly acceptable there too:

($cats, $chattels) = part \&is_feline, @animals;

Alternatively, rather than going to the trouble of declaring a separate subroutine to sort our sheep from our goats, we might prefer to conjure up a suitable (anonymous) subroutine on the spot:

($cats, $chattels) = part sub ($animal) { $animal.isa(Animal::Cat) }, @animals;

In a Bind

So far we've always captured the two array references returned from the part call by assigning the result of the call to a list of scalars. But we might instead prefer to bind them to actual arrays:

(@cats, @chattels) := part sub($animal) { $animal.isa(Animal::Cat) }, @animals;

Using binding (:=) instead of assignment (=) causes @cats and @chattels to become aliases for the two anonymous arrays returned by &part.

In fact, this aliasing of the two return values to @cats and @chattels uses exactly the same mechanism that is used to alias subroutine parameters to their corresponding arguments. We could almost think of the lefthand side of the := as a parameter list (in this case, consisting of two non-slurpy array parameters), and the righthand side of the := as being the corresponding argument list. The only difference is that the variables on the lefthand side of a := are not implicitly treated as constant.

One consequence of the similarities between binding and parameter passing is that we can put a slurpy array on the left of a binding:

(@Good, $Bad, *@Ugly) := (@Adams, @Vin, @Chico, @OReilly, @Lee, @Luck, @Britt);

The first pseudo-parameter (@Good) on the left expects an array, so it binds to @Adams from the list on the right.

The second pseudo-parameter ($Bad) expects a scalar. That means it imposes a scalar context on the second element of the righthand list. So @Vin evaluates to a reference to the original array and $Bad becomes an alias for \@Vin.

The final pseudo-parameter (*@Ugly) is slurpy, so it expects the rest of the lefthand side to be a list it can slurp up. In order to ensure that, the slurpy asterisk causes the remaining pseudo-arguments on the right to be flattened into a list, whose elements are then aliased to successive elements of @Ugly.

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

Who Shall Sit in Judgment?

Conjuring up an anonymous subroutine in each call to part is intrinsically neither good nor bad, but it sure is ugly:

($cats, $chattels) = part sub($animal) { $animal.isa(Animal::Cat) }, @animals;

Fortunately, there's a cleaner way to specify the selector within the call to part. We can use a parameterized block instead:

($cats, $chattels) = part -> $animal { $animal.isa(Animal::Cat) } @animals;

A parameterized block is just a normal brace-delimited block, except that you're allowed to put a list of parameters out in front of it, preceded by an arrow (->). So the actual parameterized block in the above example is:

-> $animal { $animal.isa(Animal::Cat) }

In Perl 6, a block is a subspecies of Code object, so it's perfectly okay to pass a parameterized block as the first argument to &part. Like a real subroutine, a parameterized block can be subsequently invoked and passed an argument list. The body of the &part subroutine will continue to work just fine.

It's important to realize that parameterized blocks aren't subroutines though. They're blocks, and so there are important differences in their behaviour. The most important difference is that you can't return from a parameterized block, the way you can from a subroutine. For example, this:

part sub($animal) { return $animal.size < $breadbox }, @creatures

works fine, returning the result of each size comparison every time the anonymous subroutine is called within &part.

But in this "pointier" version:

part -> $animal { return $animal.size < $breadbox } @creatures

the return isn't inside a nested subroutine; it's inside a block. The first time the parameterized block is executed within &part it causes the subroutine in which the block was defined (i.e. the subroutine that's calling part) to return!

Oops.

The problem with that second example, of course, is not that we were too Lazy to write the full anonymous subroutine. The problem is that we weren't Lazy enough: we forgot to leave out the return. Just like a Perl 5 do or eval block, a Perl 6 parameterized block evaluates to the value of the last statement executed within it. We only needed to say:

part -> $animal { $animal.size < $breadbox } @creatures

Note too that, because the parameterized block is a block, we don't need to put a comma after it to separate it from the second argument. In fact, anywhere a block is used as an argument to a subroutine, any comma before or after the block is optional.

Cowabunga!

Even with the slight abbreviation provided by using a parameterized block instead of an anonymous subroutine, it's all too easy to lose track of the the actual data (i.e. @animals) when it's buried at the end of that long selector definition.

We can help it stand out a little better by using a new feature of Perl 6: the "pipeline" operator:

($cats, $chattels) = part sub($animal) { $animal.isa(Animal::Cat) } <== @animals;

The <== operator takes a subroutine call as its lefthand argument and a list of data as its righthand arguments. The subroutine being called on the left must have a slurpy array parameter (e.g. *@data) and the list on the operator's right is then bound to that parameter.

In other words, a <== in a subroutine call marks the end of the specific arguments and the start of the slurped data.

Pipelines are more interesting when there are several stages to the process, as in this Perl 6 version of the Schwartzian transform:

@shortest_first = map  { .key }                     # 4
              <== sort { $^a.value <=> $^b.value }  # 3
              <== map  { $_ => .height }            # 2
              <== @animals;                         # 1

This example takes the array @animals, flattens it into a list (#1), pipes that list in as the data for a map operation (#2), takes the resulting list of object/height pairs and pipes that in to the sort (#3), then takes the resulting sorted list of pairs and maps out just the sorted objects (#4).

Of course, since the data lists for all of these functions always come at the end of the call anyway, we could have just written that as:

@shortest_first = map  { .key }                     # 4
                  sort { $^a.value <=> $^b.value }  # 3
                  map  { $_ => .height }            # 2
                  @animals;                         # 1

But there's no reason to stint ourselves: the pipelines cost nothing in performance, and often make the flow of data much clearer.

One problem that many people have with pipelined list processing techniques like the Schwartzian Transform is that the pipeline flows the "wrong" way: the code reads left-to-right/top-to-bottom but the data (and execution) runs right-to-left/bottom-to-top. Happily, Perl 6 has a solution for that too. It provides a "reversed" version of the pipeline operator, to make it easy to create left-to-right pipelines:

@animals ==> map  { $_ => .height }              # 1
         ==> sort { $^a.value <=> $^b.value }    # 2
         ==> map  { .key }                       # 3
         ==> @shortest_first;                    # 4

This version works exactly the same as the previous right-to-left/bottom-to-top examples, except that now the various components of the pipeline are written and performed in the "natural" order.

The ==> operator is the mirror-image of <==, both visually and in its behaviour. That is, it takes a subroutine call as its righthand argument and a list of data on its left, and binds the lefthand list to the slurpy array parameter of the subroutine being called on the right.

Note that this last example makes use of a special dispensation given to both pipeline operators. The argument on the "sharp" side is supposed to be a subroutine call. However, if it is a variable, or a list of variables, then the pipeline operator simply assigns the list from its "blunt" side to variable (or list) on its "sharp" side.

Hence, if we preferred to partition our animals left-to-right, we could write:

@animals ==> part sub ($animal) { $animal.isa(Animal::Cat) } ==> ($cats, $chattels);

The Incredible Shrinking Selector

Of course, even with a parameterized block instead of an anonymous subroutine, the definition of the selector argument is still klunky:

($cats, $chattels) = part -> $animal { $animal.isa(Animal::Cat) } @animals;

But it doesn't have to be so intrusive. There's another way to create a parameterized block. Instead of explicitly enumerating the parameters after a ->, we could use placeholder variables instead.

As explained in Apocalypse 4, a placeholder variable is one whose sigil is immediately followed by a caret (^). Any block containing one or more placeholder variables is automatically a parameterized block, without the need for an explicit -> or parameter list. Instead, the block's parameter list is determined automatically from the set of placeholder variables enclosed by the block's braces.

We could simplify our partitioning to:

($cats, $chattels) = part { $^animal.isa(Animal::Cat) }
@animals;

Here $^animal is a placeholder, so the block immediately surrounding it becomes a parameterized block — in this case with exactly one parameter.

Better still, any block containing a $_ is also a parameterized block — with a single parameter named $_. We could dispense with the explicit placeholder and just write our partitioning statement:

($cats, $chattels) = part { $_.isa(Animal::Cat) }
@animals;

which is really a shorthand for the parameterized block:

($cats, $chattels) = part -> $_ { $_.isa(Animal::Cat) }
@animals;

Come to think of it, since we now have the unary dot operator (which calls a method using $_ as the invocant), we don't even need the explicit $_:

($cats, $chattels) = part { .isa(Animal::Cat) }
@animals;

Part: The Second

But wait, there's even...err...less!

We could very easily extend &part so that we don't even need the block in that case; so that we could just pass the raw class in as the first parameter:

($cats, $chattels) = part Animal::Cat, @animals;

To do that, the type of the first parameter will have to become Class, which is the (meta-)type of all classes. However, if we changed &part's parameter list in that way:

sub part (Class $is_sheep, *@data) {...}

then all our existing code that currently passes Code objects as &part's first argument will break.

Somehow we need to be able to pass either a Code object or a Class as &part's first argument. To accomplish that, we need to take a short detour into...

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

The Wonderful World of Junctions

Perl 6 introduces an entirely new scalar data-type: the junction. A junction is a single scalar value that can act like two or more values at once. So, for example, we can create a value that behaves like any of the values 1, 4, or 9, by writing:

$monolith = any(1,4,9);

The scalar value returned by any and subsequently stored in $monolith is equal to 1. And at the same time it's also equal to 4. And to 9. It's equal to any of them. Hence the name of the any function that we used to set it up.

What good it that? Well, if it's equal to "any of them" then, with a single comparison, we can test if some other value is also equal to "any of them":

if $dave == any(1,4,9) { print "I'm sorry, Dave, you're just a
square." }

That's considerably shorter (and more maintainable) than:

if $dave == 1 || $dave == 4 || $dave == 9 { print "I'm sorry, Dave,
you're just a square." }

It even reads more naturally.

Better still, Perl 6 provides an n-ary operator that builds the same kinds of junctions from its operands:

if $dave == 1|4|9 { print "I'm sorry, Dave, you're just a square."
}

Once you get used to this notation, it too is very easy to follow: if Dave equals 1 or 4 or 9....

(Yes, the Perl 5 bitwise OR is still available in Perl 6; it's just spelled differently now).

The any function is more useful when the values under consideration are stored in a single array. For example, we could check whether a new value is bigger than any we've already seen:

if $newval > any(@oldvals) { print "$newval isn't the smallest."
}

In Perl 5 we'd have to write that:

if (grep { $newval > $_ } @oldvals) { print "$newval isn't the
smallest." }

which isn't as clear and isn't as quick (since the any version will short-circuit as soon as it knows the comparison is true, whereas the grep version will churn through every element of @oldvals no matter what).

An any is even more useful when we have a collection of new values to check against the old ones. We can say:

if any(@newvals) > any(@oldvals) { print "Already seen at least
one smaller value." }

instead of resorting to the horror of nested greps:

if (grep { my $old = $_; grep { $_ > $old } @newvals } @oldvals)
{ print "Already seen at least one smaller value." }

What if we wanted to check whether all of the new values were greater than any of the old ones? For that we use a different kind of junction — one that is equal to all our values at once (rather than just any one of them). We can create such a junction with the all function:

if all(@newvals) > any(@oldvals) {
    print "These are all bigger than something already seen."
}

We could also test if all the new values are greater than all the old ones (not merely greater than at least one of them), with:

if all(@newvals) > all(@oldvals) {
    print "These are all bigger than everything already seen."
}

There's an operator for building all junctions too. No prizes for guessing. It's n-ary &. So, if we needed to check that the maximal dimension of some object is within acceptable limits, we could say:

if $max_dimension < $height & $width & $depth {
    print "A maximal dimension of $max_dimension is okay."
}

That last example is the same as:

if $max_dimension < $height
&& $max_dimension < $width
&& $max_dimension < $depth {
    print "A maximal dimension of $max_dimension is okay."
}

any junctions are known as disjunctions, because they act like they're in a boolean OR: "this OR that OR the other". all junctions are known as conjunctions, because they have an implicit AND between their values — "this AND that AND the other".

There are two other types of junction available in Perl 6: abjunctions and injunctions. An abjunction is created using the one function and represents exactly one of its possible values at any given time:

if one(@roots) == 0 {
    print "Unique root to polynomial.";
}

In other words, it's as though there were an implicit n-ary XOR between each pair of values.

Injunctions represent none of their values and hence are constructed with a built-in named none:

if $passwd eq none(@previous_passwds) {
    print "New password is acceptable.";
}

They're like a multi-part NEITHER...NOR...NOR...

We can build a junction out of any scalar type. For example, strings:

my $known_title = 'Mr' | 'Mrs' | 'Ms' | 'Dr' | 'Rev';
if %person{title} ne $known_title {
    print "Unknown title: %person{title}.";
}

or even Code references:

my &ideal := \&tall & \&dark & \&handsome;
if ideal($date) {   # Same as: if tall($date) && dark($date) && handsome($date)
    swoon();
}

The Best of Both Worlds

So a disjunction (any) allows us to create a scalar value that is either this or that.

In Perl 6, classes (or, more specifically, Class objects) are scalar values. So it follows that we can create a disjunction of classes. For example:

Floor::Wax | Dessert::Topping

gives us a type that can be either Floor::Wax or Dessert::Topping. So a variable declared with that type:

my Floor::Wax|Dessert::Topping $shimmer;

can store either a Floor::Wax object or a Dessert::Topping object. A parameter declared with that type:

sub advertise(Floor::Wax|Dessert::Topping $shimmer) {...}

can be passed an argument that is of either type.

Matcher Smarter, not Harder

So, in order to extend &part to accept a Class as its first argument, whilst allowing it to accept a Code object in that position, we just use a type junction:

sub part (Code|Class $is_sheep, *@data) {
    my (@sheep, @goats);
    for @data {
        when $is_sheep { push @sheep, $_ }
        default        { push @goats, $_ }
    }
    return (\@sheep, \@goats);
}

There are only two differences between this version and the previous one. The first difference is, of course, that we have changed the type of the first parameter. Previously it was Code; now it's Code|Class.

The second change is in the body of the subroutine itself. We replaced the partitioning if statement:

for @data {
    if $is_sheep($_) { push @sheep, $_ }
    else             { push @goats, $_ }
}

With a switch:

for @data {
    when $is_sheep { push @sheep, $_ }
    default        { push @goats, $_ }
}

Now the actual work of categorizing each element as a "sheep" or a "goat" is done by the when statement, because:

when $is_sheep { push @sheep, $_ }

Is equivalent to:

if $_ ~~ $is_sheep { push @sheep, $_; next }

When $is_sheep is a subroutine reference, that implicit smart-match will simply pass $_ (the current data element) to the subroutine and then evaluate the return value as a boolean. On the other hand, when $is_sheep is a class, the smart-match will check to see if the object in $_ belongs to the same class or some derived class.

The single when statement handles either type of selector — Code or Class — auto-magically. That's why it's known as smart-matching.

Having now allowed class names as selectors, we can take the final step and simplify:

($cats, $chattels) = part { .isa(Animal::Cat) } @animals;

to:

($cats, $chattels) = part Animal::Cat, @animals;

Note, however, that the comma is back. Only blocks can appear in argument lists without accompanying commas, and the raw class isn't a block.

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

Partitioning Rules!

Now that the when's implicit smart-match is doing the hard work of deciding how to evaluate each data element against the selector, adding new kinds of selectors becomes trivial. For example, here's a third version of &part which also allows Perl 6 rules (i.e. patterns) to be used to partition a list:

sub part (Code|Class|Rule $is_sheep, *@data) {
    my (@sheep, @goats);
    for @data {
        when $is_sheep { push @sheep, $_ }
        default        { push @goats, $_ }
    }
    return (\@sheep, \@goats);

All we needed to do was to tell &part that its first argument was also allowed to be of type Rule. That allows us to call &part like this:

($cats, $chattels) = part /meow/, @animal_sounds;

In the scalar context imposed by the $is_sheep parameter, the /meow/ pattern evaluates to a Rule object (rather than immediately doing a match). That Rule object is then bound to $is_sheep and subsequently used as the selector in the when statement.

Note that the body of this third version is exactly the same as that of the previous version. No change is required because, when it detects that $is_sheep is a Rule object, the when's smart-matching will auto-magically do a pattern match.

In the same way, we could further extend &part to allow the user to pass a hash as the selector:

my %is_cat = (
    cat => 1, tiger => 1, lion => 1, leopard => 1, # etc.
);

($cats, $chattels) = part %is_cat, @animal_names;

simply by changing the parameter list of &part to:

sub part (Code|Class|Rule|Hash $is_sheep, *@data) {
    # body exactly as before
}

Once again, the smart-match hidden in the when statement just Does The Right Thing. On detecting a hash being matched against each datum, it will use the datum as a key, do a hash look up, and evaluate the truth of the corresponding entry in the hash.

Of course, the ever-increasing disjunction of allowable selector types is rapidly threatening to overwhelm the entire parameter list. At this point it would make sense to factor the type-junction out, give it a logical name, and use that name instead. To do that, we just write:

type Selector ::= Code | Class | Rule | Hash;
sub part (Selector $is_sheep, *@data) {
    # body exactly as before
}

The ::= binding operator is just like the := binding operator, except that it operates at compile-time. It's the right choice here because types need to be fully defined at compile-time, so the compiler can do as much static type checking as possible.

The effect of the binding is to make the name Selector an alias for Code | Class | Rule | Hash. Then we can just use Selector wherever we want that particular disjunctive type.

Out with the New and in with the Old

Let's take a step back for a moment.

We've already seen how powerful and clean these new-fangled explicit parameters can be, but maybe you still prefer the Perl 5 approach. After all, @_ was good enough fer Grandpappy when he lernt hisself Perl as a boy, dangnabit!

In Perl 6 we can still pass our arguments the old-fashioned way and then process them manually:

# Still valid Perl 6...
sub part {
    # Unpack and verify args...
    my ($is_sheep, @data) = @_;
    croak "First argument to &part is not Code, Hash, Rule, or Class"
        unless $is_sheep.isa(Selector);

    # Then proceed as before...
    my (@sheep, @goats);
    for @data {
        when $is_sheep { push @sheep, $_ }
        default        { push @goats, $_ }
    }
    return (\@sheep, \@goats);
}

If we declare a subroutine without a parameter list, Perl 6 automatically supplies one for us, consisting of a single slurpy array named @_:

sub part {...}      # means: sub part (*@_) {...}

That is, any un-parametered Perl 6 subroutine expects to flatten and then slurp up an arbitrarily long list of arguments, binding them to the elements of a parameter called @_. That's pretty much what a Perl 5 subroutine does. The only important difference is that in Perl 6 that slurpy @_ is, like all Perl 6 parameters, constant by default. So, if we want the exact behaviour of a Perl 5 subroutine — including being able to modify elements of @_ — we need to be explicit:

sub part (*@_ is rw) {...}

Note that "declare a subroutine without a parameter list" doesn't mean "declare a subroutine with an empty parameter list":

sub part    {...}   # without parameter list
sub part () {...}   # empty parameter list

An empty parameter list specifies that the subroutine takes exactly zero arguments, whereas a missing parameter list means it takes any number of arguments and binds them to the implicit parameter @_.

Of course, by using the implicit @_ instead of named parameters, we're merely doing extra work that Perl 6 could do for us, as well as making the subroutine body more complex, harder to maintain, and slower. We're also eliminating any chance of Perl 6 identifying argument mismatches at compile-time. And, unless we're prepared to complexify the code even further, we're preventing client code from using named arguments (see "Name your poison" below).

But this is Perl, not Fascism. We're not in the business of imposing the One True Coding Style on Perl hackers. So if you want to pass your arguments the old-fashioned way, Perl 6 makes sure you still can.

A Pair of Lists in a List of Pairs

Suppose now that, instead of getting a list of array references back, we wanted to get back a list of key=>value pairs, where each value was one of the array refs and each key some kind of identifying label (we'll see why that might be particularly handy soon).

The easiest solution is to use two fixed keys (for example, "sheep" and "goats"):

sub part (Selector $is_sheep, *@data) returns List of Pair {
    my %herd;
    for @data {
        when $is_sheep { push %herd{"sheep"}, $_ }
        default        { push %herd{"goats"}, $_ }
    }
    return *%herd;
}

The parameter list of the subroutine is unchanged, but now we've added a return type after it, using the returns keyword. That return type is List of Pair, which tells the compiler that any return statements in the subroutine are expected to return a list of values, each of which is a Perl 6 key=>value pair.

Parametric Types

Note that this type is different from those we've seen so far: it's compound. The of Pair suffix is actually an argument that modifies the principal type List, telling the container type what kind of value it's allowed to store. This is possible because List is a parametric type. That is, it's a type that can be specified with arguments that modify how it works. The idea is a little like C++ templates, except not quite so brain-meltingly complicated.

The specific parameters for a parametric type are normally specified in square brackets, immediately after the class name. The arguments that define a particular instance of the class are likewise passed in square brackets. For example:

class Table[Class $of] {...}
class Logfile[Str $filename] {...}
module SecureOps[AuthKey $key] {...}

# and later:

sub typeset(Table of Contents $toc) {...}
# Expects an object whose class is Table
# and which stores Contents objects

my Logfile["./log"] $file;
# $file can only store logfiles that log to ./log

$plaintext = SecureOps[$KEY]::decode($cryptotext);
# Only use &decode if our $KEY entitles us to

Note that type names like Table of Contents and List of Pair are really just tidier ways to say Table[of=>Contents] and List[of=>Pair].

By convention, when we pass an argument to the $of parameter of a parametric type, we're telling that type what kind of value we're expecting it to store. For example: whenever we access an element of List of Pair, we expect to get back a Pair. Similarly we could specify List of Int, Array of Str, or Hash of Num.

Admittedly List of Pair doesn't seem much tidier than List(of=>Pair), but as container types get more complex, the advantages start to become obvious. For example, consider a data structure consisting of an array of arrays of arrays of hashes of numbers (such as one might use to store, say, several years worth of daily climatic data). Using the of notation that's just:

type Climate::Record ::= Array of Array of Array of Hash of Num;

Without the of keyword, it's:

type Climate::Record ::= Array(of=>Array(of=>Array(of=>Hash(of=>Num))));

which is starting to look uncomfortably like Lisp.

Parametric types may have any number of parameters with any names we like, but only type parameters named $of have special syntactic support built into Perl.

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

TMTOWTDeclareI

While we're talking about type declarations, it's worth noting that we could also have put &part's new return type out in front (just as we've been doing with variable and parameter types). However, this is only allowed for subroutines when the subroutine is explicitly scoped:

# lexical subroutine
my List of Pair sub part (Selector $is_sheep, *@data) {...}

or:

# package subroutine
our List of Pair sub part (Selector $is_sheep, *@data) {...}

The return type goes between the scoping keyword (my or our) and the sub keyword. And, of course, the returns keyword is not used.

Contrariwise, we can also put variable/parameter type information after the variable name. To do that, we use the of keyword:

my sub part ($is_sheep of Selector, *@data) returns List of Pair {...}

This makes sense, when you think about it. As we saw above, of tells the preceding container what type of value it's supposed to store, so $is_sheep of Selector tells $is_sheep it's supposed to store a Selector.

You Are What You Eat -- Not!

Careful though: we have to remember to use of there, not is. It would be a mistake to write:

my sub part ($is_sheep is Selector, *@data) returns List of Pair {...}

That's because Perl 6 variables and parameters can be more precisely typed than variables in most other languages. Specifically, Perl 6 allows us to specify both the storage type of a variable (i.e. what kinds of values it can contain) and the implementation class of the variable (i.e. how the variable itself is actually implemented).

The is keyword indicates what a particular container (variable, parameter, etc.) is — namely, how it's implemented and how it operates. Saying:

sub bark(@dogs is Pack) {...}

specifies that, although the @dogs parameter looks like an Array, it's actually implemented by the Pack class instead.

That declaration is not specifying that the @dogs variable stores Pack objects. In fact, it's not saying anything at all about what @dogs stores. Since its storage type has been left unspecified, @dogs inherits the default storage type — Any — which allows its elements to store any kind of scalar value.

If we'd wanted to specify that @dogs was a normal array, but that it can only store Dog objects, we'd need to write:

sub bark(@dogs of Dog) {...}

and if we'd wanted it to store Dogs but be implemented by the Pack class, we'd have to write:

sub bark(@dogs is Pack of Dog) {...}

Appending is SomeType to a variable or parameter is the Perl 6 equivalent of Perl 5's tie mechanism, except that the tying is part of the declaration. For example:

my $Elvis is King of Rock&Roll;

rather than a run-time function call like:

# Perl 5 code...
my $Elvis;
tie $Elvis, 'King', stores=>all('Rock','Roll');

In any case, the simple rule for of vs is is: to say what a variable stores, use of; to say how the variable itself works, use is.

Many Happy Returns

Meanwhile, we're still attempting to create a version of &part that returns a list of pairs. The easiest way to create and return a suitable list of pairs is to flatten a hash in a list context. This is precisely what the return statement does:

return *%herd;

using the splatty star. Although, in this case, we could have simply written:

return %herd;

since the declared return type (List of Pair) automatically imposes list context (and hence list flattening) on any return statement within &part.

Of course, it will only make sense to return a flattened hash if we've already partitioned the original data into that hash. So the bodies of the when and default statements inside &part have to be changed accordingly. Now, instead of pushing each element onto one of two separate arrays, we push each element onto one of the two arrays stored inside %herd:

for @data {
    when $is_sheep { push %herd{"sheep"}, $_ }
    default        { push %herd{"goats"}, $_ }
}

It Lives!!!!!

Assuming that each of the hash entries (%herd{"sheep"} and %herd{"goats"}) will be storing a reference to one of the two arrays, we can simply push each data element onto the appropriate array.

In Perl 5 we'd have to dereference each of the array references inside our hash before we could push a new element onto it:

# Perl 5 code...
push @{$herd{"sheep"}}, $_;

But in Perl 6, the first parameter of push expects an array, so if we give it an array reference, the interpreter can work out that it needs to dereference that first argument. So we can just write:

# Perl 6 code...
push %herd{"sheep"}, $_;

(Remember that, in Perl 6, hashes keep their % sigil, even when being indexed).

Initially, of course, the entries of %herd don't contain references to arrays at all; like all uninitialized hash entries, they contain undef. But, because push itself is defined like so:

sub push (@array is rw, *@data) {...}

an actual read-writable array is expected as the first argument. If a scalar variable containing undef is passed to such a parameter, Perl 6 detects the fact and autovivifies the necessary array, placing a reference to it into the previously undefined scalar argument. That behaviour makes it trivially easy to create subroutines that autovivify read/write arguments, in the same way that Perl 5's open does.

It's also possible to declare a read/write parameter that doesn't autovivify in this way: using the is ref trait instead of is rw:

sub push_only_if_real_array (@array is ref, *@data) {...}

is ref still allows the parameter to be read from and written to, but throws an exception if the corresponding argument isn't already a real referent of some kind.

A Label by Any Other Name

Mandating fixed labels for the two arrays being returned seems a little inflexible, so we could add another — optional — parameter via which user-selected key names could be passed...

sub part (Selector $is_sheep,
          Str ?@labels is dim(2) = <<sheep goats>>,
          *@data
         ) returns List of Pair
{
    my ($sheep, $goats) is constant = @labels;
    my %herd = ($sheep=>[], $goats=>[]);
    for @data {
        when $is_sheep { push %herd{$sheep}, $_ }
        default        { push %herd{$goats}, $_ }
    }
    return *%herd;
}

Optional parameters in Perl 6 are prefixed with a ? marker (just as slurpy parameters are prefixed with *). Like required parameters, optional parameters are passed positionally, so the above example means that the second argument is expected to be an array of strings. This has important consequences for backwards compatibility — as we'll see shortly.

As well as declaring it to be optional (using a leading ?), we also declare the @labels parameter to have exactly two elements, by specifying the is dim(2) trait. The is dim trait takes one or more integer values. The number of values it's given specifies the number of dimensions the array has; the values themselves specify how many elements long the array is in each dimension. For example, to create a four-dimensional array of 7x24x60x60 elements, we'd declare it:

my @seconds is dim(7,24,60,60);

In the latest version of &part, the @labels is dim(2) declaration means that @labels is a normal one-dimensional array, but that it has only two elements in that one dimension.

The final component of the declaration of @labels is the specification of its default value. Any optional parameter may be given a default value, to which it will be bound if no corresponding argument is provided. The default value can be any expression that yields a value compatible with the type of the optional parameter.

In the above version of &part, for the sake of backwards compatibility we make the optional @labels default to the list of two strings <<sheep goats>>  (using the new Perl 6 list-of-strings syntax).

Thus if we provide an array of two strings explicitly, the two strings we provide will be used as keys for the two pairs returned. If we don't specify the labels ourselves, "sheep" and "goats" will be used.

Name Your Poison

With the latest version of &part defined to return named pairs, we can now write:

@parts = part Animal::Cat, <<cat chattel>>, @animals;
#    returns: (cat=>[...], chattel=>[...])
# instead of: (sheep=>[...], goats=>[...])

The first argument (Animal::Cat) is bound to &part's $is_sheep parameter (as before). The second argument (<<cat chattel>>) is now bound to the optional @labels parameter, leaving the @animals argument to be flattened into a list and slurped up by the @data parameter.

We could also pass some or all of the arguments as named arguments. A named argument is simply a Perl 6 pair, where the key is the name of the intended parameter, and the value is the actual argument to be bound to that parameter. That makes sense: every parameter we ever declare has to have a name, so there's no good reason why we shouldn't be allowed to pass it an argument using that name to single it out.

An important restriction on named arguments is that they cannot come before positional arguments, or after any arguments that are bound to a slurpy array. Otherwise, there would be no efficient, single-pass way of working out which unnamed arguments belong to which parameters. Apart from that one overarching restriction (which Larry likes to think of as a zoning law), we're free to pass named arguments in any order we like. That's a huge advantage in any subroutine that takes a large number of parameters, because it means we no longer have to remember their order, just their names.

For example, using named arguments we could rewrite the above part call as any of the following:

# Use named argument to pass optional @labels argument...
@parts = part Animal::Cat, labels => <<cat chattel>>, @animals;

# Use named argument to pass both @labels and @data arguments...
@parts = part Animal::Cat, labels => <<cat chattel>>, data => @animals;

# The order in which named arguments are passed doesn't matter...
@parts = part Animal::Cat, data => @animals, labels => <<cat chattel>>;

# Can pass *all* arguments by name...
@parts = part is_sheep => Animal::Cat,
                labels => <<cat chattel>>,
                  data => @animals;

# And the order still doesn't matter...
@parts = part data => @animals,
              labels => <<cat chattel>>,
              is_sheep => Animal::Cat;

# etc.

As long as we never put a named argument before a positional argument, or after any unnamed data for the slurpy array, the named arguments can appear in any convenient order. They can even be pulled out of a flattened hash:

@parts = part *%args;

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

Who Gets the Last Piece of Cake?

We're making progress. Whether we pass its arguments by name or positionally, our call to part produces two partitions of the original list. Those partitions now come back with convenient labels that we can specify via the optional @labels parameter.

But now there's a problem. Even though we explicitly marked it as optional, it turns out that things can go horribly wrong if we don't actually supply that optional argument. Which is not very "optional". Worse, it means there's potentially a problem with every single legacy call to part that was coded before we added the optional parameter.

For example, consider the call:

@pets = ('Canis latrans', 'Felis sylvestris');

@parts = part /:i felis/, @pets;

# expected to return: (sheep=>['Felis sylvestris'], goats=>['Canis latrans'] )
# actually returns:   ('Canis latrans'=>[], 'Felis sylvestris'=>[])

What went wrong?

Well, when the call to part is matching its argument list against &call's parameter list, it works left-to-right as follows:

  1. The first parameter ($is_sheep) is declared as a scalar of type Selector, so the first argument must be a Code or a Class or a Hash or a Rule. It's actually a Rule, so the call mechanism binds that rule to $is_sheep.
  2. The second parameter (?@labels) is declared as an array of two strings, so the second argument must be an array of two strings. @pets is an array of two strings, so we bind that array to @labels. (Oops!)
  3. The third parameter (*@data) is declared as a slurpy array, so any remaining arguments should be flattened and bound to successive elements of @data. There are no remaining arguments, so there's nothing to flatten-and-bind, so @data remains empty.

That's the problem. If we pass the arguments positionally and there are not enough of them to bind to every parameter, the parameters at the start of the parameter list are bound before those towards the end. Even if those earlier parameters are marked optional. In other words, argument binding is "greedy" and (for obvious efficiency reasons) it never backtracks to see if there might be better ways to match arguments to parameters. Which means, in this case, that our data is being preemptively "stolen" by our labels.

Pipeline to the Rescue!

So in general (and in the above example in particular) we need some way of indicating that a positional argument belongs to the slurpy data, not to some preceding optional parameter. One way to do that is to pass the ambiguous argument by name:

@parts = part /:i felis/, data=>@pets;

Then there can be no mistake about which argument belongs to what parameter.

But there's also a purely positional way to tell the call to part that @pets belongs to the slurpy @data, not to the optional @labels. We can pipeline it directly there. After all, that's precisely what the pipeline operator does: it binds the list on its blunt side to the slurpy array parameter of the call on its sharp side. So we could just write:

@parts = part /:i felis/ <== @pets;

# returns: (sheep=>['Felis sylvestris'], goats=>['Canis latrans'])

Because @pets now appears on the blunt end of a pipeline, there's no way it can be interpreted as anything other than the slurped data for the call to part.

A Natural Assumption

Of course, as a solution to the problem of legacy code, this is highly sub-optimal. It requires that every single pre-existing call to part be modified (by having a pipeline inserted). That will almost certainly be too painful.

Our new optional labels would be much more useful if their existence itself were also optional — if we could somehow add a single statement to the start of any legacy code file and thereby cause &part to work like it used to in the good old days before labels. In other words, what we really want is an impostor &part subroutine that pretends that it only has the original two parameters ($is_sheep and @data), but then when it's called surreptitiously supplies an appropriate value for the new @label parameter and quietly calls the real &part.

In Perl 6, that's easy. All we need is a good curry.

We write the following at the start of the file:

use List::Part;   # Supposing &part is defined in this module

my &part ::= &List::Part::part.assuming(labels => <<sheep goats>>)

That second line is a little imposing so let's break it down. First of all:

List::Part::part

is just the fully qualified name of the &part subroutine that's defined in the List::Part module (which, for the purposes of this example, is where we're saying &part lives). So:

&List::Part::part

is the actual Code object corresponding to the &part subroutine. So:

&List::Part::part.assuming(...)

is a method call on that Code object. This is the tricky bit, but it's no big deal really. If a Code object really is an object, we certainly ought to be able to call methods on it. So:

&List::Part::part.assuming(labels => <<sheep goats>>)

calls the assuming method of the Code object &part and passes the assuming method a named argument whose name is labels and whose value is the list of strings <<sheep goats>>.

Now, if we only knew what the .assuming method did...

That About Wraps it Up

What the .assuming(...) method does is place an anonymous wrapper around an existing Code object and then return a reference to (what appears to be) an entirely separate Code object. That new Code object works exactly like the original — except that the new one is missing one or more of the original's parameters.

Specifically, the parameter list of the wrapper subroutine doesn't have any of the parameters that were named in in the call to .assuming. Instead those missing parameters are automatically filled in whenever the new subroutine is called, using the values of those named arguments to .assuming.

All of which simply means that the method call:

&List::Part::part.assuming(labels => <<sheep goats>>)

returns a reference to a new subroutine that acts like this:

sub ($is_sheep, *@data) {
    return part($is_sheep, labels=><<sheep goats>>, *@data)
}

That is, because we passed a labels => <<sheep goats>>  argument to .assuming, we get back a subroutine without a labels parameter, but which then just calls part and inserts the value <<sheep goats>>  for the missing parameter.

Or, as the code itself suggests:

&List::Part::part.assuming(labels => <<sheep goats>>)

gives us what &List::Part::part would become under the assumption that the value of @labels is always <<sheep goats>> .

How does that help with our source code backwards compatibility problem? It completely solves it. All we have to do is to make Perl 6 use that carefully wrapped, two-parameter version of &part in all our legacy code, instead of the full three-parameter one. To do that, we merely create a lexical subroutine of the same name and bind the wrapped version to that lexical:

my &part ::= &List::Part::part.assuming(labels => <<sheep goats>>);

The my &part declares a lexical subroutine named &part (in exactly the same way that a my $part would declare a lexical variable named $part). The my keyword says that it's lexical and the sigil says what kind of thing it is (& for subroutine, in this case). Then we simply install the wrapped version of &List::Part::part as the implementation of the new lexical &part and we're done.

Just as lexical variables hide package or global variables of the same name, so too a lexical subroutine hides any package or global subroutine of the same name. So my &part hides the imported &List::Part::part, and every subsequent call to part(...) in the rest of the current scope calls the lexical &part instead.

Because that lexical version is bound to a label-assuming wrapper, it doesn't have a labels parameter, so none of the legacy calls to &part are broken. Instead, the lexical &part just silently "fills in" the labels parameter with the value we originally gave to .assuming.

If we needed to add another partitioning call within the scope of that lexical &part, but we wanted to use those sexy new non-default labels, we could do so by calling the actual three-parameter &part via its fully qualified name, like so:

@parts = List::Part::part(Animal::Cat, <<cat chattel>>, @animals);

Pair Bonding

One major advantage of having &part return a list of pairs rather than a simple list of arrays is that now, instead of positional binding:

# with original (list-of-arrays) version of &part...
(@cats, @chattels) := part Animal::Cat <== @animals;

we can do "named binding"

# with latest (list-of-pairs) version of &part...
(goats=>@chattels, sheep=>@cats) := part Animal::Cat <== @animals;

Named binding???

Well, we just learned that we can bind arguments to parameters by name, but earlier we saw that parameter binding is merely an implicit form of explicit := binding. So the inevitable conclusion is that the only reason we can bind parameters by name is because := supports named binding.

And indeed it does. If a := finds a list of pairs on its righthand side, and a list of simple variables on its lefthand side, it uses named binding instead of positional binding. That is, instead of binding first to first, second to second, etc., the := uses the key of each righthand pair to determine the name of the variable on its left to which the value of the pair should be bound.

That sounds complicated, but the effect is very easy to understand:

# Positional binding...
($who, $why) := ($because, "me");
# same as: $who := $because; $why := "me";

# Named binding...
($who, $why) := (why => $because, who => "me");
# same as: $who := "me"; $why := $because;

Even more usefully, if the binding operator detects a list of pairs on its left and another list of pairs on its right, it binds the value of the first pair on the right to the value of the identically named pair on the left (again, regardless of where the two pairs appear in their respective lists). Then it binds the value of the second pair on the right to the value of the identically named pair on the left, and so on.

That means we can set up a named := binding in which the names of the bound variables don't even have to match the keys of the values being bound to them:

# Explicitly named binding...
(who=>$name, why=>$reason) := (why => $because, who => "me");
# same as: $name := "me"; $reason := $because;

The most common use for that feature will probably be to create "free-standing" aliases for particular entries in a hash:

(who=>$name, why=>$reason) := *%explanation;
# same as: $name := %explanation{who}; $reason := %explanation{why};

or to convert particular hash entries into aliases for other variables:

*%details := (who=>"me", why=>$because);
# same as: %details{who} := "me", %details{why} := $because;

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

An Argument in Name Only

It's pretty cool that Perl 6 automatically lets us specify positional arguments — and even return values — by name rather than position.

But what if we'd prefer that some of our arguments could only be specified by name. After all, the @labels parameter isn't really in the same league as the $is_sheep parameter: it's only an option after all, and one that most people probably won't use. It shouldn't really be a positional parameter at all.

We can specify that the labels argument is only to be passed by name...by changing the previous declaration of the @labels parameter very slightly:

sub part (Selector $is_sheep,
          Str +@labels is dim(2) = <<sheep goats>>,
          *@data
         ) returns List of Pair
{
    my ($sheep, $goats) is constant = @labels;
    my %herd = ($sheep=>[], $goats=>[]);
    for @data {
        when $is_sheep { push %herd{$sheep}, $_ }
        default        { push %herd{$goats}, $_ }
    }
    return *%herd;
}

In fact, there's only a single character's worth of difference in the whole definition. Whereas before we declared the @labels parameter like this:

Str ?@labels is dim(2) = <<sheep goats>>

now we declare it like this:

Str +@labels is dim(2) = <<sheep goats>>

Changing that ? prefix to a + changes @labels from an optional positional-or-named parameter to an optional named-only parameter. Now if we want to pass in a labels argument, we can only pass it by name. Attempting to pass it positionally will result in some extreme prejudice from the compiler.

Named-only parameters are still optional parameters however, so legacy code that omits the labels:

%parts = part Animal::Cat <== @animals;

still works fine (and still causes the @labels parameter to default to <<sheep goats>>).

Better yet, converting @labels from a positional to a named-only parameter also solves the problem of legacy code of the form:

%parts = part Animals::Cat, @animals;

@animals can't possibly be intended for the @labels parameter now. We explicitly specified that labels can only be passed by name, and the @animals argument isn't named.

So named-only parameters give us a clean way of upgrading a subroutine and still supporting legacy code. Indeed, in many cases the only reasonable way to add a new parameter to an existing, widely used, Perl 6 subroutine will be to add it as a named-only parameter.

Careful with that Arg, Eugene!

Of course, there's no free lunch here. The cost of solving the legacy code problem is that we changed the meaning of any more recent code like this:

%parts = part Animal::Cat, <<cat chattel>>, @animals;     # Oops!

When @labels was positional-or-named, the <<cat chattel>>  argument could only be interpreted as being intended for @labels. But now, there's no way it can be for @labels (because it isn't named), so Perl 6 assumes that the list is just part of the slurped data. The two-element list will now be flattened (along with @animals), resulting in a single list that is then bound to the @data parameter, as if we'd written:

%parts = part Animal::Cat <== 'cat', 'chattel', @animals;

This is yet another reason why named-only should probably be the first choice for optional parameters.

Temporal Life Insurance

Being able to add name-only parameters to existing subroutines is an important way of future-proofing any calls to the subroutine. So long as we continue to add only named-only parameters to &part, the order in which the subroutine expects its positional and slurpy arguments will be unchanged, so every existing call to part will continue to work correctly.

Curiously, the reverse is also true. Named-only parameters also provide us with a way to "history-proof" subroutine calls. That is, we can allow a subroutine to accept named arguments that it doesn't (yet) know how to handle! Like so:

sub part (Selector $is_sheep,
          Str +@labels is dim(2) = <<sheep goats>>
          *%extras,         # <-- NEW PARAMETER ADDED HERE
          *@data,
         ) returns List of Pair
{
    # Handle extras...
    carp "Ignoring unknown named parameter '$_'" for keys %extras;

    # Remainder of subroutine as before...
    my ($sheep, $goats) is constant = @labels;
    my %herd = ($sheep=>[], $goats=>[]);
    for @data {
        when $is_sheep { push %herd{$sheep}, $_ }
        default        { push %herd{$goats}, $_ }
    }
    return *%herd;
}

# and later...

%parts = part Animal::Cat, label=><<Good Bad>>, max=>3, @data;

# warns: "Ignoring unknown parameter 'max' at future.pl, line 19"

The *%extras parameter is a "slurpy hash". Just as the slurpy array parameter (*@data) sucks up any additional positional arguments for which there's no explicit parameter, a slurpy hash sucks up any named arguments that are unaccounted for. In the above example, for instance, &part has no $max parameter, so passing the named argument max=>3 would normally produce a (compile-time) exception:

Invalid named parameter ('max') in call to &part

However, because &part now has a slurpy hash, that extraneous named argument is simply bound to the appropriate entry of %extras and (in this example) used to generate a warning.

The more common use of such slurpy hashes is to capture the named arguments that are passed to an object constructor and have them automatically forwarded to the constructors of the appropriate ancestral classes. We'll explore that technique in Exegesis 12.

The Greatest Thing Since Sliced Arrays

So far we've progressively extended &part from the first simple version that only accepted subroutines as selectors, to the most recent versions that can now also use classes, rules, or hashes to partition their data.

Suppose we also wanted to allow the user to specify a list of integer indices as the selector, and thereby allow &part to separate a slice of data from its "anti-slice". In other words, instead of:

%data{2357}  = [ @data[2,3,5,7]            ];
%data{other} = [ @data[0,1,4,6,8..@data-1] ];

we could write:

%data = part [2,3,5,7], labels=>["2357","other"], @data;

We could certainly extend &part to do that:

type Selector ::= Code | Class | Rule | Hash | (Array of Int);

sub part (Selector $is_sheep,
          Str +@labels is dim(2) = <<sheep goats>>,
          *@data
         ) returns List of Pair
{
    my ($sheep, $goats) is constant = @labels;
    my %herd = ($sheep=>[], $goats=>[]);
    if $is_sheep.isa(Array of Int) {
        for @data.kv -> $index, $value {
            if $index == any($is_sheep) { push %herd{$sheep}, $value }
            else                        { push %herd{$goats}, $value }
        }
    }
    else {
        for @data {
            when $is_sheep { push %herd{$sheep}, $_ }
            default        { push %herd{$goats}, $_ }
        }
    }
    return *%herd;
}

# and later, if there's a prize for finishing 1st, 2nd, 3rd, or last...

%prize = part [0, 1, 2, @horses-1],
              labels => << placed  also_ran >>,
              @horses;

Note that this is the first time we couldn't just add another class to the Selector type and rely on the smart-match inside the when to work out how to tell "sheep" from "goats". The problem here is that when the selector is an array of integers, the value of each data element no longer determines its sheepishness/goatility. It's now the element's position (i.e. its index) that decides its fate. Since our existing smart-match compares values, not positions, the when can't pick out the right elements for us. Instead, we have to consider both the index and the value of each data element.

To do that we use the @data array's .kv method. Just as calling the .kv method on a hash returns key, value, key, value, key, value, etc., so too calling the .kv method on an array returns index, value, index, value, index, value, etc. Then we just use a parameterized block as our for block, specifying that it has two arguments. That causes the for to grab two elements of the list its iterating (i.e. one index and one value) on each iteration.

Then we simply test to see if the current index is any of those specified in $is_sheep's array and, if so, we push the corresponding value:

for @data.kv -> $index, $value {
    if $index == any(@$is_sheep) { push %herd{$sheep}, $value }
    else                         { push %herd{$goats}, $value }
}

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

A Parting of the ... err ... Parts

That works okay, but it's not perfect. In fact, as it's presented above the &part subroutine is now both an ugly solution and an inefficient one.

It's ugly because &part is now twice as long as it was before. The two branches of control-flow within it are similar in form but quite different in function. One partitions the data according to the contents of a datum; the other, according to a datum's position in @data.

It's inefficient because it effectively tests the type of the selector argument twice: once (implicitly) when it's first bound to the $is_sheep parameter, and then again (explicitly) in the call to .isa.

It would be cleaner and more maintainable to break these two nearly unrelated behaviours out into separate subroutines. And it would be more efficient if we could select between those two subroutines by testing the type of the selector only once.

Of course, in Perl 6 we can do just that — with a multisub.

What's a multisub? It's a collection of related subroutines (known as "variants"), all of which have the same name but different parameter lists. When the multisub is called and passed a list of arguments, Perl 6 examines the types of the arguments, finds the variant with the same name and the most compatible parameter list, and calls that variant.

By the way, you might be more familiar with the term multimethod. A multisub is a multiply dispatched subroutine, in the same way that a multimethod is a multiply dispatched method. There'll be much more about those in Exegesis 12.

Multisubs provide facilities something akin to function overloading in C++. We set up several subroutines with the same logical name (because they implement the same logical action). But each takes a distinct set of argument types and does the appropriate things with those particular arguments.

However, multisubs are more "intelligent" that mere overloaded subroutines. With overloaded subroutines, the compiler examines the compile-time types of the subroutine's arguments and hard codes a call to the appropriate variant based on that information. With multisubs, the compiler takes no part in the variant selection process. Instead, the interpreter decides which variant to invoke at the time the call is actually made. It does that by examining the run-time type of each argument, making use of its inheritance relationships to resolve any ambiguities.

To see why a run-time decision is better, consider the following code:

class Lion is Cat {...}    # Lion inherits from Cat

multi sub feed(Cat  $c) { pat $c; my $glop = open 'Can'; spoon_out($glop); }
multi sub feed(Lion $l) { $l.stalk($prey) and kill; }

my Cat $fluffy = Lion.new;

feed($fluffy);

In Perl 6, the call to feed will correctly invoke the second variant because the interpreter knows that $fluffy actually contains a reference to a Lion object at the time the call is made (even though the nominal type of the variable is Cat).

If Perl 6 multisubs worked like C++'s function overloading, the call to feed($fluffy) would invoke the first version of feed, because all that the compiler knows for sure at compile-time is that $fluffy is declared to store Cat objects. That's precisely why Perl 6 doesn't do it that way. We prefer leave the hand-feeding of lions to other languages.

Many Parts

As the above example shows, in Perl 6, multisub variants are defined by prepending the sub keyword with another keyword: multi. The parameters that the interpreter is going to consider when deciding which variant to call are specified to the left of a colon (:), with any other parameters specified to the right. If there is no colon in the parameter list (as above), all the parameters are considered when deciding which variant to invoke.

We could re-factor the most recent version of &part like so:

type Selector ::= Code | Class | Rule | Hash;

multi sub part (Selector $is_sheep:
                Str +@labels is dim(2) = <<sheep goats>>,
                *@data
               ) returns List of Pair
{
    my ($sheep, $goats) is constant = @labels;
    my %herd = ($sheep=>[], $goats=>[]);
    for @data {
        when $is_sheep { push %herd{$sheep}, $_ }
        default        { push %herd{$goats}, $_ }
    }
    return *%herd;
}

multi sub part (Int @sheep_indices:
                Str +@labels is dim(2) = <<sheep goats>>,
                *@data
               ) returns List of Pair
{
    my ($sheep, $goats) is constant = @labels;
    my %herd = ($sheep=>[], $goats=>[]);
    for @data -> $index, $value {
        if $index == any(@sheep_indices) { push %herd{$sheep}, $value }
        else                             { push %herd{$goats}, $value }
    }
    return *%herd;
}

Here we create two variants of a single multisub named &part. The first variant will be invoked whenever &part is called with a Selector object as its first argument (that is, when it is passed a Code or Class or Rule or Hash object as its selector).

The second variant will be invoked only if the first argument is an Array of Int. If the first argument is anything else, an exception will be thrown.

Notice how similar the body of the first variant is to the earlier subroutine versions. Likewise, the body of the second variant is almost identical to the if branch of the previous (subroutine) version.

Notice too how the body of each variant only has to deal with the particular type of selector that its first parameter specifies. That's because the interpreter has already determined what type of thing the first argument was when deciding which variant to call. A particular variant will only ever be called if the first argument is compatible with that variant's first parameter.

Call Me Early

Suppose we wanted more control over the default labels that &part uses for its return values. For example, suppose we wanted to be able to prompt the user for the appropriate defaults — before the program runs.

The default value for an optional parameter can be any valid Perl expression whose result is compatible with the type of the parameter. We could simply write:

my Str @def_labels;

BEGIN {
    print "Enter 2 default labels: ";
    @def_labels = split(/\s+/, <>, 3).[0..1];
}

sub part (Selector $is_sheep,
          Str +@labels is dim(2) = @def_labels,
          *@data
         ) returns List of Pair
{
    # body as before
}

We first define an array variable:

my Str @def_labels;

This will ultimately serve as the expression that the @labels parameter uses as its default:

Str +@labels is dim(2) = @def_labels

Then we merely need a BEGIN block (so that it runs before the program starts) in which we prompt for the required information:

print "Enter 2 default labels: ";

read it in:

<>

split the input line into three pieces using whitespace as a separator:

split(/\s+/, <>, 3)

grab the first two of those pieces:

split(/\s+/, <>, 3).[0..1]

and assign them to @def_labels:

@def_labels = split(/\s+/, <>, 3).[0..1];

We're now guaranteed that @def_labels has the necessary default labels before &part is ever called.

Core Breach

Built-ins like &split can also be given named arguments in Perl 6, so, alternatively, we could write the BEGIN block like so:

BEGIN {
    print "Enter 2 default labels: ";
    @def_labels = split(str=><>, max=>3).[0..1];
}

Here we're leaving out the split pattern entirely and making use of &split's default split-on-whitespace behaviour.

Incidentally, an important goal of Perl 6 is to make the language powerful enough to natively implement all its own built-ins. We won't actually implement it that way, since screamingly fast performance is another goal, but we do want to make it easy for anyone to create their own versions of any Perl built-in or control structure.

So, for example, &split would be declared like this:

sub split( Rule|Str ?$sep = /\s+/,
           Str ?$str = $CALLER::_,
           Int ?$max = Inf
          )
{
    # implementation here
}

Note first that every one of &split's parameters is optional, and that the defaults are the same as in Perl 5. If we omit the separator pattern, the default separator is whitespace; if we omit the string to be split, &split splits the caller's $_ variable; if we omit the "maximum number of pieces to return" argument, there is no upper limit on the number of splits that may be made.

Note that we can't just declare the second parameter like so:

Str ?$str = $_,

That's because, in Perl 6, the $_ variable is lexical (not global), so a subroutine doesn't have direct access to the $_ of its caller. That means that Perl 6 needs a special way to access a caller's $_.

That special way is via the CALLER:: namespace. Writing $CALLER::_ gives us access to the $_ of whatever scope called the current subroutine. This works for other variables too ($CALLER::foo, @CALLER::bar, etc.) but is rarely useful, since we're only allowed to use CALLER:: to access variables that already exist, and $_ is about the only variable that a subroutine can rely upon to be present in any scope it might be called from.

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

A Constant Source of Joy

Setting up the @def_labels array at compile-time and then using it as the default for the @labels parameter works fine, but there's always the chance that the array might somehow be accidentally reassigned later. If that's not desirable, then we need to make the array a constant. In Perl 6 that looks like this:

my @def_labels is constant = BEGIN {
    print "Enter 2 default labels: ";
    split(/\s+/, <>, 3).[0..1];
};

The is constant trait is the way we prevent any Perl 6 variable from being reassigned after it's been declared. It effectively replaces the STORE method of the variable's implementation with one that throws an exception whenever it's called. It also instructs the compiler to keep an eye out for compile-time-detectable modifications to the variable and die violently if it finds any.

Whenever a variable is declared is constant it must be initialized as part of its declaration. In this case we use the return value of a BEGIN block as the initializer value.

Oh, by the way, BEGIN blocks have return values in Perl 6. Specifically, they return the value of the last statement executed inside them (just like a Perl 5 do or eval block does, except that BEGINs do it at compile-time).

In the above example the result of the BEGIN is the return value of the call to split. So @def_labels is initialized to the two default labels, which cannot thereafter be changed.

BEGIN at the Scene of the Crime

Of course, the @def_labels array is really just a temporary storage facility for transferring the results of the BEGIN block to the default value of the @labels parameter.

We could easily do away with it entirely, by simply putting the BEGIN block right there in the parameter list:

sub part (Selector $is_sheep,
          Str +@labels is dim(2) = BEGIN {
                      print "Enter 2 default labels: "; 
                      split(/\s+/, <>, 3).[0..1];
                    },
          *@data
         ) returns List of Pair
{
    # body as before
}

And that works fine.

Macro Biology

The only problem is that it's ugly, brutish, and not at all short. If only there were some way of calling the BEGIN block at that point without having to put the actual BEGIN block at that point....

Well, of course there is such a way. In Perl 6 a block is just a special kind of nameless subroutine... and a subroutine is just a special name-ful kind of block. So it shouldn't really come as a surprise that BEGIN blocks have a name-ful, subroutine-ish counterpart. They're called macros and they look and act very much like ordinary subroutine, except that they run at compile-time.

So, for example, we could create a compile-time subroutine that requests and returns our user-specified labels:

macro request(int $n, Str $what) returns List of Str {
    print "Enter $n $what: ";
    my @def_labels = split(/\s+/, <>, $n+1);
    return { @def_labels[0..$n-1] };
}

# and later...

sub part (Selector $is_sheep,
          Str +@labels is dim(2) = request(2,"default labels"),
          *@data
         ) returns List of Pair
{
    # body as before
}

Calls to a macro are invoked during compilation (not at run-time). In fact, like a BEGIN block, a macro call is executed as soon as the parser has finished parsing it. So, in the above example, when the parser has parsed the declaration of the @labels parameter and then the = sign indicating a default value, it comes across what looks like a subroutine call. As soon as it has parsed that subroutine call (including its argument list) it will detect that the subroutine &request is actually a macro, so it will immediately call &request with the specified arguments (2 and "default labels").

Whenever a macro like &request is invoked, the parser itself intercepts the macro's return value and integrates it somehow back into the parse tree it is in the middle of building. If the macro returns a block — as &request does in the above example — the parser extracts the the contents of that block and inserts the parse tree of those contents into the program's parse tree. In other words, if a macro returns a block, a precompiled version of whatever is inside the block replaces the original macro call.

Alternatively, a macro can return a string. In that case, the parser inserts that string back into the source code in place of the macro call and then reparses it. This means we could also write &request like this:

macro request(int $n, Str $what) returns List of Str {
    print "Enter $n $what: ";
    return "<< ( @(split(/\s+/, <>, $n+1).[0..$n-1]) >>";
}

in which case it would return a string containing the characters "<<", followed by the two labels that the request call reads in, followed by a closing double angles. The parser would then substitute that string in place of the macro call, discover it was a <<...>> word list, and use that list as the default labels.

Macros for BEGIN-ners

Macros are enormously powerful. In fact, in Perl 6, we could implement the functionality of BEGIN itself using a macro:

macro MY_BEGIN (&block) {
    my $context = want;
    if $context ~~ List {
        my @values = block();
        return { *@values };
    }
    elsif $context ~~ Scalar {
        my $value = block();
        return { $value };
    }
    else {
        block();
        return;
    }
}

The MY_BEGIN macro declares a single parameter (&block). Because that parameter is specified with the Code sigil (&), the macro requires that the corresponding argument must be a block or subroutine of some type. Within the body of &MY_BEGIN that argument is bound to the lexical subroutine &block (just as a $foo parameter would bind its corresponding argument to a lexical scalar variable, or a @foo parameter would bind its argument to a lexical array).

&MY_BEGIN then calls the want function, which is Perl 6's replacement for wantarray. want returns a scalar value that simultaneously represents any the contexts in which the current subroutine was called. In other words, it returns a disjunction of various classes. We then compare that context information against the three possibilities — List, Scalar, and (by elimination) Void.

If MY_BEGIN was called in a list context, we evaluate its block/closure argument in a list context, capture the results in an array (@values), and then return a block containing the contents of that array flattened back to a list. In a scalar context we do much the same thing, except that MY_BEGIN's argument is evaluated in scalar context and a block containing that scalar result is returned. In a void context (the only remaining possibility), the argument is simply evaluated and nothing is returned.

In the first two cases, returning a block causes the original macro call to be replaced by a parse tree, specifically, the parse tree representing the values that resulted from executing the original block passed to MY_BEGIN.

In the final case — a void context — the compiler isn't expecting to replace the macro call with anything, so it doesn't matter what we return, just as long as we evaluate the block. The macro call itself is simply eliminated from the final parse-tree.

Note that MY_BEGIN could be written more concisely than it was above, by taking advantage of the smart-matching behaviour of a switch statement:

macro MY_BEGIN (&block) {
    given want {
        when List   { my @values = block(); return { *@values }; }
        when Scalar { my $value  = block(); return {  $value  }; }
        when Void   {              block(); return               }
    }
}

A Macro by Any Other Syntax ...

Because macros are called by the parser, it's possible to have them interact with the parser itself. In particular, it's possible for a macro to tell the parser how the macro's own argument list should be parsed.

For example, we could give the &request macro its own non-standard argument syntax, so that instead of calling it as:

request(2,"default labels")

we could just write:

request(2 default labels)

To do that we'd define &request like so:

macro request(int $n, Str $what) 
    is parsed( /:w \( (\d+) (.*?) \) / )
    returns List of Str
{
    print "Enter $n $what: ";
    my @def_labels = split(/\s+/, <>, $n+1);
    return { @def_labels[0..$n-1] };
}

The is parsed trait tells the parser what to look for immediately after it encounters the macro's name. In the above example, the parser is told that, after encountering the sequence "request" it should expect to match the pattern:

/ :w        # Allow whitespace between the tokens
  \(        # Match an opening paren
  (\d+)     # Capture one-or-more digits
  (.*?)     # Capture everything else up to...
  \)        # ...a closing paren
/

Note that the one-or-more-digits and the anything-up-to-paren bits of the pattern are in capturing parentheses. This is important because the list of substrings that an is parsed pattern captures is then used as the argument list to the macro call. The captured digits become the first argument (which is then bound to the $n parameter) and the captured "everything else" becomes the second argument (and is bound to $what).

Normally, of course, we don't need to specify the is parsed trait when setting up a macro. Since a macro is a kind of subroutine, by default its argument list is parsed the same as any other subroutine's — as a comma-separated list of Perl 6 expressions.

Editor's note: this document is out of date and remains here for historic interest. See Synopsis 6 for the current design information.

Refactoring Parameter Lists

By this stage, you might be justified in feeling that &part's parameter list is getting just a leeeeettle too sophisticated for its own good. Moreover, if we were using the multisub version, that complexity would have to be repeated in every variant.

Philosophically though, that's okay. The later versions of &part are doing some fairly sophisticated things, and the complexity required to achieve that has to go somewhere. Putting that extra complexity in the parameter list means that the body of &part stays much simpler, as do any calls to &part.

That's the whole point: Complexify locally to simplify globally. Or maybe: Complexify declaratively to simplify procedurally.

But there's precious little room for the consolations of philosophy when you're swamped in code and up to your assembler in allomorphism. So, rather than having to maintain those complex and repetitive parameter lists, we might prefer to factor out the common infrastructure. With, of course, yet another macro:

macro PART_PARAMS {
	my ($sheep,$goats) = request(2 default labels);
	return "Str +\@labels is dim(2) = <<$sheep $goats>>, *\@data";
}

multi sub part (Selector $is_sheep, PART_PARAMS) {
    # body as before
}

multi sub part (Int @is_sheep, PART_PARAMS) {
    # body as before
}

Here we create a macro named &PART_PARAMS that requests and extracts the default labels and then interpolates them into a string, which it returns. That string then replaces the original macro call.

Note that we reused the &request macro within the &PART_PARAMS macro. That's important, because it means that, as the body of &PART_PARAMS is itself being parsed, the default names are requested and interpolated into &PART_PARAMS's code. That ensures that the user-supplied default labels are hardwired into &PART_PARAMS even before it's compiled. So every subsequent call to PART_PARAMS will return the same default labels.

On the other hand, if we'd written &PART_PARAMS like this:

macro PART_PARAMS {
	print "Enter 2 default labels: ";
	my ($sheep,$goats) = split(/\s+/, <>, 3);
	return "*\@data, Str +\@labels is dim(2) = <<$sheep $goats>>";
}

then each time we used the &PART_PARAMS macro in our code, it would re-prompt for the labels. So we could give each variant of &part its own default labels. Either approach is fine, depending on the effect we want to achieve. It's really just a question how much work we're willing to put in in order to be Lazy.

Smooth Operators

By now it's entirely possible that your head is spinning with the sheer number of ways Perl 6 lets us implement the &part subroutine. Each of those ways represents a different tradeoff in power, flexibility, and maintainability of the resulting code. It's important to remember that, however we choose to implement &part, it's always invoked in basically the same way:

%parts = part $selector, @data;

Sure, some of the above techniques let us modify the return labels, or control the use of named vs positional arguments. But with all of them, the call itself starts with the name of the subroutine, after which we specify the arguments.

Let's change that too!

Suppose we preferred to have a partitioning operator, rather than a subroutine. If we ignore those optional labels, and restrict our list to be an actual array, we can see that the core partitioning operation is binary ("apply this selector to that array").

If &part is to become an operator, we need it to be a binary operator. In Perl 6 we can make up completely new operators, so let's take our partitioning inspiration from Moses and call our new operator: ~|_|~

We'll assume that this "Red Sea" operator is to be used like this:

%parts = @animals ~|_|~ Animal::Cat;

The left operand is the array to be partitioned and the right operand is the selector. To implement it, we'd write;

multi sub infix:~|_|~ (@data, Selector $is_sheep)
    is looser(&infix:+)
    is assoc('non')
{
    return part $is_sheep, @data;
}

Operators are often overloaded with multiple variants (as we'll soon see), so we typically implement them as multisubs. However, it's also perfectly possible to implement them as regular subroutines, or even as macros.

To distinguish a binary operator from a regular multisub, we give it a special compound name, composed of the keyword infix: followed by the characters that make up the operator's symbol. These characters can be any sequence of non-whitespace Unicode characters (except left parenthesis, which can only appear if it's the first character of the symbol). So instead of ~|_|~ we could equally well have named our partitioning operator any of:

infix:¥
infix:¦
infix:^%#$!
infix:<->
infix:∇

The infix: keyword tells the compiler that the operator is placed between its operands (as binary operators always are). If we're declaring a unary operator, there are three other keywords that can be used instead: prefix:, postfix:, or circumfix:. For example:

sub prefix:±       (Num $n) is equiv(&infix:+)    { return +$n|-$n }

sub postfix:²      (Num $n) is tighter(&infix:**) { return $n**2 }

sub circumfix:⌊...⌋ (Num $n) { return POSIX::floor($n) }

# and later...

$error = ±⌊$x²⌋;

The is tighter, is looser, and is equiv traits tell the parser what the precedence of the new operator will be, relative to existing operators: namely, whether the operator binds more tightly than, less tightly than, or with the same precedence as the operator named in the trait. Every operator has to have a precedence and associativity, so every operator definition has to include one of these three traits.

The is assoc trait is only required on infix operators and specifies whether they chain to the left (like +), to the right (like =), or not at all (like ..). If the trait is not specified, the operator takes its associativity from the operator that's specified in the is tighter, is looser, or is equiv trait.

Arguments Both Ways

On the other hand, we might prefer that the selector come first (as it does in &part):

%parts = Animal::Cat ~|_|~ @animals;

in which case we could just add:

multi sub infix:~|_|~ (Selector $is_sheep, @data)
    is equiv( &infix:~|_|~(Array,Selector) )
{
    return part $is_sheep, @data;
}

so now we can specify the selector and the data in either order.

Because the two variants of the &infix:~|_|~ multisubs have different parameter lists (one is (Array,Selector), the other is (Selector, Array), Perl 6 always knows which one to call. If the left operand is a Selector, the &infix:~|_|~(Selector,Array) variant is called. If the left operand is an array, the &infix:~|_|~(Array,Selector) variant is invoked.

Note that, for this second variant, we specified is equiv instead of is tighter or is looser. This ensures that the precedence and associativity of the second variant are the same as those of the first. That's also why we didn't need to specify an is assoc.

Parting Is Such Sweet Sorrow

Phew. Talk about "more than one way to do it"!

But don't be put off by these myriad new features and alternatives. The vast majority of them are special-purpose, power-user techniques that you may well never need to use or even know about.

For most of us it will be enough to know that we can now add a proper parameter list, with sensibly named parameters, to any subroutine. What we used to write as:

sub feed {
    my ($who, $how_much, @what) = @_;
    ...
}

we now write as:

sub feed ($who, $how_much, *@what) {
    ...
}

or, when we're feeling particularly cautious:

sub feed (Str $who, Num $how_much, Food *@what) {
    ...
}

Just being able to do that is a huge win for Perl 6.

Parting Shot

By the way, here's (most of) that same partitioning functionality implemented in Perl 5:

# Perl 5 code...
sub part {
    my ($is_sheep, $maybe_flag_or_labels, $maybe_labels, @data) = @_;
    my ($sheep, $goats);
    if ($maybe_flag_or_labels eq "labels" && ref $maybe_labels eq 'ARRAY') { 
        ($sheep, $goats) = @$maybe_labels;
    }
    elsif (ref $maybe_flag_or_labels eq 'ARRAY') {
        unshift @data, $maybe_labels;
        ($sheep, $goats) = @$maybe_flag_or_labels;
    }
    else {
        unshift @data, $maybe_flag_or_labels, $maybe_labels;
        ($sheep, $goats) = qw(sheep goats);
    }
    my $arg1_type = ref($is_sheep) || 'CLASS';
    my %herd;
    if ($arg1_type eq 'ARRAY') {
        for my $index (0..$#data) {
            my $datum = $data[$index];
            my $label = grep({$index==$_} @$is_sheep) ? $sheep : $goats;
            push @{$herd{$label}}, $datum;
        }
    }
    else {
        croak "Invalid first argument to &part"
            unless $arg1_type =~ /^(Regexp|CODE|HASH|CLASS)$/;
        for (@data) {
            if (  $arg1_type eq 'Regexp' && /$is_sheep/
               || $arg1_type eq 'CODE'   && $is_sheep->($_)
               || $arg1_type eq 'HASH'   && $is_sheep->{$_}
               || UNIVERSAL::isa($_,$is_sheep)
               ) {
                push @{$herd{$sheep}}, $_;
            }
            else {
                push @{$herd{$goats}}, $_;
            }
        }
    }
    return map {bless {key=>$_,value=>$herd{$_}},'Pair'} keys %herd;
}

... which is precisely why we're developing Perl 6.

Overloading

Introduction: What is Overloading?

All object-oriented programming languages have a feature called overloading, but in most of them this term means something different from what it means in Perl. Take a look at this Java example:

public Fraction(int num, int den);
public Fraction(Fraction F);
public Fraction();

In this example, we have three methods called Fraction. Java, like many languages, is very strict about the number and type of arguments that you can pass to a function. We therefore need three different methods to cover the three possibilities. In the first example, the method takes two integers (a numerator and a denominator) and it returns a Fraction object based on those numbers. In the second example, the method takes an existing Fraction object as an argument and returns a copy (or clone) of that object. The final method takes no arguments and returns a default Fraction object, maybe representing 1/1 or 0/1. When you call one of these methods, the Java Virtual Machine determines which of the three methods you wanted by looking at the number and type of the arguments.

In Perl, of course, we are far more flexible about what arguments we can pass to a method. Therefore the same method can be used to handle all of the three cases from the Java example. (We'll see an example of this in a short while.) This means that in Perl we can save the term "overloading" for something far more interesting — operator overloading.

Number::Fraction — The Constructor

Imagine you have a Perl object that represents fractions (or, more accurately, rational numbers, but we'll call them fractions as we're not all math geeks). In order to handle the same situations as the Java class we mentioned above, we need to be able to run code like this:

use Number::Fraction;

my $half       = Number::Fraction->new(1, 2);
my $other_half = Number::Fraction->new($half);
my $default    = Number::Fraction->new;

To do this, we would write a constructor method like this:

sub new {
    my $class = shift;
    my $self;
    if (@_ >= 2) {
        return if $_[0] =~ /\D/ or $_[1] =~ /\D/;
        $self->{num} = $_[0];
        $self->{den} = $_[1];
    } elsif (@_ == 1) {
        if (ref $_[0]) {
            if (UNIVERSAL::isa($_[0], $class) {
                return $class->new($_[0]->{num}, $_[0]->{den});
            } else {
                croak "Can't make a $class from a ", ref $_[0];
            }
        } else {
            return unless $_[0] =~ m|^(\d+)/(\d+)|;

            $self->{num} = $1;
            $self->{den} = $2;
        }
    } elsif (!@_) {
        $self->{num} = 0;
        $self->{den} = 1;
    }

    bless $self, $class;
    $self->normalise;
    return $self;
}

As promised, there's just one method here and it does everything that the three Java methods did and more even, so it's a good example of why we don't need method overloading in Perl. Let's look at the various parts in some detail.

sub new {
    my $class = shift;
    my $self;

The method starts out just like most Perl object constructors. It grabs the class which is passed in as the first argument and then declares a variable called $self which will contain the object.

    if (@_ >= 2) {
        return if $_[0] =~ /\D/ or $_[1] =~ /\D/;
        $self->{num} = $_[0];
        $self->{den} = $_[1];

This is where we start to work out just how the method was called. We look at @_ to see how many arguments we have been given. If we've got two arguments then we assume that they are the numerator and denominator of the fraction. Notice that there's also another check to ensure that both arguments contain only digits. If this check fails, we return undef from the constructor.

     } elsif (@_ == 1) {
        if (ref $_[0]) {
            if (UNIVERSAL::isa($_[0], $class) {
                return $class->new($_[0]->num, $_[0]->den);
            } else {
                croak "Can't make a $class from a ", ref $_[0];
            }
        } else {
            return unless $_[0] =~ m|^(\d+)/(\d+)|;
            $self->{num} = $1;
            $self->{den} = $2;
        }

If we've been given just one argument, then there are a couple of things we can do. First we see if the argument is a reference, and if it is, we check that it's a reference to another Number::Fraction object (or a subclass). If it's the right kind of object then we get the numerators and denominators (using the accessor functions) and use them to call the two argument forms of new. It the argument is the wrong type of reference then we complain bitterly to the user.

If the single argument isn't a reference then we assume it's a string of the form num/den, which we can split apart to get the numerator and denominator of the fraction. Once more we check for the correct format using a regex and return undef if the check fails.

     } elsif (!@_) {
        $self->{num} = 0;
        $self->{den} = 1;
    }

If we are given no arguments, then we just create a default fraction which is 0/1.

    bless $self, $class;
    $self->normalise;
    return $self;
}

At the end of the constructor we do more of the normal OO Perl stuff. We bless the object into the correct class and return the reference to our caller. Between these two actions we pause to call the normalise method, which converts the fraction to its simplest form. For example, it will convert 12/16 to 3/4.

Number::Fraction — Doing Calculations

Having now created fraction objects, we will want to start doing calculations with them. For that we'll need methods that implement the various mathematical functions. Here's the add method:

sub add {
    my ($self, $delta) = @_;

    if (ref $delta) {
        if (UNIVERSAL::isa($delta, ref $self)) {
            $self->{num} = $self->num  * $delta->den
                + $delta->num * $self->den;
            $self->{den} = $self->den  * $delta->den;
        } else {
            croak "Can't add a ", ref $delta, " to a ", ref $self;
        }
    } else {
        if ($delta =~ m|(\d+)/(\d+)|) {
            $self->add(Number::Fraction->new($1, $2));
        } elsif ($delta !~ /\D/) {
            $self->add(Number::Fraction->new($delta, 1));
        } else {
            croak "Can't add $delta to a ", ref $self;
        }
    }
    $self->normalise;
}

Once more we try to handle a number of different types of arguments. We can add the following things to our fraction object:

  • Another object of the same class (or a subclass).
  • A string in the format num/den.
  • An integer. This is converted to a fraction with a denominator of 1.

This then allows us to write code like this:

my $half           = Number::Fraction->new(1, 2);
my $quarter        = Number::Fraction->new(1, 4);
my $three_quarters = $half;
$three_quarters->add($quarter);

In my opinion, this code looks pretty horrible. It also has a nasty, subtle bug. Can you spot it? (Hint: What will be in $half after running this code?) To tidy up this code we can turn to operator overloading.

Number::Fraction — Operator Overloading

The module overload.pm is a standard part of the Perl distribution. It allows your objects to define how they will react to a number of Perl's operators. For example, we can add code like this to Number::Fraction:

use overload '+' => 'add';

Whenever a Number::Fraction is used as one of the operands to the + operator, the add method will be called instead. Code like:

$three_quarters = $half + '3/4';

is converted to:

$three_quarters = $half->add('3/4');

This is getting closer, but it still has a serious problem. The add method works on the $half object. In general, however, that's not how an assignment should work. If you were working with ordinary scalars and had code like:

$foo = $bar + 0.75;

You would be very surprised if this altered the value of $bar. Our objects need to work in the same way. We need to change our add method so that it doesn't alter $self but instead returns the new fraction.

sub add {
    my ($l, $r) = @_;
    if (ref $r) {
        if (UNIVERSAL::isa($r, ref $l) {
            return Number::Fraction->new($l->num * $r->den + $r->num * $l->den,
                    $l->den * $r->den})
        } else {
            ...
        } else {
            ...
        }
    }
}

In this example, I've only shown one of the sections, but I hope it's clear how it would work. Notice that I've also renamed $self and $delta to $l and $r. I find this makes more sense as we are working with the left and right operands of the + operator.

Overloading Non-Commutative Operators

We can now happily handle code like:

$three_quarters = $half + '1/4';

Our object will do the right thing — $three_quarters will end up as a Number::Fraction object that contains the value 3/4. What will happen if we write code like this?

$three_quarters = '1/4' + $half;

The overload modules handle this case as well. If your object is either operand of one of the overloaded operators, then your method will be called. You get passed an extra argument which indicates whether your object was the left or right operand of the operator. This argument is false if your object is the left operand and true if it is the right operand.

For commutative operators you probably don't need to take any notice of this argument as, for example:

$half + '1/4'

is the same as:

'1/4' + $half

However, for non-commutative operators (like - and /) you will need to do something like this:

sub subtract {
    my ($l, $r, $swap) = @_;

    ($l, $r) = ($r, $l) if $swap;
    ...
}

Overloadable Operators

Just about any Perl operator can be overloaded in this way. This is a partial list:

  • Arithmetic: +, +=, -, -=, *, *=, /, /=, %, %=, **, **=, <<, <<=, >>, >>=, x, x=, ., .=
  • Comparison: <, <=, >, =>, ==, !=, <=> lt, le, gt, ge, eq, ne, cmp
  • Increment/Decrement: ++, -- (both pre- and post- versions)

A full list is given in overload.

It's a very long list, but thankfully you rarely have to supply an implementation for more than a few operators. Perl is quite happy to synthesize (or autogenerate) many of the missing operators. For example:

  • ++ can be derived from +
  • += can be derived from +
  • - (unary) can be derived from - (binary)
  • All numeric comparisons can be derived from <=>
  • All string comparisons can be derived from cmp

Two other special operators give finer control over this autogeneration of methods. nomethod defines a subroutine that is called when no other function is found and fallback controls how hard Perl tries to autogenerate a method. fallback can have one of three values:

undef
Attempt to autogenerate methods and die if a method can't be autogenerated. This is the default.
0
Never try to autogenerate methods.
1
Attempt to autogenerate methods but fall back on Perl's default behavior for the the object if a method can't be autogenerated.

Here's an example of an object that will die gracefully when an unknown operator is called. Notice that the nomethod subroutine is passed the usual three arguments (left operand, right operand, and the swap flag) together with an extra argument containing the operator that was used.

use overload
    '-' => 'subtract',
    fallback => 0,
    nomethod => sub { 
        croak "illegal operator $_[3]" 
};

Three special operators are provided to control type conversion. They define methods to be called if the object is used in string, numeric, and boolean contexts. These operators are denoted by q{""}, 0+, and bool. Here's how we can use these in Number::Fraction:

use overload
    q{""} => 'to_string',
    '0+'  => 'to_num';

sub to_string {
    my $self = shift;
    return "$_->{num}/$_->{den}";
}

sub to_num {
    my $self = shift;
    return $_{num}/$_->{den};
}

Now, when we print a Number::Fraction object, it will be displayed in num/den format. When we use the object in a numeric context, Perl will automatically convert it to its numeric equivalent.

We can use these type-conversion and fallback operators to cut down the number of operators we need to define even further.

use overload
    '0+' => 'to_num',
    fallback => 1;

Now, whenever our object is used where Perl is expecting a number and we haven't already defined an overloading method, Perl will try to use our object as a number, which will, in turn, trigger our to_num method. This means that we only need to define operators where their behavior will differ from that of a normal number. In the case of Number::Fraction, we don't need to define any numeric comparison operators since the numeric value of the object will give the correct behavior. The same is true of the string comparison operators if we define to_string.

Overloading Constants

We've come a long way with our overloaded objects. Instead of nasty code like:

use Number::Fraction;

$f = Number::Fraction->new(1, 2);
$f->add('1/4');

we can now write code like:

use Number::Fraction;

$f = Number::Fraction->new(1, 2) + '1/4';

There are still, however, two places where we need to use the full name of the class — when we load the module and when we create a new fraction object. We can't do much about the first of these, but we can remove the need for that ugly new call by overloading constants.

You can use overload::constant to control how Perl interprets constants in your program. overload::constant expects a hash where the keys identify various kinds of constants and the values are subroutines which handle the constants. The keys can be any of integer (for integers), float (for floating point numbers), binary (for binary, octal, and hex numbers), q (for strings), and qr (for the constant parts of regular expressions).

When a constant of the right type is found, Perl will call the associated subroutine, passing it the string representation of the constant and the way that Perl would interpret the constant by default. Subroutines associated with q or qr also get a third argument -- either qq, q, s, or tr --which indicates how the string is being used in the program.

As an example, here is how we would set up constant handlers so that strings of the form num/den are always converted to the equivalent Number::Fraction object:

my %_const_handlers = 
    (q => sub { 
        return __PACKAGE__->new($_[0]) || $_[1] 
});

sub import {
    overload::constant %_const_handlers if $_[1] eq ':constants';
}

sub unimport {
    overload::remove_constant(q => undef);
}

We've defined a hash, %_const_handlers, which only contains one entry as we are only interested in strings. The associated subroutine calls the new method in the current package (which will be Number::Fraction or a subclass) passing it the string as found in the program source. If this string can be used to create a valid Number::Fraction object, a reference to that object is returned.

If a valid object isn't returned then the subroutine returns its second argument, which is Perl's default intepretation of the constant. As a result, any strings in the program that can be intepreted as a fraction are converted to the correct Number::Fraction object and other strings are left unchanged.

The constant handler is loaded as part of our package's import subroutine. Notice that it is only loaded if the import subroutine is passed the optional argument :constants. This is because this is a potentially big change to the way that a program's source code is interpreted so we only want to turn it on if the user wants it. Number::Fraction can be used in this way by putting the following line in your program:

use Number::Fraction ':constants';

If you don't want the scary constant-refining stuff you can just use:

use Number::Fraction;

Also note that we've defined an unimport subroutine which removes the constant handler. An unimport subroutine is called when a program calls no Number::Fraction — it's the opposite of use. If you're going to make major changes to the way that Perl parses a program then it's only polite to undo your changes if the programmer askes you to.

Conclusion

We've finally managed to get rid of most of the ugly class names from our code. We can now write code like this:

use Number::Fraction ':constants';

my $half = '1/2';
my $three_quarters = $half + '1/4';
print $three_quarters;  # prints 3/4

I hope you can agree that this has the potential to make code far easier to read and understand.

Number::Fraction is available on the CPAN. Please feel free to take a closer look at how it is implemented. If you come up with any more interesting overloaded modules, I'd love to hear about them.

This week on Perl 6, week ending 2003-07-20

Welcome back to an interim Perl 6 Summary, falling between two conference weeks — OSCON and YAPC::Europe. For reasons involving insanity, a EuroStar ticket going begging, and undeserved generosity, I shall be bringing my conference haul up to 3 for the year. Yay me! Now if I can just finagle a talk into the schedule I will have spoken at all three too. I'll certainly have heckled at all three.

The State of the Onion

Tim Howell wondered if Larry's State of the Onion address to OSCON would be available anywhere. Robert "White Camel" Spier didn't rest on his laurels, but posted a link to Larry's slides and transcript. chromatic popped up with a different link which neatly combines slides and transcript. However, they both lack the funny balloon pony which I gave Larry just before the talk. Maybe there will be video later.

http://groups.google.com/groups

http://www.perl.com/pub/a/2003/07/16/soto2003.html

A Small Perl Task for the Interested

A few weeks back, Dan asked for volunteers to improve Parrot's build system. Lars Balker Rasmussen stepped up to the plate and offered a simple wrapper. Josh Wilmes seems to be somewhat sceptical about the whole endeavour, pointing at libtool as an illustration of the size of the problem.

http://groups.google.com/groups

env.pmc

Parrot is in the process of getting an env PMC, for accessing environment variables. Dan and Leo's discussion of the patch took in iterators and the wisdom of caching values. In the end, Dan prevailed and the environment PMC doesn't cache values.

http://groups.google.com/groups

Dan on threading

"Yes, [there is some plan of how threading will work in Parrot]. (Though there's some disagreement [as] to the solidity and sanity of the plan)".

Dan went on to outline the plan.

http://groups.google.com/groups

Event handling

Event handling appears to be the topic of the week. The consensus seems to be that getting it right is hard. Lots of people offered constructive suggestions and problem cases. I sat on the sidelines and trusted them to get it Right (a strategy which seems to pay off remarkably often, thankfully).

Damien Neil was rather less sure of the wisdom of Parrot's events and asynchronous IO model, arguing for a system based on threading. Dan was unconvinced. I don't think Damien was convinced either.

http://groups.google.com/groups

http://groups.google.com/groups

http://groups.google.com/groups

IMCC sub names are not labels

Luke Palmer had some problems with using continuation passing style subroutines in IMCC. Leo initially pointed out that IMCC didn't support CPS. Then he fixed it so that it did (sort of: the user still has to do a chunk of the work, but sometimes that's a good thing).

In an example of synchronicity, Will Coleda had a similar problem. Leo's fix helped him too, and he went off to continue hacking on Tcl. (It helps me too, as I have this cunning plot....)

http://groups.google.com/groups

More on targeting GCC

In last week's summary I wondered what Tupshin Harper was talking about when he suggested emulating a "more traditional, stack-oriented processor". This week, Tupshin wished that "people had the decency to tell me how insane [the idea] is". Dan obliged, telling Tupshin that he was insane (but he did it with a smile).

The problem boils down to GCC's assumptions about the "shape" of the stack. GCC assumes a conventional, contiguous area of stack memory filled with stack frames. Parrot assumes a garbage collected chain of continuations, which is about as far as you can get from GCC's beliefs.

http://groups.google.com/groups

Parrot_sprintf not recognizing 7 in precision

mrnobo1024 posted a patch to fix a minor niggle with Parrot_sprintf. Leon Brocard applied it.

http://groups.google.com/groups

Problems with new object ops

Dan has started adding real classes and objects to Parrot. (Yay!)

Simon Glover found some bugs. (Not entirely unexpected.)

Simon also sent in a patch fixing the bugs. (Yay! Yay! and thrice Yay!)

http://groups.google.com/groups

The big core.ops split

Brent Dax has started work on splitting core.ops from being the second largest file in the Parrot distribution (108k) into a total of ten more narrowly defined .ops files. Benjamin Goldberg wondered if this would be a good time for tidying up the parrot directory structure. At the very least, he called for moving the source files into a src subdirectory.

Dan asked for a volunteer to rough out a move plan covering which files move where and what the new directory structure would look like, which would allow the perl.org jockeys to rearrange things in such a way that the various files remember their histories.

http://groups.google.com/groups

Copyrights

Josh Wilmes posted a monster patch to unify the various copyright notices attached to the files in the parrot distribution. Everything is now (correctly) copyrighted by the Perl Foundation.

http://groups.google.com/groups

Meanwhile, in perl6-language

Somebody needs to set an Exegesis among the pigeons. There's all of 14 messages in there, almost discussing the semantics of aliasing an array slice.

http://groups.google.com/groups — Alias those slices

Parsers with Pre-processors

I didn't quite understand what Dave Whipp was driving at when he talked about overloading the <ws> pattern as a way of doing preprocessing of Perl 6 patterns. I didn't understand Luke Palmer's answer either. Help.

http://groups.google.com/groups

Protocols

Luke Palmer's been toying with Objective-C (and why not, it's a nice language) and would like Perl 6 to steal Objective-C's protocols. Protocols are a little like Java's interfaces, but nicer. I'm waiting for him to latch onto the idea of Categories which are Objective C's way of adding methods to a class, at runtime.

Discussion of Luke's proposal centered on a couple of his peripheral points. Nobody's yet addressed his main point, but it looks like a decent idea to me.

http://groups.google.com/groups

Acknowledgements, Announcements, and Apologies

First of all, I plead insanity for my mistake of last week's summary. PONIE does not stand for "Perl On New Internal Architecture", it obviously stands for "Perl On New Internal Engine". I'm very, very sorry and I'll try not to do it again.

Editor's note: the Ponie expansion was corrected in both summaries before publication. The summarizer has been whacked appropriately.

Secondly, an announcement, I've started one of those punditry/bloggy things. It's at http://pc1.bofhadsl.ftech.co.uk:8080/ (Snappy URL eh?) and it's called 'Just A Summary'. I'm expecting it to concentrate on Perl and Perl 6 related issues, but at the time of this writing I've only got two 'real' articles up there so anything could happen.

As ever, if you've appreciated this summary, please consider one or more of the following options:

  • Send money to the Perl Foundation at http://donate.perl-foundation.org/ and help support the ongoing development of Perl.
  • Get involved in the Perl 6 process. The mailing lists are open to all. http://dev.perl.org/perl6/ and http://www.parrotcode.org/ are good starting points with links to the appropriate mailing lists.
  • Send feedback, flames, money, requests for consultancy, photographic and writing commissions, or an Apple 23' Cinema Display to p6summarizer@bofh.org.uk (One of these days that begging request will work, and I'll be flabbergasted).

State of the Onion 2003

This is the 7th annual State of the Perl Onion speech, wherein I tell you how Perl is doing. Perl is doing fine, thank you. Now that that's out of the way, I'd like to spend the rest of the time telling jokes.

In fact, the conference organizers have noticed that I spend most of the time telling jokes. So each year they give me a little less time, so I have to chop out more of the serious subject matter so as to leave time for the jokes.

Extrapolating several years into the future, they'll eventually chop my time down to ten seconds. I'll have just enough time to say: "I'm really, really excited about what is happening with Perl this year. And I'd like to announce that, after lengthy negotiations, Guido and I have finally decided... <gong> ["Time's up. Next speaker please"]

Well, you didn't really want to know that anyway...

Since this is a State of the Union speech, or State of the Onion, in the particular case of Perl, I'm supposed to tell you what Perl's current state is. But I already told you that the current state of Perl is just fine. Or at least as fine as it ever was. Maybe a little better.

But what you really want to know about is the future state of Perl. That's nice. I don't know much about the future of Perl. Nobody does. That's part of the design of Perl 6. Since we're designing it to be a mutable language, it will probably mutate. If I did know the future of Perl, and if I told you, you'd probably run away screaming.

As I was meditating on this subject, thinking about how I don't know the future of Perl, and how you probably don't want to know it anyway, I was reminded of a saying that I first saw posted in the 1960's. You may feel like this on some days.

We the unwilling,
led by the unknowing,
are doing the impossible
for the ungrateful.
We have done so much for so long with so little
We are now qualified to do anything with nothing

blue collar

I think of it as the Blue-Collar Worker's Creed.

This has been attributed to various people, none of whom are Ben Franklin, Abraham Lincoln, or Mark Twain. My favorite attribution is to Mother Teresa. She may well have quoted it, but I don't think she coined it, because I don't think Mother Teresa thought of herself as "unwilling". After all, Mother Teresa got a Nobel prize for being one of the most willing people on the face of the earth.

It's also been attributed to the Marines in Vietnam, and it certainly fits a little better. But since I grew up in a Navy town, I'd like to think it was invented by a civilian shipyard worker working for the Navy. In any event, I first saw it posted in a work area at Puget Sound Naval Shipyard back in the 1960's. Now, you may well wondering what I was doing in a Naval Shipyard in the 1960's. That's a secret.

Anyway, you may also be wondering why I brought it up at all. Well, last year I used the table of contents from an issue of Scientific American as my outline. This year I'd like to use this as my outline.

I'd like to, but I won't.

But if I did, here's what I'd say.

From the postmodern point of view, this is a text that needs to be deconstructed. It was obviously written by someone in a position of power pretending not to be. And by making light of the plight of blue collar workers, and allowing the oppressed workers to post this copy-machine meme in the workplace, this white-collar wolf in blue-collar sheep's clothing has managed to persuade the oppressed workers that being powerless is something to be proud of.

Now, some of you young folks are too steeped in postmodernism to know anything about postmodernism, so let's review. Postmodernism in its most vicious form started out with the notion that there exist various cultural constructs, or texts, or memes, that allow some human beings to oppress other human beings. Of course, in Soviet Russia it's the other way around. Which is why they managed to deconstruct themselves, I guess.

Anyway, deconstructionism is all about throwing out the bad cultural memes, where "bad" is defined as anything an oppressed person doesn't like. Which is fine as far as it goes, but the spanner in the works is that you can only be an oppressed person if the deconstructionists say you are. Dead white males need not apply. Fortunately, I'm not dead yet. Though I'm trying. As some of you know, several weeks ago I was in the hospital with a bleeding ulcer. I guess I'm a little like Soviet Russia. I oppress myself, so I deconstruct myself.

Oh, by the way, I got better. In case you hadn't noticed.

Though I'm not allowed to drink anything brown anymore. Sigh. That's why this speech is so boring — I wrote it under the non-influence.

But back to postmodernism. Postmodern critics have invented a notation for using a word and denying its customary meaning at the same time, since most customary meanings are oppressive to someone or other, and if not, they ought to be. Or something like that.

Anyway, I'm going to borrow that notation for my own oppressive purposes, and strike out a few of these words that don't mean exactly what I want them to mean. I hope that doesn't make me a postmodern critic. Or maybe it does. As Humpty Dumpty said, the question is who's to be master, that's all.

So let's start by striking out "unwilling", because there are quite a few willing people around here. Or at least willful.

And let's strike out "unknowing" too, because you wouldn't be sitting here listening to us leaders here tonight if you thought we didn't know anything. On the other hand, maybe you just came for the jokes...

Now let's strike out the "impossible". Actually, I hesitate to strike that one out, because what we're trying to do with Perl is to be all things to all people, and in the long run that is completely impossible, technically, socially, and theologically speaking.

But that doesn't stop us from trying. And who knows, maybe more of it is possible than we imagine.

We definitely have to strike out ungrateful, because we know many people are grateful. Nevertheless, a number of people find it impossible to be grateful, and we should be working to please them as well. Love your enemies, and all that. Another impossible task. Or... perhaps the same one.

I like to please people who did not expect to be pleased. One day when I was a lot younger than I am now, I performed a piece on my violin. A lady came up to me afterward and said, "You know, I don't like the violin. But I liked that."

I treasure that sort of compliment, just as I treasure the email messages that say, "I had given up on computer programming because it wasn't any fun, and then I discovered Perl." That's what I mean when I say we should work to please the people who don't expect to be grateful.

Anyway, back to our Creed here. I can't see anything wrong with the last two lines. In fact, they're directly applicable.

We have done so much for so long with so little

That's Perl 5.

We are now qualified to anything with nothing.

That's Perl 6. I suppose I need to strike that out too, since it doesn't really exist yet, except in our heads.

Well, maybe that's not such a bad outline after all. Let's talk a little more about those things.

the unwilling

We the unwilling

Here in the open source community, we're willing to help out, but that's because we're not willing to put up with the status quo. And that's generally due to our inflated sense of Laziness, Impatience, and Hubris. But then a really funny thing happens. A number of us will get together and agree about something that needs doing because of our Laziness, Impatience, and Hubris, and then we'll start working on that project with a great deal of industriousness, patience, and humility, which seem to be the very opposite qualities to those that motivated us in the first place.

I've tried to figure out a rationale for that, but I've pretty much come to the conclusion that it's not rational or reasonable. It's just who we are. Here's a favorite quotation of mine.

"The reasonable man adapts himself to the world; the unreasonable one persists in trying to adapt the world to himself. Therefore all progress depends on the unreasonable man."

I think that all of us agree that this is true. We just can't always agree on what to be unreasonable about. Of course, this was written by George Bernard Shaw, who had his own ideas of the most reasonable ways to be unreasonable. This is, after all, the guy who wrote Pygmalion, upon which the musical My Fair Lady was based, with dear old 'Enry 'Iggins and Eliza Dolit'le going at each other's throats. And over linguistics of all things. Fancy that.

The only problem with this quote is that it's false. A lot of progress comes from unreasonable women.

Well, okay, maybe Shaw meant "he or she" when he only said "he". Still, if we're going to please unreasonable people in the twenty-first century, maybe we need to rewrite it like this:

strike out man and him

On the other hand, some people are impossible to please. We should probably just strike out "George Bernard Shaw" since he's a dead white male.

unknowing

We the unwilling, led by the unknowing

That's me all over. Which is what the bug said after he hit the windshield.

Or as the bug's friend said, "Bet you don't have the guts to do that again."

Whether I have the guts to do Perl again is another question. My guts are still in sad shape at the moment, according to the doctor...

Anyway, back to "me the unknowing". I admit that there's an awful lot that I don't know. I'd love to tell you how much I don't know, but I don't know that either.

So I'll have to talk about what I know instead. If you are so inclined, you may infer that I am totally oblivious to anything I don't talk about today.

One thing I do know about is the universal architectural diagram. It looks like this.

It doesn't have to be chartreuse. How about pink, to match the fireworks up in the corner. I put the fireworks up in the corner there in case you missed the fireworks on the 4th of July.

pink

Anyway, this is the universal architectural diagram because you can represent almost any architecture with it, if you try hard enough. Here's a common enough one:

CPU

Here we have a bus that's common across the other three components of our computer, the memory, the CPU, and the I/O system. Within the computer we have other entities such as strings, which you can view either as a whole or as a sequence of characters.

string

An integer is just like a string, only it's a sequence of bits.

integer

We can go from very small ideas like integers to very large ideas like government:

government

Or even alternate forms of government.

Borg

The diagram is even more versatile because you can rotate it on its side.

plain right

Now, for some reason, this particular orientation seems to engender the most patriotism. It might just be accidental, but if you color it like this

flag

people start thinking about saluting it. Kinda goes with the fireworks, I guess.

A little more dangerous is this diagram.

object

It's amazing how many people will salute that one. And people will even go to war for this one:

class

But you know, the whole notion of objects like this is that there are ways in which you treat them as a single thing, and ways in which you treat them as multiple things. Every structured object is wrapped up in its own identity. That's really what this little diagram is getting at.

Well, let's keep rotating it and see what we get.

rotate God

Okay, if you happen to be a Christian of the trinitarian persuasion like me, then you believe that God is a structured object that is simultaneously singular and plural depending on how you look at it. Of course, nobody ever fights about that sort of thing, right?

plain left

It's kind of unusual to see the diagram in this orientation, probably due to linguistic considerations.

one out of many

But whether you say "one out of many" or "e pluribus unum",

pluribus

it means much the same thing. In a language that reads left to right, perhaps it's more naturally suited to processes that lose information, such as certain kinds of logic.

or

Again, we can go from the very small to the very large.

black hole

If you feed three random planets to a black hole, you also lose information. Or at least you hide it very well, depending on your theory of how black holes work.

If you feed one of these diagrams to a black hole, it turns into a piece of spaghetti.

But let's not, and say we did.

Oddly enough, what I'd really like to talk about today is Perl. If we look at our goal for the Parrot project, it looks something like this.

Borg Parrot

Oops, wrong slide.

Parrot

That is, Parrot is designed to be a single engine upon which we can run both Perl 5 and Perl 6. And... stuff. Admittedly, this is a rather Perl-centric view of reality, to the extent you can call this reality.

Well, okay, I'll cheat and show you the other stuff we'd like to do.

detail

We'd also like to support, for example, PHP, Ruby, Python, BASIC, Scheme, COBOL, Java, Befunge, TECO, Rebol, REXX, and... I can't quite make out that one on the bottom there. And if I could, I wouldn't say it anyway, because there are children present, and I wouldn't want to fuck up their brains.

Okay, I admit this is not quite reality yet. I just put in all those languages because I'm a white male who is trying to oppress you before I'm quite dead. So I'd better strike out a few things that aren't really there yet.

strikes

Could I interest you in a really fast BASIC interpreter?

Parrot + BASIC

Well, it's time to move on to our next point.

impossible

We the unwilling, led by the unknowing, are doing the impossible.

Is what we're doing really impossible? It's possible. But we won't know till we try. More precisely, till we finish trying. Sometimes things seem impossible to us, but maybe that's just because we're all slackers.

And because we oversimplify.

Let's take another look at the pink tennis court. I mean, the universal architectural diagram. It really isn't quite as universal as I've made it out to be. First, let's get rid of the pink.

black

Maybe I should give equal time to blue.

blue

Nah.

black

Anyway, as I was saying, this isn't universal enough. Here's the real universal diagram.

line widget

This is what's known as an impossible object. I like it. I'm impossible object oriented. This particular impossible object is often called a widget. But you knew that already.

What you might not have known is that, up till now, it's been thought impossible to color such an object accurately. But as you can see,

colorized

that is false. There are still some perceptual difficulties with it, but I'm sure that problem is just a relic of our reptile brain. Or was it our bird brain. I forget. In any event, if you have trouble perceiving this object correctly, just use the universal clarification tool.

cloud

I'll assume you can supply your own cloud from now on.

Should be easy here in Portland... I'm allowed to make jokes about Portland because I grew up in the Pacific Northwet.

As you can see, this more accurate universal architectural diagram can actually be rotated in 3-d with properly simulated lighting.

rotate rotate rotate rotate

It's extensible.

6comb 12comb

Comb structures are important in a programming language. That's why we're adding a switch statement to Perl 6.

It's also a more accurate representation of Parrot.

parrot

It's also more sophisticated linguistically.

widget

Not only can it represent singular and plural concepts, but also the old Indo-European notion of dual objects.

We still have vestiges of that in English.

oxen

One ox, many oxes, two oxen yoked together pulling your plow.

regexen

Or one regex, many regexes, but two regexen working together.

Vaxen

You always wanted to know the proper name for a two-headed Vax?

Everything is possible. You should be grateful.

ungrateful

On to the ungrateful undead.

There's been a lot of carping lately about how slow Perl 6 development is going. Some of it comes from well intentioned folks, but some of it comes from our poison pen pals who live in the troll house. Still, I think a lot of the criticism shows a lack of understanding of the basic laws of development. These laws can be illustrated with this diagram:

widget

Basically, perfect development is impossible. Development can be fast, good, and cheap. Pick two.

Actually, that's unrealistic.

Pick one.

Which one would you pick? You want fast? You want cheap? No, I think you want this one.

Good.

Good design is neither fast nor cheap. Every time we crank out a new chunk of the design of Perl 6 or of Parrot, it's a bit like writing a master's thesis. It's a lot of reading, and a lot of writing, and a lot of thinking, and a lot of email, and a lot of phone conferences. It's really complicated and multidimensional.

escher

There's a lot going on behind the scenes that you don't hear about every day. Many people have sacrificed to give us time to work on these things. People have donated their own time and money to it. O'Reilly and Associates have donated phone conferences and other infrastructure. The Perl 6 design team in particular has borne a direct financial cost but also a tremendous opportunity cost in pursuing this at the expense of career and income. I'm not looking for sympathy, but I want you to know that I almost certainly could have landed a full-time job 20 months ago if I'd been willing to forget about Perl 6. I'm extremely grateful for the grants the Perl Foundation has been able to give toward the Perl 6 effort. But I just want you to know that it's costing us more than that.

But Perl 6 is all about freedom, and that's why we're willing to pledge our lives, our fortunes, and our sacred honor.

Times are tough, and I'm not begging for more sacrifice from you good folks. I just want to give a little perspective, and fair warning that at some point soon I'm going to have to get a real job with real health insurance because I can't live off my mortgage much longer. It's bad for my ulcer, and it's bad for my family.

Fortunately, the basic design of Perl 6 is largely done, appearances to the contrary notwithstanding. Damian and I will be talking about that in the Perl 6 session later in the week.

Well, enough ranting. I don't want to sound ungrateful myself, because I'm not. In any event, the last three years have been extremely exciting, and I think the coming years will be just as interesting.

In particular, I have a great announcement to make at the end of my talk about what's going to be happening next. But let me explain a bit first what's happened, again using our poor, abused widget.

implementations

In this case, time is flowing in the upward direction.

Originally we just had one implementation of Perl, and the general perception as we started developing Perl 6 was that we were going to have two implementations of Perl.

But in actual fact, we're going to have at least three implementations of Perl.

First, the good old Perl 5 that's based on C, And on the right, the Perl 6 that's based on Parrot. But there in the middle is a Perl5 that is also based on Parrot.

ellipses

Note that the left two are the same language, while the right two share the same platform.

So what's that Perl 5 doing there in the middle? If you've been following Perl 6 development, you'll know that from the very beginning we've said that there has to be a migration strategy, and that that strategy has two parts. First, we have to be able to translate Perl 5 to Perl 6. If that were all of it, we wouldn't need the middle Perl there. But not only do people need to be able to translate from Perl 5 to Perl 6, it is absolutely crucial that they be allowed to do it piecemeal. You can't translate a complicated set of modules all at once and expect them to work. Instead, we want people to be able to run some of their modules in Perl 5, and others in Perl 6, all under the same interpreter.

So that's one good reason to have a Perl 5 compiler for Parrot. Another good reason is that we expect Perl 5 to run faster on Parrot, by and large.

hands

Yet another reason is that we have a little bootstrapping issue with the Perl 6 grammar. The Perl 6 grammar is defined in Perl 6 regexes. But those regexes are parsed with the Perl 6 grammar. Catch 22. The solution to this involves two things. First, a magical module of Damian's that translates Perl 6 regexes back into Perl 5 regexes. Second, a Perl 5 regex interpreter to run those regexes. Now, it'd be possible to do it with old Perl 5, but it'll be cleaner to run it with the new Perl 5 running on Parrot.

widget

Now, it's awfully cumbersome to keep saying "Perl 5 over Parrot" and such, so we need to do some namespace cleanup here. We can drop the "over Parrot" for Perl 6, because that's redundant.

drop parrot

Likewise, people always think of the original when we say "Perl 5".

drop C

That means we need a code name for this thing in the middle. We've decided to call it "Ponie".

Ponie

We have lots of reasons to call it that. To be sure, none of them are good reasons, but I'm told it will make the London.pm'ers deliriously happy if I say, "I want a Ponie".

And I do want a Ponie. "I want the Ponie, I want the whole Ponie. I want it now."

versions

The plan is to for Ponie version 5.10 to be a drop-in replacement for Perl 5.10. Eventually there will be a Ponie 5.12, and if Ponie is good enough, there may not be an old-fashioned 5.12. We'll just stop with 5.10.

So we're gonna start on Ponie right now. Since I've been carping about lack of resources, you might wonder how we're gonna do this.

Well, as it happens, a nice company called Fotango has a lot of Perl 5 code they want to run on Parrot, and they are clued enough to have authorized one of their employees, our very own Arthur Bergman, to spend company time porting Perl 5 to Parrot.

Is that cool or what? I'm out of time, so read the press release. But I'm really excited by our vision for the future, and if you're not excited, maybe you need to have your vision checked.

vision

Thanks for listening, and I hope that from now on you'll all be completely unreasonable.

How to Avoid Writing Code


One of the most boring programming tasks in the world has to be pulling data out of a database and displaying it on a web site. Yet it's also one of the most ubiquitous. Perl programmers being lazy, there are tools to help make boring programming tasks less painful, and two of these tools, Class::DBI and the Template Toolkit, create a whole which is far more drudgery-destroying than its parts.

Both these tools can do more complicated stuff than that described in this article, but my aim is to motivate people who may not have tried them out to give them a go and see how much work they can save you for even simple tasks.

I've assumed that you know the basics of designing a database--why you have several tables and JOIN them rather than putting everything in the same table. I've also assumed that you're not allergic to reading documentation, so I'm going to spend more space on saying why I use particular features of the modules rather than explaining exactly how they work.

Synergy

The reason that Class::DBI and the Template Toolkit work so well together is simple. Template Toolkit templates can call methods on objects passed to them--so there's no need to explicitly pull every column out of the database before you process the template--and Class::DBI saves you the bother of writing methods to retrieve database columns. You're essentially going straight from the database to HTML with only a very small amount of Perl in the middle.

Suppose you're writing a web application to store details of books and their authors, and reviews of the books by users of the site. You'd like to have a page that displays all the books in your database and, for each book, offers links to all the reviews already written. With suitably set-up classes you can write a couple of lines of Perl:

  #!/usr/bin/perl -w
  use strict;

  use Bookworms::Book;
  use Bookworms::Template;

  my @books = Bookworms::Book->retrieve_all;
  @books = sort { $a->title cmp $b->title } @books;
  print Bookworms::Template->output( template => "book_list.tt",
                                     vars     => { books => \@books } );

hand your designer a simple template to pretty up:

  [% page_title = "List all books" %]
  [% INCLUDE header.tt %]

    <ul>
      [% FOREACH book = books %]
        <li>[% book.title %] ([% book.author.name %])
            [% FOREACH review = book.reviews %]
              (<a href="review.cgi?review=[% review.uid %]">Read review 
			  by [% review.reviewer.name %]</a>)
            [% END %]
        </li>
      [% END %]
    </ul>

  [% INCLUDE footer.tt %]

and your task is done. You don't have to explicitly select the reviews; you don't have to then cross-reference to another table to find out the reviewer's name; you don't have to mess with HERE-documents or fill your program with print statements. You hardly have to do anything.

Except of course, write the Bookworm::* classes in the first place, but that's easy.

Simple, Small Classes

For convenience, we write a class containing all the SQL needed to set up our database schema. This is very useful for running tests as well as for deploying a new install of the application.

  package Bookworm::Setup;
  use strict;
  use DBI;

  # Hash for table creation SQL - keys are the names of the tables,
  # values are SQL statements to create the corresponding tables.
  my %sql = (
      author => qq {
          CREATE TABLE author (
              uid   int(10) unsigned NOT NULL auto_increment,
              name  varchar(200),
              PRIMARY KEY (uid)
          )
      },
      book => qq{
          CREATE TABLE book (
              uid           int(10) unsigned NOT NULL auto_increment,
              title         varchar(200),
              first_name    varchar(200),
              author        int(10) unsigned, # references author.uid
              PRIMARY KEY (uid)
          )
      },
      review => qq{
          CREATE TABLE review (
              uid       int(10) unsigned NOT NULL auto_increment,
              book      int(10) unsigned, # references book.uid
              reviewer  int(10) unsigned, # references reviewer.uid
              PRIMARY KEY (uid)
          )
      },
      reviewer => qq{
          CREATE TABLE review (
              uid   int(10) unsigned NOT NULL auto_increment,
              name  varchar(200),
              PRIMARY KEY (uid)
          )
      }
  );
This class has a single method that sets up a database conforming to the schema above. Here's the rendered POD for it; the implementation is pretty simple. The "force_clear" option is very useful for testing.
    setup_db( dbname      => 'bookworms',
              dbuser      => 'username',
              dbpass      => 'password',
              force_clear => 0            # optional, defaults to 0
            );

  Sets up the tables. Unless "force_clear" is supplied and set to a
  true value, any existing tables with the same names as we want to
  create will be left alone, whether or not they have the right
  columns etc. If "force_clear" is true, then any tables that are "in
  the way" will be removed. _Note that this option will nuke all your
  existing data._

  The database user "dbuser" must be able to create and drop tables in
  the database "dbname".

  Croaks on error, returns true if all OK.

Now, another class to wrap around the Template Toolkit; we want to grab global variables like the name of the site, and so on, from a config class. (There are plenty of config modules on CPAN; you're bound to find one you like. I quite like Config::Tiny; other people swear by AppConfig--and since the latter is a prerequisite of the Template Toolkit, you'll have it installed already.) Bookworms::Config is just a little wrapper class around Config::Tiny, so if I change to a different config method later I don't have to rewrite lots of code.

  package Bookworms::Template;
  use strict;
  use Bookworms::Config;
  use CGI;
  use Template;

  # We have one method, which returns everything you need to send to
  # STDOUT, including the Content-Type: header.

  sub output {
      my ($class, %args) = @_;

      my $config = Bookworms::Config->new;
      my $template_path = $config->get_var( "template_path" );
      my $tt = Template->new( { INCLUDE_PATH => $template_path } );

      my $tt_vars = $args{vars} || {};
      $tt_vars->{site_name} = $config->get_var( "site_name" );

      my $header = CGI::header;

      my $output;
      $tt->process( $args{template}, $tt_vars, \$output)
          or croak $tt->error;
      return $header . $output;
  }

Now we can start writing the classes to manage our database tables. Here's the class to handle book objects:

  package Bookworms::Book;
  use base 'Bookworms::DBI';
  use strict;

  __PACKAGE__->set_up_table( "book" );
  __PACKAGE__->has_a( author => "Bookworms::Author" );
  __PACKAGE__->has_many( "reviews",
                         "Bookworms::Review" => "book" );

  1;

Yes, that's all you need. This simple class, by its ultimate inheritance from Class::DBI, has auto-created constructors and accessors for every aspect of a book as defined in our database schema. And moreover, because we've told it (using has_a) that the author column in the book table is actually a foreign key for the primary key of the table modeled by Bookworms::Author, when we use the ->author accessor we actually get a Bookworms::Author object, which we can then call methods on:

  my $hobbit = Bookworms::Book->search( title => "The Hobbit" );
  print "The Hobbit was written by " . $hobbit->author->name;

There are a couple of supporting classes that we need to write, but they're not complicated either.

First a base class, as with all Class::DBI applications, to set the database details:

  package Bookworms::DBI;
  use base "Class::DBI::mysql";

  __PACKAGE__->set_db( "Main", "dbi:mysql:bookworms", 
    "username", "password" );

  1;

Our base class inherits from Class::DBI::mysql instead of plain Class::DBI, so we can save ourselves the trouble of directly specifying the table columns for each of our database tables--the database-specific base classes will auto-create a set_up_table method to handle all this for you.

At the time of writing, base classes for MySQL, PostgreSQL, Oracle, and SQLite are available on CPAN. There's also Class::DBI::BaseDSN, which allows you to specify the database type at runtime.

We'll also want a class for each of the author, review, and reviewer tables, but these are even simpler than the Book class. For example, the author class could be as trivial as:

  package Bookworms::Author;
  use base 'Bookworms::DBI';
  use strict;

  __PACKAGE__->set_up_table( "author" );

  1;

If we wanted to be able to access all the books by a given author, we could add the single line

  __PACKAGE__->has_many( "books",
                         "Bookworms::Book" => "author" );

and an accessor to return an array of Bookworms::Book objects would be automatically created, to be used like so:

  my $author = Bookworms::Author->search( name => "J K Rowling" );
  my @books = $author->books;

Or indeed:

  <h1>[% author.name %]</h1>

  <ul>
    [% FOREACH book = author.books %]
      <li>[% book.title %]</li>
    [% END %]
  </ul>

Simple, small, almost trivial classes, taking a minute or two each to write.

What Does This Get Me?

The immediate benefits of all this are obvious:

  • You don't have to mess about with HTML, since the very simplistic use of the Template Toolkit means that templates are comprehensible to competent web designers.
  • You don't have to maintain classes full of copy-and-paste code, since the repetitive programming tasks like creating constructors and simple accessors are done for you.

A large hidden benefit is testing. Since the actual CGI scripts--which can be a pain to test--are so simple, you can concentrate most of your energy on testing the underlying modules.

It's probably worth writing a couple of simple tests to make sure that you've set up your classes the way you intended to, particularly in your first couple of forays into Class::DBI.

  use Test::More tests => 5;
  use strict;

  use_ok( "Bookworms::Author" );
  use_ok( "Bookworms::Book" );
  my $author = Bookworms::Author->create({ name => "Isaac Asimov" });
  isa_ok( $author, "Bookworms::Author" );
  my $book = Bookworms::Book->create({ title  => "Foundation",
                                       author => $author });
  isa_ok( $book, "Bookworms::Book" );
  is( $book->author->name, "Isaac Asimov", "right author" );

However, the big testing win with this technique of separating out the heavy lifting from the CGI scripts into modules is when you'd like to add something more complicated. Say, for example, fuzzy matching. It's well known that people can't spell, and you'd like someone typing in "Isaac Assimov" to find the author they're looking for. So, let's process the author names as we create the author objects, and store some kind of canonicalized form in the database.

Class::DBI allows you to define "triggers"--methods that are called at given points during the lifetime of an object. We'll want to use an after_create trigger, which is called after an object has been created and stored in the database. We use this in preference to a before_create trigger, since we want to know the uid of the object, and this is only created (via the auto_increment primary key) once the object has been written to the database.

We use Search::InvertedIndex to store the canonicalized names, for quick access. We start with a very simple canonicalization--stripping out vowels and collapsing repeated letters. (I've found that this can pick up about half of name misspellings found in the wild, which is pretty impressive.)

We'll write a couple of tests before we move on to code. Here are some that check that our class is doing what we told it to--removing vowels and collapsing repeated consonants.

  use Test::More tests => 2;
  use strict;

  use Bookworms::Author;

  my $author = Bookworms::Author->create({ name => "Isaac Asimov" });
  my @matches = Bookworms::Author->fuzzy_match( name => "asemov" );
  is_deeply( \@matches, [ $author ], 
    "fuzzy matching catches wrong vowels" );
  @matches = Bookworms::Author->fuzzy_match( 
    name => "assimov" );
  is_deeply( \@matches, [ $author ], 
    "fuzzy matching catches repeated letters" );

We should also write some other tests to run our algorithms over various misspellings that we've captured from actual users, to give an idea of whether "what we told our class to do" is the right thing.

Here's the first addition to the Bookworms::Author class, to store the indexed data:

  use Search::InvertedIndex;

  my $database = Search::InvertedIndex::DB::Mysql->new(
                     -db_name    => "bookworms",
                     -username   => "username",
                     -password   => "password",
                     -hostname   => "",
                     -table_name => "sii_author",
                     -lock_mode  => "EX"
    ) or die "Couldn't set up db";

  my $map = Search::InvertedIndex->new( -database => $database )
    or die "Couldn't set up map";
  $map->add_group( -group => "author_name" );

  __PACKAGE__->add_trigger( after_create => sub {
      my $self = shift;
      my $update = Search::InvertedIndex::Update->new(
          -group => "author_name",
          -index => $self->uid,
          -data  => $self->name,
           -keys  => { map { $self->_canonicalise($_) => 1 }
                       split(/\s+/, $self->name)
                     }
          );
          $map->update( -update => $update );
      }
  } );

  sub _canonicalise {
      my ($class, $word) = @_;
      return "" unless $word;
      $word = lc($word);
      $word =~ s/[aeiou]//g;    # remove vowels
      $word =~ s/(\w)\1+/$1/eg; # collapse doubled 
                                # (or tripled, etc) letters
      return $word;
  }

(We'll also want similar triggers for after_update and after_delete, in order that our indexing is kept up to date with our data.)

Then we can write the fuzzy_matching method:

  sub fuzzy_match {
      my ($class, %args) = @_;
      return () unless $args{name};
      my @terms = map { $class->_canonicalise($_) => 1 }
                        split(/\s+/, $args{name});
      my @leaves;
      foreach my $term (@terms) {
          push @leaves, Search::InvertedIndex::Query::Leaf->new(
              -key   => $term,
              -group => "author_name" );
      }

      my $query = Search::InvertedIndex::Query->new( -logic => 'and',
                                                     -leafs => \@leaves );
      my $result = $map->search( -query => $query );

      my @matches;
      my $num_results = $result->number_of_index_entries || 0;
      if ( $num_results ) {
          for my $i ( 1 .. $num_results ) {
              my ($index, $data) = $result->entry( -number => $i - 1 );
              push @matches, $data;
          }
      }

      return @matches;
  }

(The matching method can be improved. I've found that neither Text::Soundex nor Text::Metaphone are much of an improvement over the simple approach already detailed, but Text::DoubleMetaphone is definitely worth plugging in, to catch misspellings such as Nicolas/Nicholas and Asimov/Azimof.)

There are plenty of other features that our little web application would benefit from, but I shall leave those as an exercise for the reader. I hope I've given you some insight into my current preferred web development techniques--and I'd love to see a finished Bookworms application if it does scratch anyone's itch.

See Also

This week on Perl 6, week ending 2003-07-13

Welcome once again to the Perl 6 Summary, in a week of major developments and tantalizing hints. We start, as usual, with what's happening in perl6-internals:

Targeting Parrot from GCC

Discussion in the thread entitled 'WxWindows Support / Interfacing Libraries' centred on writing a Parrot backend to GCC. (No, I have no idea what that has to do with the thread subject.) Tupshin Harper, Leo Tötsch and Benjamin Goldberg discussed possibilities and potential pit- and pratfalls. At one point, Tupshin suggested emulating a "more traditional stack-oriented processor". I don't think he was joking.

http://groups.google.com/groups

Timely destruction and TRACE_SYSTEM_AREAS

Jürgen Bömmels' Parrot IO rewrite is causing some problems with garbage collection. (IO handles are the canonical examples of resources that need timely destruction).

Leo tracked down the source of resource leak to a problem with handles being found on the C stack. Jürgen wasn't happy about this (he's not keen on the stack walking approach to garbage collection). He proposed that we get rid of the stack walk in favour of some other solution to the infant mortality problem and offered a few candidates. Leo said that he didn't like walking the C stack, going so far as to state that "Timely destruction and solving infant mortality don't play together or are mutually exclusive - in the current approach." Dan hasn't commented on this yet.

http://groups.google.com/groups

Parrot is not feature frozen

There was a certain amount of confusion as some old email with the subject "Parrot is feature-frozen until Wednesday" made its way into a small number of inboxes, sowing confusion as it went. Suffice to say that Parrot is not currently feature frozen, though Steve Fink did say that he was considering a point release once the imcc/parrot integration was complete. If Dan gets objects and exceptions finished, then it might even warrant a 0.1.0 version number rather than 0.0.11.

http://groups.google.com/groups

Perl* Abstraction

Luke Palmer has "finally" started to implement his Infinity PMC and has noticed a lot of redundant code in the Perl* classes. He also noticed that Parrot doesn't seem to have the distinction between container and value that has been confusing people on the language list.

http://groups.google.com/groups

Fun with ParrotIO

First, Jürgen Bömmels sent in a patch to excise integer file descriptors from Parrot except when they are managed via ParrotIO PMCs. Leo applied this.

Clinton Pierce thought that this patch meant that a Win32 bug could be closed in the Parrot bug database. This sparked a discussion with Leo, and Jürgen, but I'm not entirely sure of the status of the bug.

http://groups.google.com/groups

http://groups.google.com/groups

Jako groks basic PMCs

Gregor N Purdy seems to have started working on Jako again, and checked in some changes allowing Jako to manipulate PMCs. People agreed that this was cool.

http://groups.google.com/groups

I want a Ponie!

The ponie-dev@perl.org mailing list was announced and I'll be summarizing it as of next week when I've joined, caught up, and generally recovered from America.

What's Ponie? Ponie is "Perl On New Internal Engine" or, as Thomas Klausner put it, "A version of Perl 5 that will run on Parrot". Larry announced Ponie at his OSCON 'State of the Onion' address.

Discussion of Ponie on the perl6-internals list centered on the "What is ponie?" question, with a certain amount of "Why ponie-dev, not perl6-ponie?" thrown in for good measure.

Brian Ingerson announced that he'd set up a Ponie Wiki. Leon Brocard pointed at the use.perl story announcing Ponie. Your summarizer punted on writing a description of the project himself.

http://groups.google.com/groups

http://www.poniecode.org/ — More on Ponie

http://ponie.kwiki.org/ — Ingy's Ponie Wiki

http://use.perl.org/article.pl — use.perl announcement

Exceptions!

Leo Tötsch checked in the beginnings of an exceptions system. Then he checked in the beginnings of an events system.

http://groups.google.com/groups

http://groups.google.com/groups

Meanwhile, in perl6-language

There were all of 6 messages, all of them discussing the effects of aliasing an array slice.

http://groups.google.com/groups

Perl 6 Rules at OSCON

No, wait, that should be Perl6::Rules.

For his last talk at OSCON, Damian spoke about Perl6::Rules, his implementation of Perl 6's rules system in pure Perl 5. Oh boy, was it tantalizing. He demonstrated working code that supports a large chunk of Perl 6 matching semantics, complete with handy debugging information, diagnostic dumping and all the other useful stuff.

When we were all gagging for him to release it to CPAN immediately, he told us that it wasn't finished yet; that he'd implemented it all during the week of OSCON, in 700 lines of code; that he was going on holiday for a month once he got home; and that the module would be completed and released to CPAN as time and money allowed and would be out by Christmas.

He didn't say which Christmas.

Trust me, we want this. A lot.

Acknowledgements, Announcements, and Apologies

Hmm... this summary is later than last week's. How did that happen?

Thanks to Darren Stalder and Cindy Fry, our kind hosts in Seattle; to Shiro san for some fantastic sushi on Sunday night; to Jesse Broksmith and Melissa Cain for being our Seattle native guides; to Ward Cunningham for a lift in his Jeep and for just being Ward; and to Casey West for reasons I promised not to go into in the summary.

For months now, I've been half joking about sending me jobs in the last little bit of this summary, but I really mean it now. If anyone's looking for an experienced OO Perl programmer and half experienced writer, I'm hunting work. Please get in touch at the address below.

As ever, if you've appreciated this summary, please consider one or more of the following options:

Integrating mod_perl with Apache 2.1 Authentication

Scratching Your Own Itch

Some time ago I became intrigued with Digest authentication, which uses the same general mechanism as the familiar Basic authentication scheme but offers significantly more password security without requiring an SSL connection. At the time it was really just an academic interest—while some browsers supported Digest authentication, many of the more popular ones did not. Furthermore, even though the standard Apache distribution came with modules to support both Basic and Digest authentication, Apache (and thus mod_perl) only offered an API for interacting with Basic authentication. If you wanted to use Digest authentication, flat files were the only password storage medium available. With both of these restrictions, it seemed impractical to deploy Digest authentication in all but the most limited circumstances.

Fast forward two years. Practically all mainstream browsers now support Digest authentication, and my interest spawned what is now Apache::AuthDigest, a module that gives mod_perl 1.0 developers an API for Digest authentication that is very similar to the Basic API that mod_perl natively supports. The one lingering problem is probably not surprising—Microsoft Internet Explorer. As it turns out, using the Digest scheme with MSIE requires a fully RFC-compliant Digest implementation, and Apache::AuthDigest was patterned after Apache 1.3's mod_digest.c, which is sufficient for most browsers but not MSIE.

In my mind, opening up Digest authentication through mod_perl still needed work to be truly useful, namely full RFC compliance to support MSIE. Wading through RFCs is not how I like to spend my spare time, so I started searching for a shortcut. Because Apache 2.0 did away with mod_digest.c and replaced it with the fully compliant mod_auth_digest.c, I was convinced that there was something in Apache 2.0 I could use to make my life easier. In Apache 2.1, the development version of the next generation Apache server, I found what I was looking for.

In this article, we're going to examine a mod_perl module that provides Perl support for the new authentication provider hooks in Apache 2.1. These authentication providers make writing Basic authentication handlers easier than it has been in the past. At the same time, the new provider mechanism opens up Digest authentication to the masses, making the Digest scheme a real possibility for filling your dynamic authentication needs. While the material is somewhat dense, the techniques we will be looking at are some of the most interesting and powerful in the mod_perl arsenal. Buckle up.

To follow along with the code in this article, you will need at least mod_perl version 1.99_10, which is currently only available from CVS. You will also need Apache 2.1, which is also only available from CVS. Instructions for obtaining the sources for both can be found here. When compiling Apache, keep in mind that the code presented here only works under the prefork MPM — making it thread-safe is the next step in the adventure.

Authentication Basics

Because there is lots of material to cover, we'll skip over the requisite introductory discussion of HTTP authentication, the Apache request cycle, and other materials that probably already familiar and skip right to the mod_perl authentication API. In both mod_perl 1.0 and mod_perl 2.0, the PerlAuthenHandler represents Perl access to the Apache authentication phase, where incoming user credentials are traditionally matched to those stored within the application. A simple PerlAuthenHandler in mod_perl 2.0 might look like the following.

package My::BasicHandler;

use Apache::RequestRec ();
use Apache::Access ();

use Apache::Const -compile => qw(OK DECLINED HTTP_UNAUTHORIZED);

use strict;

sub handler {
  my $r = shift;

  # get the client-supplied credentials
  my ($status, $password) = $r->get_basic_auth_pw;

  # only continue if Apache says everything is OK
  return $status unless $status == Apache::OK;

  # user1/basic1 is ok
  if ($r->user eq 'user1' && $password eq 'basic1') {
    return Apache::OK;
  }

  # user2 is denied outright
  if ($r->user eq 'user2') {
    $r->note_basic_auth_failure;
    return Apache::HTTP_UNAUTHORIZED;
  }

  # all others are passed along to the Apache default
  # handler, which reads from the AuthUserFile
  return Apache::DECLINED;
}

1;

Although simple and impractical, this handler illustrates the API nicely. The process begins with a call to get_basic_auth_pw(), which does a few things behind the scenes. If a suitable Basic Authorization header is found, get_basic_auth_pw() will parse and decode the header, populate the user slot of the request record, and return OK along with the user-supplied password in clear text. Any value other than OK should be immediately propagated back to Apache, which effectively terminates the current request.

The next step in the process is where the real authentication logic resides. Our handler is responsible for digging out the username from $r->user() and applying some criteria for determining whether the user-supplied credentials are acceptable. If they are, the handler simply returns OK and the request is allowed to proceed. If they are not, the handler has a decision to make: either call note_basic_auth_failure() and return HTTP_UNAUTHORIZED (which is the same as the old AUTH_REQUIRED) to indicate failure, or return DECLINED to pass authentication control to the next authentication handler.

For the most part, the mod_perl API is identical to the API Apache offers to C module developers. The benefit that mod_perl adds is the ability to easily extend authentication beyond Apache's default flat-file mechanism to the areas where Perl support is strong, such as relational databases or LDAP. However, despite the versatility and strength programming the authentication phase offered, I never liked the look and feel of the API. While in some respects the process is dictated by the nuances of RFC 2617 and the HTTP protocol itself, the interface always struck me as somewhat inconsistent and difficult for new users to grasp. Additionally, as already mentioned, the API covers only Basic authentication, which is a real drawback as more and more browsers support the Digest scheme.

Apparently I wasn't alone in some of these feelings. Apache 2.1 has taken steps to improve the overall process for module developers. The result is a new, streamlined API that focuses on a new concept: authentication providers.

Authentication Providers in Apache 2.1

While in Apache 2.0 module writers were responsible for a large portion of the authentication logic—calling routines to parse and set authentication headers, digging out the user from the request record, and so on — the new authentication mechanism in Apache 2.1 delegates all HTTP and RFC logic out to two standard modules. mod_auth_basic handles Basic authentication and is enabled in the default Apache build. The standard mod_auth_digest, not enabled by default, handles the very complex world of Digest authentication. Regardless of the authentication scheme you choose to support, these modules are responsible for the details of parsing and interpreting the incoming request headers, as well as generating properly formatted response headers.

Of course, managing authentication on an HTTP level is only part of the story. What mod_auth_basic and mod_auth_digest leave behind is the job of digging out the server-side credentials and matching them to their incoming counterpart. Enter authentication providers.

Authentication providers are modules that supply server-side credential services to mod_auth_basic or mod_auth_digest. For instance, the default mod_authn_file digs the username and password out of the flat file specified by the AuthUserFile directive, similar to the default mechanism in Apache 1.3 and 2.0. An Apache 2.1 configuration that explicitly provides the same flat file behavior as Apache 2.0 would look similar to the following.

<Location /protected>
  Require valid-user
  AuthType Basic
  AuthName realm1

  AuthBasicProvider file

  AuthUserFile realm1
</Location>

The new part of this configuration is the AuthBasicProvider directive, which is implemented by mod_auth_basic and used to specify the provider responsible for managing server-side credentials. There is also a corresponding AuthDigestProvider directive if you have mod_auth_digest installed.

While it could seem as though Apache 2.1 is merely adding another directive to achieve essentially the same results, the shift to authentication providers adds significant value for module developers: a new API that is far simpler than before. Skipping ahead to the punch line, programming with new Perl API for Basic authentication, which follows the Apache API almost exactly, would look similar to the following.

package My::BasicProvider;

use Apache::Const -compile => qw(OK DECLINED HTTP_UNAUTHORIZED);

use strict;

sub handler {
  my ($r, $user, $password) = @_;

  # user1/basic1 is ok
  if ($user eq 'user1' && $password eq 'basic1') {
    return Apache::OK;
  }

  # user2 is denied outright
  if ($user eq 'user2') {
    return Apache::HTTP_UNAUTHORIZED;
  }

  # all others are passed along to the next provider
  return Apache::DECLINED;
}

1;

As you can see, not only are the incoming username and password supplied in the argument list, removing the need for get_basic_auth_pw() and its associated checks, but gone is the need to call note_basic_auth_failure() before returning HTTP_UNAUTHORIZED. In essence, all that module writers need to be concerned with is validating the user credentials against whatever back-end datastore they choose. All in all, the API is a definite improvement. To add even more excitement, the API for Digest authentication looks almost exactly the same (but more on that later).

Because the new authentication provider approach represents a significant change in the way Apache handles authentication internally, it is not part of the stable Apache 2.0 tree and is instead being tested in the development tree. Unfortunately, until the provider mechanism is backported to Apache 2.0, or an official Apache 2.2 release, it is unlikely that authentication providers will be supported by core mod_perl 2.0. However, this does not mean that mod_perl developers are out of luck—by coupling mod_perl's native directive handler API with a bit of XS, we can open up the new Apache provider API to Perl with ease. The Apache::AuthenHook module does exactly that.

Introducing Apache::AuthenHook

Over in the Apache C API, authentication providers have a few jobs to do: they must register themselves by name as a provider while supplying a callback interface for the schemes they wish to support (Basic, Digest, or both). In order to open up the provider API to Perl modules our gateway module Apache::AuthenHook will need to accomplish these tasks as well. Both of these are accomplished at the same time through a call to the official Apache API function ap_register_provider.

Usually, mod_perl provides direct access to the Apache C API for us. For instance, a Perl call to $r->get_basic_auth_pw() is proxied off to ap_get_basic_auth_pw—but in this case ap_register_provider only exists in Apache 2.1 and, thus, is not supported by mod_perl 2.0. Therefore, part of what Apache::AuthenHook needs to do is open up this API to Perl. One of the great things about mod_perl is the ease at which it allows itself to be extended even beyond its own core functionality. Opening up the Apache API past what mod_perl allows is relatively easy with a dash of XS.

Our module opens with AuthenHook.xs, which is used to expose ap_register_provider through the Perl function Apache::AuthenHook::register_provider().

#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"

#include "mod_perl.h"
#include "ap_provider.h"
#include "mod_auth.h"

...

static const authn_provider authn_AAH_provider =
{
  &check_password,
  &get_realm_hash,
};

MODULE = Apache::AuthenHook    PACKAGE = Apache::AuthenHook

PROTOTYPES: DISABLE

void
register_provider(provider)
  SV *provider

  CODE:

    ap_register_provider(modperl_global_get_pconf(),
                         AUTHN_PROVIDER_GROUP,
                         SvPV_nolen(newSVsv(provider)), "0",
                         &authn_AAH_provider);

Let's start at the top. Any XS module you write will include the first three header files, while any mod_perl XS extension will require at least #include "mod_perl.h". The remaining two included header files are specific to what we are trying to accomplish—ap_provider.h defines the ap_register_provider function, while mod_auth.h defines the AUTHN_PROVIDER_GROUP constant we will be using, as well as the authn_provider struct that holds our callbacks.

Skipping down a bit, we can see our implementation of Apache::AuthenHook::register_provider(). The MODULE and PACKAGE declarations place register_provider() into the Apache::AuthenHook package. Following that is the definition of the register_provider() function itself.

As you can see, register_provider() accepts a single argument, a Perl scalar representing the name of the provider to register, making its usage something akin to the following.

Apache::AuthenHook::register_provider('My::BasicProvider');

The user-supplied name is then used in the call to ap_register_provider from the Apache C API to register My::BasicProvider as an authentication provider.

The twist in the process is that our implementation of ap_register_provider registers Apache::AuthenHook's callbacks (the check_password and get_realm_hash routines not shown here) for each Perl provider. In essence, this means that Apache::AuthenHook will be acting as go-between for Perl providers. Much in the same way that mod_perl proper is called by Apache for each phase and dispatches to different Perl*Handlers, Apache::AuthenHook will be called by Apache's authentication modules and dispatch to the appropriate Perl provider at runtime.

If this boggles your mind a bit, not to worry, it is only being presented to give you a feel for the bigger picture and show how easy it is to open up closed parts of the Apache C API with mod_perl and just a few lines of XS. However, the fun part of Apache::AuthenHook (and the part that you are more likely to use in your own mod_perl modules) is handled over in Perl space.

Setting the Stage

Now that we have the ability to call ap_register_provider, we need to link that into the Apache configuration process somehow. What we do not want to do is replace current PerlAuthenHandler functionality, since that directive is for inserting authentication handler logic in place of Apache's defaults. In our case, we need the default modules to run so they can call our Perl providers. Instead, we want to make it possible for Perl modules to register themselves as authentication providers. While we could have Perl providers call our new register_provider() function directly, Apache::AuthenHook chose to make the process transparent, using mod_perl's directive handler API to call register_provider() silently as httpd.conf is parsed.

Apache::AuthenHook makes sneaky use of directive handlers to extend the default Apache AuthBasicProvider and AuthDigestProvider directives so they register Perl providers on-the-fly. The net result is that Perl providers will be fully registered and configured via standard Apache directives, similar to the following.

AuthBasicProvider My::BasicProvider file

At configuration time, Apache::AuthenHook will intercept AuthBasicProvider and register My::BasicProvider. At request time, mod_auth_basic will attempt to authenticate the user, first using My::BasicProvider, followed by the default file provider if My::BasicProvider declines the request.

A nice side effect to this is that through our implementation we will be giving mod_perl developers a feature they have never had before—the ability to interlace Perl handlers and C handlers within the same phase.

AuthDigestProvider My::DigestProvider file My::OtherDigestProvider

Exciting, no? Let's take a look at AuthenHook.pm and see how the directive handler API works in mod_perl 2.0.

Directive Handlers with mod_perl

Directive handlers are a very powerful but little used feature of mod_perl. For the most part, their lack of use probably stems from the complex and intimidating API in mod_perl 1.0. However, in mod_perl 2.0, the API is much simpler and should lend itself to adoption by a larger audience.

The directive handler API allows mod_perl modules to define their own custom configuration directives that are understood by Apache, For example, enabling modules to make use of configuration variables like:

Foo "bar"

in httpd.conf requires only a few relatively simple settings you can code directly in Perl.

While using directive handlers simply to replace PerlSetVar behavior might seem a bit flashy, the techniques used by Apache::AuthenHook are some of the most powerful mod_perl has to offer.

As previously mentioned, we will be extending the new AuthBasicProvider and AuthDigestProvider directives to apply to Perl providers as well, silently registering each provider as the directive itself is parsed. To do this, we redefine these core directives, manipulate their configuration data, then disappear and allow Apache to handle the directives as if we were never there.

The code responsible for this is in AuthenHook.pm.

package Apache::AuthenHook;

use 5.008;

use DynaLoader ();

use mod_perl 1.99_10;     # DECLINE_CMD and $parms->info support
use Apache::CmdParms ();  # $parms->info

use Apache::Const -compile => qw(OK DECLINE_CMD OR_AUTHCFG RAW_ARGS);

use strict;

our @ISA     = qw(DynaLoader);
our $VERSION = '2.00_01';

__PACKAGE__->bootstrap($VERSION);

our @APACHE_MODULE_COMMANDS = (
  { name         => 'AuthDigestProvider',
    errmsg       => 'specify the auth providers for a directory or location',
    args_how     => Apache::RAW_ARGS,
    req_override => Apache::OR_AUTHCFG,
    cmd_data     => 'digest' },

  { name         => 'AuthBasicProvider',
    errmsg       => 'specify the auth providers for a directory or location',
    args_how     => Apache::RAW_ARGS,
    req_override => Apache::OR_AUTHCFG,
    func         => 'AuthDigestProvider',
    cmd_data     => 'basic' },
);

At the top of our module we import a few required items, some that are new and some that should already be familiar. DynaLoader and the bootstrap() method are required to pull in the register_provider() function from our XS implementation and, unlike with mod_perl 1.0, have nothing to do with the actual directive handler implementation. The Apache::CmdParms class provides the info() method we will be illustrating shortly, while Apache::Const gives us access to the constants we will need throughout the process.

The @APACHE_MODULE_COMMANDS array is where the real interface for directive handlers begins. @APACHE_MODULE_COMMANDS holds an array of hashes, each of which defines the behavior of an Apache directive. Let's focus on the first directive our handler implements, AuthDigestProvider, forgetting for the moment that mod_auth_digest also defines this directive.

While it should be obvious that the name key specifies the name of the directive, it is not so obvious that it also specifies the default Perl subroutine to call when Apache encounters AuthDigestProvider while parsing httpd.conf. Later on, we will need to implement the AuthDigestProvider() subroutine, which will contain the logic for all the activities we want to perform when Apache sees the AuthDigestProvider directive.

The args_how and req_override are fields that tell Apache specifically how the directive is supposed to behave in the configuration. req_override defines how our directive will interact with the core AllowOverride directive, in our case allowing AuthDigestProvider in .htaccess files only on directories governed by AllowOverride AuthConfig. Similarly, args_how defines how Apache should interact with our AuthDigestProvider() subroutine when it sees our directive in httpd.conf. In the case of RAW_ARGS, it means that Apache will pass our callback whatever follows the directive as a single string. Other possible values for both of these keys can be found in the documentation pointed to at the end of this article.

The final important key in our first hash is the cmd_data key, in which we can store a string of our choosing. This will become important in a moment.

The second hash in @APACHE_MODULE_COMMANDS defines the behavior of the AuthBasicProvider directive, which for the most part is identical to AuthDigestProvider. The differences are important, however, and begin with the addition of the func key. Although the default Perl subroutine callback for handling directives is the same as the name of the directive, the func key allows us to point to a different subroutine instead. Here we will be reusing AuthDigestProvider() to process both directives. How will we know which directive is actually being parsed? The cmd_data slot will contain digest when processing AuthDigestProvider and basic when processing AuthBasicProvider.

At this point, we have defined what our directives will look like and how they will interact with Apache in httpd.conf. What we have not shown is the logic that sits behind our directives. As we mentioned, both of our directives will be calling the Perl subroutine AuthDigestProvider, defined in AuthenHook.pm.

sub AuthDigestProvider {
  my ($cfg, $parms, $args) = @_;

  my @providers = split ' ', $args;

  foreach my $provider (@providers) {

    # if the provider looks like a Perl handler...
    if ($provider =~ m/::/) {

      # save the config for later
      push @{$cfg->{$parms->info}}, $provider;

      # and register the handler as an authentication provider
      register_provider($provider);
    }
  }

  # pass the directive back to Apache "unprocessed"
  return Apache::DECLINE_CMD;
}

The first argument passed to our directive handler callback, $cfg, represents the configuration object for our module, which we can populate with whatever data we choose and access again at request time. The second argument is an Apache::CmdParms object, which we will use to dig out the string we specified in the cmd_data slot of our configuration hash using the info() method.

While the first two arguments are standard and will be there for any directive handler you write, the third argument can vary somewhat. Because we specified RAW_ARGS as our args_how setting in the configuration hash, $args contains everything on the httpd.conf line following our directive. The standard Auth*Provider directives we are overriding can take more than one argument, so we split on whitespace and break apart the configuration into an array of providers, each of which we then process separately.

Each provider is examined using a cursory check to see whether the specified provider is a Perl provider. If the provider meets our criteria, we call the register_provider() function defined in AuthenHook.xs and keep track of the provider by storing it in our $cfg configuration object.

The final part of our callback brings the entire process together. The constant DECLINE_CMD has special meaning to Apache. Just as you might return DECLINED from a PerlTransHandler to trick Apache into thinking no translation took place, returning DECLINE_CMD from a directive handler tricks Apache into thinking that the directive was unprocessed. So, after our AuthDigestProvider() subroutine runs, Apache will continue along until it finds mod_auth_digest, which will then process the directive as though we were never there.

The one final piece of AuthenHook.pm that we need to discuss is directive merging. In order to deal properly with situations when directives meet, such as when AuthBasicProvider is specified in both an .htaccess file as well as the <Location> that governs the URI, we need to define DIR_CREATE() and DIR_MERGE() subroutines.

DIR_CREATE() is called at various times in the configuration process, including when <Location> and related directives are parsed at configuration time, as well as whenever an .htaccess file enters the request cycle. This is where we create the $cfg object our callback uses to store configuration data. While it is not required, DIR_CREATE() is a good place to initialize fields in the object as well, which prevents accidentally dereferencing nonexistent references.

sub DIR_CREATE {
  return bless { digest => [],
                 basic  => [], }, shift;
}

DIR_MERGE, generally called at request time, defines how we handle places where directives collide. The following code is standard for allowing the current configuration (%$base) to inherit only missing parameters from higher configurations (%$add), which is the behavior you are most likely to want.

sub DIR_MERGE {
  my ($base, $add) = @_;

  my %new = (%$add, %$base);

  return bless \%new, ref($base);
}

1;

Thus ends AuthenHook.pm.

The final result is pretty amazing. By secretly intercepting the AuthDigestProvider directive before mod_auth_digest has the chance to process it, we have provided an interface that makes the presence of Apache::AuthenHook all but undetectable. To enable the new provider mechanism for mod_perl developers, all that is required is to load Apache::AuthenHook using the new PerlLoadModule directive

PerlLoadModule Apache::AuthenHook

and their Perl providers will be magically inserted into the authentication phase at the appropriate time.

Taking a Step Back

Let's recap what we have accomplished so far. AuthenHook.pm redefines AuthDigestProvider and AuthBasicProvider so that any Perl providers listed in the configuration are automagically registered and inserted into the authentication process. At request time, one of the default Apache authentication handlers will call on the configured providers to supply server-side credentials. All registered Perl providers really point to the callbacks in AuthenHook.xs which have the arduous task of proxying the request for server-side credentials to the proper Perl provider. All in all, Apache::AuthenHook covers lots of ground, even if the gory details of what happens over in XS land have been left out.

As we mentioned earlier, Apache::AuthenHook not only the ability to write authentication providers in Perl, but it also follows the Apache API very closely. While diving deep into the XS code that Apache::AuthenHook uses to implement the check_password and get_realm_hash callbacks is far beyond the scope of this article, you may find it interesting that the callback signature for check_password

static authn_status check_password(request_rec *r, const char *user,
                                   const char *password)
{
  ...
}

is practically identical to what Apache::AuthenHook passes on to Perl providers supporting the Basic authentication scheme.

sub handler {
  my ($r, $user, $password) = @_;

  ...
}

If you recall, we started investigating the Apache 2.1 provider mechanism as a way to combine the security of the Digest authentication scheme with the strength of Perl. The signature for the Digest authentication callback, get_realm_hash, is only slightly different than check_password.

static authn_status get_realm_hash(request_rec *r, const char *user,
                                   const char *realm, char **rethash)
{
  ...
}

How does this translate into a Perl API? It is surprisingly simple. As it turns out, the name check_password is significant—for Basic authentication, the provider is expected to take steps to see if the incoming username and password match the username and password stored on the server back-end. For Digest authentication, as the name get_realm_hash might suggest, all a provider is responsible for is retrieving the hash for a user at a given realm. mod_auth_digest does all the heavy lifting.

Digest Authentication for the People

While we didn't take the time to explain how Basic authentication over HTTP actually works, briefly explaining Digest authentication is probably worth the time, if only to allow you to appreciate the elegance of the new provider mechanism.

When a request comes in for a resource protected by Digest authentication, the server begins the process by returning a WWW-Authenticate header that contains the authentication scheme, realm, a server generated nonce, and various other bits of information. A fully RFC-compliant WWW-Authenticate header might look like the following.

WWW-Authenticate: Digest realm="realm1", 
nonce="Q9equ9C+AwA=195acc80cf91ce99828b8437707cafce78b11621", 
algorithm=MD5, qop="auth"

On the client side, the username and password are entered by the end user based on the authentication realm sent from the server. Unlike Basic authentication, in which the client transmits the user's password practically in the clear, Digest authentication never exposes the password over the wire. Instead, both the client and server handle the user's credentials with care. For the client, this means rolling up the user credentials, along with other parts of the request such as the server-generated nonce and request URI, into a single MD5 hash, which is then sent back to the server via the Authorization header.

Authorization: Digest username="user1", realm="realm1", 
qop="auth", algorithm="MD5", uri="/index.html",
nonce="Q9equ9C+AwA=195acc80cf91ce99828b8437707cafce78b11621", 
nc=00000001, cnonce="3e4b161902b931710ae04262c31d9307", 
response="49fac556a5b13f35a4c5f05c97723b32"

The server, of course, needs to have its own copy of the user credentials around for comparison. Now, because the client and server have had (at various points in time) access to the same dataset—the user-supplied username and password, as well as the request URI, authentication realm, and other information shared in the HTTP headers—both ought to be able to generate the same MD5 hash. If the hash generated by the server does not match the one sent by the client in the Authorization header, the difference can be attributed to the one piece of information not mutually agreed upon through the HTTP request: the password.

As you can see from the headers involved, there is quite a lot of information to process and interpret with the Digest authentication scheme. However, if you recall, one of the benefits of the new provider mechanism is that mod_auth_digest takes care of all the intimate details of the scheme internally, relieving you from the burden of understanding it at all.

All a Digest provider is required to do is match the incoming user and realm to a suitable digest, stored in the medium of its choosing, and return it. With the hash in hand, mod_auth_digest will do all the subsequent manipulations and decide whether the hash the provider supplied is indeed sufficient to allow the user to continue on its journey to the resource it is after.

With that background behind us, we can proceed with a sample Perl Digest provider.

package My::DigestProvider;

use Apache::Log;

use Apache::Const -compile => qw(OK DECLINED HTTP_UNAUTHORIZED);

use strict;

sub handler {
  my ($r, $user, $realm, $hash) = @_;

  # user1 at realm1 is found - pass to mod_auth_digest
  if ($user eq 'user1' && $realm eq 'realm1') {
    $$hash = 'eee52b97527306e9e8c4613b7fa800eb';
    return Apache::OK;
  }

  # user2 is denied outright
  if ($user eq 'user2' && $realm eq 'realm1') {
    return Apache::HTTP_UNAUTHORIZED;
  }

  # all others are passed along to the next provider
  return Apache::DECLINED;
}

1;

Note the only slight difference between the interface for Digest authentication as compared to Basic authentication. Because the authentication realm is an essential part of the Digest scheme, it is passed to our handler() subroutine in addition to the request record, $r, and username we received with the Basic scheme.

Knowing the username and authentication realm, our provider can choose whatever method it desires to retrieve the MD5 hash associated with the user. Returning the hash for comparison by mod_auth_digest is simply a matter of populating the scalar referenced by $hash and returning OK. While using references in this way may feel a bit strange, it follows the same pattern as the official Apache C API, so I guess that makes it ok.

If the user cannot be found, the provider can choose to return HTTP_UNAUTHORIZED and deny access to the user, or return DECLINED to pass authority for the user to the next provider. Remember, unlike with the Perl handlers for all the other phases of the request, you can intermix Perl providers with C providers, sandwiching the default file provider with Perl providers of your own choosing.

The one question that remains is how to generate a suitable MD5 digest to pass back to mod_auth_digest. For the default file provider, the return digest is typically generated using the htdigest binary that comes with the Apache installation. However, a Perl one-liner that can be used to generate a suitable MD5 digest for Perl providers would look similar to the following.

$ perl -MDigest::MD5 -e'print Digest::MD5::md5_hex("user:realm:password")'

That is all there is to it. No hash checking, no header manipulations, no back flips or somersaults. Simply dig out the user credentials and pass them along. At last, Digest authentication for the (Perl) people.

Don't Forget the Tests!

Of course, no module would be complete without a test suite, and the Apache-Test framework introduced last time gives us all the tools we need to write a complete set of tests.

For the most part, the tests for Apache::AuthenHook are not that different from those presented before. LWP supports Digest authentication natively, so all our test scripts really need to do is make a request to a protected URI and let LWP do all the work. Here is a snippet from one of the tests.

plan tests => 10, (have_lwp &&
                   have_module('mod_auth_digest'));

my $url = '/digest/index.html';

$response = GET $url, username => 'user1', password => 'digest1';
ok $response->code == 200;

When we plan the tests, we first check for the existence of mod_auth_digest—both mod_auth_basic and mod_auth_digest can be enabled or disabled for any given installation, so we need to check for them where appropriate. Passing the username and password credentials is pretty straightforward, using the username and password keys after the URL when formatting the request.

Actually, while the username and password keys have special meaning, you can use the same technique to send any arbitrary headers in the request.

# fetch the Last-Modified header from one response...
my $last_modified = $response->header('Last-Modified');

# and use it in the next request
$response = GET $uri, 'If-Modified-Since' => $last_modified;
ok ($response->code == HTTP_NOT_MODIFIED);

That's something to note just in case you need that functionality sometime later in your testing life.

One final note about our tests will apply to anyone writing a mod_perl XS extension. Instead of using extra.conf.in to configure Apache, we used extra.last.conf.in. The difference between the two is that extra.last.conf.in is guaranteed to be loaded the last in the configuration order—if our PerlLoadModule directive is processed before mod_perl gets the chance to add the proper blib entries, nothing will work, so ensuring our configuration is loaded after everything else is in place is important.

Whew

mod_perl is truly exciting. With surprisingly little work, we have managed to open an entire new world within Apache 2.1 to the Perl masses. I know of no other blend of technologies that allow for such remarkable flexibility beyond what each individually brings to the table. Hopefully, this article has not only introduced you to new Apache and authentication concepts, but has also brought to light ways in which you can leverage mod_perl that you never thought of before.

More Information

I apologize if this article is a little on the heavy side, teasing you with only cursory introductions to cool concepts while leaving out the finer details. So, if you want to explore these concepts in more detail, I leave you with the following required reading.

A nice overall introduction to the new provider mechanism can be found in Safer Apache Driving with AAA. The mechanics of Basic authentication can be found in lots of places, but decent explanations of Digest authentication are harder to find. Both are covered to some level of detail in Chapter 13 of the mod_perl Developer's Cookbook, which is freely available online. Recipe 13.8 in particular includes the code that became the splinter in my mind and eventually this article.

A more detailed explanation of directive handlers in mod_perl 2.0 can be found in the mod_perl 2.0 documentation. Although covering only mod_perl 1.0 directive handlers, whose implementation is very different, Chapter 8 in Writing Apache Modules with Perl and C and Recipes 7.8 through 7.11 in the mod_perl Developer's Cookbook provide excellent explanations of concepts that are universal to both platforms, and are essential reading if you plan on using directive handlers yourself. If you are curious about the intricate details of directive merging, Chapter 21 in Apache: the Definitive Guide presents probably the most comprehensive explanation available.

Finally, if you are interested in the gory details of the XS that really drives Apache::AuthenHook, there is no better single point of reference than Extending and Embedding Perl, which was my best friend while writing this module and absolutely deserves a place on your bookshelf.

Thanks

Many thanks to Stas Bekman and Philippe Chiasson for their feedback and review of the several patches to mod_perl core that were required for the code in this article, as well as to Jörg Walter, who was kind enough to take the time to review this article and give valuable feedback.

This week on Perl 6, week ending 2003-07-06

Perl 6 Summary for the week ending 20030706

Welcome to this week's Perl 6 Summary, coming to you live from a gatecrashed Speakers' lounge at OSCON/TPC, surrounded by all the cool people like Dan Sugalski, Lisa Wolfisch, Graham Barr, and Geoff Young, who aren't distracting me from writing this summary at all.

10 minutes later, after the arrival of James Duncan and Adam Turoff, I'm still not being distracted. Oh no... Leon Brocard's just arrived, which helps with the running joke, but not with the 'getting the summary written'.

So, we'll start as usual with the goings on in perl6-internals.

More on Parrot's multiple stack implementations

Dan pointed out that, although Control, User, and Pads share the same stack engine, Pads should be implemented using a simple linked list. He also confessed that all the register backing stacks should share an implementation, but that he'd been too lazy to macro things up properly.

http://groups.google.com/groups

Building IMCC as parrot

Leo Tötsch posted an initial patch to make IMCC build as the parrot executable. And there was much Makefile debugging.

http://groups.google.com/groups

Parrot IO work

Work on Parrot's IO system was ongoing this week.

http://groups.google.com/groups

Parrot Exceptions

Discussion of possibly resumable exceptions continued, and morphed into a discussion of the workings of warnings when Benjamin Goldberg wondered if warnings were being implemented as exceptions.

They aren't. I think Benjamin was being confused by Perl 6's fail function which can either issue a warning and return undef to its caller's caller, or throw an exception, depending on a pragma whose name I can't for the life of me remember.

http://groups.google.com/groups

Parrot's build system

Last week Dan asked for help writing a configuration and build system that would allow per-C-file compiler flag overrides and various other complexities. This week Alan Burlison worried that this approach looked "very suspect". This sparked a discussion of the proper way to do interrupt safe queueing, reentrant code, and other scary things.

I may be misreading what's being said in the thread, but it seems that the scariest thing about the whole issue is that there's no portable way of doing the Right Thing, which leads to all sorts of painful platform dependencies and hoopage (From "to jump through hoops", anything which requires you to jump through a lot of hoops has a high degree of hoopage. For the life of me I can't remember if it was me or Leon Brocard who coined the term).

Swamps were mentioned. Monty Python skits were quoted. Uri Guttman was overtaken by (MUAHAHAHAHAHA!) megalomania.

http://groups.google.com/groups

Klaas-Jan Stol Explains Everything (part 1)

Last week I mentioned that I didn't know what Klaas-Jan Stol was driving at when he proposed a general, language-independent "argument" PMC class and hoped that he would provide an explanation, with code fragments.

This week, Klaas-Jan came through, and I think I understand what he wants. Go read his explanation and see if you understand it too. It's to do with when a parameter should be passed by reference or by value. Parrot currently assumes that Strings and PMCs are passed by reference, and that integers and floats are passed by value.

There was some discussion of whether support for optionally passing PMCs by value should be added at the parrot level or whether individual language compilers should be responsible for calling the appropriate PMC methods.

http://groups.google.com/groups

Moving ParrotIO to PMCs

Jürgen Bömmels is working hard on moving Parrot's IO-system to a fully garbage collected PMC based system. He posted an initial patch transforming ParrotIO structures into PMCs using a simple wrapping approach.

The patch had problems because the new ParrotIO PMCs were marked as needing active destruction, which could lead to problems with files being closed twice and memory structures getting cleaned up twice, which tends to make memory leak detecting tools a little unhappy. This bug proved to be easy to fix. But then another one reared its head and has so far proved rather harder to fix.

http://groups.google.com/groups

Jako gets modules (sort of)>

Gregor N Purdy announced that he's added rudimentary module support to his Jako "little" (?) language. A little later he announced that he'd added rather less rudimentary module support.

http://groups.google.com/groups

Stupid Parrot Tricks

Clinton Pierce, shooting for his "mad genius" badge announced that he'd implemented a simple CGI script in Parrot BASIC. Everyone gasped. Robert Spier decided that the time may have come to start playing with mod_parrot (embedding Parrot in Apache) again.

http://groups.google.com/groups

http://www.camfriends.org/testform.html -- that BASIC CGI URL

Lazy Arrays

Luke Palmer is thinking about implementing LazyArray and LazySequence PMCs and outlined his design approach for the list. Benjamin Goldberg and Leo Tötsch both contributed answers to some of the questions Luke raised.

http://groups.google.com/groups

wxWindows Support

David Cuny asked if there was any interest in supporting wxWindows (an open source, cross platform, native UI framework for a pile of operating systems).

Leo thought that one could use wxWindows from Parrot via the NCI (Native Call Interface), but that it wouldn't be at all easy. He thought that the best way would be a custom dynamically loaded PMC, tied to good object support in Parrot. Neither of which we have (yet).

Leo also thought that using Parrot as a GCC backend would be a reasonably sensible idea, but that it "would need some extensions at both sides". He wondered if there were any GCC people listening.

http://groups.google.com/groups

Hash Iterators

Leo checked in his "buggy initial try" at implementing a hash iterator for Parrot, but he wasn't entirely happy with his implementation. Sean O'Rourke suggested a Better Way, which Leo immediately took advantage of.

http://groups.google.com/groups


Meanwhile, in perl6-language

Almost nothing happened.

printf like formatting in interpolated strings

Jonadab the Unsightly One made another proposal in this longrunning thread.

http://groups.google.com/groups

Perl 6 Daydreams

The directed daydreaming continued. Jonadab is looking forward to Perl 6's new improved object model (him and many, many others I believe). This morphed into a discussion of porting Inform to run on Parrot, which spawned thoughts of when Parrot achieves its final goal of being able to load and run z-code based interactive fiction. (The theory is that getting z-code working on Parrot will be the final goal because once that happens the dev team will stop working on Parrot and spend all their time playing Zork).

http://groups.google.com/groups

Aliasing an array slice

Dan Brook wondered if it would be either possible or sane to bind a variable to an array slice:

my @array_slice := @array[1,3,5]

Luke Palmer thought it would have to be, because if it weren't

my *@fibs := (1, 1, map { $^a  + $^b} zip(@fibs, @fibs[1...]));

wouldn't work (and that would be a bad thing because?).

Damian pointed out that it sort of worked already in Perl 5. This started people discussing what the Right Thing would finally be in Perl 6.

http://groups.google.com/groups


Acknowledgements, Announcements and Apologies

Look, I'm sorry the Summary's late, okay? OSCON is an incredibly distracting thing to be present at; every time I settled down to write I got drawn into another fascinating conversation and before I knew it it was time to go back to my hotel, which is, of course, where I managed to knuckle down and write this.

Thanks to Curtis Poe for putting us up for the first couple of nights in Portland, and to Tom Phoenix for being a top notch Native Guide. No thanks to Powell's City of Books http://www.powells.com/ for having far too much good stuff in stock. Thankfully, they ship.

As ever, if you've appreciated this summary, please consider one or more of the following options:

Power Regexps, Part II

In the previous article, we looked at some of the more intermediate features of regular expressions, including multiline matching, quoting, and interpolation. This time, we're going to look at more-advanced features. We'll also look at some modules that can help us handle regular expressions.

Look Forward, Look Back

Perhaps the most misunderstood facility of regular expressions are the lookahead and lookbehind operators; let's begin with the simplest, the positive lookahead operator.

This operator, spelled (?= ), attempts to match a pattern, and if successful, promptly forgets all about it. As its name implies, it peeks forward into the string to see whether the next part of the string matches the pattern. For instance:


    $a="13.15    Train to London"; 
    $a=~ /(?=.*London)([\d\.]+)/

This is perhaps an inefficient way of writing:


    $a =~ /([\d\.]+).*London/;

and it can be read as "See if this string has 'London' in it somewhere, and if so, capture a series of digits or periods."

Here's an example of it in real-life code; I want to turn some file names into names of Perl modules. I'll have a name like /Library/Perl/Mail/Miner/Recogniser/Phone.pm - this is part of my Mail::Miner module, so I can guarantee that the name of the module will start with Mail/Miner - and I want to get Mail::Miner::Recogniser::Phone. Here's the code that does it:


    our @modules = map {
        s/.pm$//;
        s{.*(?=Mail/Miner)}{};
        join "::", splitdir($_)
    } @files;

We look at each of our files, and first take off the .pm from the end. Now what we need to do is remove everything before the Mail/Miner portion, stripping off /Library/Perl or whatever our path happens to be. Now we could write this as:


    s{.*Mail/Miner}{Mail/Miner};

removing everything which appears before Mail/Miner and then the text Mail/Miner itself, and then replacing all that with Mail/Miner again. This is obviously horribly long-winded, and it's much more natural to think of this in turns of "get rid of everything but stop when you see Mail/Miner". In most cases, you can think of (?= ) as meaning "up to".

Similar but subtly different is the negative counterpart (?! ). This again peeks forward into the string, but ensures that it doesn't match the pattern. A good way to think of this is "so long as you don't see". Damian Conway's Text::Autoformat contains some code for detecting quoted lines of text, such as may be found in an e-mail message:


    % Will all this regular expression scariness go away in 
    % Perl 6?

    Yes, definitely; we're replacing it with a completely different set
    of scariness.

Here the first two lines are quoted, and the expressions that check for this look like so:


    my $quotechar = qq{[!#%=|:]};
    my $quotechunk = qq{(?:$quotechar(?![a-z])|[a-z]*>+)};

$quotechar contains the characters that we consider signify a quotation, and $quotechunk has two options for what a quotation looks like. The second is most natural: a greater-than sign, possibly preceded by some initials, such as produced by the popular Supercite emacs package:


    SC> You're talking nonsense, you odious little gnome!

The left-hand side of the alternation in $quotechunk is a little more interesting. We look for one of our quotation characters, such as % as in the example above, but then we make sure that the next character we see is not alphabetic; this may be a quotation:


    % I think that all right-thinking people...

but this almost certainly isn't


    %options = ( verbose => 1, debug => 0 );

The (?!) acts as a "make sure you don't see" directive.

The mistake everyone makes at least once with this is to assume you can say:


    /(?!foo)bar/;

and wonder why it matches against foobar. After all, we've made sure we didn't see a foo before the bar, right? Well, not exactly. These are lookahead operators, and so can't be used to find things "before" anything at all; they're only used to determine what we can or can't see after the current position. To understand why this is wrong, imagine what it would mean if it were a positive assertion:


    /(?=foo)bar/;

This means "are the next three characters we see foo? If so, the next three characters we see are bar". This is obviously never going to happen, since a string can't contain both foo and bar at the same position and the same time. (Although I believe Damian has a paper on that.) So the negative version means "are the next three characters we see not foo? Then match bar". foo is not bar, so this matches any bar. What was probably meant was a lookbehind assertion, which we will look at imminently.

Now we've seen the two forward-facing assertions, we can turn (ha, ha) to the backward-facing assertions, positive and negative lookbehind. There's one important difference between these and their forward-facing counterparts; while lookahead operators can contain more or less any kind of regular expression pattern, for reasons of implementation the lookbehind operators must have a fixed width computable at compile time. That is, you're not allowed to use any indefinite quantifiers in your subpatterns.

The positive lookbehind assertion is (?<=), and the only thing you need to know about it is that it's so rare I can't remember the last time I saw it in real code. I don't think I've ever used it, except possibly in error. If you think you want to use one of these, then you almost certainly need to rethink your strategy. Here's a quick example, though, from IPC::Open3:


    $@ =~ s/(?<=value attempted) at .*//s;

The context for this is that we've just done the equivalent of


    eval { $_[0] = ... };

and if someone maliciously passes a constant value to the subroutine, we want to through the Modification of a read-only value attempted error back in their face. We check we're seeing the error we expect, then strip off the at .../IPC/Open3.pm, line 154 part of the message so that it can be fed to croak. The less Tom-Christianseny way to do this would be something like:


    croak "You fed me bogus parameters" if $@ =~ /attempted/;

The negative lookbehind assertion, on the other hand, is considerably more common; this is the answer to our "bar not preceded by foo" problem of the previous section.


    /(?!<foo)bar/;

This will match bar, peeking backward into the string to make sure it doesn't see foo first. To take another example, suppose we're preparing some text for sending over the network, and we want to make sure that all the line feeds (\n) have carriage returns (\r) before them. Here's the truly lazy way to do it:


    # Make sure there's an \r in there somewhere
    s{\n}  {\r\n}g;
    # And then strip out duplicates
    s{\r\r}{\r}  g;
 
This is fine (if somewhat inefficient) unless it's OK for two carriage
returns to appear without a line feed in the way. Here's the finesse:

    s/(?<!\r)\n/\r\n/g;

If you see a line feed that is not preceded by a carriage return, then stick a carriage return in there -- much cleaner, and much more efficient.

split, //g and other shenanigans

In the previous article, we had a nice piece of multiline, formatted data, such as one might expect to parse with Perl:


    Name: Mark-Jason Dominus
    Occupation: Perl trainer
    Favourite thing: Octopodes

    Name: Simon Cozens
    Occupation: Hacker
    Favourite thing: Sleep

Now, there's a boring way to parse this. If you're coming from a C or Java background, then you might try:


    my $record = {}
    my @records;
    for (split /\n/, $text {
        chomp;
        if (/([^:]+): (.*)/) {
            $record->{$1} = $2;
        } elsif ($_ =~ /^\s*$/) {
            # Blank line => end of current record
            push @records, $record;
            $record = {};
        } else {
            die "Wasn't expecting to see '$_' here";
        }
    }

And, of course, this will work. But there's several more Perl-ish solutions that this. When you know the fields provided by your data, it's rather nice to have a regular expression that reflects the data structure:


    while ($data =~ /Name:\s(.*)\n
                     Occupation:\s(.*)\n 
                     Favourite.*:\s(.*)/gx) {
        push @records, { name => $1, occupation => $2, favourite => $3 }
    }

Here we use the /g modifier, which allows us to resume the match from where it last left off.

If we don't know the fields while we're writing our program, then we'll have to break the process up into two stages. First, we extract individual records: records are delimited by a blank line:


    my @texts = split /\n\s*\n/, $text;

And then for each record, we can either use the /g trick again, or simply split each record into lines. I prefer the latter, for reasons you'll see in a second:


    for (@texts) {
        my $record = {};
        for (split /\n/, $_) {
            /([^:]+): (.*)/;
            $record->{$1} = $2;
        }
        push @records, $record;
    }

This is not dissimilar from the initial solution, but it allows us to make some interesting improvements. For starters, when you see code that transforms data with a for loop, you should wonder whether it could be better written with a map statement. This goes double if you're using push inside the for loop as we are here. So this version is a natural evolution:


    @records = map {
        my $record = {};
        for (split /\n/, $_) { 
            /([^:]+): (.*)/;
            $record->{$1} = $2;
        }
        $record;
    } split /\n\s*\n/, $text;

And we can actually do away with the inner for loop too:


    @records = map {
        {
            map { /([^:]+): (.*)/ and ($1 => $2) } split /\n/, $_
        }
    } split /\n\s*\n/, $text;

But if we're prepared to be a little lax about trailing whitespace, there's actually an even nicer way to do it, using the one thing that everyone forgets about split: if your split pattern contains parentheses, then the captured text is inserted into the list returned by split. That is, the following code:


    split( /(\W+)/, "perl-5.8.0.tar.gz")

will produce the list


    ("perl", "-", "5", ".", "8", ".", "0", ".", "tar", ".", "gz")

So we can actually use the field name, colon and space at the start of each line as the split expression itself:


    split /^([^:]+):\s*/m

There is a slight problem with this idea - because the first thing in each record is delimeter we're looking for, the first thing returned by split will be an empty string. But we can easily get around this by adding another undef to provide a fake undef => '' hash element. This allows us to reduce the parser code to:


    @records = map { 
                     { undef, split /^([^:]+):\s*/m, $_ } 
                   } split /\n\s*\n/, $text;

It may not be pretty, but it's quick and it works.

Of course, you may also use lookahead and lookbehind assertions with split; I sometimes use the following code to break a string into tokens:


    split /(?<=\W)|(?=\W)/, $string;

This is almost the same as


    split /(\W)/, $string

but with a subtle difference. Again, as Perl wants to see a nonword character as a delimiter, it will return an empty string between two adjacent nonwords:


    split /(\W)/, '$foo := $bar';
    # '', '$', 'foo', ' ', '', ':', '', '=', '', ' ', '', '$', 'bar'

Splitting on a word boundary goes too much the other way:


    split /\b/, '$foo := $bar';
    # '$', 'foo', ' := $', 'bar'

And so it turns out that we want to cleave the string where we've just seen a nonword character, or if we're about to see one:


    split /(?<=\W)|(?=\W)/, $string;
    # '$', 'foo', ' ', ':', '=', ' ', '$', 'bar'

And this gives us the sort of tokenisation we want.

Regexp Modules

Now, though, we are getting into the sort of regular expressions that are not written lightly, and we may need some help constructing and debugging these expressions. Thankfully, there are plenty of modules which make regexp handling much easier for us.

re

The re module is as invaluable as it is obscure. It's one of those hidden treasures of the Perl core that Casey was talking about last month. As well as turning on two features of the regular expression engine, tainting subexpressions and evaluated assertions, it provides a debugging facility that allows you to watch your expression being compiled and executed.

Here's a relative simple expression:


    $a =~ /([^:]+):\s*(.*)/;

When this code is run under -Mre=debug, then the following will be printed when the regexp is compiled:


    Compiling REx `([^:]+):\s*(.*)'
    size 25 first at 4
       1: OPEN1(3)
       3:   PLUS(13)
       4:     ANYOF[\0-9;-\377](0)
      13: CLOSE1(15)
      15: EXACT <:>(17)
      17: STAR(19)
      18:   SPACE(0)
      19: OPEN2(21)
      21:   STAR(23)
      22:     REG_ANY(0)
      23: CLOSE2(25)
      25: END(0)

This tells us the instructions for the little machine that the regular expression compiler creates: it should first open a bracket, then go into a loop (PLUS) finding characters that are ANYOF character zero through to 9 and ; through to character 255 - that is, everything apart from a :. Then we close the bracket, look for a specific character, and so on. The numbers in brackets after each instruction are the line number to jump to on completion; then the PLUS loop exits, it should go on to line 13, CLOSE1 and so on.

Next when we try to run this match against some text:


    $a = "Name: Mark-Jason Dominus";

It will first tell us something about the optimizations it performs:


    Guessing start of match, REx `([^:]+):\s*(.*)' against `Name: ...'
    Found floating substr `:' at offset 4...
    Does not contradict STCLASS...
    Guessed: match at offset 0

What this means is that it has found the constant element : in the regular expression, and tries to locate that in the string, and then work backward to find out where it should start the match. Since the : is at position four in our string, it will go on to deduce that the match should start at the beginning and...


    Matching REx `([^:]+):\s*(.*)' against `Name: Mark-Jason Dominus'
    Setting an EVAL scope, savestack=3
    0 <> <Name: Mark-J>    |  1:  OPEN1
    0 <> <Name: Mark-J>    |  3:  PLUS
    ANYOF[\0-9;-\377] can match 4 times out of 32767...

The [^:] can match four times, since it knows there are four things that are not colons there.

The re module is absolutely essential for heavy-duty study of how the regular expression engine works, and why it doesn't do what you think it should.

YAPE::Regex::Explain

The description given by re is a little low-level for some people; well, most people. YAPE::Regex::Explain aims to put the explanation at a much higher level; for instance,


     % perl -MYAPE::Regex::Explain -e 'print 
       YAPE::Regex::Explain->new(qr/(?<=\W)|(?=\W)/)->explain'

will produce quite a verbose explanation of the regular expression like so:


    ----------------------------------------------------------------------
    (?-imsx:                 group, but do not capture (case-sensitive)
                             (with ^ and $ matching normally) (with . not
                             matching \n) (matching whitespace and #
                             normally):
    ----------------------------------------------------------------------
      (?<=                     look behind to see if there is:
    ----------------------------------------------------------------------
        \W                       non-word characters (all but a-z, A-Z,
                                 0-9, _)
    ----------------------------------------------------------------------
    ...

GraphViz::Regex

I find that one of the best ways to debug and understand a complex procedure is to draw a picture. GraphViz::Regex uses the graphviz visualization library to draw a state machine diagram for a given regular expression:


    use GraphViz::Regex;

    my $regex = '(([abcd0-9])|(foo))';

    my $graph = GraphViz::Regex->new($regex);
    print $graph->as_png;

Regexp::Common

So much for explaining complicated regular expressions; what about generating them? The Regexp::Common module aims to be a repository for all kinds of commonly needed regular expressions, such as URIs, balanced texts, domain names and IP addresses. The interface is a little freaky, but it can hugely help to clarify complex regexps:


    my $ts = qr/\d+:\d+:\d+\.\d+/;
    $tcpdump =~ /$ts ($RE{net}{IPv4}) > ($RE{net}{IPv4}) : (tcp|udp) (\d+)/;

Text::Balanced

Finally, one particularly common family of things to match for are quoted, parenthesised or tagged text. Damian's Text::Balanced module helps produce both regular expressions and subroutines to match and extract balanced text sequences. For instance, we can create a regular expression for matching double-quoted strings like so:


    use Text::Balanced qw(gen_delimited_pat);
    $pat = gen_delimited_pat(q{"})
    # (?:\"(?:[^\\\"]*(?:\\.[^\\\"]*)*)\")

This pattern will match quoted text, but will also be aware of escape sequences like \" and \\, and hence not break off in the middle of


    "\"So\", he said, \"How about lunch?\""

Text::Balanced also contains routines for extracting tagged text, finding balanced pairs of parentheses, and much more.

Summary

We've looked at some slightly more-complex features of regular expressions, and shown how we can use these to slice and dice text with Perl. As these regexes get more complicated, the need for tools to help us debug them increases; and so we've looked also at re, YAPE and GraphViz::Regex.

Finally, the Regexp::Common and Text::Balanced modules help us create complex regular expressions of our own.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en