March 2004 Archives

This week on Perl 6, week ending 2004-03-28

... and we're back! Another interesting week in Perl 6. Your Summarizer even wrote some [parrot] code and it's been simply ages since he did that. In accordance with ancient custom, we'll start the summary with perl6-internals.

Building with miniparrot

Back in the early days Dan proposed, and it was generally agreed that the Parrot build process wouldn't be Perl dependent, but instead there would be a few OS specific 'bootstrap' scripts, enough to get miniparrot up and running. Miniparrot would have just enough smarts to be able to complete the configuration and build the final full parrot.

After last week's discussion about reinventing metaconfig, I wondered if the miniparrot plan was still in place. It seems I'd missed the discussion of stat that ended up talking about how miniparrot would be able to do its job. I find myself wondering what else is needed to get miniparrot to the point where it can start doing configuration work.

http://groups.google.com

Continuations continued (and fun with stacks)

Warning: The following discussion of the Continuation discussions is irrevocably biased; I find it very hard to be objective about discussions I participate in, and I was rather loud mouthed in this one.

The previous discussions of the uses and semantics of continuations carried over into this week. Piers Cawley argued that the current stack architecture seemed to be optimized for the wrong thing, with the special case RetContinuations being symptoms. He argued that current architecture (where a single stack frame can accommodate multiple pushes, with copy on write semantics being used to handle full continuations) should be replaced with a 'naïve' architecture using linked lists of immutable, simple stack frames, one frame per push. Switching to this approach, he argued, would do away with a great deal of code complexity, and issues of high object creation overhead could be offset by using free lists and preallocation to reuse stack frames. Oh yes, and there'd be no difference between a RetContinuation and a full Continuation with this scheme.

Leo Tötsch wasn't convinced. Dan was though, and made the decision to switch to single item per frame, immutable, non COW stacks. Leo implemented it. His first cut was rather slow; later refinements added freelists and other handy stuff to start pulling the performance back up. I'm sure there's more refinement to come though.

http://groups.google.com

http://groups.google.com

http://groups.google.com

http://groups.google.com

http://groups.google.com

Variadic functions

Ilya Martynov had some questions about how to handle variadic functions. Leo clarified some things and pointed Ilya at the foldup op. Jens Rieks suggested aliasing the registers I[1-4] to argc[ISPN], which Leo liked. I'm not sure he's implemented it yet though.

http://groups.google.com

GCC compiling to Parrot

In previous weeks Gerard Butler had posted wondering about getting GCC to target Parrot. The initial response was rather negative, pointing out that GCC and Parrot saw memory very differently, to the extent that there would probably be a need have special PMCs for GCC managed memory, which would make communication between GCC implemented languages and Parrot implemented ones rather tricky.

Undeterred, Gerald mapped out a way forward and asked for opinions. Dan thought the scheme looked reasonable, but fenced that with the caveat that he knows nothing about GCC's internals.

http://groups.google.com

Safe execution core and ops classification

Leo checked in some patches from Jens Rieks to allow classification of ops. He thought that this meant we were a good way along the road to providing a 'Safe' run-core option, though there was still a lot to do. He outlined a road map and asked for comments (and implementations). Comments were forthcoming, and Dan eventually bundled his comments up into a single post with some design in it. For some reason this induced some industrial hand waving about Proof Carrying Code from Steve Fink (he accused himself of hand waving, not me).

Jarkko Hietaniemi (Emacs' dynamic completion suggested 'Jarkko Himself' for that one. Well, it made me smile) offered the pathological

    eval 'while(push@a,0){}'

as an example of the kind of bad things that can happen if you allow eval EXPR in your 'safe' code, even with strict rules on what it's allowed to compile (Dan pointed out that quotas would help in this particular case...)

If there's one lesson to take from the discussion it's this: Code Safety is Hard. Whether it's AI Hard or not is left as an exercise for the interested reader.

http://groups.google.com

http://groups.google.com

http://groups.google.com -- Dan's big(gish) post

UNO (Universal Network Objects) interface for Parrot?

Tim Bunce pointed everyone at OpenOffice's Universal Network Objects and wondered if anyone had had a look to see what is needed to plug Parrot into them. And was promptly Warnocked.

http://groups.google.com

http://udk.openoffice.org/ -- More on UNO

Load paths

In Perl, it's possible to write require Some::Module, and Perl will go off and hunt for the appropriate file in the various directories in its @INC. You can do something similar in most languages.

Right now, you can't do it in Parrot though; Parrot's load_bytecode and other such ops take filesystem paths so, if things aren't set up exactly as the programmer expects, Bad Things can happen.

As Dan (and others on IRC and elsewhere I'm sure) points out, this is sub optimal. He posted an overview of the issue and a few possible ways forward and asked for comments. There were several. Mostly along the lines of 'core support for full over the net URIs for bytecode loading would be unutterably Bad'.

http://groups.google.com

Tcl, looking for a few good people

Will Coleda's Tcl implementation has apparently reached the point where he'd appreciate assistance. He said as much on the list. If you're interested in helping getting a full Tcl implementation that targets Parrot up and running then drop him a line.

http://groups.google.com

Ulterior Reference Counting for DoD

Andre Pang pointed the list at a paper on yet another Garbage Collection strategy called 'Ulterior Reference Counting' that looks potentially interesting. However, it turns out that it doesn't quite work that well with Parrot since Parrot guarantees that objects don't move around.

http://groups.google.com

Multi Method Dispatch vtable functions in bytecode

Dan announced that he'd started adding opcode support for multimethod dispatch. Leo had a bunch of questions with no answers so far.

http://groups.google.com

So that's where Jürgen's been

After a long absence, Jürgen Bömmels appeared on the list and explained that he'd got a new job, moved to a new town and had had no connection to the Internet. He's currently working through a huge backlog of mail and trying to get familiar with the current state of Parrot. It sounds like it might be a while before he starts contributing patches to ParrotIO again. Still, welcome back Jürgen.

http://groups.google.com

ParrotUnit

Piers Cawley posted his initial version of ParrotUnit, a port of the xUnit OO testing framework. Warnock applies.

http://groups.google.com

Behaviour of PMCs on assignment

Dan noted that, right now, binary vtable functions take three arguments, the destination, the left hand side and the right hand side, which allows them to either take the type of the destination into account, or simply to replace it with a new value. The advantage of this approach is that vtable functions have the potential to be more efficient when, say, the left hand side is the same as the destination. The disadvantage is that you have to make a PMC to receive the results of the operation before you can actually do the operation, which can be a pain (and suboptimal). Dan offered 3 different options and asked for opinions.

TOGoS argued that the 3 argument form was actually the Wrong Thing in general and that vtable methods should simply create a new PMC and replace the destination with it. He argued that this behaviour is what most HLLs expect, and having it would make the compiler's life a great deal easier.

http://groups.google.com

Meanwhile, in perl6-language

They talked about Unicode a good deal, but (per my announcement a few weeks back) I won't be covering those bits.

Outer product considered useful

Luke Palmer proposed an outer(*@ranges) function to allow for what he called 'dynamically nested loops'. He even provided an implementation for it which used a coroutine. As Simon Cozens pointed out, the fact that something as powerful as Luke's proposal can be implemented in (initially buggy) pure Perl 6 with no need for any additions to the language itself is very nice, but really more of a side issue for the time being.

http://groups.google.com

Announcements, Acknowledgements, Apologies

No announcements (apart from "Look! ParrotUnit! It's jolly good! You should use it and send me patches!") this week. And if you think I'm apologizing...

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send me feedback at , or drop by my website, maybe I'll really add some content to it this week.

http://donate.perl-foundation.org/ -- The Perl Foundation

http://dev.perl.org/perl6/ -- Perl 6 Development site

http://www.bofh.org.uk/ -- My website, "Just a Summary"

Making Dictionaries with Perl

When you woke up this morning, the last thing you are likely to have thought is "If only I had a dictionary!" But there are thousands of languages on Earth that many people want to learn, but they can't, because there are little or no materials to start with: no Pocket Mohawk-English Dictionary, no Cherokee Poetry Reader, no Everyday Otomi: Second Year. Only in the past few years have people realized that these languages are not just curiosities, but are basic indispensable, untranslatable parts of local cultures -- and they're disappearing in droves.

As I was learning Perl, the long arm of coincidence put me in contact with a good number of linguists who work on producing materials to help the study of these endangered languages. These folks work on producing textbooks and other "language materials," which is mostly straightforward, since the 1980s gave us "desktop publishing." But there was one real trouble spot: dictionaries. Writing a dictionary of any real size using just a word processor is maddening, like writing a novel on Post-Its. So they started using database programs, but had no way to turn this into anything you could print and call a dictionary. They had no way to take this:

  Headword: dagiisláng
  Citation: HSD
  Part of speech: verb
  English: wave a piece of cloth
  Example:  Dáayaangwaay hal dagiislánggan. | He was waving a flag.

And turn it into this:

"Well," I said and have been saying ever since, "This is no big deal, for you see, I am a programmer! Just export your database as CSV or something, email it to me, and I'll write a program that reads that and writes out a word-processor file with everything formatted all nice just like you want."

"A mere person, you, can program something that writes a word-processing document? But how can this be?! Surely this would require a year's work, a million lines of C++, and a bajillion dollars!"

"Yes. But instead I'll just use Perl, where I can do it in a few dozen lines of code, taking me just a few minutes." Because, you see, a conventionally formatted dictionary is just a glorified version of what people with business degrees would call a "database report", and people who work in cubicles generate such things all the time. And now I'll show you how it's done.

Reading the Input

Of course you'll need Perl, and that's not hard to come by. Then, at most, you just need a module for the input format and a module for the output format. And you don't even need that if the input and/or output formats are simple enough. In this case, the input format I'm often given is simple enough. It's called Shoebox Standard Format, and it looks like this:

  \hw dagiisláng
  \cit hsd
  \pos verb
  \engl wave a piece of cloth
  \ex Dáayaangwaay hal dagiislánggan. | He was waving a flag.

  \hw anáa
  \cit hsd; led-285
  \pos adverb
  \engl inside a house; at home
  
  \hw súut hlgitl'áa
  \cit hsd; led-149; led-411
  \engl speak harshly to someone; insult
  \ex 'Láa hal súut hlgitl'gán. | She said harsh words to her.
  
  \hw tlak'aláang
  \cit led-398
  \pos noun
  \engl the shelter of a tree

Namely, \fieldname fieldvalue, each record ("entry") starting with a \hw field, and the records and fields being in no particular order. (And the data, incidentally, is vocabulary from Haida, an endangered language spoken in the Southeast Alaskan islands, where I live.)

Now, one could parse this with a regexp and a bit of while(<IN>) {...}, but there's already a module for this that will read in a whole file as a big data list-of-lists data structure. After just a glance at the module's documentation, we can write this simple program to read in the lexicon as an object, and dump it to make sure that it's getting well filled in:

  use Text::Shoebox::Lexicon;
  my $lex = Text::Shoebox::Lexicon->read_file( "haida.sf" );
  $lex->dump;

And that prints this:

 Lexicon Text::Shoebox::Lexicon=HASH(0x15550f0) contains 4 entries:
 
 Entry Text::Shoebox::Entry=ARRAY(0x1559104) contains:
   hw = "dagiisláng"
   cit = "hsd"
   pos = "verb"
   engl = "wave a piece of cloth"
   ex = "Dáayaangwaay hal dagiislánggan. | He was waving a flag."
 
 Entry Text::Shoebox::Entry=ARRAY(0x1559194) contains:
   hw = "anáa"
   cit = "hsd; led-285"
   pos = "adverb"
   engl = "inside a house; at home"
 
 Entry Text::Shoebox::Entry=ARRAY(0x155920c) contains:
   hw = "súut hlgitl'áa"
   cit = "hsd; led-149; led-411"
   engl = "speak harshly to someone; insult"
   ex = "'Láa hal súut hlgitl'gán. | She said harsh words to her."
 
 Entry Text::Shoebox::Entry=ARRAY(0x1559284) contains:
   hw = "tlak'aláang"
   cit = "led-398"
   pos = "noun"
   engl = "the shelter of a tree"

A further glance shows that $lexicon->entries returns a list of the entry objects, and that $entry->as_list returns the entry's contents as a list (key1, value1, key2, value2) -- exactly the kind of list that is ripe for dumping into a Perl hash. So:

  foreach my $entry ($lex->entries) {
    my %e = $entry->as_list;
  }

And that works perfectly, assuming we never have an entry like this:

  \hw súut hlgitl'áa
  \cit hsd; led-149; led-411
  \engl speak harshly to someone
  \engl insult
  \ex 'Láa hal súut hlgitl'gán. | She said harsh words to her.

In that case, because there's two "engl" fields, $entry->as_list would return this:

 (
  'hw'   => "súut hlgitl'áa",
  'cit'  => "hsd; led-149; led-411",
  'engl' => "speak harshly to someone",
  'engl' => "insult",
  'ex'   => "'Láa hal súut hlgitl'gán. | She said harsh words to her.",
 )

And once we dump that into the hash %e, we would end up with just this:

 (
  'hw'   => "súut hlgitl'áa",
  'cit'  => "hsd; led-149; led-411",
  'engl' => "insult",
  'ex'   => "'Láa hal súut hlgitl'gán. | She said harsh words to her.",
 )

...since, of course, hash keys have to be unique in Perl hashes. If you needed to deal with a lexicon that had such entries, there are various methods in the Text::Shoebox::Entry class, but for a simple lexicon where each field comes up just once per entry, you can just use a hash -- and you can even check that that's the case by calling with $entry->assert_keys_unique;, which normally does nothing -- unless it sees duplicate field names in that given entry, in which case it will abort the program and print a helpful error message about the offending entry.

But for our data, with its unique keys, a hash works just fine:

  foreach my $entry ($lex->entries) {
    my %e = $entry->as_list;
  }

We would then do things with the contents of $e in that loop: either generating output right there, or putting it into Perl variables whose contents will later be output by other subroutines of ours.

Making the Output

Since we've got the basic input code squared away, now we get to think about how to output data. Once we know that, we'll know better how to write the code to make the formats meet in the middle.

As output formats go, HTML is good for many purposes; practically all programmers can code in it pretty well, and just about everyone can hardcopy HTML with their browser or word processor. However, even after all these years, there are still some basic problems with HTML: as a typesetting language, there's still no reliable support for control of page-layout options like headers and page-numbering, page breaks, newspaper columns, and the like. More importantly, WYSIWYG HTML editors all seem to be harmless at best or disastrous at worst. In my experience, that has ruled out HTML as an output format for the many lexicons where the output file still needs various kinds of manual touching-up in a word processor.

Because of these problems with HTML, I have generally chosen RTF as my output format. RTF is technically a Microsoft format, but somehow, somehow, it avoided most of the lunacy that that usually entails. Moreover, just about every word processor supports it. And Microsoft Word both prints and edits RTF pretty much flawlessly. (After all, it had to be good at something.) And finally, there's good Perl support for generating RTF, via the CPAN modules RTF::Writer and RTF::Document, so you can almost completely insulate yourself from dealing directly with the language. I'll use RTF::Writer, simply because I'm more familiar with it. (This may be due to the fact that it was written by the author of the delightful O'Reilly book RTF Pocket Guide, a handsome and charming man whose modesty forbids him from revealing that he is me.)

With a bit of skimming the RTF::Writer documentation, we can see that to send output to an RTF file, you create a sort of file handle for it, and then send data to it via its print or paragraph methods, like so:

  use RTF::Writer;
  my $rtf = RTF::Writer->new_to_file( "sample.rtf" );
  $rtf->prolog();  # sets up sane defaults

  $rtf->paragraph( "Hello world!" );
  $rtf->close;

That writes an RTF document consisting of just a sane header and then basically the text, "Hello world!":

  {\rtf1\ansi\deff0{\fonttbl
  {\f0 \froman Times New Roman;}}
  {\colortbl;\red255\green0\blue0;\red0\green0\blue255;}
  {\pard
  Hello world!
  \par}
  }

The RTF::Writer documentation comes with a list of some basic escape codes that are basically all we need to format our lexicon. The notables are:

  \b     for bold
  \i     for italic
  \f2    switch to font #2 (i.e., the second font we declare for this document)
  \fs40  switch text size to 20-point (40 = how many half-points)

RTF::Writer's interface is designed so that normal text passed to it will get escaped before being written to the RTF output file, and clearly you don't want that to happen to these codes -- you want the \b to be written as is, not escaped so that it'd show a literal backslash and a literal b in the document. To signal this to the RTF::Writer interface, you pass references to these strings, like so:

  $rtf->paragraph( \'\i', "Hello world!" );

You can also limit the effect of a code by wrapping it in an arrayref, i.e., with [code, text], like so:

  $rtf->paragraph(
    "And ",
    [ \'\i', "Hello world!" ],
    " is what I say."
  );

That'll produce a document saying: And Hello world! is what I say.

That's just about all the RTF we'd need to know to produce some simple lexicon output. We can exercise this with some literal text:

  use RTF::Writer;
  my $rtf = RTF::Writer->new_to_file( "lex.rtf" );
  $rtf->prolog();  # sets up sane defaults

  $rtf->paragraph(
    [ \'\b',    "tlak'aláang: " ],
    [ \'\b\i',  "n." ],
    " the shelter of a tree"
  );
  $rtf->paragraph(
    [ \'\b',    "anáa: " ],
    [ \'\b\i',  "adv." ],
    " inside a house; at home"
  );

  $rtf->close;

And that gets us something very close to the kind of formatting you'd find in a typical fancy dictionary:

Of course, we'd like to tweak spacing and fonts a bit, but that can be left for later as just minor additions to the code. Knowing just as much as we do now, we can see the output code taking shape. It would be something like:

  foreach my $entry (...) {
    ...
    $rtf->paragraph(
      [ \'\b',    $headword, ": " ],
      [ \'\b\i',  $part_of_speech ],
      " ", $english,
      ...and something to drop the example sentences, if any...
    );
  }

In fact, we can already cobble this together with our earlier input-reading code, to make a clunky but working prototype:

  use strict;
  use Text::Shoebox::Lexicon;
  my $lex = Text::Shoebox::Lexicon->read_file( "haida.sf" );

  use RTF::Writer;
  my $rtf = RTF::Writer->new_to_file( "lex.rtf" );
  $rtf->prolog();  # sets up sane defaults

  foreach my $entry ($lex->entries) {
    my %e = $entry->as_list;
    $rtf->paragraph(
      [ \'\b',    $e{'hw'}  || "?hw?", ": " ],
      [ \'\b\i',  $e{'pos'} || "?pos?" ],
      " ", $e{'engl'} || "?english?"
    );
  }
  $rtf->close;

And that produces this:

Now, sure, the entries aren't in alphabetical order, we see "noun" instead of "n.", and the example sentences aren't in there yet. But consider that with not even twenty lines of Perl, we've got a working dictionary renderer. It's downhill from here.

Sorting and Duplicate Headwords

So how do we take entries in whatever order, and put them into alphabetical order? A first hack is something like this:

  my %headword2entry;
  foreach my $entry ($lex->entries) {
    my %e = $entry->as_list;
    $headword2entry{ $e{'hw'} } = \%e;
  }
  
  foreach my $headword (sort keys %headword2entry) {
    my %e = %{ $headword2entry{$headword} };
    ...and print it here...
  }

And that indeed works fine. But suppose one of the linguists comes by and adds these three entries into our little database:

  \hw gíi
  \pos auxiliary verb
  \engl already; always; often
  
  \hw gu
  \pos postposition
  \engl there
  
  \hw gíi
  \pos verb
  \engl swim away [of fish]

When we run our program, there's trouble with the output:

First off, the second "gíi" (the verb for fish swimming away) was stored as $headword2entry{'gíi'} and that overwrote the first "gíi" entry (the one that means already, always, or often). And secondly, "gíi" got sorted after "gu"!

The first problem can be solved by changing from the current data structure, which is like this:

  $headword2entry{ 'gíi' } = ...one_entry...;

over to a new data structure, which is like this:

  $headword2entries{ 'gíi' } =
    [ ...one_entry... , ...another_entry..., ...maybe_even_another... ];

...even though in most cases that list will hold just one entry.

That's simple to graft into our program, even if the syntax for dereferencing gets a bit thick:

  my %headword2entries;
  foreach my $entry ($lex->entries) {
    my %e = $entry->as_list;
    push @{ $headword2entries{ $e{'hw'} } },  \%e;
  }
  
  foreach my $headword (sort keys %headword2entries) {
    foreach my $entry ( @{ $headword2entries{$headword} } ) {
      ...code to print the entry...
    }
  }

And that works just right: both "gíi" entries show up.

Now how to get sort keys %headword2entries to sort "gíi" before "gu"? The default sort() that Perl uses just sorts ASCIIbetically, where "í" comes not just after "u", but actually after all the unaccented letters. We can get Perl to use a smarter sort() if we add a "use locale;" line and see about changing our current locale to French or German or something that'd know that "í" sorts before "u". This approach works in some cases, but suppose that you're dealing with a language that uses "dh" as a combined letter that comes after "d". You'd be out of luck, since there aren't any existing locales that (as far as I know) that have "dh" as a letter after "d", and since under most operating systems you can't define you own locales.

But CPAN, once again, comes to the rescue. The CPAN module Sort::ArbBiLex lets you state a sort order and get back a function that sorts according to that order. We can just pull this example from the docs:

  use Sort::ArbBiLex (
    'custom_sort' =>    # that's the function name to define
    "
     a A à À á Á â Â ã Ã ä Ä å Å æ Æ
     b B
     c C ç Ç
     d D ð Ð
     e E è È é É ê Ê ë Ë
     f F
     g G
     h H
     i I ì Ì í Í î Î ï Ï
     j J
     k K
     l L
     m M
     n N ñ Ñ
     o O ò Ò ó Ó ô Ô õ Õ ö Ö ø Ø
     p P
     q Q
     r R
     s S ß
     t T þ Þ
     u U ù Ù ú Ú û Û ü Ü
     v V
     w W
     x X
     y Y ý Ý ÿ
     z Z
    "
  );

And if we need that "dh" to be a new letter between "d" and "e", it's a simple matter of adding a line to the above code:

     ...
     d D ð Ð
     dh Dh
     e E è È é É ê Ê ë Ë
     ...

And if the above sort order isn't right, we can fix this by just moving things around. For example, a few Haida words use an x-circumflex character for an odd pharyngeal sound, and since that character isn't in Latin-1, the folks working on Haida use a special font that replaces the Latin-1 þ character with the x-circumflex. To have that sort as a letter after x, we'd rearrange the end of the above sort-order to read like this:

     ...
     t T
     u U ù Ù ú Ú û Û ü Ü
     v V
     w W
     x X
     þ Þ
     y Y ý Ý ÿ
     z Z

Once we get the big use Sort::ArbBiLex (...); statement set up just the way we like it, we can just replace the "sort" in our "sort keys" with "custom_sort", like so:

  foreach my $headword (custom_sort keys %headword2entries) {
    foreach my $entry ( @{ $headword2entries{$headword} } ) {
      ...code to print the entry...
    }
  }

With that in place, our entries sort just right:

Reverse Indexing

The last thing anyone wants to do when they've finished working on a dictionary, is to turn right around and write another one -- but that's exactly the problem that comes up in lexicography: you've been compiling a Haida-to-English dictionary, and then someone says "Gee, it'd be really handy to have an English-to-Haida one, too!"

In the bad old days before people used databases for their lexicons, this process of "reversing the dictionary" was manual. Now that we have databases, we just need a way to see the entry that expresses "gu" = "there" in our main lexicon, and then make an entry in a reverse lexicon that expresses "there" = "gu".

The reverse lexicon could be just %english2native with entries like:

  $english2native{'there'} = "gu";

But there could be several words that mean "there" -- like "gyaasdáan" -- so we'd have to use an array here, just as we did in %headword2entries, like this:

  $english2native{'there'} = [ "gu", "gyaasdáan" ];

We can implement this by changing our initial lexicon-scanning routine to add a line to push to @{$english2native{each_english_bit}}, like so:

  foreach my $entry ($lex->entries) {
    my %e = $entry->as_list;
    push @{ $headword2entries{ $e{'hw'} } },  \%e;
    foreach my $engl ( reversables( $e{'engl'} ) ) {
      push @{ $english2native{ $engl } }, $e{'hw'}
    }
  }

And later on, we can spit out the contents of %english2native after the main dictionary:

  $rtf->paragraph( "\n\nEnglish to Haida Index\n" );

  foreach my $engl ( custom_sort keys %english2native) {
    my $n = join "; ", custom_sort @{ $english2native{ $engl } };
    $rtf->paragraph( "$engl: $n" );
  }

All we need now is a routine, reversables(), that can take the string "already; always; often" (from the gíi entry) and turn it into the list ("already," "always," "often"), and to take the string "the shelter of a tree" and turn it into the one-item list ("shelter of a tree"). (If we left the "the" on there, we'd have a huge bunch of entries under "the"!)

This function is a decent first hack:

  sub reversables {
    my $in = shift || return();
    my @english;
    foreach my $term ( split /\s*;\s*/, $in ) {
      $term =~ s/^(a|an|the|to)\s+//i;
       # Take the "to" off of "to swim away [of fish]",
       # and the "the" off of "the shelter of a tree"
      push @english, $term;
    }
    return @english;
  }

However, consider the entry anáa: "inside a house; at home" -- our reversables() function will return this as the list ("inside a house", "at home"). That seems passable, but if I were looking for a word like this in the English end of the dictionary, I'd probably want it to be under "home" and "house" as well.

Now, there are four alternatives here for how to have finer control over the reversing:

Just don't bother, and instead just do this all manually in the editing of the final draft.
This is a bad approach because, in my experience, the people working on the lexicon get so used to the just-passable reversing algorithm that they end up thinking it's no big deal, and so in the end its effects never get fixed.
Don't do automatic reversing, but have a mandatory field in each entry that says what English headword(s) should point to this native entry.
For example, if we call the field "ehw" (for "English headword"), then for "at home; inside a house" could say something like: "\ehw home, at; house, inside a". However, having this be mandatory is a real drag for simple entries like "gu," where you'd have to do:
        \hw gu
        \engl there
        \ehw there
* Make an "ehw" field optional, and when it's absent, use a smart reversing algorithm.
So when we have an entry like "\hw gu \engl there", of course the reversing algorithm would know to infer a "\ehw there." And it would somehow be smart enough to know to index "wave a piece of cloth" under "wave" and "cloth" but not under "a," "piece," or "of." The problem with very smart fallback algorithms like this is that people have to understand them completely, so that they can know whether the result is good enough or whether it should be overridden with a default "\ehw" field. But since nobody can remember all the hacks that get built into the smart algorithm, they either err on the side of doubt by always putting a "\ehw" field (thus making the whole algorithm pointless), or by never putting a "\ehw" field, or, worse some unpredictable and headachy mix of the two. So ironically, a smart fallback algorithm is often a bad idea. That leads us to the final alternative:
* Make an "ehw" field optional, and when it's absent, use a dumb reversing algorithm.
By "dumb," I mean a maximum of two rules -- if it's any more complex than that, people will forget how it works and won't know when they should key in an explicit "\ehw" field.

So while we could add more and more things to our reversables() algorithm, it seems wisest to refrain from doing this, to be content with our one s/^(a|an|the|to)\s+//i rule, and instead just add support for an "\ehw" field. We can do that simply by changing the call to reversables(), from this:

    foreach my $engl ( reversables( $e{'engl'} ) ) {
      push @{ $english2native{ $engl } }, $e{'hw'}
    }

to this:

    my @reversed = $e{'ehw'} ? split( m/\s*;\s*/, $e{'ehw'} )
                             : reversables( $e{'engl'} );
    foreach my $engl ( @reversed ) {
      push @{ $english2native{ $engl } }, $e{'hw'}
    }

With that in place (and with a "\ehw home, at; house, inside a" line in the "anáa" entry just to get the ball rolling), our program runs and spits out an English index after the Haida dictionary:

Conditional Output and Example Sentences

There's two optional parts of the entries that we haven't used yet: the citation fields, like "\cit hsd; led-149; led-411", and the example sentences field, like "\ex 'Láa hal súut hlgitl'gán. | She said harsh words to her.". The citation fields are typically only of importance to the editors, who might want to spot-check words against the places in the text where they were found. (And typically the editors are the only ones who would be fluent with the abbreviations, e.g., would know that "led-149" is short for "page 149 of the Leer-Edenso Dictionary of Haida".)

Ideally our program should produce output for the editors with the citations, and output for normal users (without the citations). We can do this my having a $For_Editors variable that's set early on in the program:

  my $For_Editors = 0; # set to 1 to add citations

And then later on we have code that uses that variable:

  foreach my $headword ( custom_sort keys %headword2entries) {
    foreach my $entry ( @{ $headword2entries{$headword} } ) {
      print_entry( $entry );
    }
  }

  sub print_entry {
    my %e = %{$_[0]};
    $rtf->paragraph(
      [ \'\b',    $e{'hw'}  || "?hw?", ": " ],
      [ \'\b\i',  $e{'pos'} || "?pos?" ],
      " ", $e{'engl'} || "?english?", ".", 
      $For_Editors && $e{'cit'} ? " [$e{'cit'}]" : (),
    );
  }

Our new and punctuation-rich $For_Editors && $e{'cit'} line is just a concise way of saying "if this is for the editors and if there's a citation in this entry, then print a space and a bracket before it, and a bracket after it -- otherwise don't add anything".

Our example sentences ("\ex 'Láa hal súut hlgitl'gán. | She said harsh words to her".) should probably end up in any normal dictionary, but of course we wouldn't want to try adding the contents of $e{'ex'} with formatting codes around it if it weren't actually present in this entry. We can use the same sort of $value ? "...$value..." : () idiom we used before -- except that this time we need to first take out the "|" that separates the Haida part from the English translation. That's simple, though:

    my($ex, $ex_eng);
    ($ex, $ex_eng) = split m/\|/, $e{'ex'} if $e{'ex'};
    $rtf->paragraph(
      ...
      $ex_eng ? (" $ex = $ex_eng") : (),
    );

With that code in place, our entries that have example sentences, show them, like this:

Fancier Formatting

Now that basically everything else about our program is working, how about we polish it off with some formatting codes to make it look just right. We've already got some simple bold and italic codes, so the next thing is certainly to use different fonts. We could use, say, Bookman for the main headword and Times for the rest of the entry -- except for in the example sentence, we can use Bookman again for the Haida text, and Arial for the English translation.

However, a glance at the RTF Pocket Guide shows no RTF code that means "change to the font 'Arial'" -- just a code that means "change to font number N [i.e., the second font we declare for this document]", This declaring is just a matter of adding a parameter 'fonts' = [ ...font names...],> to that dull $rtf->prolog() method we called back when we created $rtf. As the RTF::Writer documentation notes, "You should be sure to declare all fonts that you switch to in your document (as with \'\f3', to change the current font to what's declared in entry 3 (counting from 0) in the font table)." So if we just change our prolog call to this...

  $rtf->prolog( 'fonts' => [ "Times New Roman", "Bookman", "Arial" ] );

... Then we can use a \f0 to switch to Times New Roman (which is the default, incidentally, since it's the first declared font), and \f1 to switch to Bookman, and \f2 to switch to Arial.

And suppose we want everything to be in 10-point, except for the Arial part, which we want in specifically 9-point Arial so it won't steal attention from the rest of the text, as sans-serif fonts often do. That's just a matter of a \fs20 and \fs18 code -- "fs" for "font size", plus the desired font size, in half-points. (Odd, I know.)

With these extra codes in place, our print_entry routine now looks like this:

  sub print_entry {
    my %e = %{$_[0]};
    my($ex, $ex_eng);
    ($ex, $ex_eng) = split m/\|/, $e{'ex'} if $e{'ex'};
    $rtf->paragraph(  \'\fs20',  # Start out in ten-point
      [ \'\f1\b', $e{'hw'}  || "?hw?", ": " ],
      [ \'\b\i',  $e{'pos'} || "?pos?" ],
      " ", $e{'engl'} || "?english?", ".", 
      $For_Editors && $e{'cit'} ? " [$e{'cit'}]" : (),
      $ex_eng ? (" ", \'\f1', $ex, \'\f2\fs18', $ex_eng) : (),
    );
  }

It's dense, but then it does a lot of work! And that work comes out looking like this:

As to adding fancier formatting, this is usually best done by just flipping through the RTF Pocket Guide and looking for a mention of the effect you want. For example, in a lexicon we might be particularly interested in hanging indents (\fi-300), two-column pages (\col2), and page numbering ({\header \pard\ql\plain p.\chpgn \par}).

Now suppose that you're trying to make the most of your xeroxing budget, trading off nice large readable point size against how many people get copies. One way to squeeze as much content into as small a space is to use abbreviations for the most repeated text in the dictionary -- the part-of-speech tags. So we can turn "noun" into just "n.", "verb" into "v.", and so on. Each time, we save only a little space, but it adds up quick. And doing this (or at least trying it out to see how it looks) is straightforward. We need only change one line in print_entry(), from this

      [ \'\b\i',  $e{'pos'} || "?pos?" ],

To this:

      [ \'\b\i',  $Abbrev{$e{'pos'}||''} || $e{'pos'} || "?pos?" ],

And earlier we'll have to define what should be in %Abbrev:

  my %Abbrev = (
   'auxiliary verb' => 'aux.',
   qw(noun n. verb v. adverb adv.),
  );

But that's all it takes to change our output to look like this:

That continues to print "?pos?" when an entry is erroneously missing the part-of-speech field. And it doesn't abbreviate the term "postposition". (If we did so, it'd probably be "pp.", which people would probably think was "participle" or something.) But the most common terms, "noun" and "verb", got shrunk down, saving a few characters per entry, which could add up to a dozen pages in a large printout.

Other Formats

I've just been talking about producing conventionally formatted dictionaries, but the same database and the same kinds of Perl could be used to instead produce different output formats. Use a bit of fancy page layout and a double-sided printer (or copier) and the same data can be turned into readymade flashcards. Or if you have a subject field in entries (like "plant", "color", "body part", "food"), it's easy to re-sort entries by topic, and produce a "topical dictionary", which language teachers find very useful in planning classroom exercises.

World Enough and Time

As A. N. Whitehead's famous quote goes, "Civilization advances by extending the number of important operations which we can perform without thinking about them. Operations of thought are like cavalry charges in a battle - they are strictly limited in number, they require fresh horses, and must only be made at decisive moments." I've found this to be personally and critically true in dealing with endangered languages: it takes man-years of time to produce a dictionary of any useful size, both on the part of linguists and of members of the community. And with most of North America's native languages, the most fluent speakers are over 65, so there's no great surplus of man-years.

Whitehead was more right than he knew: saving time and effort doesn't just advance civilizations, it can help save them.

So when Perl helps us glue together a database program, a printer, and a word processor so that the typesetting phase of a dictionary takes not months, but minutes, this frees up the linguists and teachers and elders to spend scarce time and "decisive moments" working on preserving the language through study and teaching. We need every minute to work on revitalizing these languages that are the foundation of endangered cultures and civilizations -- with all their stories, poems, songs, sayings, proverbs, figures of speech, jokes, liturgy, and heaps of specialized jargon from botany and agriculture and healing and just plain ways of relating to people and the world, very little of which would survive mere translation to English.

We're in a hurry, and so we really appreciate Perl.

Finished Code for Sample Haida Dictionary

  use strict;
  use warnings;

  my $For_Editors = 0; # set to 1 to add citations

  use RTF::Writer;
  use Text::Shoebox::Lexicon;
  my $lex = Text::Shoebox::Lexicon->read_file( "haida.sf" );

  my $rtf = RTF::Writer->new_to_file( "lex.rtf" );
  $rtf->prolog( 'fonts' => [ "Times New Roman", "Bookman", "Arial" ] );

  use Sort::ArbBiLex (
    'custom_sort' =>
    "
     a A à À á Á â Â ã Ã ä Ä å Å æ Æ
     b B
     c C ç Ç
     d D ð Ð
     e E è È é É ê Ê ë Ë
     f F
     g G
     h H
     i I ì Ì í Í î Î ï Ï
     j J
     k K
     l L
     m M
     n N ñ Ñ
     o O ò Ò ó Ó ô Ô õ Õ ö Ö ø Ø
     p P
     q Q
     r R
     s S ß
     t T þ Þ
     u U ù Ù ú Ú û Û ü Ü
     v V
     w W
     x X
     y Y ý Ý ÿ
     z Z
    "
  );
  my %headword2entries;
  my %english2native;

  my %Abbrev = (
   'auxiliary verb' => 'aux.',
   qw(noun n. verb v. adverb adv.),
  );

  foreach my $entry ($lex->entries) {
    my(%e) = $entry->as_list;
    push @{ $headword2entries{ $e{'hw'} } },  \%e;
    my @reversed = $e{'ehw'} ? split( m/\s*;\s*/, $e{'ehw'} )
                             : reversables( $e{'engl'} );
    foreach my $engl ( @reversed ) {
      push @{ $english2native{ $engl } }, $e{'hw'}
    }
  }

  $rtf->paragraph( "Haida to English Dictionary\n\n" );

  foreach my $headword ( custom_sort keys %headword2entries) {
    foreach my $entry ( @{ $headword2entries{$headword} } ) {
      print_entry( $entry );
    }
  }

  $rtf->paragraph( "\n\nEnglish to Haida Index\n" );

  foreach my $engl ( custom_sort keys %english2native) {
    my $native = join "; ", custom_sort @{ $english2native{ $engl } };
    $rtf->paragraph( "$engl: $native" );
  }

  $rtf->close;
  exit;


  sub reversables {
    my $in = shift || return;
    my @english;
    foreach my $term ( grep $_, split /\s*;\s*/, $in ) {
      $term =~ s/^(a|an|the|to)\s+//;
      push @english, $term;
    }
    return @english;
  }


  sub print_entry {
    my %e = %{$_[0]};
    my($ex, $ex_eng);
    ($ex, $ex_eng) = split m/\|/, $e{'ex'} if $e{'ex'};
    $rtf->paragraph(  \'\fs20',  # Start out in ten-point
      [ \'\f1\b', $e{'hw'}  || "?hw?", ": " ],
      [ \'\b\i',  $Abbrev{$e{'pos'}||''} || $e{'pos'} || "?pos?" ],
      " ", $e{'engl'} || "?english?", ".", 
      $For_Editors && $e{'cit'} ? " [$e{'cit'}]" : (),
      $ex_eng ? (" ", \'\f1', $ex, \'\f2\fs18', $ex_eng) : (),
    );
  }

This week on Perl 6, week ending 2004-03-21

Spring is sprung, the Equinoctal gales seem to have blown themselves out, I'm a proud step grandfather and life is generally grand.

"So, what's been going on in perl6-internals?" I hear you ask. Let's find out shall we?

Parrot grabs SIGINT

It appears that embedded Parrot tries to do too much. In particular, it grabs signals that its embedder may not want it to deal with. Dan declared that at some point Parrot would have to treat signals as something the embedding environment controls (And the standalone Parrot interpreter becomes just another place where the Parrot core is embedded).

http://groups.google.com

Unprefixed global symbols

Unhygenic namespaces are bad, mm'kay?

Mitchell N Charity posted the results of doing

    $ nm ./blib/lib/libparrot.a | egrep ' [TDRC] |\.o:' | grep -v
    Parrot_

and the results are rather embarrassing. Parrot exports a bunch of symbols that have no Parrot specific prefix, and which have the potential to clash with symbols that the embedder is using for something else. Arthur Bergman agreed that doing something about this would be a jolly good idea and proposed prefixing all externally visible macros, types and functions with Parrot_.

There followed some discussion of the prefices that are currently in use within Parrot (there are several); Arthur pointed out that many of them are still worryingly generic and proposed expanding most of the 'P's in them to a full 'Parrot'. Jeff Clites suggested that it would be a good idea to get the linker to only expose external entry points in the first place, though there are issues of cross platform compatibility to deal with in order to implement that. Dan announced that, despite the potential pain, this would be the way forward.

http://groups.google.com

FAQs?

Tim Bunce wondered if anyone was tracking the mailing list and adding questions and (good) answers to the FAQ. Apparently, chromatic has taken this job upon himself, but he confessed that he hadn't actually been doing recently. He's on it though.

http://groups.google.com

What does -0.0 mean to you?

Mitchell N Charity pointed out that PerlNum didn't appear to retaining the sign of zero. (Which, frankly, does my head in every time I think about it; minus zero? What's that then?). Apparently retaining the sign is the Right Thing and, when Mitchell pointed it out, PerlNums were doing the intuitive thing (zero is zero is zero, and the sign's irrelevant). It's what the floating point standard mandates though, so things were changed.

http://groups.google.com

GC issues

Jens Rieks continues to stress the object stuff (he's working on writing an EBNF parser). He posted a test to the list that led Leo Tötsch to find and fix 3 bugs in the Garbage Collector's DOD phase.

http://groups.google.com

Will's questions

Will Coleda, who's working on implementing Tcl for Parrot had a few questions. My particular favourite one was "Unicode?", which seems to echo a lot of people's feelings on that particular issue.

Leo answered his questions (the answer to the Unicode one being "Needs a lot of work still.").

In the thread that followed, Gay, Jerry (Or should that be Jerry Gay? What's a Summarizer to do eh?) wondered why you wrote store_global "global", Pn, but find_global Pn, "global", which he thought was inconsistent. My rule of thumb for this is that the target always comes first on the opcode argument list. Jens Rieks pointed out that opcode arguments are always ordered so that any OUT arguments come first.

http://groups.google.com

New Tcl release

Will Coleda announced the latest version of his Tcl interpreter. Judging by the Changelog extract he posted, things are looking very good. You'll find it in the latest Parrots from CVS.

http://groups.google.com

Classes and metaclasses

Larry, Dan and chromatic had a discussion about what's responsible for dispatch and whether Roles are inherited or acquired by some other means. It got rather philosophical (I like philosophy). Dan got the last word.

http://groups.google.com

Numeric weirdness

Simon Glover found a bug in Parrot's string->number handling. It turns out that the route from -1.2 to a number is different to that from "-1.2" to a number. Which means that the two resulting numbers don't have the same value. Which is bad.

It turned out to be down to hysterical reasons from when IMCC just generated parrot assembly and then called parrot to do the actual execution; then parrot would use the same string conversion routines at compile and run time.

Leo fixed it.

http://groups.google.com

Configure.pl and the history of the world

Dan pointed out that, as the Ponie work goes on, integrating Parrot with Perl 5, we need to get the embedding interface fixed up so that it plays well with others.

He was also concerned that we seemed to be reinventing perl's Configure.SH in a horribly piecemeal fashion and suggested that we should just dig all the stuff out in one swell foop. Larry pointed everyone at metaconfig and discussion ensued.

Quite how metaconfig sits with the miniparrot based configuration/build plan that Dan's talked about was left as an exercise for the interested reader.

http://groups.google.com

Method caching

Work continued on making objects more efficient. The object PMC had a good deal of fat/indirection removed, and work started on implementing a method cache. Dan reckoned that the two most useful avenues for exploration were method caching and thunked vtable lookups.

Zellyn Hunter suggested people take a look at papers on Smalltalk's dispatch system by Googling for [smalltalk cache].

Mitchell N Charity suggested a couple of possible optimizations (and benchmarks to see if they're worth trying).

There was some discussion of the costs of creating return continuations. (My personal view is that the current continuation and stacks implementation isn't the Right Thing, but I don't have the C skills to implement what I perceive to be the Right Thing. Which sucks.)

Leo reckons that, with a method cache and continuation recycling, he's seeing a 300% improvement in speed on the object oriented Fibonacci benchmark.

http://groups.google.com

http://groups.google.com

ICU incorporation

Jeff Clites gave everyone a heads up about the work he's doing on a patch to incorporate the use of ICU (the Unicode library Parrot will be using) and some changes to our internal representation of strings. Apparently the changes give us a simpler and faster internal representation, which can't be bad.

http://groups.google.com

Continuation usage

Jens Rieks and Piers Cawley both had problems with continuations. Leo Tötsch tried explain what they were doing wrong. There seemed to be a fair amount of talking past each other going on (at least, that's how it felt from my point of view) but I think communication has been established now. Hopefully this will lead to a better set of tests for continuation usage and a better understanding of what they're for and how to use them.

http://groups.google.com

http://groups.google.com

Optimization in context

Mitchell Charity argued that we should think carefully before doing too much more optimization of Parrot until we've got stuff working correctly. Leo agreed, up to a point, but pointed out that optimizing for speed is lot of fun. Brent Royal-Gordon thought that it was a balancing act, some things are painfully slow and need optimizing, at other times, things are painfully none existent and need to be implemented. Objects were both of those things for a while.

Piers Cawley said that, for all that objects were slow (getting faster), he thought they were rather lovely.

http://groups.google.com

Meanwhile, in perl6-language

Hash subscriptor

At the back end of the previous week, Larry introduced the idea of subscripting hashes with %hash«baz» when you mean %hash{'baz'}. This surprised John Williams (and others I'm sure, it certainly surprised me, but it's one of those "What? Oh... that makes a lot of sense" type surprises.) Larry explained his thinking on the issue. Apparently it arose because :foo('bar') was too ugly to live, but too useful to die, so :foo«bar» was invented, and once you have that, it is but a short step to %foo«bar». (If you've not read Exegesis 7, you probably don't know that :foo«bar» is equivalent to foo => 'bar', but you do now.) John wasn't convinced though. It remains to be seen if he's convinced Larry.

Larry: unfortunately it's an unavoidable part of my job description to decide how people should be surprised.

http://groups.google.com

Mutating methods

Oh lord... I'm really not following this particular thread. The mutating methods thread branched out in different directions that made my head hurt. I *think* we're still getting

    $aString.lc;  # None mutating, returns a new lower case string
    $aString.=lc; # Mutating, makes $aString lower case

I'm going to bottle out of summarizing the rest of the thread. Hopefully subthread perpetrators will be kind to an ageing Summarizer and change subject lines to reflect the content of a given subthread. Ta.

Some questions about operators

Joe Gottman has been reading Synopsis 3 and had a bunch of questions. Much of the ensuing discussion covered the use of the 'broken bar' glyph as an infix form of the zip operator. Which I didn't quite realise as I skimmed the thread during the week because courier doesn't seem to distinguish between the broken bar and the standard bar. Larry later suggested using the yen (¥) symbol instead, which has the advantage of looking a little like a zipper. I really hope that firms up from suggestion to design call.

http://groups.google.com

Announcements, Acknowledgements, Apologies

Whee! I have an announcement! This summary is dedicated to my step grandson, Isaac Stamper, born 2004-03-17T13:13GMT at the RVI in Newcastle. There are (of course) photos online at http://www.bofh.org.uk/photos/Isaac/, but you don't have to go and look at them.

I'd also like to apologise to everyone on perl6-internals for my complete inability to post attachments to the list. I hope that those who are interested got to see my first cut at a Parrot implementation of xUnit in the end.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send me feedback at mailto:p6summarizer@bofh.org.uk, or drop by my website, maybe I'll add some content to it this week.

http://donate.perl-foundation.org/ -- The Perl Foundation

http://dev.perl.org/perl6/ -- Perl 6 Development site

http://www.bofh.org.uk/ -- My website, "Just a Summary"

Synopsis 3

Operator Renaming

Several operators have been given new names to increase clarity and better Huffman-code the language:

  • -> becomes ., like the rest of the world uses.
  • The string concatenation . becomes ~. Think of it as "stitching" the two ends of its arguments together.
  • Unary ~ now imposes a string context on its argument, and + imposes a numeric context (as opposed to being a no-op in Perl 5). Along the same lines, ? imposes a Boolean context.
  • Bitwise operators get a data type prefix: +, ~, or ?. For example, | becomes either +| or ~| or ?|, depending on whether the operands are to be treated as numbers, strings, or Boolean values. Left shift << becomes +< , and correspondingly with right shift. Unary ~ becomes either +^ or ~^ or ?^, since a bitwise NOT is like an exclusive-or against solid ones. Note that ?^ is functionally identical to !. ?| differs from || in that ?| always returns a standard Boolean value (either 1 or 0), whereas || returns the actual value of the first of its arguments that is true.
  • x splits into two operators: x (which concatenates repetitions of a string to produce a single string), and xx (which creates a list of repetitions of a list or scalar).
  • Trinary ? : becomes ?? ::.
  • qw{ ... } gets a synonym: « ... » . For those still living without the blessings of Unicode, that can also be written: << ... >>.
  • The scalar comma , now constructs a list reference of its operands. You have to use a [-1] subscript to get the last one.

New Operators

  • Binary // is just like ||, except that it tests its left side for definedness instead of truth. There is a low-precedence form, too: err.
  • Binary => is no longer just a "fancy comma." it now constructs a Pair object that can, among other things, be used to pass named arguments to functions.
  • ^^ is the high-precedence version of xor.
  • Unary . calls its single argument (which must be a method, or an de-referencer for a hash or array) on $_.
  • ... is a unary postfix operator that constructs a semi-infinite (and lazily evaluated) list, starting at the value of its single argument.
  • However, ... as a term is the "yada, yada, yada" operator, which is used as the body in function prototypes. It dies if it is ever executed.
  • $(...) imposes a scalar context on whatever it encloses. Similarly, @(...) and %(...) impose a list and hash context, respectively. These can be interpolated into strings.

Hyperoperators

The Unicode characters » (\x[BB]) and « (\x[AB]) and their ASCII digraphs >> and << are used to denote "hyperoperations" – "list" or "vector" or "SIMD" operations that are applied pairwise between corresponding elements of two lists (or arrays) and which return a list (or array) of the results. For example:

     (1,1,2,3,5) »+« (1,2,3,5,8);  # 
(2,3,5,8,13)

If one argument is insufficiently dimensioned, Perl "upgrades" it:

     (3,8,2,9,3,8) >>-<< 1;          # 
(2,7,1,8,2,7)

This can even be done with method calls:

     ("f","oo","bar")».«length; 
   # (1,2,3)

When using a unary operator, only put it on the operand's side:

     @negatives = -« @positives;

      @positions»++;            # Increment all positions

      @objects».run();

Junctive Operators

|, &, and ^ are no longer bitwise operators (see Operator Renaming) but now serve a much higher cause: they are now the junction constructors.

A junction is a single value that is equivalent to multiple values. They thread through operations, returning another junction representing the result:

     1|2|3 + 4;                              # 5|6|7
     1|2 + 3&4;                              # (4|5) & (5|6)

Note how when two junctions are applied through an operator, the result is a junction representing the operator applied to each combination of values.

Junctions come with the functional variants any, all, one, and none.

This opens doors for constructions like:

     unless $roll == any(1..6) { print "Invalid roll" }

     if $roll == 1|2|3 { print "Low roll" }

Chained Comparisons

Perl 6 supports the natural extension to the comparison operators, allowing multiple operands.

     if 3 < $roll <= 6              { print "High roll" }
     
     if 1 <= $roll1 == $roll2 <= 6  { print "Doubles!" }

Binding

A new form of assignment is present in Perl 6, called "binding," used in place of typeglob assignment. It is performed with the := operator. Instead of replacing the value in a container like normal assignment, it replaces the container itself. For instance:

    my $x = 'Just Another';
    my $y := $x;
    $y = 'Perl Hacker';

After this, both $x and $y contain the string "Perl Hacker," since they are really just two different names for the same variable.

There is another variant, spelled ::=, that does the same thing at compile time.

There is also an identity test, =:=, which tests whether two names are bound to the same underlying variable. $x =:= $y would return true in the above example.

List Flattening

Since typeglobs are being removed, unary * may now serve as a list-flattening operator. It is used to "flatten" an array into a list, usually to allow the array's contents to be used as the arguments of a subroutine call. Note that those arguments still must comply with the subroutine's signature, but the presence of * defers that test until runtime.

    my @args = (\@foo, @bar);
    push *@args;

Is equivalent to:

    push @foo, @bar;

Piping Operators

The new operators ==> and <== are akin to UNIX pipes, but work with functions that accept and return lists. For example,

     @result = map { floor($^x / 2) }
                 grep { /^ \d+ $/ }
                   @data;

Can also now be written:

     @data ==> grep { /^ \d+ $/ }
           ==> map { floor($^x / 2) }
           ==> @result;

Or:

     @result <== map { floor($^x / 2) }
             <== grep { /^ \d+ $/ }
             <== @data;

Either form more clearly indicates the flow of data. See Synopsis 6 for more of the (less-than-obvious) details on these two operators.

Invocant Marker

An appended : marks the invocant when using the indirect-object syntax for Perl 6 method calls. The following two statements are equivalent:

    $hacker.feed('Pizza and cola');
    feed $hacker: 'Pizza and cola';

zip

In order to support parallel iteration over multiple arrays, Perl 6 has a zip function that interleaves the elements of two or more arrays.

    for zip(@names, @codes) -> $name, $zip {
        print "Name: $name;   Zip code: $zip\n";
    }

zip has an infix synonym, the Unicode operator ¦.

Minimal Whitespace DWIMmery

Whitespace is no longer allowed before the opening bracket of an array or hash accessor. That is:

    %monsters{'cookie'} = Monster.new;  # Valid Perl 6
    %people  {'john'}   = Person.new;   # Not valid Perl 6

One of the several useful side-effects of this restriction is that parentheses are no longer required around the condition of control constructs:

    if $value eq $target {
        print "Bullseye!";
    }
    while 0 < $i { $i++ }

It is, however, still possible to align accessors by explicitly using the . operator:

     %monsters.{'cookie'} = Monster.new;
     %people  .{'john'}   = Person .new;
     %cats    .{'fluffy'} = Cat    .new;

This week on Perl 6, week ending 2004-03-14

Another week, another summary. It's been a pretty active week so, with a cunningly mixed metaphor, we'll dive straight into the hive of activity that is perl6-internals.

Benchmarking

Discussion and development of Sebastien Riedel's nifty Parrot comparative benchmarking script continued. Leo suggested a handy config file approach which would allow for adding benchmarks in other languages without having to change the script itself.

The initial results don't look good if you're name's Dan and you want to avoid getting a pie in the face at OSCON though, as Dan pointed out, there's a few tricks still to be played in this area. Anyway, benchmark.pl is now in the CVS tree if you want to play with it.

http://groups.google.com

Speeling mistacks

The ever helpful chromatic applied Bernhard Schmalhofer's patch to fix up an endemic speling mostake in some tests. Apparently DESCRIPTION isn't spelt "DECSRIPTION".

http://groups.google.com

Dates and Times

Discussion of parrot's handling of dates and times continued this week too. Joshua Hoblitt who works on DateTime.pm (a very handy base for doing date/time handling in Perl 5, you should check it out) said that the DateTime people really, really want is "an epoch that's absolutely fixed across all platforms."

There was also some discussion of where the opcode/library boundary should be placed, with some arguing that the likes of strftime should be implemented as opcodes. Melvin Smith responded to this with what seems to me to be a very telling point: "If we cannot provide a decently performing VM that makes people want to write stuff in bytecode (or compiled to bytecode) we have failed anyway".

Toward the end of the week, Dan announced some decisions and questions. Larry had a few quibbles, but otherwise there have been no other comments. Personally, if I never hear the phrase "leap second" again, I'll not be unhappy.

http://groups.google.com

http://groups.google.com -- Dan's decisions

Alternate function calling scheme

Dan has come to the view that we need an alternative, light weight, calling convention for calling vtable opcode functions; the idea being that this should speed up objects a tad. He asked for suggestions.

Leo Tötsch wasn't convinced that we really need special calling conventions, arguing (with numbers) that it would be best to concentrate on speeding up object instantiation by optimizing object layout. Simon Glover agreed with him, noting that simply changing the Array that used to store class, classname and attributes gave him a speedup of around 15% on the object benchmarks.

http://groups.google.com

Summary Correction

Last week I said that we can't yet do delegated method calls for vtable functions with objects. Leo pointed out that, actually, we can now. Leo also disclaimed any responsibility for helping Brent Royal-Gordon (né Dax?) fix up the support functions for Parrot::Config, though Brent later claimed that he was merely the person doing the typing...

Jerome Quelin noted that parrotbug has already reached version 0.2.1 (I wonder what its version will be when Parrot 1.0.0 gets released).

http://groups.google.com

http://groups.google.com

Dead Object Detection improved

Not content with his work on everything else this week, Leo has revisited the Garbage Collector and tracked down a couple of bugs including a really nasty sounding one that caused disappearing continuations. He even isolated the problem with a test.

http://groups.google.com

Rejigging trace output

Leo's rearranged the output of parrot -t slightly in an effort to make it more readable. Jens Rieks was impressed and pointed out a couple of issues, which Leo quickly fixed.

http://groups.google.com

Namespaces in IMCC

Dan's day job continues to be a handy driver of Parrot development. This time he needs to make use of namespaces and, whilst namespaces themselves aren't completely nailed down yet, there's enough there that the time has come to work out the syntax for working with them in IMCC. He proposed

    .namespace [foo; bar; baz]

as a way of setting the current namespace to foo::bar::baz (in Perl style, your language may vary). Leo was okay with that as far as it went, but pointed out that things were slightly more complicated than Dan's proposal implied. He suggested that the time was right to sort out the duties of the PIR assembler towards variable names, name mangling, lexical scopes, namespaces, globals and all that good stuff. Dan punted slightly on this latter part, saying that, in general it shouldn't be the assembler's job to track them. The current namespace would simply be used as the namespace in which to place any subsequently defined functions. There was the sound of a hand slapping a forehead from Austria, and Leo went off and implemented it.

http://groups.google.com

New library, objecthacks.imc

In the process of redoing the Parrot Data::Dumper to use objects, Jens Rieks built a library of helper functions to make object usage easier, so he submitted it to the list as a standalone library. Leo checked it in, you can find it at library/objecthacks.imc

http://groups.google.com

Implementing stat?

Leo Tötsch proposed a stat opcode for finding out about things in the filesystem. He outlined a proposed interface. Dan agreed that we'd need something, but that Leo's proposal was far too unix-centric to work for a cross platform platform like Parrot. He suggested going back to first principles and working out what information would be needed (and possibly available). He also said that one of his guiding principles for Parrot was that he would "rather not re-implement and expose 35 years worth of 'Whoops, that turned out to be a bad idea' crud."

Josh Wilmes took this opportunity to remind everyone of the proposed miniparrot and pointed out that, if we want it to work again, there needs to be a smooth way to exclude opcodes or PMCs that won't work in miniparrot's environment. Dan agreed strongly, reminding everyone that miniparrot is intended to be the basis of Parrot's eventual build system. (The process will go: Platform specific shell script -> miniparrot -> parrot).

http://groups.google.com

http://groups.google.com -- Don't forget the miniparrot!

IMCC and method calls

Leo announced that he'd expanded IMCC's PIR parser a bit, allowing you to write:

    obj.method(args)
    ret = obj.method(...)
    (retvals) = obj.method(...)

where method is a label.

A couple of hours later Dan posted a design spec for how he wanted method calls to work in IMCC:

    object.variable(params)
    object."literal name"(params)

Methods would be declared like:

    .pcc_sub foo prototyped, method
        .param pmc foo
        .param pmc bar
        ...
    .end

Declaring a method in this way would also create a local self which points to the object PMC register. What do you know, Leo implemented it all. There is also a more 'explicit' way of making method calls for those occasions when you need more control. Check the docs/examples for details.

http://groups.google.com-- Leo implements at 3pm

http://groups.google.com -- Dan designs at 5pm

Data::Dumper test version

The excellent Jens Rieks posted a test version of his object oriented version of library/dumper.imc and asked for comments. (Mine was "Wow!" but I didn't post it to the list 'til now). Leo wondered if he should add it to the repository or wait for it to get rejigged to take account of all the improved object syntax changes that had gone into IMCC. Jens told him to hold fire until he'd converted everything to the new syntax.

http://groups.google.com

Months that Do The Right Thing

Dan's rejigged the date decoding logic to return months in the range 1-12 instead of 0-11. And there was some rejoicing.

http://groups.google.com

Problems calling methods on self

Jens Rieks discovered that he couldn't call self."method"(...), or even compile it for that matter. It turned out to be a problem with the grammar. Steve Fink provided a grammar fix, Leo tracked down a problem with registers not getting preserved properly, and there followed some discussion about the ambiguity of . being used for both method calls and string concatenation. Luke Palmer suggested (and Melvin Smith agreed) that it was probably better style to use the concat op anyway.

http://groups.google.com

Object instantiation/initialization

Dan's currently seesawing about how to customize object creation by passing arguments to the constructor function. At present, you create an object by calling the class's init method, which doesn't take any arguments, and then call an object method with appropriate arguments to do any other setup that's dependent on the arguments if you want to. Dan outlined three ways forward and asked for opinions.

http://groups.google.com

Ponie problems

Nicholas Clark posted to say that Ponie was having problems when built using Parrot's garbage collection. After some encouragement from Leo, he managed to write a short test case that showed the issue in a few lines of C. Leo proposed a solution, but Nicholas and Arthur Bergman (Ponie's core team) weren't happy with either option. This hadn't been resolved by the end of the week.

http://groups.google.com

Why does Parrot have so many opcodes?

Matt Greenwood wondered why Parrot had so many opcodes, and what the criteria are for adding or deleting ops. Various rationales were offered. Dan's explanation was probably the most comprehensive though. Apparently, Parrot has opcode explosion issues because there's no runtime op variance and if you want to know what that means, read Dan's post.

http://groups.google.com

http://groups.google.com -- Dan explains it all

Parrot grabs SIGINT

Arthur "Poniemaster" Bergman noticed that Parrot grabs SIGINT, which makes some of Ponie's signal handling code break. He wondered if this was something that needed fixing, or a deliberate design decision. Dan says that, eventually and by design, Parrot will grab all the signals, though not necessarily in the embedded case (which is what Arthur is using of course.)

http://groups.google.com

Ponie gets a development release

Arthur Bergman has made the second development release of Ponie, the Perl5/Parrot hybrid. The initial response was positive.

http://groups.google.com

Per-class attribute offsets

Peter Haworth had a bunch of questions and worries about using numeric offsets to get at class attributes. Dan tried to reassure him, but Oli came up with a worrying corner case. Dan didn't think it was that big an issue, but confessed that it's still a worry.

http://groups.google.com

Using Ruby Objects with Parrot

Poor old Dan; once you've got some bits of objects done, it just means everyone wants more from you. Mark Sparshatt wondered how to handle languages like Ruby, where a class is also an object, which seems to run counter to Parrot's scheme using ParrotClass and ParrotObject PMCs where a ParrotClass isn't a ParrotObject. He suggested three ways forward.

Summarizing this stuff is Hard. One one level you can reduce the issue to a couple of sentences, but understanding those sentences involves rather a lot of underpinning knowledge that I'm not sure I have. I just know that life's an awful lot easier if I can treat classes as objects.

Dan tried to wrap his head 'round things and posted a summary of his initial understanding of what a metaclass is. He wondered if we'd need to have a 'call class method' operation. Paolo Molaro pointed out that most of this stuff only becomes a real issue when you have to deal with objects implemented in one language calling methods on objects implemented in a different one. Leo Tötsch explained Parrot's current (somewhat confusing to these eyes) hierarchy where:

  • A ParrotClass isa delegate PMC
  • A ParrotObject isa ParrotClass
  • A HLL class isa ParrotClass and a low-level PMC object
  • A HLL object isa all of above

Piers Cawley thought that having ParrotObject inherit from ParrotClass rather than vice versa seemed somewhat perverse. He also argued that method dispatch should be decoupled from method lookup. I think he and Leo failed to communicate.

Mitchell N Charity was good value on all this as well; definitely worth reading if you're into the theory behind OO should work.

http://groups.google.com

http://groups.google.com

http://groups.google.com -- Mitchell N Charity on Objects, Classes and MetaClasses

Configure changes

Brent announced that he's been making some significant changes to Configure in the last week or so. He posted a big list. Leo liked what had been done so far, but thought the process needed more steps to probe for more stuff.

http://groups.google.com

PIR changes

Leo announced that he's added some syntactic sugar to PIR; you can now write:

    $N0 = sin 0
    $I0 = can $P0, "puts"
    $I0 = isa $P0, "scalar"

etc, and they'll be magically converted to

    PARROT_OP lhs[, args]

which is rather handy (with some ugly cases: P1 = invoke being the example Leo came up with).

http://groups.google.com

OO version of Data::Dumper

On Sunday, Jens Rieks released his OO version of Data::Dumper, along with a couple of helper libraries, including a rewritten objecthacks.imc, now called objects.imc. Leo committed it (and a followup bugfix) promptly and Parrot now has its first OO application/library. Yay Jens!

http://groups.google.com

Meanwhile, in perl6-language

Magic blocks

Remember last week I implied that magic UPPERCASE blocks for things like BEGIN, CHECK, INIT etc were being replaced by properties on variables? Well, I was wrong. We keep magic blocks but we also get handy properties for attaching actions to variables.

In fact, the handy properties were first discussed in April last year, as Larry pointed out.

http://groups.google.com

Mutating methods

Juerd wondered about the possibility of mutating and non mutating methods in Perl 6. Consider the Perl 5 functions chomp and lc. chomp potentially modifies its argument, whereas lc returns a new string. Juerd liked the idea of having mutating and non mutating versions of both chomp and lc (and other such string methods). He proposed using

    $aString.lc; # Non-mutating
    $aString.=lc; # Mutating

The general response to this was favourable.

The thread sparked a discussion of Perl 6 Rules and hypothetical variables (which just makes me hope Damian gets Perl6::Rules written really soon).

Larry posted a chunk of stuff from the forthcoming Apocalypse 12 that looks rather neat. Then he and Simon Cozens started talking Japanese and my head started spinning.

http://groups.google.com

http://groups.google.com -- Larry tells us more about A12

Operators that keep going and going...

Carissa wondered if Perl 6 would have what she called 'persistent' operators, so that you could do:

    my $a = 10;
    my $b = 5;

    my $c = $b + $a;
   
    print $c; # 15
   
    $a = 8;

    print $c; # 13

Several people pointed out that it's already possible to do this sort of thing in Perl 5 so it'd be a big surprise if you couldn't do the same thing in Perl 6. (See perldoc overload in the section "Really symbolic calculator" for details).

http://groups.google.com

Announcements, Apologies, Acknowledgements

I've had it with discussions of whether Unicode's a good idea or not, and from this summary onwards I'll not be including any such discussions unless Larry does an implausible volte face. That is all.

Now I can go back to being anxious about my stepdaughter who's been in labour for rather a long time.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send me feedback at mailto:p6summarizer@bofh.org.uk, or drop by my website, maybe I'll add some content to it this week.

http://donate.perl-foundation.org/ -- The Perl Foundation

http://dev.perl.org/perl6/ -- Perl 6 Development site

http://www.bofh.org.uk/ -- My website, "Just a Summary"

Simple IO Handling with IO::All

One of my favorite things about Perl is how flexible it is. When I don't like something about the language, I don't let it get me down. I just change the language!

The secret to doing this lies in Perl modules. Modules make this easy. Let's say you have a Perl idiom that you use everyday in your programming, but it just seems clumsier than it needs to be. Usually, with a little cleverness, you can simply hide a few hundred lines of code inside a module, thereby turning your 3-line idiom into a 2-line one!

I'm joking here, but at the same time I'm not joking. While it may seem like a recipe for scratching your itch with a backhoe, if you share your module on CPAN, you end up potentially scratching a million itches simultaneously. So perhaps the backhoe is the appropriate solution. Let me give you an example.

Slurp Me Up, Scotty

One of the most common idioms in my day-to-day programming is to read the contents of a file into a single scalar. This is often referred to by Perl geeks as a slurp operation. Usually it looks something like this:

    open my $file_handle, './Scotty'
      or die "Scotty can't be slurped:\n$!";
    local $/;   # Set input to "slurp" mode.
    my $big_string = <$file_handle>;
    close $file_handle;

So it would seem that slurping is a big deal. I mean, it took me 5 lines of code to do. Five lines is a lot in Perl. Surely, there could be an easier way. If I had my druthers, I would be able to just do this:

    my $big_string << './Scotty';

And be done with it. Unfortunately, this doesn't work the way I want it to, even though it is valid (albeit useless) Perl. How do I know it's useless? Perl told me so when I turned on warnings: "Useless use of left bitshift (<<) in void context."

Now even though I could write a source filter to make the above code do what I wanted, it wouldn't be the right approach. Surely there is something just as simple, that uses valid Perl constructs. After thinking about it for a couple hours, I came up with this:

    my $big_string = io('./Scotty')->slurp;

Being quite satisfied with my new idiom, I sat down for a few more weeks, and wrote a few hundred lines of code, and hid it in a module called IO::All and uploaded it to CPAN. Now I can do my 5-line slurp in 1 line. Phew!

Extreme Simplicity

How on earth could a module to perform the slurp idiom be several hundred lines long? Well, IO::All does slurping and a whole lot more. The motivating idea behind this module is to simplify all of the Perl IO idioms as much as possible, and also to create new idioms for common-use cases that weren't really idiomatic to begin with.

In recent years, I've become a fan and student of Extreme Programming (XP). One of the principles of XP is to constantly refactor your code to make it simpler, easier to read, and ultimately more maintainable. In striving to do so I found that as my code became as clean as I could make it, the parts that still looked dirty were constructs imposed on me by the Perl language itself; especially the IO stuff. I didn't let it get me down. I changed the language!

What's Going on Here?

The basic idea of IO::All is that it exports a function called io, which returns a new IO::All object. For example:

    use IO::All;

    my $io = io('file1');
    # Is the same thing as:
    my $io = IO::All->new('file1');

Another principle of IO::All is that it takes as many cues as possible from its context. Consider the following example:

    my @lines = io('stuff')->slurp;
    my @good_lines = grep {not /bad/} @lines;
    io('good-stuff')->print(@good_lines);

Here we are basically censoring a file, removing all the bad lines. The first statement slurps up the file, but since it is called in list context, it returns all the lines instead of one long string. The second statement weeds out any filth, and the third statement writes the good lines to a new file.

But the question arises, "How did IO::All know to open the output file for output?"

The answer is that IO::All delays the open until the first IO operation and uses the operation to determine the context. The opening and closing of files happens automatically and you almost never need to indicate the file mode, although you can do all of this manually if you really want to.

Directory Assistance

I never really liked the directory commands in Perl. You know, opendir, readdir, and closedir. I thought, well why not let an IO::All object act as a directory in addition to a file? How would IO::All know the difference? Context, of course!

    my $dir = io('mydir');
    while (my $io = $dir->read) {
        print $io->name, "\n"
          if $io->is_file;
    }

In this example, IO::All opens a directory for reading and returns one entry at a time, much like readdir. The difference is that instead of a file or subdirectory name, you get back another IO::All object. This is really cool, because you can immediately perform actions on the new objects. In the above code, we print the filename (returned by an IO::All method) if the object represents a file rather than a subdirectory (which is also returned by an IO::All Method).

File::Find

Ask any experienced Perl programmer which core module has the most abysmal interface, and they'd probably say File::Find. Rather than explain how File::Find works (which would take me an hour of research to figure out again), here's an easy way to roll your own search.

    my @wanted_file_names = map {
        $_->name
    } grep {
        $_->name =~ /\.\w{3}/ &&
        $_->slurp =~ /ingy/
    } io('my/directory')->all_files;

This search finds all the file names in a directory that have a three-character extension and contain the string 'ingy'. The all_files method is a shortcut that returns only the files. There are also all_dirs, all_links, and simply all methods.

A Poor Man's tar

This example reads all the files under a directory, and dumps them into one big file, separated by a line containing the file's name and size. This is analogous to what the Unix tar command does.

    use IO::All;
    for my $file (io('mydir')->all_files('-r')) {
        my $output = sprintf("--- %s (%s)\n", $file->name, -s $file->name)
                     . $file->slurp;
        io('tar_like')->append($output);
    }

In this usage, we pass a special flag, -r, that tells all_files to be recursive. That is, to find all files in all subdirectories. Also notice the append method. This is the same as print, but the file is opened for concatenation.

Double STanDards, String Cheese, and Temporary Insanity

IO::All has some handy shortcut names. In the Unix tradition, it uses a dash to mean STDIN, but it also uses it to mean STDOUT. Check out this one liner:

    io('-')->print(io('-')->slurp);

This just prints everything on STDIN to STDOUT. Once again context is used to determine which file handle the dash is actually referring to. A potentially more efficient way to write this is:

    my $stdin = io('-');
    my $stdout = io('-');
    $stdout->buffer($stdin->buffer);
    $stdout->write while $stdin->read;

The read and write methods work from an internal buffer, which defaults to 1k in size. What we've done in this example is to set the two objects to use the same buffer. Since the write method clears the buffer after writing it, the above idiom works nicely.

Another special character is the dollar sign. This means that the IO::All object will read/write to a Perl scalar rather than a file. This can be useful when you have a code base that writes to a file, but you want to fake it out and capture all the output in a string without changing the code base.

Finally, if you pass no arguments at all to the io function it will work as a temporary or nameless file. This is somewhat similar in effect to writing to a string, except that the data is actually going to your disk. The temporary file is opened for both read and write modes.

Here is a somewhat contrived example using all of these special cases.

    my $temp = io;
    $temp->print(io('-')->slurp);
    $temp->seek(0, 0);
    my $str = io('$');
    $str->print($_) for $temp->getline;
    my $data = ${$str->string_ref};
    $data =~ s/hate/love/;
    io('-')->print($data);

OK, listen up and repeat after me:

We slurp up all of STDIN, and slam it in a temp. We seek back to the start of it, and shove it a string. We suck the soul right out of string and save it from its sin. Then ship the lot to STDOUT, and sing it once again.

Socket to Me

If IO::All objects can represent files, directories, streams, and strings then they can surely do the same for sockets. This example prints the header lines from an HTTP GET call:

    my $io = io('www.google.com:80');
    $io->print("GET / HTTP/1.1\n\n");
    print while ($_ = $io->getline) ne "\r\n";

Again, the context comes into play. Since www.google.com:80 looks like a socket address, the IO::All object does the right thing. It is worth noting that if you really wanted to open a file called 'www.google.com:80' or '-' or '$', you can explicitly override the IO::All heuristics like such:

    my $io1 = io(-filename => 'www.google.com:80');
    my $io2 = io(-filename => '-');
    my $io3 = io(-filename => '$');

For Fork's Sake

The one thing I always use O'Reilly's Perl Cookbook for is creating a forking socket server. Not because it's that hard, but it's just something I don't keep in my head. With IO::All I have no problem remembering how to do it because it's been made dead simple:

    use IO::All;
    my $server = io(':12345')->accept('-fork');
    $server->print($_) while <DATA>;
    $server->close;
    __DATA__
    One File, Two File
    Red File, Blue File

This server sits and listens for connections on port 12345. When it gets a connection, it forks off a sub-process and sends two lines to the receiver.

The client code to call this server might look like this:

    use IO::All;
    my $io = io('localhost:12345');
    print while $_ = $io->getline;

IO::All of It to Graham

It may strike you as silly, vain, or even foolish for someone to rewrite all of the Perl IO functions as a new module when older, more mature modules exist. But therein lies the beauty of IO::All: it doesn't rewrite anything. It simply provides a keen new interface to IO::File, IO::Directory, IO::Socket, IO::String, IO::Handle, and others. It ties all of these robust modules together into one cohesive unit. So even though IO::All is relatively new, it hopefully inherits well from this legacy of stability.

As far as I know, almost all of these modules we're written by Perl superhero Graham Barr. I've met Graham personally and I don't think it would be too forward of me to suggest that you send him a beer to thank him for making Perl so great. Unfortunately I don't know his address.

Tie Me Up and Lock Me

I learned a neat trick from Gisle Aas by reading the code of his IO::String module. The trick is that you can tie an object to itself. This is especially handy when the object is IO handle. It means that you can use the object as a regular file handle with all Perl built-in IO functions. And you can also use it as a regular object by calling methods on it. It's TMTOWTDI at its finest.

    use IO::All '-tie';
    my $file = io('myfile');
    my $line1 = $file->getline;
    my $line2 = <$file>;

Pretty nifty, eh? Note that you need to request that IO::All perform the tie with the -tie option. That's because a bug in Perl 5.8.0 caused things tied to themselves to not go out of scope.

Here's another nifty feature: automatic file locking. If you specify the -lock option, IO::All will call flock after every file open. You still have to worry about things like deadlock, but at least the mechanism is simple. Here is a sample where all messages written to a log file then lock the file:

    use IO::All '-lock';
    io('log')->appendln(localtime() . " - I'm still here");

The appendln method is a cousin of the println method. Both print a new line after your output. Note that the above code is the same as the following:

    use IO::All;
    io(-lock => 'log')->appendln(localtime() . " - I'm still here");

That's because any parameters that are passed to IO::All are simply passed along to all invocations of the io function.

The Methods in My Madness

IO::All has over 60 methods that you can call to perform various IO-related actions. Not all methods make sense in all contexts for all flavors of IO. Most of the methods are simply direct proxies for methods found in the core modules that IO::All is built upon.

Some of the methods have been enhanced to be more flexible than their ancestors. Take the getline function, for example. The IO::Handle version simply gets the next line. The IO::All version takes an optional argument that is used as the record separator. To read a paragraph of text you could do this:

    my $paragraph = io('myfile')->getline('');

Lots of the methods have been presented in this article. For complete information on all the available IO::All methods, see the IO::All documentation.

It's Not Just Keen, It's Spiffy

IO::All is a little bit different from most Perl modules in that it exports the io function. As I said before, the io function acts as an object constructor, returning a new IO::All object for each invocation. This property is gained from IO::All's base class, Spiffy.pm.

Spiffy is a new kind of generic base class. Its primary magic trick is that it supports a unique feature that I call inheritable exporting. Normally if you use a module as a base class, it is strictly an object-oriented thing. You don't also export functions. Furthermore, if you were to export some functions, your subclasses would need to manually export those functions to its subclasses.

Spiffy is set up so that all of the @EXPORT arrays in all the modules in the @ISA tree of a class, are combined together to act as one big export list. The magic is then taken one step further. Functions like io that act as an object constructor are smart enough to return an object of a subclass; not just an IO::All object.

This is demonstrated in the next section. See the Spiffy documentation for details on this and other exciting features.

EIE::IO

Let's write an extension of IO::All that adds a new method called pruls, which is slurp backwards. Its purpose will be to return the whole file with its lines in reverse order. Here is the code:

    package EIE::IO;
    use IO::All '-base';
    sub pruls {
        my $self = shift;
        my @lines = reverse $self->slurp;
        wantarray ? @lines : join '', @lines;
    }
    1;

That's all there is to it. This module will act just like IO::All, except with one more method. You would use it just like this:

    use EIE::IO;
    print io('mystuff')->pruls;

Note that the io function is still exported, but it returns a new EIE::IO object. That's Spiffy!

Conclusion

IO::All is still quite a young module. There is room for many, many more idioms. There is also the possibility of including even more types of IO, like shared memory, IPC, and Unix sockets. If you have a use case that you think would make a nice addition to this Swiss Army Light Sabre of Perl IO, please Let's change the Perl language for the better.

This week on Perl 6, week ending 2004-03-07

Time marches on, and another summary gets written, sure as eggs are eggs and chromatic is a chap with whom I will never start a sentence. We start, as always, with perl6-internals.

Platform games

Work continued this week on expanding the number of known (and preferably known good) platforms in the PLATFORMS file.

Languages tests

Dan reckons it's time to be a little more aggressive with tests for ancillary stuff, in particular the contents of the languages subdirectory. He called for language maintainers (and any other interested parties) to at least get minimal tests written for all the languages in the languages directory, and to get those welded to the languages-test makefile target.

http://groups.google.com

IMCC and objects/methods

Tim Bunce congratulated everyone on Parrot 0.1.0 before asking about where we stood with IMCC and objects/methods. Leo confirmed Tim's supposition that there is no syntactic support for objects and methods in IMCC, at least in part because there's been no discussion of how such syntax should look.

http://groups.google.com

Parrotbug reaches 0.0.1

Jerome Quelin responded to Dan's otherwise ignored request for a parrot equivalent of perlbug when he offered an implementation of parrotbug for everyone's perusal, but didn't go so far to add it to the distribution. I don't think it's been checked into the repository yet, but it'll probably go in tools/dev/ when it does.

Later in the week, he actually got it working, sending mail to the appropriate mailing lists. With any luck the mailing lists themselves will be up and running by the time you read this.

http://groups.google.com

Subclassing bug

Jens Rieks found what looked like a problem with subclassing, which turned out to be a problem with clone not going deep enough. Simon Glover tracked it to its den and Dan Sugalski fixed it.

http://groups.google.com

Good news! Bad news!

Good news! Dan says the infrastructure is in place to do delegated method calls for vtable functions with objects. Bad news! It doesn't actually work because it's impossible to inherit from delegate.pmc properly. Dan pleaded for someone to take a look at pmc2c2.pl and or lib/Parrot/Pmc2c.pm and fix things so that the generated code is heritable.

http://groups.google.com

Parrot m4 updated

Bernhard Schmalhofer posted a patch to improve the Parrot implementation of the m4 macro language.

http://groups.google.com

Use vim?

I don't use vim, but it seems that Leo Tötsch does. He's added some handy dandy vim syntax files for IMC code. If you're a vim user you might like to check it out. Leo points out that the syntax tables might well be handy if you don't know all 1384 opcode variants by heart.

http://groups.google.com

Parrotris

Sadly, Jens Rieks' Parrot and SDL implementation of tetris didn't quite make it under the wire for the 0.1.0 release. However, Leo has got round to trying it and is impressed, though he did spot a few bugs (it doesn't spot that the game is over for instance). Jens is working on fixing those (and on adding new features), which he reckons will go a deal faster when IMCC has better syntactic support for OO techniques.

http://groups.google.com

Dates and Times

To paraphrase Barbie: Dates and Times are Hard. Not that hardness has ever stopped Dashing Dan Sugalski before. This time he waded into the swamp that is Parrot's handling of dates, times, intervals and all that other jazz. He started by soliciting opinions. He got quite a few.

The discussion can probably be divided into two camps: KISS (Keep it Simple) people, and DTRT (Do The Right Thing) people. But KISS still has it's complexities (which Epoch do we want? Should time be a floating point value?) and what, exactly, is the Right Thing? The catch is, time is a messily human thing, and computers are really bad at messy humanity.

Larry said that Dan could do what he wants with Parrot, but he wants Perl 6's standard interface to be a floating point seconds since 2000. He argues that "floating point will almost always have enough precision for the task at hand, and by the time it doesn't, it will. :-)". He also argued that normal users "should never have to remember the units of the fractional seconds".

Zellyn Hunter pointed everyone at Dave Rolsky's excellent article on the complexities of handling dates and times with a computer.

Discussion is ongoing, but it seems that Larry and Dan are leaning towards the KISS approach.

http://groups.google.com

http://www.perl.com/pub/a/2003/03/13/datetime.html

Initializers, finalizers and fallbacks

Anyone who has been reading the internals list for any length of time, or who has chatted to Dan about Parrot on the #parrot irc channel will be only too aware that Dan isn't the greatest fan of the OO silver bullet. So, getting the initial objects implementation out there was the sort of thing he probably hoped would mean he didn't have to come back to objects for a while. Except, once you've got part of an object implementation, the need for the rest of it just seems to become more pressing.

So, poor Dan's been doing more OO stuff. This time he sketched out where he was going with initialization, finalization and fallback method location. The basic idea is that, instead of mandating particular method names for different tasks (which seems like the easy approach, but doesn't work across languages), we mandate particular properties which tag various methods as being initializers/finalizers/fallbacks. He outlined his initial set of standard properties and asked for comments.

Leo liked the basic idea, but suggested that, in the absence of any such properties on a class's methods, Parrot should fall back to some methods on a delegate PMC.

http://groups.google.com

OO Benchmarks

Leo posted the results of a couple of benchmarks. They weren't very encouraging. It seems there was a memory leak. And some inefficiencies in object creation. And some more inefficiencies in object creation. By the end of the mini thread, things had got a good deal faster. Quite where it puts us against Perl 5 and Python wasn't mentioned.

http://groups.google.com

Parrot dependencies

Michael Scott, who has been doing sterling work on Parrot's documentation wanted to remove some non-modified, non-parrot Perl modules from the Parrot distribution and have people install them from CPAN. Dan disagreed quite strongly. The current rule of thumb is that Parrot is allowed to depend on finding a Perl 5.005_0x distribution and a C compiler. Any Perl modules it needs that can't be found in that distribution should be provided as part of the Parrot distribution.

Larry argued that we should separate the notion of the developer distribution from the user distribution. The developer codebase is allowed to have any number of external dependencies ("dependencies out the wazoo" was Larry's chosen phrase), and the (for plural values of 'the') comes with all the bells and whistles a distributer sees fit to include. He argued that the developer codebase should be completely unusable by anyone but developers to prevent ISPs from installing that and then claiming to "support Perl".

http://groups.google.com

Vtables as collectible objects

Dan wondered whether we were going to need to treat vtables as garbage collectible objects, and if we did, what kind of hit that would entail. He and Leo discussed it, and Leo reinvented refcounting as being a possible way to go rather than full DOD type collection.

http://groups.google.com

Continuing dumper improvement

Jens Rieks added ParrotObject support to his Parrot data dumper. Dan applied it. Go Jens.

http://groups.google.com

Freezing strings

Brent Dax is working on writing a parrot implementation of Parrot::Config. His initial idea for implementation involves generating a PerlHash of configuration info and then freezing it to disk. However, when he tried it he had problems with freezing strings, so he asked for help. It turned out to be a simple matter of unimplemented functions, which he and Leo rapidly implemented. A patch with Brent's implementation appeared shortly afterwards.

http://groups.google.com

A Perl task - Benchmarking

Leo wondered if anyone could implement an equivalent of perlbench for parrot's benchmarks to do speed comparisons of Parrot, Perl, Python etc implementations. Sebastien Riedel delivered the goods.

http://groups.google.com

Meanwhile, in perl6-language

Exegesis 7

Everyone carried on discussing Damian's Exegesis 7 on Perl6::Format, there was even a surprise appearance by Tom Christiansen, who demonstrated a novel, if computationally intractable, approach to generating fully justified text.

Multi matching

Remember last week, Larry had proposed using & in Perl 6 rules as a way of matching multiple rules at the same point in the string. Damian liked it a lot and said he'd be delighted to add support for it to the semi mythical Perl6::Rules module. So, Larry said the word ("Execute!", which wasn't quite the word everyone expected) so Damian tugged his forelock and added it to his todo list. Questions like "What are the precedence rules for & and | in regular expressions then?" haven't been asked yet.

http://groups.google.com

Compile-time undefined sub detection

Nigel Sandever wondered if Perl 6 will be able to detect undefined subs at compile time. Larry thought it would be in theory, if you ask it to check in a CHECK block, and you're prepared to assume that no eval or INIT block will fill in the blank later, and there's no AUTOLOAD that might catch it. Sounds rather a like "Not in the general case, no" to me.

Bringing CHECK and INIT up prompted Rafael Garcia-Suarez to ask what the rules would be for them in Perl 6 because they're pretty broken in Perl 5. (Sometimes you want CHECK and INIT semantics to work well when you're loading a module at runtime, and Perl 5 doesn't do that). It looks like he's going to get his heart's desire on this (A big "Yay!" from the direction of Gateshead). Dan Sugalski popped over from perl6-internals to point out that Parrot would be using properties rather than specifically named blocks for this sort of stuff.

Larry eventually made a ruling which uses properties in a rather cunning fashion. Check his message for the details.

http://groups.google.com

http://groups.google.com -- Larry on magic blocks

Announcements, Apologies, Acknowledgements

Crumbs, another Monday night, another summary finished. I've got to be careful here, it might by habit forming.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send me feedback at mailto:p6summarizer@bofh.org.uk, or drop by my website.

http://donate.perl-foundation.org/ -- The Perl Foundation

http://dev.perl.org/perl6/ -- Perl 6 Development site

http://www.bofh.org.uk/ -- My website, "Just a Summary"

Distributed Version Control with svk

I started to use Subversion one year ago and liked the elegant file-system design a lot. Soon it became impossible for me to go back to CVS. This means that I felt uncomfortable whenever I was working on projects using CVS, and I wanted to see a tool to keep my Subversion repository in sync with a CVS repository. This would not only save me time importing snapshots into vendor branches, but it would also give me the whole history when I'm not online.

I found Barrie's VCP and wrote a Subversion driver. Then I understood why people said Subversion was slow. My driver invoked the svn command, and it took something like 30 hours to convert from a CVS repository that resulted in 3000 revisions in the Subversion repository.

Fortunately the Subversion developers made the code easy and ready for wrapping into different languages using SWIG. At that time, only Python bindings were implemented, so I had to do the Perl bindings myself. With the Perl bindings implemented rapidly, VCP gets much faster, and I also started writing SVN::Mirror, a module that enables mirroring between Subversion repositories. When I felt bored, I would add Subversion back-end support to tools like Bloxsom and Kwiki.

Then the season for traveling came. As I'm far more productive and creative while disconnected from the Internet, I realized I need a distributed version control system, and decided to give myself a year break to develop such a tool to enable me to be even more productive in the future. svk was born soon after my birthday in September 2003.

Why?

There are other distributed version control systems available: Arch, monotone, darcs. The functionalities they offer are more or less equivalent.

svk, however, is written in Perl, and so might be more hackable by a large community. svk also has a set of commands similar to those of cvs. On top of this, svk plans to implement transparent interpolation between different version-control systems.

As I don't see any strong argument suggesting one system over another, it's really up to you to try and decide.

Design Decisions

Subversion has a layered design:

  • fs: Underlying versioned tree library using bdb.
  • repos: Higher-level support for the fs, like log messages.
  • ra: Repository access. Abstracted protocol handlers.
  • wc: Working copy handling.
  • client: Implements the commands for clients.

The first design decision was to drop the wc and ra layers. Elaborating the Subversion design mentality: "Bandwidth is expensive, disks are cheap," we should really keep a local copy for every revision -- and SVN::Mirror is already available for such purposes.

Having everything in a local repository, we don't need anything like the bloated wc implementation at all. The wc library not only has the .svn metadata directory to confuse your favorite utilities like diff and grep, but also stores a text-base that makes your checkout twice the size of the actual content. XD (which is the character-wise increment of wc) was written to maintain checkout copies in a lightweight manner.

Next, I found the most important component of Subversion is not on the above list. It's the delta library that defines the API for describing tree deltas; this is definitely the core thing in tree-based version control systems. It's called "Delta Editor."

For example, running a delta between revision 1 and revision 3 will generate a series of method calls (add_directory, open_file, apply_textdelta, close_file, close_directory, etc.), to the targeted editor object. These method calls describe the changes made from revision 1 to revision 3.

While svk was self-hosting within two months of development, I started to refactor the existing code to center around this interface. With Perl, I could easily stack the editors together, making each editor do its own job, adding arbitrary callbacks as extension to the API, and all of the fun things you know you can do with Perl. Much of the functionality is abstracted and it resulted in the following core components of svk:

  • An editor that receives delta calls to modify the checkout copy.
  • A function to generate delta calls for describing the modification done to the checkout copy.
  • An editor that takes delta calls and merges them with a tree to generate non-conflicting calls.

Together with these, the logic behind most of the commands became just a question of gluing together a delta generator and the appropriate editors.

Additionally, with Perl's flexible PerlIO layers system, keyword expansion (like $Id$ in cvs) was done within one hour. The reusable part of this was abstracted out to the PerlIO::via::dynamic module on CPAN.

Now let's see svk in action.

A First Look

I hate typing those long URLs when using Subversion. So mapping repositories to shorter names is a must:


    $ svk depotmap

This will help you create a default repository at ~/.svk/local, and you could refer to it by // in the future. If you have a Subversion repository on the disk, you could add another line: test: '/path/to/repos'. Then you have immediate access to the existing repository -- only with the shorter name /test/ instead of file:///long-path-plus-auto-complete-wont-work.

Now let's put something in it:


$ svk import //project/vendor /path/to/project-0.01

This will do what you think: import things into //project/vendor. Repeat the command with a newer version of this project, say 0.02, you'll have a vendor branch tracked on the path.

Like Subversion, branches and tags are implemented as cheap file system copies:


$ svk cp -m 'development trunk for project' //project/vendor //project/trunk

Now let's check it out:


$ svk checkout //project/trunk ~/work/project

If you have experience with cvs or Subversion, you'll find it familiar when trying to add, modify, remove, or commit files. svk log will give a change history of files or directories.

Suppose you import project-0.02 after branching trunk, and want to merge the changes from the vendor branch. You just need to:


$ svk smerge //project/vendor //project/trunk

svk remembers branch and merge history, so it does things automatically for you. If there are conflicts, just replace //project/trunk with a checkout path such as ~/project/trunk. You will be able to see the conflicts. Resolve them and commit once done. Merging is no more painful.

Once merged, you could bring the checkout copy of your trunk to the latest revision with svk update.

Working with Remote Repositories

As mentioned earlier, svk uses SVN::Mirror to handle remote repository access. You need to mirror them before you can use them:


$ svk mirror //project/trunk https://svn.somewhere.org/repos/trunk
$ svk sync //project/trunk

Currently you need to set up a Subversion server (either using Apache2 or svnserve). See relevant articles or tutorials about it.

Now create a local branch, and prepare for traveling:


$ svk cp -m 'create a local branch' //project/trunk //project/local

You could now check out //project/local and work on it just as above. Of course you could still create your own branch with cp //project/local //project/new-feature.

Use svk sync to sync the latest trunk when connected. Merging the new changes from trunk to your local branch works just like the previous example of merging from a vendor branch. How about merging your local changes back to the remote repository?


$ svk smerge //project/local //project/trunk

Transparent, isn't it?

You should use smerge -C in advance to check if there are conflicts. Even if your local branch is not merged from the latest trunk, svk will merge the changes for you and commit to the remote repository directly, provided there's no conflicts. But be sure to sync the latest trunk first.

In fact, if you are online and about to commit a minor change, you could forget about the process "modify on local branch, then merge back." Just do:


$ svk switch //project/trunk
$ svk commit

This first line means we now switch from the local branch to trunk, which is the path containing the mirrored archive. The switch command will keep your local change and apply it to trunk as if those modifications are done to a checkout of trunk. Then the svk commit on the mirrored path will just commit the changes directly to the remote server and then sync the path for you. If the server is temporarily unavailable, just switch back to local and merge back later.

You could also merge individual changes. Find the change number you want with svk log, and use:


$ svk cmerge -c 113,125-128,130 //project/trunk //project/stable

Now if you are working on projects where you don't have the permission to commit, you could easily generate a diff and submit it to the author:


$ svk diff //project/trunk //project/local

Working with Multiple Repositories

Many people track development of several projects. Once you use svk to mirror the projects, you can run svk sync -a to sync all of them.

Now suppose another hacker uses svk and adds a feature to the project and publishes his own branch, and you wish to experiment with or utilize his feature:


$ svk mirror //project/new-feature http://svn.somewhere.else/repos/trunk
$ svk sync //project/new-feature

Then you could merge the changes from him:


$ svk smerge //project/new-feature ~/work/project

Or you might decide to merge that branch to trunk directly:


$ svk smerge //project/new-feature //project/trunk

You could also use the cmerge command described above to merge specific changes only from that new-feature branch.

This is the minimum case of the distributed development model. The idea is that everyone could create his private branch of the product and then to be merged back by the maintainer. There have been arguments against such model, but I am not going into them here. Although tools somehow promote certain models to solve problems, we change the model or just use another tool when we have to.

There are several features planned in the near future:

Changeset signing and verification
Signing the modified file in a commit with gpg. This is already done; it's just that the SVN::Mirror side hasn't been able to propagate and verify the signatures.
VCP integration

This would enable mirroring (and thus branching) from alien version control systems, like cvs or perforce. Imagine:

 $ svk mirror //foo/fromcvs cvs://cvs.server/foo@trunk

This will make me (and perhaps other people) more comfortable when working with projects which use other version control systems, and also less confused when switching between different command sets for working with different projects.

Patch manager
Non-committers can already easily generate the diff as shown above. While it would be good to register a merge history with the patch manager, large projects that need to merge many developer-submitted patches would find it handy to have a feature which allowed a review, then test, then click-to-apply for a particular change.

The development of svk is rather rapid, so expect them coming soon!

Conclusion

Five months after the birth of svk, it had become a fast, full-featured distributed version control system. This is possible mainly because the flexibility of Perl and the spirit of Perl -- use something that exists to create new things. Besides, the commands are designed to DWIM!

If you find it interesting, get a copy from the home page and install it just like any other Perl module. Hopefully I'll then receive your complaints, make svk better, and make the open source world more productive.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en