This Week on p5p 2000/12/03

Dec 4, 2000 by Simon Cozens

Notes
Tests
Charnames
Regular Expression Bug
xsubpp
Perlipc Examples Buggy
PerlIO news
Dodgy Function Names
Lvalue Subs
Various

Notes

You can subscribe to an e-mail version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to simon@cozens.net.

Tests

I opted not to mention this last week, but Casey Tweten pointed out that it was quite important for module authors: regression tests. You write regression tests, right? Of course, you do. The problem is that there are umpteen gazillion ways to write regression tests, which makes it horrible to debug them and find out what’s really happening when a test fails. There’s a core module called Test that gives a neat framework for writing tests. There was some noise on perl5-porters to the effect that people wanted the core regression tests to use Test, but the counter-argument was that the core tests should be kept as free from outside interference as possible - the Test module may contain some constructs that the core tests are trying to test. It’s no good loading a module to help with your tests if one of your tests is whether you can load a module or not! Nevertheless, it would be nice if some of the more advanced tests were converted to the Test interface. (That was a hint, by the way, for anyone who fancies doing that.)

The real outcome of this was a patch by Casey to convert the standard module template generated by h2xs to use the Test module, and to encourage (i.e. force) module authors to use it. So, module authors, if you don’t know about Test.pm, you will soon.

Charnames

This patch from Ilya took me a while to get my head around, but now I have and I think it’s beautiful. When doing Unicode testing and entering Unicode data without a Unicode editor, we have to resort to things like

    $x =
    "\x{395}\x{3CD}\x{3B1}\x{3B3}\x{3B3}\x{3B5}\x{3BB}\x{3CA}\x{3B1}";

    $x =
    v917.973.945.947.947.949.955.970.945;

or even

    $x = 
    "\N{GREEK CAPITAL LETTER EPSILON}\N{GREEK SMALL LETTER UPSILON}...";

This is a nightmare.

Ilya’s solution allows you to enter Unicode texts in foreign languages as Latin transliterations. He gives a module that provides Russian transliterations, so with Ilya’s module you can now do:

    use Charnames qw(cyrillic);
    $x = "\N{Il'ya Zakharevich}";

and Perl will do the right thing. The suggestion is to have a few transliteration modules in the core for testing and to have less-commonly used ones on CPAN.

However, in many non-Latin languages, transliteration to the Latin alphabet is vague at best, and there are usually several different methods of doing so; worse, the mappings are sometimes nonreversable and/or non-one-to-one. Ilya’s module for Russian is neat, but doesn’t cover everything.

Regular Expression Bug

Jarkko has been turning up all sorts of wonders with his experiments in UTF8 regular-expression land. This time, he has found that

    use utf8; @a=("b" =~ /(.)/)

will cause a segmentation fault, which is horrid. Worse, this only seems to fail on 64-bit platforms, regardless of the setting of use64bitint, which suggested some hidden assumption. Eventually, it was traced to a careless read in sv_utf8_downgrade; Jarkko says:

Why the different platforms behave so differently (core dump vs. no core dump) on this bug is a but of a mystery, but if I had to guess I would mumble something like ‘alignment.’

This is why being the Configure pumpkin is such a demanding job.

Another core dump came from

    use utf8; "," =~ /([^,]*,)*/

and another from

    use utf8; 
    $x = $^R = 67;
    "foot" =~ /foo(?{ $^R + 12 })((?{ +$x = 12; $^R + 17 })[xy])?/;

which was traced to a failure to save and restore the parantheses count. Again, the symptoms were confusingly different on different machines.

xsubpp

Ilya produced a patch for xsubpp that allows the OUT and IN_OUT keywords; this is in addition to the old IN_OUTLIST and OUTLIST keywords.

These are somewhat confusing, but here’s my understanding of what they do: A parameter in a C function marked OUTLIST will have its value at the end of the function added to the list of return values to Perl. A parameter labeled IN_OUT will be read from a Perl variable at the beginning of the C function, and the value of the C variable at the end of the C function will be put back into the Perl variable. In effect, IN_OUT gives you a pointer to write through, which is “tied” to a Perl variable. [IN_OUTLIST] does the same, but instead of writing the value back to the Perl variable, it goes onto the list of return values.

An OUT value is set to the return value of the C function - I think. Decide for yourself.

Perlipc Examples Buggy

Nicholas Clark gave what I shall call an “impassioned appeal” about the state of the perlipc documentation; some of the examples didn’t even compile, much less do what they claimed to do. This also turned up a problem with Net::hostent, which was particularly embarrassing since Net::hostent didn’t have a regression test. Nicholas wiped up the worst of the perlipc bugs, and provided a basic regression test, which Robert Spier expanded. As Jarkko pointed out, writing a portable test for it is tricky, but any test is better than none … .

(Hey, maybe someone would like to try writing a program that automatically extracts example code from the documentation and makes sure it compiles?)

PerlIO news

Using my magic crystal ball, I found that this week saw 500 patches to the Perl repository. Naturally, the bulk of them - a massive 400 - were the development main line, 32 were Sarathy integrating bunches of patches into 5.6.1-to-be, but the remaining 67 were Nick beavering away on the PerlIO branch. This should remind you that most of the PerlIO improvements happen without much advertisement, and it’s easy to be unaware of exactly how much work is going on there.

Here’s what Nick says about how PerlIO is going:

-Duseperlio now works as a replacement for stdio on UNIX platforms. As of last weekend, it was also working in “same functions as before” mode on Win32 in Win32’s “simple” configuration. There has been some progress, but not success, in getting OS/2 in line. (Nothing on VMS yet.)

This week’s target is the PERL_IMPLICIT_SYS scheme on Win32 that is needed for fork() emulation. Once that is built, the plan is to replace low-level pseudo-unix read() on Win32 with our own version.

The other area of work is to turn on use of PerlIO to allow files to be read/written as utf8 under programmer control.

Once that works, then we hook PerlIO to Encode - and we are “done” ;-) (This is actually a bit messy right now as PerlIO is deep under the core, and Encode.pm is an external XS module.)

Since Nick is going to be allowing layers to be accessible under programmer control, we need to know what layers ought to do, and this was Nick’s question: “So would anyone care to remind me what the Unicode issues were that we want to solve?”

Briefly, we want to be able to read in UTF8-encoded text into UTF8-encoded SVs, and have the same output. One of the other uses of layers would be the CRLF translation magic used on DOS-derived systems and to replace the source filter mechanism.

Dodgy Function Names

This causes a syntax error:

    sub f {}
    $x-f($y);

This is because Perl assumes that -f is a file-test operator, and wonders what it’s doing next to a variable with no binary operator in the middle. Some people, including Jarkko, thought that was silly; if I define a sub f, Perl should know that I’m trying to call that subroutine.

This naturally applies not just for the file tests, but any other operators that look like functions, such as y and s. Several solutions were proposed, such as forcing Perl to use the subroutine, or outlawing subroutines with “reserved” names. In the end, Jarkko produced a patch for the file tests that spits out a warning in the above case - I think the y, m, and other cases are still on the loose. The whole thread (36 messages) is worth reading, if only so you can get an idea of what nefarious things Perl porters get up to when syntax goes bad.

Lvalue subs

Casey asked for more useful lvalue subroutines. At the moment, you can say things like:

  package Person;
  sub new { bless { name => $name }, shift }
  sub name : lvalue { $_[0]->{name} }

  package main;

  my $p    = Person->new;
  $p->name = "casey";
  print $p->name . "\n";

Note that $p->name on the left-hand side of that assignment is actually a method call returning an lvalue. Cool, huh?

Casey mentioned that he’d really like some way of getting at the right-hand side of the assignment as well, in order to do things like implementing substr in pure Perl. Rick Delaney suggested that you could return a tied lvalue, but Casey replied that that was slow; the alternative was yet another global. Piers appealed for a faster tie system, which is fine, but someone has to design it, code it and make it better than the current one while doing all the same things.

Yitzchak pointed out that there’s more to lvalues than just the assignment context, and having a way to get at the rvalue would probably break in nonassignment cases.