This Week on p5p 2001/08/15

Aug 15, 2001 by Simon Cozens

Notes

This Week on P5P

• POD Specification
• Unicode Normalisation
• Threading Semantics
• perlmodstyle
• Shrinking down the Perl install
• Various

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month. Changes and additions to the perl5-porters biographies are particularly welcome.

I’m back. This week saw the usual just-over-400 messages.

POD Specification

Sean Burke attempted to save the world:

Over the past several years, I have heard nothing but griping from all quarters about the perpetually underspecified state of perlpod, when considered as a pod specification. A markup language without a clear specification simply invites everyone, including implementors, to have their own shaky idea of what the language means. And that, along with the general tendency for markup language parsing to produce write-only code, explains much of the upsetting current state of Pod::* modules.

And so he did something about it. He wrote two documents, a completely rewritten perlpod and a new perlpodspec, which clarified the sense of POD without adding anything much in the way of new features. The spec in particular is extremely useful for those contemplating writing POD formatters. Jarkko reported that he’d be including them in 5.8.0, but Russ complained that the translators didn’t fully match the specification. Jarkko’s primary concern was cleaning up the horrendous state of the L<..> tag, which is often abused. Russ picked up primarily on Sean's request that translators assume the input text to be UTF-8, although what that actually means is not specified. Sean clarified what he meant in quite a long message:

First off, my intent is to declare Unicode to be POD’s “the reference character set” (ugh, and I thought I could get out of this without using SGML jargon), for purposes of resolving E<number> sequences.

So, for example, if I say E<233>, that is to mean the e-acute character, because in Unicode, code point 233 is e-acute.

Whether “print ord 233” on your terminal prints an e-acute, an ess-tsett, a gimel, a “tsu” katakana, a double-dagger, or whether it lasers a hole thru your monitor’s glass, is a whole different problem.

E<233> does NOT mean “simply pass a literal 233 blindly thru to the formatter”. E<233> and its exact synonym E<eacute> both merely mean “make a reasonable attempt to make an e-acute”.

This wandered off into the usual Unicode semantics debate. Philip Newton threw a brilliantly unexpected spanner in the works:

(Oh, by the way, if someone writes POD on an EBCDIC machine, *all* of the bytes will have code points > 127 AFAIK if the author sticks only to letters and numbers.)

Peter Prymmer also chastised the use of the word “ASCII” in the specification.

Sarathy suggested we mandate POD to be in Unicode, and move towards assuming all Perl code to be in UTF8, and mandating unicode semantics on things like chr $x when [$x] is less than 255. There did not appear to be anyone forging his posts.

Unicode Normalisation

And while we’re on the subject of Unicode, Sadahiro Tomoyuki rocks my world. Oh my. Not only has he produced an alternative and much cleaner (and more complete) module to handle normalization of Unicode data, he achieved the near-impossible and implemented the Unicode collation algorithm detailed in Unicode Technical Report 10. This is a major bonus, since it allows us to correctly compare and store Unicode strings.

(Incidentally, Nick Ing-Simmons pointed out that “normalize” is incorrect, according to OED and Fowler. Surprisingly for p5p, this generated less comment than the technical content of the thread.)

Threading Semantics

Artur reported on the remaining problems with iThreads:

Request for threads->kill, to kill a given thread, is this something we should have? Should we try to do what POSIX does with all these cancelation points and cancelation callbacks? We know what mutexes we own so we can make sure they are all canceled. My belief [is] that this is needed.

On unix this can be done by a signal(?), I don’t know about Win32.
Right now, it seems like the parent interpreter must still stick around for its children. I think it will have to stay like this for 5.8 and something that can be fixed for 5.10 with a global SV arena; right now we use the first interpreter as the store.
Quitting the main thread; now, this kills all threads (as normal threading). However this usually comes at bad times. And we don’t get proper cleanup (and segfaults). I think we should wait for all threads to finish, letting the user kill them.
lock, unlock, share, are now implmented in threads::shared, You have to manually unlock(), it is not scope based. I think the implementation of lock() unlock() should perhaps move into pp_lock and pp_unlock, share() cond_*() needs new reference prototype.
Sharing probably needs a new kind of magic, but I think that can wait until we see sharing working as we want.
A shared blessed object: should the destructor be called once for each thread, or in the thread where it actually is destroyed? Is there a way to catch blessings to magic/tied variables? If not, I guess we can not rebless shared data structures.

Dan Sugalski suggested that people ought to write shutdown synchronisation code themselves, to ensure that the shared data structures are in a sane state. Benjamin Stuhl asked for a detach method to disown other threads, so that the main interpreter can shutdown while blissfully ignorant of what’s going on elsewhere. Dan made the very good point that we can provide surprising behaviour because:

When you come to threaded programming without any experience, everything is surprising, and lots of things don’t make sense. Inexperienced thread programmers *will* screw themselves up. Repeatedly. Threads are a very powerful tool, but one with no guards to speak of.

The discussion got heated soon afterwards, with Artur trying to avoid coredumps because coredumps are bad, but Dan maintaining that most things that happen with threads are bad, and coredumps are occasionally unavoidable - if you exit your interpreter without shutting down all your threads, that’s an abnormal exit and anything could happen.

Artur did, however, go ahead with the new shared_sv implementation, adding two files to core and core functionality for sharing, locking and correctly refcounting shared SVs. Wow.

perlmodstyle

Kirrily Robert (Skud) stormed onto the scene with a fantastic first contribution: she submitted a document on Perl module style conventions, which was generally pretty well received, with a few minor nits.

The ensuing thread had some very interesting thoughts on various style considerations. Benjamin Frantz did a benchmark of different parameter passing conventions, finding that passing arrays is 200,000ths of a second slower than passing hashes. So if you need a bit more speed from your Perl code, turn your arrays into hashes. This massive efficiency gain didn’t impress Michael Schwern, though:

Speed isn’t really the issue, clarity of interface is. You don’t want to confuse things in first-time module author docs by bringing up efficiency arguments. It just confuses things.

This then turned into Yet Another CPAN Argument. Elaine Ashton did plea for help on behalf of the CPAN testers; if you want to be a hero, see testers.cpan.org.

Skud also hinted she’s working on a perltesttut. She also tried to herd date and time module authors onto the datetime list. Not a bad start. Think you can do better?

Shrinking down the Perl install

Alan Burlison gave out the now-common howl: Perl has doubled in size from 5.005_03 to 5.6.1. However, he was trying to cram it onto the Solaris miniroot CD so that people can write Jumpstart scripts in Perl. Jarkko suggested throwing out pod/, (which Alan had already done) unicode/ (with a caution that attempting to use Unicode stuff would make demons fly out of your nose.) the headers from CORE/, plus a bunch of the libraries.

Dan Sugalski suggested stuffing everything into miniperl, statically linking in all the extensions.

Dave Mitchell asked what the big Unicode files were; Jarkko explained that they could be stored compressed or left out completely, since the information in them is stripped out into the various unicode/*.pl files. (By the way, unicode/ has been renamed unicore/ so that we can merge in Unicode::* modules without having to worry about case-folding filesystems.) The only casualty would be UnicodeCD.

Andy Dougherty pointed to the Debian perl-base package, which is a meagre 1.2Mb. This is possible by turning off the autoloader in some of the packages, and chopping down lib/ to the absolute bare essentials. The really serious cut, of course, would be to just ship perl, libperl.so and Config.pm.

Various

There was a reasonably big thrad on wild-card expansion on Windows that I just couldn’t follow at all.

Artur asked why perl_run and not perl_destruct called the END blocks; Sarathy said it was a bug and wanted a patch, which Artur provided.

Jerrad Pierce asked how to get modules that exist in the core and not on CPAN, such as the fixed-up version of English. Schwern encouraged him to take the module back to CPAN.

I discovered two old, dust-covered opcodes; Abhijit Menon-Sen breathed live into one of them, rcatline, which optimizes $a .= <foo>. Actually, Abhijit did all sorts this week, including: fixing the calling of FIRSTKEY on tied hashes, allowing localised tying on filehandles, stopping FETCH being called twice on tainted data, and, to defend accusations that he only fixes bugs without adding new functionality, added the ever-useful panic operator.

Tony Bowden asked what would happen to the my Dog $sam declaration now pseudohashes are dead. The responses were: i) pseudohashes aren’t dead yet, they just smell that way, and ii) nothing, since the fields pragma would still do the trick.

The old more-than-256-files-open-on-Solaris question came up again: this time, though, we had a sensible answer. PerlIO can be used to get around the Solaris limitation. Alan explained that there was a section in perlsolaris about the limitation, which was due to the extreme backwards compatibility of Solaris harking back to a time before it could conceive of more than 256 files.

Someone discovered that you can tie a variable with an object. The utility of this was debated, and Schwern conceded that there’d be no harm in documenting it. That’d be a nice little small task for someone.

Andy Dougherty dropped in a couple of BSD patches, to the Makefile generation and the hints file. Paul Johnson made B::Concise recognise padops.

James Duncan allowed BEGIN blocks to be more visible from B. Robin Houston asked whether or not we should move PL_minus_c from a bool to a U8 to give us more flexibility; apparently we currently to (PL_minus_c & 0x10) which, as Robin points out, is a “rather wrong thing to do to a bool”. I wonder when the Perl core was formally forbidden from doing wrong things.

Until next week I remain, your humble and obedient servant,

Simon Cozens

Tags

community