Notes

You can subscribe to an email version of this summary by sending an empty message to perl5-porters-digest-subscribe@netthink.co.uk.

Please send corrections and additions to perl-thisweek-YYYYMM@simon-cozens.org where YYYYMM is the current year and month.

We saw about 325 messages this week, not including test results. Note that the mailing list for this summary has moved over to NetThink, joining the Perl 6 summary which you can subscribe to by sending a message to perl6-digest-subscribe@netthink.co.uk.

Unicode Fun

You know I can't resist it; there were three fairly big Unicode threads this week.

I kicked off by providing a patch which enables Unicode literals of the form U+89AB. Jarkko and Andreas reasonably complained that we have enough magical token types already; I countered by suggesting v0x89AB, but haven't produced a patch yet. Philip Newton pointed out that this was pretty much the same as "\x{89AB", so was probably not necessary, however a few people found it intuitive, so I'll look at it when the next supply of tuits arrives.

Jarkko, meanwhile, found a really, really weird Unicode bug related to string sharing in hash keys and certain 64 bit platforms. The code

    use utf8;
    $FOO{BAR};

produced a warning about cleaning up the string when perl exits.

I say "certain platforms", but it proved very difficult indeed to reproduce, and Jarkko finally tracked it down with a combination of brute force and finesse.

There were a number of problems of understanding with versions and vstrings this week; several serious, and one silly. Firstly, $] generates 5.006, rather than 5.6; Perl still juggles the two versioning conventions, but this means that

    if ($] < 5.6) ... 

doesn't do the right thing. You should remember that $] produces old-style "floating point" versions, and $^V produces new-style Unicode versions.

Next, there was the misunderstanding about how vstrings construct strings: vstrings are not bare words, but they are another way of generating strings. Hence,

    if ($^V lt "v5.6")

won't work; indeed, the v doesn't show up in v5.6 at all, and v5.6 is very different from [5.6].

For the silly one, you'll just have to read the list. :)

nlink and temporary file systems

Linux recently introduced a temporary file system, tmpfs, which is like the Solaris tmpfs but done right. Unfortunately, it wasn't done right enough for us: File::Find was having problems because the nlink field of stat was inaccurate; one suggestion was just to set the dont_use_nlink configuration variable, but that's horrendously slow. Tels provided a patch to File::Find which avoids using nlink where possible, but Andy pointed out that it isn't a general solution, because it could get expensive to stat every directory to make sure you haven't passed a filesystem boundary and need to change your nlink usage. Andreas got in touch with the author of Linux's tmpfs who provided a patch to it. Alan Cox bitched that applications demanding traditional Unix semantics for nlink were buggy, but applied the patch anyway, so tmpfs and File::Find will play nice again at some unspecified point in the future.

Parsing XML

One of the targets for Perl 5.8 is to speed up XML parsing, but nobody really has any idea how to do that. The current XML::Parser uses an external library, which means that a lot of speed is lost in flapping around in XS. The idea was mentioned of a pure-Perl version, which we would then be able to ship in core. Jarkko says:

Okay, I lied. I do have an opinion: relying on an external library to do XML parsing is weird. expat is nice and is a de facto standard, and reinventing the wheel that has already been extensively invented and debugged is silly in the extreme -- but we are, after all, supposed to be The Text Processing Language.

Matt Sergeant, as ever, had good XML suggestions:

If you do that, I suggest/recommend at least doing it the Python way - by letting XML experts (i.e. a SIG) discuss what would be the best way to do it. Note also that ActivePerl ships with XML::Parser, though in a few months it may not necessarily be the best option any more.

Also, if you do add one, it should definitely conform to the Perl SAX API (either v1 or v2), as that's the way perl-xml is heading.

He also pointed out that:

The speed problem is that expat is basically a callback/event based parser, so you have a storm of events crossing the XS/Perl barrier, meaning that you're constantly building SV's. Orchard can get around this by doing the parsing to a tree structure in C. (Note that Orchard is also based on expat). Or it can also do SAX based event passing, but again that's about as slow as XML::Parser.

Doing it all in Perl is possible, but not entirely trivial to get exactly right. XML has a lot of annoying nuances that were left in from SGML, mostly to do with DTDs. And while I don't use most of the annoying features, I think I'd be upset if the core Perl started shipping an "XML" parser that wasn't fully XML compliant.

I wonder if the perl-xml mailing list could go chew on that and let us know the best way to proceed.

IV Presentation Wars

There was an overly long and underly helpful discussion between Ilya, Jarkko, and Nicholas Clark about the merits and the implementation of the IV preservation patch. (That's the thing that means $a = $b+$c is done as a integer rather than a floating point operation where possible.)

For some reason this morphed into an overly long and underly helpful discussion between Ilya, Jarkko and the other Nick about Unicode. I stood back and let it happen.

Nobody had any code in either branch of the discussion, so the current implementation remains, no matter how dirty people think it is. If you really want to wade through it all, start here.

Memory Leaks and Refcounting Loops

Alan "The Plumber" Burlison has been exceptionally hard at work again this week tracking down memory leaks and situations where SVs are not being freed properly due to reference cycles.

Part of the problem with memory leaks is that because of the arena mechanism, they're pretty hard to spot; the "arena" is a chunk of memory which is allocated in advance and split up and divided out when new SVs are requested. The problem with this is that since everything comes from the same chunk of allocated memory, it's hard to detect where leaks are really happening; according to Alan:

...the current SV arena cleanup carries a very big shovel and bucket, and scrapes up all the camel dung at the end. However, in the process it *does* clean up things that really are *only* still referenced by the arena allocator, and therefore do in fact constitute memory that isn't accessible from anywhere else inside the interpreter (a squeak?). The upshot of this is that the arena allocator has the fortunate (?) side-effect of cleaning up squeaked SVs when the interpreter exits.

However it has the unfortunate side-effect of hiding squeaks in the rest of the interpreter, and it also hides the fact that the current attempt to delete PL_defstash doesn't work, and that magical things inside a stash squeak when they are deleted, and that refcounts don't get updated correctly when you delete something magical from a stash. Basically a lot of nasty stuff is hiding under the existing arena allocator. My aim is to dig out the cesspit.

He also found a nasty problem with removing XSUBs: because CVs have a pointer back to their GV, the GV has a reference count of 2. When the GV is deleted, that reference count drops to 1, and the CV is left stranded, without being cleanly removed. His suggested fix was to stop the artificial increasing of the GV's reference count. Nick Ing-Simmons said that he thought this was done to stop the function from disappearing while it was being called, meaning Perl would segfault.

Alan also found a really, really horrible problem where entries in a stash with a circular reference never get freed. This is, of course, one of the shortcomings of a reference counting garbage collection system, but it's still possible to delete an SV and remove all its references at the time you remove it from a stash. The usual "just add mark and sweep GC" discussion miraculously failed to materialise; perhaps everyone's tired out after having exactly the same discussion on perl6-internals. At any rate, changing Perl 5's garbage collection system is not on the horizon, but Alan's idea is simple and should work. Alan asked for comment, but everyone was too busy gawping wide-eyed in wonder.

Test::Harness and the t/ directory

Mr. Schwern has, yet again, been doing some fantastic stuff with Perl QA, including a suggestion to rearrange the test directory to allow multiple tests per module; to allow tests for core modules to be exactly the same as the tests on the CPAN version of the module, an idea to rework t/TEST so that it uses Test::Harness, and to honour "todo" and "skip" notifications in a test suite so that known failures can be safely identified. Basically, you can now output

    not ok 13 # TODO cure cancer

in your test script, and the harness won't call it a bug. Schwern also extracted Test::Harness from the core and put it back as a CPAN module so it can be used by non-bleeding-edge people. It's a very useful module, but it doesn't seem to have appeared on CPAN yet; check it out when it does!

Various

Olaf Flebbe asked whether you can rsync the perl 5.6.x tree as well as the bleadperl; the answer is yes, you can. Simply change perl-current to perl-5.6.x in your rsync line, and you can follow the maintainance track.

Sarathy asked if anyone was interested in playing with waitpid on Windows and its fork emulation, so that calling waitpid would close the handle to the "child" thread. If that's something that would interest you, read Sarathy's mail.

Oh, and we got some spam. The first piece in a long while, thanks to the work of Richard and Ask.

Until next week I remain, your humble and obedient servant,


Simon Cozens