Recently in Perl Internals Category

Where Wizards Fear To Tread

So you're a Perl master. You've got XS sorted. You know how the internals work. Hey, there's nothing we can teach you on perl.com that you don't already know. You think? Where Wizards Fear To Tread brings you the information you won't find anywhere else concerning the very top level of Perl hackery.

Putting Down Your Roots

This month, we look at the Perl op tree. Every Perl program is compiled into an internal representation before it is executed. Functions, subroutine calls, variable accesses, control structures, and all that makes up a Perl program, are converted into a series of different fundamental operations (ops) and these ops are strung together into a tree data structure.

For more on the different types of ops available, how they fit together, and how to manipulate them with the B compiler module, look at the Perl 5 internals tutorial. Right now, though, we're going to take things a step further.

B and Beyond With B::Utils

The B module allows us to get at a wealth of information about an op, but it can become incredibly frustrating to know which op you want to deal with, and to perform simple manipulation on a range of ops. It also offers limited functionality for navigating around the op tree, meaning that you need to hold onto a load of additional state about which op is where. This gets complicated quickly. Finally, it's not easy to get at the op trees for particular subroutines, or indeed, all subroutines both named and anonymous.

B::Utils was created at the request of Michael Schwern to address these issues. It offers much more high-level functionality for navigating through the tree, such as the ability to move ``upward'' or ``backward,'' to return the old name of an op that has currently been optimized away, to get a list of the op's children, and so on. It can return arrays of anonymous subroutines, and hashes of subroutine op roots and starts. It also contains functions for walking through the op tree from various starting points in various orders, optionally filtering out ops that don't match certain conditions; while performing actions on ops, B::Utils provides carp and croak routines which perform error reporting from the point of view of the original source code.

But one of the most useful functions provided by B::Utils is the opgrep routine. This allows you to filter a series of ops based on a pattern that represents their attributes and their position in a tree. The major advantage over doing it yourself is that opgrep takes care of making sure that the attributes are present before testing them - the seasoned B user is likely to be accustomed to the carnage that results from accidentally trying to call name on a B::NULL object.

For instance, we can find all the subroutine calls in a program with


    walkallops_filtered (
        sub { opgrep( { name => "entersub" }, @_) },
        sub { print "Found one: $_[0]\n"; }
    );

C<opgrep> supports alternation and negation of attribute queries. For instance, here are all the scalar variable accesses, whether to globals or lexicals:


    @svs = opgrep ( { name => ["padsv", "gvsv"] }, @ops)

And as for checking an op's position in the tree, here are all the exec ops followed by a nextstate and then followed by something other than exit, warn or die:


  walkallops_filtered(
      sub { opgrep( {
                        name => "exec",
                        next => {
                           name    => "nextstate",
                           sibling => { 
                                         name => [qw(! exit warn die)] 
                                      }
                                }
                    }, @_)},
      sub {
            carp("Statement unlikely to be reached");
            carp("\t(Maybe you meant system() when you said exec()?)\n");
      }
  )

Don't Do That, Do This

So, what can we do with all this? The answer is, of course, ``anything we want.'' If you can mess about with the op tree, then you have complete control over Perl's operation. Let's take an example.

Damian Conway recently released the Acme::Don't module, which doesn't do anything:


    don't { print "Something\n" }

doesn't print anything. Very clever. But not clever enough. You see, I like double negatives:


    my $x = 1;
    don't { print "Something\n" } unless $x;

doesn't print anything either, and if you like double negatives, then you might agree that it should print something. But how on earth are we going to get Perl to do something when a test proves false? By messing about with the op tree, of course.

The way to solve any problem like this is to think about the op tree that we've currently got, work out what we'd rather do instead, and work out the differences between the op trees. Then, we write something that looks for a given pattern in a program's op tree and modifies it to be what we want.

There are several ways of achieving what we actually want to get but the simplest one is this: add a second parameter to don't which, if set, actually does do the code. This allows us to replace any occurrence of

    don't { print "Something\n" } if (condition);

with


    don't(sub { print "Something\n" }, 1) unless (condition);

Let's now look at this in terms of op trees. Here's the relevant part of the op tree for don't { ... } if $x, produced by running perl -MO=Terse and then using sed to trim out to unsightly hex addresses:


    UNOP  null
        LOGOP  and
            UNOP  null [15]
                SVOP  *x
            UNOP  entersub [2]
                UNOP  null [141]
                    OP  pushmark
                    UNOP  refgen
                        UNOP  null [141]
                            OP  pushmark
                            SVOP  anoncode  SPECIAL #0 Nullsv
                    UNOP  null [17]
                        SVOP  *don::t

As we can see, the if is represented as an and op internally, which makes sense if you think about it. The two ``legs'' of the and, called ``first'' and ``other,'' are a call to fetch the value of $c, and a subroutine call. Look at the subroutine call closely: the ops ``inside'' this set up a mark to say where the parameters start, push a reference to anonymous code (that's our { ... }) onto the stack, and then push the glob for *don::t on there.

So, we need to do two things: We need to insert another parameter between refgen and the null attached to *don::t, and we need to invert the sense of the test.

Now we know what we've got to do, let's start doing it - remember our solution: stage one, write code to find the pattern.

This is actually pretty simple: We're looking for either an and or an or op, and the ``other'' leg of the op is going to be a call to *don::t. However, we have to be a bit clever here, since Perl internally performs a few optimizations on the op tree that even the B::* reporting modules don't tell you about. When Perl threads the next pointers around an op tree, it does something special for a short-circuiting binary op like and or or - it sets the other pointer to be not the first sibling in the tree, but the first op in execution order. In this case, that's pushmark, as we can see from running B::Terse,exec:


    LOGOP (0x80fa008) and
    AND => {
        OP (0x80f9f88) pushmark
        OP (0x80f9f20) pushmark
        SVOP (0x80f9ec0) anoncode  SPECIAL #0 Nullsv
        ...

With this knowledge, we can create a pattern to pass to opgrep:


    {
        name => ["and", "or"],
        other => {
            name => "pushmark",
            sibling => { next => { name => "gv" }}
        }
    }

Unfortunately, this doesn't tell us the whole story, since we actually need to check that the subroutine call is to don't, rather than to any other given subroutine that might be called conditionally. Hence, our filter looks like this:


    sub {
        my $op = shift;
        opgrep(
            {
                name => ["and", "or"],
                other => {
                    name => "pushmark",
                    sibling => { next => { name => "gv" }}
                }
            }, $op) or return;
        my $gv = $op->other->sibling->next->gv;
        return unless $gv->STASH->NAME eq "don" and $gv->NAME eq "t";
        return 1;
    }

We grab the GV (we know exactly where it's going to be because of our pattern!) and test that it's in the don stash and is called t.

Part one done - we have located the ops that we want to change. Now how on earth do we change ops in an op tree?

Fixing It Up With B::Generate

B::Generate was written to allow users to create their own ops and insert them into the op tree. The original intent was to be able to create bytecode for other languages to be run on the Perl virtual machine, but it's found plenty of use manipulating existing Perl op trees.

It provides ``constructor'' methods in all of the B::*OP classes, and makes many of the accessor methods read-write instead of read-only. Let's see how we can apply it to this problem. Remember that we want to negate the sense of the test, and then to add another argument to the call to don't.

For the first of these tasks, B::Generate provides the handy mutate and convert methods on each B::OP-derived object to change one op's type into another. The decision as to which of them use is slightly complex: mutate can only be used for ops of the same type - for instance, you cannot use it to mutate a binary op into a unary op. However, convert produces a completely new op, which needs to be threaded back into the op tree. So convert is much more powerful, but mutate is much more convenient. In this case, since we're just flipping between and and or, we can get away with using mutate:


    require B::Generate;
    my $op = shift;
    if ($op->name eq "and") {
        $op->mutate("or");
    } else {
        $op->mutate("and");
    }

Now to insert the additional parameter. For this, remember that entersub works by popping off the top entry in the stack and calling that as a subroutine, and the remaining stack entries become parameters to the subroutine. So we want to add a const op to put a constant on the stack. We use the B::SVOP->new constructor to create a new one, and then thread the next pointers so that Perl's main loop will call it between $op->other->sibling (the refgen op) and the op after it. (the GV which represents *don::t)


    my $to_insert = $op->other->sibling;
    my $newop = B::SVOP->new("const", 0, 1);
    $newop->next($to_insert->next);
    $to_insert->next($newop);

All that's left is to replace the definition of don't so that, depending on the parameters, it sometimes does:


    sub don't (&;$) { $_[0]->() if $_[1] }

And there we have it:


    package Acme::Don't;
    CHECK {
        use B::Utils qw(opgrep walkallops_filtered);
        walkallops_filtered(
            sub {
                my $op = shift;
                opgrep(
                {
                    name => ["and", "or"],
                    other => {
                        name => "pushmark",
                        sibling => { next => { name => "gv" }}
                    }
                }, $op) or return;
                my $gv = $op->other->sibling->next->gv;
                return unless $gv->STASH->NAME eq "don" and $gv->NAME eq "t";
                return 1;
            },
            sub {
                require B::Generate;
                my $op = shift;
                if ($op->name eq "and") {
                    $op->mutate("or");
                } else {
                    $op->mutate("and");
                }
                
                my $to_insert = $op->other->sibling;
                my $newop = B::SVOP->new("const", 0, 1);
                $newop->next($to_insert->next);
                $to_insert->next($newop);
            }
       );
    }
    
    sub don't (&;$) { $_[0]->() if $_[1] }

This will turn


    $false = 0; $true = 1;
    
    don't { print "Testing" } if $false;
    don't { print "Testing again" } unless $true;

into


    $false = 0; $true = 1;
    
    don't(sub { print "Testing" }, 1) unless $false;
    don't(sub { print "Testing again" }, 1) if $true;

setting off the conditions and making don't do the code. A neat trick? We think so.

Where To From Here?

But that's not all! And, of course, this doesn't cater for some of the more complex constructions people can create, such as


    if ($x) {
        do_something();
        don't { do_the_other_thing() };
        do_something_else();
    }

or even


    if ($x) {
        do_that();
        don't { do_this() }
    } else {
        do_the_other();
        don't { do_something_else() }
    }

But this can be solved in just the same way. For instance, you want to turn the first one into


    if ($x) {
        do_something();
        do_something_else();
    } else {
        don't(sub { do_the_other_thing() }, 1);
    }

and the second into


    if ($x) {
        do_that();
        don't(sub { do_something_else() }, 1);
    } else {
        do_the_other();
        don't(sub { do_this() }, 1);
    }

Both of these transformations can be done by applying the method above: compare the op trees, work out the difference, find the pattern you want to look for, then write some code to manipulate the op tree into the desired output. An easy task for the interested reader ...

And we really haven't scratched the surface of what can be done with B::Generate and B::Utils; the B::Generate test suite shows what sort of mayhem can be caused to existing Perl programs, and there have been experiments using B::Generate to generate op trees for other languages - a B::Generate port of Leon Brocard's shiny Ruby interpreter could produce Perl bytecode for simple Ruby programs; chromatic is working on an idea to turn Perl programs into XML, manipulate them and use B::Generate to turn them back into Perl op trees.

Later in our ``Where Wizards Fear To Tread'' series, we'll have articles about Perl and Java interaction, iThreads, and more.

Sapphire


Table of Contents

Design Principles
So, uh, what is it?
Where did it get?
What else can be done?
What can't be done?
Structure of a Sapphire
The future of Sapphire
Reflections through a Sapphire

Naming languages after stones is getting a bit old-hat these days. We all know and love Perl; you might have heard of the Ruby language, which I'll talk more about another time. There's also Chip Salzenberg's Topaz project, an idea to rewrite Perl in C++, which ended with the announcement of the Perl 6 effort. And now, there's Sapphire. So what's this all about?

Sapphire is one of the many projects which was started purely and simply to prove a point. In this case, the point was that building a large program from scratch in this day and age is crazy. I was going to prove it by showing how rapidly software can be developed when using established libraries, and that was done by seeing how quickly I could rewrite Perl 5.

Also, as a secondary goal, I wanted to show the flexibility of some of my design ideas for Perl 6. It's dangerous when people are sitting around for a long time discussing software design without implementations, without benchmarks and without a single line of code. I prefer getting up and doing something rather than talking about it. So I was going to show my ideas in software, not in words.

Design Principles

Here are some of the ideas I was intending to showcase:

Being Good Open-Source Citizens
What do I mean by this? Perl 5 is extremely self-sufficient. Once you've got the source kit, it'll compile anywhere, on almost any platform and requiring very few ``support'' libraries. It'll make do with what you have. One of the unfortunate side-effects of this is that if Perl wants to do something, it implements it itself. As a result, Perl contains a load of interesting routines, but keeps them to itself. It also doesn't use any of the perfectly fine implementations of those interesting routines which are already out there.

Some people think this is a feature; I think it's a wart. If we can give things back to the open-source community, and work with them to help improve their tools, then everyone benefits.

Generalizing Solutions
One of the great design choices of Perl 5 which appears to have been completely and utterly rejected in the discussions on Perl 6's proposed language is that we do things in the most general way possible. This is why Perl doesn't need huge numbers of new built-ins - it just needs a way to make user-defined syntax with the same status as built-ins. It doesn't need beautiful new OO programming models - it just needs a way to help people write their own OO models. Sapphire tries to do things in the most general way possible.

Modularity
Perl 5 consists of a number of parts, including a stunning regular expression engine and a decent way of dealing with multityped variables. Unfortunately, in the current implementation, these parts are all highly interdependent in twisty ways. Separating them out into modules means that you can test them independently, distribute them as independent libraries, and upgrade them independently.

Seems like a winner to me!

So, uh, what is it?

Sapphire, then, is a partial implementation of the Perl 5 API. I wasn't setting out to create a new interpreter, although that would have been necessary for some of the API routines, such as those which execute Perl code. What I wanted was to re-create the programming environment which Perl 5 gives you internally - a sort of ``super C,'' a C customized for creating things such as Perl interpreters.

Specifically, I wasn't trying to do anything new. I like Perl 5. It has a lot going for it. Of course, it could be tidier, since five years of cruft have accumulated all around it now. It could be less quirky, and it could display the design goals I have just mentioned. That's what I wanted to do.

Where did it get?

I gave myself a week. I was going to hack on it for one week, and see where that got me. I didn't spend a lot of time on it, but I still managed to achieve a fair amount: I had scalars, arrays and hashes working, as well as Unicode manipulation to the level of Perl 5 and slightly beyond.

How? Well, I started by looking at the glib library, at http://developer.gnome.org/doc/API/glib/. This provided a fantastic amount of what I needed. The GPtrArray corresponds nicely with a Perl AV, and glib also implements hashes, which saved a lot of time, although to have HEs (hash entries) you need to dig a little into the glib source.

All the Unicode support was there. I initially used GNOME's libunicode, but then found that the development version of glib added UTF8 support and was much easier to deal with. There were a few functions I needed which Perl 5 already had, and I'll be pushing those back to the glib maintainers for potential inclusion.

Perl uses a lot of internal variable types to ensure portability - an I32 is an integer type guaranteed to be 32 bits, no matter where it runs. Not surprisingly, I didn't have much work to do there, since glib provides a family of types, such as gint32, to do the same thing. Differing byte orders are also catered for. The ``super C'' environment the Perl 5 API provides is largely out there, in existing code.

Oh, and let's be honest: There was one large piece of existing code that was just too tempting not to use, and that was Perl itself. When you're trying to replicate something and you've got a working version in front of you, it's tricky not to borrow from it; it seems a shame to throw away five years of work without looking for what can be salvaged from it. A lot of the scalar handling code came from Perl 5, although I did rearrange it and make it a lot more sensible and maintainable. I wasn't just interested in developing with external libraries. I also wanted to see if I could correct some other misfeatures of Perl's internals.

The first problem to solve was the insidious use of macros on macros on macros. The way I went about this was by first outlawing lvalue macros. That is, for example,

    SvPVX(sv) = "foo";

had to turn into

    sv_setpv(sv, "foo");

Incidentally, this is how perlapi says it should be done. Perl 5 often optimizes for speed (sometimes too enthusiastically) at the expense of maintainability - Sapphire questions that trade-off, preferring to trust compiler optimization and Moore's Law.

Next, I wrote a reasonably sophisticated Perl program to convert inline functions into macros. That is, it would take
    #ifdef EXPAND_MACROS
    INLINE void sv_setpv (SV* sv, char * pv) {
       ((XPV*)  SvANY(sv))->xpv_pv = pv;
    }
    #endif

and turn it, automatically, into:

    #ifdef EXPAND_MACROS
    #ifdef EXPAND_HERE
    INLINE void sv_setpv (SV* sv, char * pv) {
       ((XPV*)  SvANY(sv))->xpv_pv = pv;
    }
    #endif
    #else
    #define sv_setpv(sv, pv) ((XPV*)  SvANY(sv))->xpv_pv = pv
    #endif

Now you can choose whether your macros should be expanded by flipping on -DEXPAND_MACROS and whether they should be inline by playing with -DINLINE. But what's EXPAND_HERE for? Well, the above code snippet would go into an include file, maybe sv.h, and one C file - let's call it sv_inline.c - would contain the following code:

    #include <sapphire.h>
    #define EXPAND_HERE
    #include <sv.h>

Then if EXPAND_MACROS was defined, the function definitions would all be provided in one place; if macros were not expanded, sv_inline.c would define no functions. The function prototypes would be extracted automatically with C::Scan.

With the state of compiler optimization these days, it's likely that turning everything into macros makes no significant speed difference. In which case, it's best to turn on EXPAND_MACROS to assist with source-level debuggers which cannot read macros. However, you can't tell until you benchmark, and the ``optional expansion'' method gives you a nice easy way to do that.

I also took a swipe at memory allocation; it seems the first job in every large project is to write your own memory allocator. I had heard from perl5-porters and other places that the biggest speed overhead in XS routines is SV creation, so I wrote an allocator which would maintain pools of ready-to-ship variables, refreshing the pools when there was nothing else to do, like a McDonald's burger line.

What else can be done?

If I had given myself two weeks, where would I be? Sticking with glib, I could very easily have safe signal handling, portable loadable module support, a main event dispatch loop and a safe threading model. It's all there, ready to go. It's free software, and that's just one library.

To be honest, I wouldn't advocate the use of glib for everything I could do with it. For example, I replaced Perl's main run loop Perl_runops_standard in run.c) with a GMainLoop and benchmarked the two. The glib version, although signal safe, was at least five times slower. (However, you may want to contemplate what it means for graphical application programming if you have, effectively, a GNOME event loop right there in your interpreter.)

Heavier Unicode support would probably need libunicode. What about regular expressions? Well, the glib developers are working on a fully Unicode-aware Perl-compatible regular expression library (which, frankly, is more than we have). If they don't finish that, Philip Hazel's Perl Compatible Regular Expression library does exactly what it says on the tin.

What can't be done?

There are some areas of Perl's programming environment where I'm not aware of a pre-existing solution. For instance, scope.c in the Perl source distribution gives C the concept of ``dynamic scope,'' allowing you to save variables and restore them at the end of a block, just like the local operator in Perl.

And some problems just can't be solved in C. There's no good way, for instance, to get a partitioned namespace. I didn't bother trying. Once you've told the developer what the API is, it's his responsibility to ensure it works.

On the other hand, C is not meant to be a language that gives you this type of support. Some would argue that C++ solves these problems, but in my experience, C++ never solves anything.

Structure of a Sapphire

As I've mentioned, I tried to plan the structure of Sapphire along modular lines, so that pieces could be tested individually and upgraded. My proposed structure was a series of libraries, like this:

libsvar
Standing for ``Sapphire variables,'' libsvar contains all the functions for manipulating SVs, AVs and HVs. This is an interesting library in its own right which can be used for programming outside of the Sapphire environment - having SVs in an ordinary C program without the overhead of a Perl interpreter really expands your programming possibilities, and, as far as I'm aware, there isn't a good variable-handling library around.

libre
The regular expression engine would be separated into its own library, again so that external applications can use it without an entire Perl interpreter. I didn't implement this myself, leaving it to PCRE or glib to provide this.

libutf8
Again, we can split off the Unicode handling functions into their own library, although this functionality can be implemented by libunicode or glib.

libscope
The present-day scope.c and scope.h solve a problem in C by giving it dynamic scoping; this is something that contributes to the friendliness of the Perl programming environment, and something we can separate and share.

libpp
Although this wouldn't be useful outside of Sapphire, libpp would contain the ``push-pop'' code which runs the operations inside the interpreter.

libutil
Libutil would contain everything else which was potentially useful outside of Sapphire - the memory allocation, the stack manipulation and so on.

The future of Sapphire

So, what am I going to do with Sapphire now? To be honest, nothing at all. I hope it has served its purpose by presenting the argument for reusable code and stable design principles.

I certainly don't, at present, want to be placed in a position where I'm choosing between spending time fiddling with Sapphire and spending time contributing to coding Perl 6. Please understand: Sapphire is emphatically not intended to be a fork of Perl, merely an interesting interlude, and this is shown by the fact that I didn't try to make any exciting changes.

If anyone has some interesting ideas on how to take this ball and run with it, feel free. It's free software, and this is exactly what you should be doing. Contact me if you'd like a copy of the source.

I do have some thoughts on what my next experiment is going to be, however ... .

Reflections through a Sapphire

What have I learned from all this? I've learned a lot about the structure of Perl 5. I've realized that roughly half of it is support infrastructure for the other half, the business half. Is this good or bad? Well, it certainly means that we're not beholden to anyone else - an external library may suddenly change its implementation, semantics or interface, and Sapphire would have to struggle to catch up. Perhaps it's all about control. By implementing everything ourselves, the porters retain control over Perl.

I've also learned that Perl 5, internally, has a lot to share, yet, even though we claim to believe in code reuse where the CPAN's concerned, we do very little of it on a lower level, neither really giving nor really taking.

I've learned that rapid development can come out of a number of things. First, having external code already written to do the work for you helps a lot, even though you don't have much control over it.

Second, having an existing implementation of what you're trying to program also helps, although you have to walk a fine line. Taking Perl 5 code wholesale meant I either had to do a lot of surgery or support things I didn't really want to support, but ignoring the whole of the existing code base would feel like throwing the baby out with the bathwater. (Hence I would caution the Perl 6 internals people to thresh carefully the Perl 5 code; there is some wheat in that chaff, or else you wouldn't be using it ... .)

Finally, rapid development can come from having a well-organized and disciplined team. My team swiftly agreed on all matters of design and implementation and got down to coding without interminable and fruitless discussions, taking unanimous decisions on how to get around problems - because I was my team.

Would I say the Sapphire experiment was a success? Well, since it taught me all the above, it certainly couldn't have been a failure. Did it prove the point that developing with reusable code is worth the sacrifice in terms of control? That remains to be seen.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02