September 2004 Archives

Don't Be Afraid to Drop the SOAP

SOAP has great hype; portable, simple, efficient, flexible, and open, SOAP has it all. According to many intelligent people, writing a web service with SOAP should be a snap, and the results will speak for themselves. So they do, although what they have to say isn't pretty.

Two years ago I added a SOAP interface to the Bricolage open source content management system. I had high expectations. SOAP would give me a flexible and efficient control system, one that would be easy to develop and simple to debug. What's more, I'd be out on the leading edge of cool XML tech.

Unfortunately the results haven't lived up to my hopes. The end result is fragile and a real resource hog. In this article I'll explore what went wrong and why.

Last year, I led the development of a new content-management system called Krang, and I cut SOAP out of the mix. Instead, I created a custom XML file-format based on TAR. Performance is up, development costs are down, and debugging is a breeze. I'll describe this system in detail at the end of the article.

What is SOAP?

In case you've been out to lunch, SOAP (Simple Object Access Protocol) is a relatively new RPC (Remote Procedure Call) system that works by exchanging XML messages over a network connection, usually over HTTP. In an RPC system, a server offers routines (procedures) that clients may call over a network connection. SOAP surpasses its direct predecessor, XML-RPC, with an enhanced type system and an improved error-handling system. Despite the name, SOAP is neither particularly simple nor object-oriented.

Bricolage Gets SOAP

When I joined the Bricolage project, it lacked a good way to control the application aside from the browser-based GUI. In particular, we needed a way to import data and trigger publish runs. Bricolage is a network application, and some useful tasks require interaction with multiple Bricolage servers. SOAP seemed like an obvious choice. I read "Programming Web Services with Perl" and I was ready to go.

I implemented the Bricolage SOAP interface as a set of classes that map SOAP requests to method calls on the underlying objects, with some glue code to handle XML serialization and deserialization. I used XML Schema to describe an XML vocabulary for each object type, which we used to validate input and output for the SOAP methods during testing.

By far the most important use-case for this new system was data import. Many of our customers were already using content management systems (CMSs) and we needed to move their data into Bricolage. A typical migration involved processing a database dump from the client's old system and producing XML files to load in Bricolage via SOAP requests.

The SOAP interface could also move content from one system to another, most commonly when moving completed template changes into production. Finally, SOAP helped to automate publish runs and other system maintenance tasks.

To provide a user interface to the SOAP system, I wrote a command-line client called bric_soap. The bric_soap script is a sort of Swiss Army knife for the Bricolage SOAP interface; it can call any available method and pipe the results from command to command. For example, to find and export all the story objects with the word foo in their title:

$ bric_soap story list_ids --search "title=%foo%" |
	bric_soap story export - > stories.xml

Later we wrote several single-purpose SOAP clients, including bric_republish for republishing stories and bric_dev_sync for moving templates and elements between systems.

What Went Right

  • The well-documented XML format for Bricolage objects made developing data import systems straightforward. Compared to previous projects that attempted direct-to-SQL imports, the added layer of abstraction and validation was an advantage.
  • The interface offered by the Bricolage SOAP classes is simpler and more regular than the underlying Bricolage object APIs. This, coupled with the versatile bric_soap client, allowed developers to easily script complex automations.

What Went Wrong

  • SOAP is difficult to debug. The SOAP message format is verbose even by XML standards, and decoding it by hand is a great way to waste an afternoon. As a result, development took almost twice as long as anticipated.
  • The fact that all requests happened live over the network further hampered debugging. Unless the user was careful to log debugging output to a file it was difficult to determine what went wrong.
  • SOAP doesn't handle large amounts of data well. This became immediately apparent as we tried to load a large data import in a single request. Since SOAP requires the entire request to travel in one XML document, SOAP implementations usually load the entire request into memory. This required us to split large jobs into multiple requests, reducing performance and making it impossible to run a complete import inside a transaction.
  • SOAP, like all network services, requires authentication to be safe against remote attack. This means that each call to bric_soap required at least two SOAP requests — one to login and receive a cookie and the second to call the requested method. Since the overhead of a SOAP request is sizable, this further slowed things down. Later we added a way to save the cookie between requests, which helped considerably.
  • Network problems affected operations that needed to access multiple machines, such as the program responsible for moving templates and elements — bric_dev_sync. Requests would frequently timeout in the middle, sometimes leaving the target system in an inconsistent state.
  • At the time, there was no good Perl solution for validating object XML against an XML Schema at runtime. For testing purposes I hacked together a way to use a command-line verifier using Xerces/C++. Although not a deficiency in SOAP itself, not doing runtime validation led to bad data passing through the SOAP interface and ending up in the database where we often had to perform manual cleanup.

Round Two: Krang

When I started development on Krang, our new content management system, I wanted to find a better way to meet our data import and automation needs. After searching in vain for better SOAP techniques, I realized that the problems were largely inherent in SOAP itself. SOAP is a network system, tuned for small messages and it carries with it complexity that resists easy debugging.

On the other hand, when I considered the XML aspects of the Bricolage system, I found little to dislike. XML is easy to understand and is sufficiently flexible to represent all the data handled by the system. In particular, I wanted to reuse my hard-won XML Schema writing skills, although I knew that I'd need runtime validation.

In designing the new system I took a big step back from the leading edge. I based the new system on the TAR archive file format, which dates back to the mid-70s!


Figure 1.

I named the file format "Krang Data Set" (KDS). A KDS file is a TAR archive containing a set of XML files. A special file, index.xml, contains data about all the files contained in the KDS file, providing class names and IDs. To reduce their size, it's possible to compress KDS files using gzip.

I wrote two scripts, krang_import and krang_export, to read and write KDS files. Each object type has its own XML Schema document describing its structure. Krang classes implement their own deserialize_xml() and serialize_xml() methods. For example, to export all templates into a file called templates.kds:

$ krang_export --templates --output templates.kds

To import those templates, possibly on a different machine:

$ krang_import templates.kds

If the object being exported has any dependencies, the KDS file will include them. In this way a KDS file generated by krang_export is guaranteed to import successfully.

By using a disk-based system for importing and exporting data I cut the network completely out of the picture. This alone accomplishes a major reduction in complexity and a sizable performance increase. Recently we completed a very large import into Krang comprising 12,000 stories and 160,000 images. This took around 4 hours to complete, which may seem like a long time but it's a big improvement over the 28 hours the same import required using SOAP and Bricolage!

For system automation such as running publish jobs from cron, I decided to code utilities directly to Krang's Perl API. This means these tools must run on the target machine, but in practice this is usually how people used the Bricolage tools. When an operation must run across multiple machines, perhaps when moving templates from beta to production, the administrator simply uses scp to transfer the KDS files.

I also took the opportunity to write XML::Validator::Schema, a pure-Perl XML Schema validator. It's far from complete, but it supports all the schema constructs I needed for Krang. This allows Krang to perform runtime schema validation on KDS files.

What Went Right

  • The new system is fast. Operating on KDS files on disk is many times faster than SOAP network transfers.
  • Capacity is practically unlimited. Since KDS files separate objects into individual XML files, Krang never has to load them all into memory at once. This means that a KDS file containing 10,000 objects is just as easy to process as one containing 10.
  • Debugging is much easier. When an import fails the user simply sends me the KDS file and I can easily examine the XML files or attempt an import on my own system. I don't have to wade through SOAP XML noise or try to replicate network operations to reproduce a bug. Separating each object into a single XML file made working on the data much easier because each file is small enough to load into Emacs.
  • Runtime schema validation helps find bugs faster and prevents bad data from ending up in the database.
  • Because Krang's design accounted for the XML system from the start it has a much closer integration with the overall system. This gives it greater coverage and stability.

What Went Wrong

  • Operations across multiple machines require the user to manually transfer KDS files across the network.
  • Users who have developed expertise in using the Bricolage SOAP clients must learn a new technology.

Conclusion

SOAP isn't a bad technology, but it does have limits. My experience developing a SOAP interface for Bricolage taught me some important lessons that I've tried to apply to Krang. So far the experiment is a success, but Krang is young and problems may take time to appear.

Does this mean you shouldn't use SOAP for your next project? Not necessarily. It does mean that you should take a close look at your requirements and consider whether an alternative implementation would help you avoid some of the pitfalls I've described.

The best candidates for SOAP applications are lightweight network applications without significant performance requirements. If your application doesn't absolutely require network interaction, or if it will deal with large amounts of data then you should avoid SOAP. Maybe you can use TAR instead!

Resources

This Week on Perl 6, Week Ending 2004-09-24

The Perl 6 Summary for the week ending 2004-09-24

This is my last summary before I start my teaching practice. Hopefully I've things set up so writing the summary isn't going to interfere with that, and vice versa.

This week in perl6-compiler

State of Rules

Discussion of the state of the Perl 6 compiler (with particular reference to the rules engine) continued. People wanted to make sure that the rules engine that Luke and Patrick are working on would be flexible enough to cope with different languages in closures.

Synopsis 5 updated

Ed Peschko asked that there be some way of "turning the rules engine inside out" to make something which, given a rule, would generate strings that could match against it. Actually, this is the second time Ed asked for this, as Luke reminded him. Luke went on to implement a generator in hypothetical perl 6, which seemed to please everyone but Ed.

Rod Adams wins the "making the summarizer smile wryly" occasional prize.

Meanwhile, in perl6-internals

Problems Reimplementing Parrot Forth

Matt Diephouse fell afoul of problems with the compile and compreg opcodes in his efforts to reimplement Parrot Forth in PIR. Steve Fink made some suggestions for workarounds based on his work on writing a regular expression compiler.

From further discussion, it seems that, if you're implementing a stack based language, you'd do well to manage the language's stack yourself rather than using Parrot's User stack which is interestingly scoped.

__init not being magically called

Will Coleda had some problems with a class's __init function not being called magically. Nobody else could reproduce the problem. After a certain amount of confusion, Will did make realclean; perl Configure.pl; make; make test and all was well again.

If you're experiencing a weird problem, it's probably best to do the rebuild before posting to the list. Or you could fix the build system to have more accurate dependencies...

Incremental collector and finalization

Jeff Clites had some questions about how finalizers interact with the work that Leo's done while implementing an incremental garbage collector for Parrot. Leo had some thoughts, but noted that there's still a problem with ordered finalization and destruction.

[Your summarizer is really starting to understand why old school GC types really don't like finalizers...]

Python bytecode volunteers

Dan asked for volunteers to finish the task of getting Python bytecode working on Parrot; he reckoned that the work was mostly done, but that neither he nor Leo have the time to go the last couple of yards.

Come on people, this would definitely be a cool thing to have.

mod_parrot 0.0

Jeff Horwitz announced the release of version 0.0 of his mod_parrot Apache module. It's remarkably powerful for version 0.0.

The compile op and building compilers

Dan had some thoughts on tidying up the spec for the compreg and compile operators and asked for comments before he nailed the spec down. Steve Fink and Leo had comments.

Misc. remarks about YAPC::EU

Leo popped up to thank everyone who'd donated to The Perl Foundation and thus supported the purchase of shiny new Apple Powerbook G4 that he'd used to run his presentation at YAPC Europe in Belfast.

He went on to outline some of the things he'd done and heard in Belfast, including the fact that one French teacher is using Parrot for teaching assembly language.

Parrot m4 0.0.8

Bernhard Schmalhofer announced version 0.0.8 of Parrot m4. There's no new functionality, "just" some structural improvement and tidying.

Parrot TCL

Will Coleda posted a progress report on his Parrot TCL implementation which is progressing gradually towards being a full blown TCL implementation; he's working towards using special Tcl* PMCs with real TCL semantics instead of the current scheme which uses Perl PMCs.

Namespaces, Part 1

Dan posted the first part of his Namespaces spec. There was, of course, much discussion. Inevitably, there was another revision, and further discussion.

Dan's initial post

The revised version

Towards a new call scheme

Leo posted an overview of the work he was doing on Parrot's internals to put a faster calling scheme in place (as discussed endlessly). The usual perl6-internals discussion and revision process swung into action.

Hello everybody

Remember the French teacher that Leo mentioned? Well, the man himself, Christian Aperghis-Tramoni, popped up on the list and pointed everyone at his work so far, and asked for help in finding more information.

If anyone would like to translate Christian's work from French to English...

Bits of introspection

Leo announced that he'd started work on adding more introspection features to Parrot, accessible through the interpinfo op. This looks very cool.

Why lexical pads

Klaas-Jan wondered why Parrot had support for lexical pads, as he thought that PIR's .local syntax was good enough.

Several people explained. Essentially, lexical pads are really handy, bordering on the essential, when you're implementing a language with closures.

Meanwhile, in perl6-language

In Brief

Discussion of various synopses continued.

Larry can be very persuasive when he's right.

Michele Dondi is a chap. (See last week's question.)

Trying to run a thread across multiple mailing lists is the sort of thing that annoys a summarizer.

You have to write something before it goes in the core.

Patrick hopes to have a first cut at the Perl 6 rules engine available within a couple of weeks.

Pipeline Performance

Luke showed off Perl 6's little known gather {...; take ... } construct in some example code. People were impressed.

Unary dot and custom control

Luke Palmer wondered about topicalization and scope in:

    method foo () {
        preserve {
            .bar;
        }
    }

In particular, he hoped that the topic that .bar sees is the topic that's lexically current.

Larry set his mind to rest. (Well, he set my mind to rest).

attributes/methods on sigils

Michele Dondi wondered if sigils could be "(sort of special) operators ... thus allowing attributes/methods or even adverbs". Larry's response was superbly deadpan.

The usual footer

Hmm... maybe I should trying doing the perl6-language summary like that every week; it's certainly quicker to write like that. Let me know what you think.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send feedback or contributions to a getting Piers to OSCON 2005 fund.

The Perl Foundation

Perl 6 Development site

Or, you can check out my website.

This Week on Perl 6, Week Ending 2004-09-17

Another week, another summary, and I'm running late. So:

This week in perl6-compiler

The current state of the compiler

Discussion of the current state of the nascent perl 6 compiler and how best to contribute to its development even before code has been released continued. The best way to contribute right now is "Write tests". Don't worry about what the test harness should look like; simple tables of rules, test strings and expected matches will be very welcome.

The status discussion also touched on how to handle different languages in the closures embedded in rules.

Bootstrapping the grammar

Uri Guttman had some thoughts on bootstrapping Perl 6's grammar. He hoped that his suggested approach would enable lots of people to work on the thing at once without necessarily getting in each other's way. Adam Turoff pointed everyone at a detailed description of how Squeak (a free Smalltalk) got bootstrapped.

Synopsis 5 updated

Larry announced that he has updated Synopsis 5, which covers Grammars, rules and all that good stuff. It's now only a week out of date instead of two years and counting.

This week on perl6-internals

Namespaces

Discussion of Dan's namespace proposal really got going this week.

Buffered IO and Parrot Forth

Matt Diephouse fell afoul of a problem with IO buffering when he was taking a look at Parrot Forth, so he asked the list for help. Leo supplied the help, so Matt supplied a patch to Parrot Forth which made it print its prompts correctly when run under modern (CVS) Parrot.

Pragma @LOAD is not always honoured

Stéphane Payrard was bemused to discovered that parrot routines declared with the @LOAD pragma don't automatically execute if they're in the main segment. He suggested that the issue be either fixed or documented.

Leo documented it.

NCI basics

Charles Somebody tried to crash the monomonikered big leagues by failing to vouchsafe his surname when he posted a question about getting NCI to work with wxWindows. For reasons that escape me, the answers (and, indeed, Charles's surname — Lowell) appeared in a different thread.

Sadly the answers were more along the lines of "Oops, that's a bug that is, we'll add it to the RT queue". Still, better to have it identified than festering away undiscovered.

Language::Zcode

Who says Perl 6 is the only language that's taking a long time to appear on Parrot? Amir Karger posted his first annual update on his attempt to get Parrot to emulate the Z-machine. Hopefully subsequent updates will be more frequent.

Meanwhile, in perl6-language

Ordinals, Hashes and Arrays, oh my!

David Green had some thoughts on Perl 6's compound data structures. Larry didn't sound convinced.

Writing pack, or something like it

Michele Dondi wondered how to write pack-like functions in Perl 6, where the first argument is a string which specifies the signature of the rest of the function call. The proposal stumped me, but maybe you all can make something of it.

But is it intuitive?

No it isn't.

S5 Grammar compositions

While peacefully reading Synopsis 5 (Rules & Grammars), Dave Whipp noticed that grammatical inheritance wasn't as flexible as the Role based compositions that can be used when working with classes. Larry wondered allowed about having grammar roles, but I don't think they've been officially mandated yet...

Still about subroutines...

Michele Dondi continues to make my head hurt with zir proposals. In part it's because I've still not worked out whether zie is male or female, and in part because, well, zir proposals are challenging. In this particular proposal zie wondered if there would be a way to magically write recursive anonymous functions without having to introduce a new symbol of some sort.

Luke and Larry think there will be such a way, but the precise syntax hasn't settled just yet.

Range quantifier woes

Jonathan Scott Duff wasn't happy with the new range quantifier syntax in Synopsis 5. He posted a bunch of questions that were nagging at him. Larry had some good answers (if you're interested in contributing to the design of Perl 6 you should really read Larry's replies).

Announcements, Apologies, Acknowledgements

And so ends another summary. I hope you liked it. Sorry for the delay if you're reading this on the mailing list; this teacher training malarkey is remarkably tiring.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send feedback or contributions to a getting Piers to OSCON 2005 fund.

The Perl Foundation

Perl 6 Development site

Or, you can check out my web site.

Building a Finite State Machine Using DFA::Simple

I am converting some articles from MS Word to HTML by hand. I often use bulleted outlines so I face a lot of work creating lists with nested sub-lists. It didn't take opening and closing many <ul> and <li> tags to realize I wanted to automate the task.

It's very easy to use filters with my text editor and I've written several in Perl to speed up my HTML formatting. I decided to create a filter to handle this annoying chore. After a little thought I decided the logic involved was just tricky enough to use a finite state machine (FSM) to keep things straight.

Large programs always require some thought and planning while most small programs can be written off the top of the head. Once in a while I run into a small problem with just enough complexity that I know "off the top" will quickly become "not so quick, very dirty, and probably wrong." Finite state machines can help to solve these knotty problems cleanly and correctly.

There have been several times I wanted to use an FSM but didn't because I lacked a good starting skeleton. I would do it some other way or hand code an FSM using a loop, nested if statements, and a state variable. This ad hoc approach was tedious and unsatisfying.

I knew there had to be a Perl module or two implementing FSMs and decided it was time to add this tool to my toolbox.

What Are Finite State Machines?

A finite state machine is a set of states, one of which is the starting state. Each state has a set of transitions to other states, based on the next input token. A transition has a condition and an action. If the condition is met, the FSM performs the action and then enters the new state. This cycle repeats until reaching the end state.

Finite state machines are important in the theory of computation. A Wikipedia article on finite state machines goes into the computer science aspect. FSMs are valuable because it is possible to identify precisely what problems they can and cannot solve. To solve a new class of problems, add features to create related machines. In turn, you can characterize problems by the kind of machine needed to solve it. Eventually you work up to a Turing machine, the most theoretically complete computing device. It is a finite state machine attached to a tape with potentially infinite capacity.

The formal and highly abstract computer science approach to FSMs does not hint at their usefulness for solving practical problems.

The Bubble Diagram

Figure 1 shows the parts of a finite state machine. The circles represent states. A double circle indicates the start state. The arcs represent the transitions from one state to another. Each arc label is a condition. The slashes separate actions.

Finite state machine elements
Figure 1. Finite state machine elements.

I like FSMs because they are graphical. I can grab a pen and sketch the solution to a problem. The bubble diagram helps me see the combinations of states and input conditions in a logical and organized way.

Hand drawn finite state bubble diagram
Figure 2. Initial rough diagram.

Figure 2 is the initial diagram I drew for the HTML list problem. The diagram turned out to be have an omission — it does not handle blank lines. Having discovered this bug, it was easy to make sure to consider blank lines in every state.

Final finite state diagram
Figure 3. The final diagram.

Figure 3 is a formal diagram of the final FSM, complete with blank-line handling, prepared for this article. Normally, I would never be so elaborate for such a small problem. You can see that it is an excellent way to communicate a solution to others.

Looking at DFA::Simple

I have yet to look for a Perl module to do something and come up totally empty handed. That's absolutely my favorite thing about Perl.

I looked for existing FSM modules and found DFA::Command and DFA::Simple. DFA::Command looks good but was more than I needed. (By the way, the simplest FSM is also a "deterministic finite automata," thus the DFA in the module names.)

The "simple" in DFA::Simple sounded promising, but the module's documentation is poor. The sample program is difficult to follow and does not process input from a file. Using the module in the real world is left as an exercise for the reader: the docs do not mention handling errors and the end-of-file condition.

While FSMs start as circles and arcs, they end up as tables of states and transitions. My interest was in exactly how the packages created these tables. Having clean, easy-to-look-at code is very important to me. I'm probably the one who'll have to change it someday!

I was very unhappy that DFA::Simple defines its sample FSM in a position-dependent way. The state being defined depends upon its textual position in the source. It uses numeric constants for state IDs within the tables and you must synchronize them with any changes. Comments indicate which state number you think you are working on, but they could easily unsynchronize as you add, drop, and move states.

To clarify, I saw this:

my $states = [
    # State 0 
    [ ...jump to state 2... ],
    # State 1 
    [ ...jump to state 0... ],
    # State 2 
    [ ...jump to state 1... ],
    # ...
];

But I wanted to see something like this:

my %states;
$states{stStart}   = [ ...jump to stSubitem... ];
$states{stItem}    = [ ...jump to stStart...   ];
$states{stSubitem} = [ ...jump to stItem...    ];
# ...

This was a big negative. I have run into enough bugs caused by position-dependency that I refused to consider using this code. I almost decided to write my own module, but first went back and took a last look at DFA::Simple to see if there was any way I could make it work.

I realized that an FSM use symbolic constants for state definitions. Array references using these symbolic constants would provide the self-documenting and maintainable approach I demanded. It was easy to develop a way to process input (handling errors and EOF) that works in most cases. I put it together, gave it a try, and it worked!

The Input and Output

The input comes from copying an MS Word document and pasting it into my text editor. My bulleted outlines look like this:

o Major bullet points
    - each line is a point
    - a leading "o " is removed
o Minor bullet points
    - indicated by a leading "- "
    - may not start a list
    - only one level of minor points is supported
o Future enhancements
    - numbering of points
    - handling of paragraphs

o Start a new group of points
    - a blank line indicates end of old group
    - old group is closed, new one opened

This outline actually documents key features of the program.

A correctly nested HTML list must come entirely within the parent list item. This is what made the filter tricky enough that I wanted to use a finite state machine. When you open a top-level list element you have to wait for the next input to see if you should close it or start a sub-list. When you look at an input, you have to know if you have an open list element pending. The current state and the input determine what to do next, exactly fitting the finite state machine model.

A snippet of output:

<ul>
  <li>Major bullet points
    <ul>
      <li>Each line is a point</li>
      <li>A leading "o " is removed</li>
    </ul>
  </li>
...
</ul>

Not being an HTML expert, I initially tried to insert my sub-list between two higher-level list elements. The page rendered correctly but would not validate, revealing my error and starting this project.

The Parts of the Machine

The first step in building the machine was deciding how to represent state names symbolically rather than numerically. The natural choice is the constant pragma.

use constant {
      stStart     => 0,
      stItem      => 1,
      stSubItem   => 2,
      stEnd       => 3,
};

I defined the machine as two arrays. The first is @actions defining what to do when entering and exiting a state. You might use it to initialize a data structure on entry and save it on exit, or allocate resources on entry and de-allocate them on exit.

Each element of @actions is a two-element array. The first is the enter action and the second the exit action. My program logs a message on entry and leaves the exit action undefined.

my @actions;

# log entry into each state for debugging
$actions[stStart]   = [sub{print LOG "Enter Start\n";}  ];
$actions[stItem]    = [sub{print LOG "Enter Item\n";}   ];
$actions[stSubItem] = [sub{print LOG "Enter SubItem\n";}];
$actions[stEnd]     = [sub{print LOG "Enter End\n";}    ];

Each element of the second array, @states, is an array of transitions for a state. A transition has three elements: the new state; the test condition; and the action. The condition is a subroutine returning true or false, or undef, which always succeeds.

The machine tests the transitions of the current state in sequence until a condition returns true or is undefined. Then it takes the action and enters the new state. If the new state is different than the old one, the respective exit and entry actions execute.

As with the final else in a nested if statement, consider having an undef at the end of the transitions. If nothing else, log a "can't happen" error message to catch holes in your logic as well as holes caused by future changes.

Here is the transition array for a state:

my @states;

# Next State, Test,        Action

  $states[stSubItem] = [
    [stEnd,     sub{$done},  sub{end_sublist; end_list}  ],
    [stSubItem, sub{/^-\s/}, sub{output_subitem}         ],
    [stStart,   sub{/^$/},   sub{end_sublist; end_list}  ],
    [stItem,    undef,       sub{end_sublist; start_item}],
];

This uses the symbolic constants for state IDs in both tables. The states' numerical values are irrelevant. I followed convention and numbered the states consecutively from zero, but I tested the program using random state numbers. It worked fine.

This is much better than the definition of the sample FSM in DFA::Simple. Now adding, deleting, and shuffling states will not introduce position-dependent bugs.

Final Assembly

The parts of the machine are built. Assembling them is easy:

use DFA::Simple

my $fsm = new DFA::Simple \@actions, \@states;
$fsm->State(stStart);

This creates a new DFA::Simple object using references to the @actions and @states arrays. The State method retrieves or sets the machine's state, invoking entry and exit routines as needed. Here, the call sets the initial state.

Putting the Machine in Motion

The major goal of this project was to develop a skeleton to develop FSMs quickly, with an emphasis on filters. I wanted to develop a way of handling file input and processing errors that would work for future projects.

This loop accomplishes that:

while (<>) {
      chomp;

      # log input line for debugging
      print LOG "Input <$_>\n";

      # trim input and get rid of leading "o "
      s/^\s+//;
      s/\s+$//;
      s/^o\s+//;

      # process this line of input
      $fsm->Check_For_NextState();

      # get out if an error occurred
      last if $error;
}

The while (<>) loop is standard for filters. The application cleans up input as needed. In this case, it chomps each line and trims and strips it of some text bullet characters.

The log message is for initial debugging. This, along with the new state message, gives a clear picture of what is happening.

The most important statement is $fsm->Check_For_NextState();, which does the major work — scanning the current state's transition table to see what to do next.

The FSM must be able to exit early if it spots an error. This requires the use of a global $error switch. It starts out as false. If $error is true, the program ends and returns the value of $error.

End-of-file handling also uses a global switch, $done. It is initially false and changes to true when the program exhausts standard input. One last cycle runs to give the machine a chance to clean up. Transition conditions checking for EOF can simply return $done.

Here is the end-of-file code:

unless ($error) {
      # finish up
      $done = 1;
      $fsm->Check_For_NextState();
}

The program does not support early normal termination. You can enter a state with no transitions and the program will read (but ignore) any remaining input. However, there is no way to avoid reading all of the input without setting $error. You can implement early exit by adding a $exit switch and handling it similarly to $error.

Appropriately for filters, this program is line oriented. Many FSMs read input a character at time: lexical scanners and regular expression parsers, for example. This program could easily change to handle character input.

The Action Routines

Now that the machine is purring along, it needs to do something. This is the purpose of the actions. Those in my program are typical of a finite state machine. They invoke small subroutines that form a mini-language for building lists with sub-lists. The subroutines are very simple, but running them in the right order does the job.

Thus, this approach breaks a complex problem into smaller pieces, each of which is simply solvable. This is what programming is all about.

Conclusion

The next time you have a knotty little problem like this one, try sketching a solution using a bubble diagram and implementing it with DFA::Simple. You'll have added a great new tool to your toolbox.

Resources

This Week on Perl 6, Week Ending 2004-09-10

This week on perl6-compiler

Yes you read that right; development of the Perl 6 compiler now has its own mailing list. Hopefully, in the coming weeks, the current perl6-internals list will get renamed parrot-internals to reflect that split.

As I write this, groups.google.com hasn't picked up the new list, but I'm sure it will do eventually, so I'm going to continue to use Google URLs here on the theory that they'll work eventually. If you're desperate to read the whole threads before then, start at perl6-compiler on Perl.org — there's all of 35 messages there as I type this...

Bundles

In a thread that migrated over from perl6-language Gregory Keeney continued discussion of whether Perl 6 will/should have an officially blessed bundling system, or whether it should simply have good enough hooks that someone could write one. There was some discussion about whether perl6-compiler is actually the correct list for the discussion...

Current state?

Herbert Snorrason wondered what state the Perl 6 compiler was in. (I think we all did to be honest, but Herbert got to ask the question.)

Patrick said that he and Luke (I think) are at the beginning stages of building a basic perl 6 grammar engine that compiles to parrot and handles some basic optimizations. In tandem with that, Patrick was also working on writing a Perl 6 grammar. The plan is (and no plan survives contact with the enemy) to get a working basic grammar engine up in the next month or so then use that to build the Perl 6 parser.

As soon as there's code, announcements will be made in all the right places (and who knows? maybe even on Slashdot).

If you want to help with the process of writing the grammar engine, your best bet (at least until there's some running code) is to snag the appropriate apocalypse and use that to write tests and post 'em to the list.

Meanwhile, in perl6-internals

Fix ops

Last week Leo had discussed unhooked ops without definite opcode numbers and asked for comments on what to do with them. Apparently Dan finds many of the boolean functions useful, so they're going to be kept, probably hidden behind a namespace, once we know how namespaces work.

Perl free configuration

Right now, Parrot's configuration system palms off a lot of the work to the local Perl installation by querying the perl config. Which isn't ideal in the long run. So Dan suggested that now's the time to make a list of what info configure.pl grabs from the perl installation and instead to write(appropriate) probing code ourselves so we can do perl-free configuration. The usual "autoconf!" "No! Metaconfig!" "No! Something else entirely!" discussion occurred. (For the record, Dan seems to favour the "something else entirely" option).

The bike shed remains unpainted.

Oh My. Minesweeper!

Jens Rieks pointed everyone at the shiny new examples/sdl/minesweeper/mines.imc. Speaking as a compulsive Minesweeper in recovery, this is a bad, bad thing.

Dynamic PMC libraries

Steve Fink posted a patch to Parrot's build system to make the process of building libraries of dynamic PMCs rather less platform specific. After a sanity check from Leo it got committed.

Semantics for regexes

The discussion of required semantics to implement a regular expression engine continued. Chip Salzenberg chipped in with some experience from the Topaz project (a brave project to reimplement Perl 5 in C++ that, sadly failed whilst teaching Chip a great deal). I confess that the thought of an open_up_and_say_aaah method on Perl Scalars delighted your summarizer.

Namespaces

(Am I the only person who wants to repeat Namespaces! with the same intonation used for "Monorail!" in the Simpsons?)

As Dan pointed out, Parrot needs namespaces, more than that, it needs them to be designed. So, like the excellent designer he is, he laid out his plan for namespaces and invited comments. And comments there were. We can probably expect a PDD soon. Ish.

Patrick R. Michaud Speaks!

Mark Patrick's words: There will be a Perl 6.

Now all that remains is to find out when.

mod_parrot progress

Jeff Horwitz updated everyone on his progress with updating (rewriting) a mod_parrot plugin for Apache 2.

Continuations and GC. Continued

They're baack! Just when you thought return continuations had been sorted, they're back and causing memory leaks.

No Autoconf, dammit!

Dan restated his position on using Autoconf in the parrot build system. It's really rather simple:

We're not using autoconf. Period.

In the discussion, Herbert Snorrason suggested we institute a "Rule One" for Dan. In other words, Dan is always right. Unless he changes his mind ("Rule Two"). Or is overridden by Larry ("The One True Rule One").

It works for me.

Constant strings

Dan posted a discussion of const_string, a Parrot function for making use of string constants from C. It's very useful, but it doesn't seem to be as useful as Dan would like it to be, so he's extended the API.

Personally, I'd like Symbols as first class data structures in Parrot, but not enough that I'm likely to implement the darned things.

Constant strings

Part two

Multiple languages clarification

Self-described newbie Richard Jolly asked for clarification on what mixing languages will look like. Dan pointed out that it hadn't actually been specified yet. Besides, the internals list didn't do syntax.

He then outlined what he thought the syntax would look like. He doubts that people will end up mixing languages in single source files though. Further discussion pretty much agreed with Dan, generally arguing that most uses of multiple languages would come from using other languages' libraries.

NCI and the running process image

Jeff "mod_parrot" Horwitz revived a thread from a year ago. He wanted to be able to use the running process image (say httpd) and grab its various API functions using dlfunc. Sadly, although this works for functions that Apache uses from shared libraries, it doesn't work for those functions that are statically linked into the binary. He wondered if this was a bug, or if he was going to have to stick to using the workaround he'd come up with.

It turns out that it isn't a bug, but it can be done by passing a NULL as the filename from which you're dynamically loading a function. Thanks Leo.

Meanwhile, in perl6-language

Synopsis 9, the discussion continues

Synopsis 9 covers Perl's data types. And, from such a small acorn grows one might oak of a thread. Some of the discussion was even about Perl's data types. There was also a snide remark about summaries and the correct pronunciation of "Z" — in revenge for which I plan to make sure that David Green is never mentioned in a Perl 6 Summary.

Oops.

Sub signatures without parens

Juerd wondered if the parentheses around the signatures of function/method were still strictly necessary. Larry explained that they were and why.

The reach of macros

Last week, Larry pointed out that even existing keywords could be overridden with macros. Piers Cawley pointed out that they'd be no fun if they didn't. Larry added the caveat that macros couldn't work with precompiled library code (which is a shame, but understandable). This thread developed into the (occasionally intemperate) discussion of Bundles that later migrated to perl6-compiler.

Iterators and for

David Wheeler wondered if the (Smalltalk by way of) Rubyish style of eschewing for in favour appropriate iterators and code blocks would eventually become good Perl 6 style. (I expect it will in my code). Larry didn't think it would, but pointed out that the syntax had been modified recently to make it relatively easy to work that way if you wanted to. Instead of writing, say:

@foo.each({...})

we'll be able to write

@foo.each:{...}

Announcements, Apologies, Acknowledgements

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send feedback or contributions to a getting Piers to OSCON 2005 fund.

The Perl Foundation

Perl 6 Development site

Or, you can check out my website.

Embedded Databases

The expression "Embedded Database" requires an explanation. A "database" is an application that allows the targeted retrieval of stored data -- a log-file is not a database. By "embedded" I mean a database that does not run in a separate process, but instead is directly linked ("embedded") into the application requiring access to the stored data. This is in contrast to more conventional database management systems (such as Postgres or Oracle), which run as a separate process, and to which the application connects using some form of Inter Process Communication (such as TCP/IP sockets, for instance).

The prime advantage of embedded database systems lies in their availability and ease of administration. Since the data is kept in ordinary files in the user's space, there is no need to obtain special permissions to connect to the database process or to obtain a database account. Furthermore, since embedded databases require nothing more than a normal library, they can be useful in constrained environments (think shared web hosting), where no "proper" database is available. They can even be linked to an application and shipped with it.

Sequential Access: Tie::File

The simplest database is still a flat file of records, separated by a record separator. If the separator is chosen to be the new line character "\n", as it usually is, then a record corresponds to a line.

Flat files allow only sequential access. This can be both a blessing and a curse: if the application primarily requires the lookup of individual records in arbitrary order, then flat files become impractical once the number of records becomes too large. On the other hand, if the typical access pattern is sequential for at least a substantial subset of records, then a flat file can be very practical.

Here is a typical (if a little construed) example: Assume that we have several thousand very long strings (say, each string having more than one million characters) and we want to determine whether any two strings are equal. To do so, we have to compare each string to all remaining ones, using two nested loops. However, the total amount of data is too large to be all kept in memory. On the other hand, we access all the records in a predictable, sequential order.

The Perl module Tie::File allows us to do so in a particularly convenient fashion: it ties a Perl array to the file in such a way that each record corresponds to an element of the array.

Here then is a code snippet that demonstrates the use of Tie::File to our string comparison problem introduced above:


use Tie::File;

tie @array, 'Tie::File', 'data.txt' or die "Could not tie to file: $!";

$len = scalar( @array );

for( $i=0; $i<$len; $i++ ) {
  $string1 = $array[$i];

  for( $j=$i; $j<$len; $j++ ) {
    $string2 = $array[$j];

    compare_strings( $string1, $string2 );
  }
}

untie @array;

sub compare_strings {
  my ( $s1, $s2 ) = @_;

  # do something smart...
}
     

There is no question that this program requires heavy disk access, and if we had additional information about the data in question, we should try hard to use it and avoid the brute force approach of comparing "everything to everything". However, if we have no choice, then Tie::File makes code like the above as efficient as can be: It does not attempt to load the entire file into memory, so that arbitrarily large files can be processed. The first time we iterate over the file, Tie::File builds up an index containing the offset of each record from the beginning of the file, so that subsequent accesses can go directly to the proper position in the file using seek(). Finally, accessed records are cached in memory and the cache size can be adjusted passing the memory option to the tie command.

However, any changes to the array (and its underlying file), in particular insertions and deletions in the middle, will always be slow, since all the subsequent records will have to be moved. If your application requires such capabilities, don't use Tie::File, use a Berkeley DB!

Fast Lookup: Berkeley DB

A Berkeley DB is almost the exact opposite of a flat file in regards to the access patterns it supports. Scanning through a large number of successive records is difficult if not entirely impossible, but reading, inserting, updating, and deleting random records is very, very fast!

Conceptually, a Berkeley DB is a hash, saved to a file. Each record is found by its key, and the value of the record is treated as a string. It is up to the application to interpret this string - for instance, to break it up into parts, representing different fields.

The following code opens a Berkeley DB (creating a new file, if necessary) and writes a single record, containing information suitable for the /etc/passwd file. We then look up the same record, using the login name as key, and split the data into its constituent fields.


use BerkeleyDB;

$login = 'root';
@info  = ( 0, 0, 'SuperUser', '/root', '/bin/tcsh' );

$db = new BerkeleyDB::Hash( -Filename => 'data.dbm',
			    -Flags => DB_CREATE ) or die "Cannot open file: $!";

$db->db_put( $login, join(':', @info) );

$db->db_get( $login, $val );

( $uid, $gid, $name, $home, $shell ) = split ':', $val;

print "$login:\t$uid|$gid $name \t$home\t$shell\n";

$db->db_close();
      

The code above uses the Perl interface to the Berkeley DB modelled after its C API. The BerkeleyDB module also supports a tie interface, tying the database file to a Perl hash, making code like the following possible: $hash{ $login } = join( ':', @info );. The API-style interface, on the other hand, follows the native interface of the Berkeley DB quite closely, which means that most of the (quite extensive) original C API documentation also applies when programming Perl - which is good, since the perldoc for BerkeleyDB is a bit sketchy in places. Try man db_intro to get started on the native Berkeley DB interface, or visit the Berkeley DB's homepage.

Complex data structures can be stored in a Berkeley DB provided they have been serialized properly. If a simple solution using join and split as demonstrated above is not sufficient, we can use modules such as Data::Dumper or (better) Storable to obtain a flat string representation suitable for storage in the Berkeley DB. There is also the MLDBM ("Mulit-Level DBM") module, which provides a convenient wrapper interface, bundling serialization and storage.

For suitable problems, solutions based on Berkeley DBs can work very well. However, there are a couple of problems. One is that the Berkeley DB must be available - it has to be installed separately (usually as /usr/lib/libdb.so). The bigger limitation is that although there are quite a few problems out there for which a single-key lookup is sufficient, for many applications such a simple access pattern will not work. This is when we require a relational database -- such as SQLite.

Complex Queries: SQLite

There are situations when neither of the aforementioned straightforward access patterns applies. Berkeley DBs are great if we know the key for the record - but what happens if we only know part of the key, or we are interested in a set of records, all matching some pattern or logical expression? Alas, the Berkeley DB does not allow wildcard searches. Another frequently arising problem occurs when each record has several fields or attributes and we want to be able to look up records by either one of them (such as being able to search a list of books by author as well as title). This can be achieved in a Berkeley DB using multiple lookups, but one can't help thinking that there got to be a better way! Finally, the simple key/value mapping of the Berkeley DB occasionally forces duplication of data, since all the relevant information has to be present in each record - this both wastes storage space and creates maintenance problems.

Relational databases address all these points. They permit wildcard and logical expression searches, multiple searchable columns, and joins to represent relationships. Unfortunately, search capabilities such as these used to be restricted to big, standalone relational database engines (such as Postgres or Oracle), often making them unavailable for the quick-and-dirty Perl project.

This is where the SQLite project comes in: Founded in 2000, SQLite tries to combine the convenience of the Berkeley DB architecture with the power of a relational (SQL) database. Its latest release is version 2.8.14, with the next major revision, version 3.0, currently in beta. SQLite implements most of standard SQL (with some exceptions, such as correlated subqueries).

However, with added power comes additional complexity: a relational database no longer ties neatly to one of Perl's built-in data structures. Instead, one has to use the DBI (DataBaseInterface) module and code the queries in SQL. Furthermore, since the structure of the data is no longer predetermined, the programmer has to define and construct the required tables explicitly.

This is not the place for a full discussion of the Perl-DBI, but the example below is enough to get started. First, SQLite has to be installed on the system. Conveniently, the DBI-driver for SQLite, DBD::SQLite, already contains the database engine itself as part of the module - so installing this module is all that is required to be able to use SQLite from Perl.

We begin by connecting to the SQLite database contained in the file "data.dbl". Note that the connect method loads the appropriate database driver - all we need to do is specify use DBI;. We create and populate two tables, and then run a query joining the two tables and using an SQL wildcard. The results are returned as a reference to an array. Each array element is an array itself, containing the column values in its elements.


use DBI;

$dbh = DBI->connect( "dbi:SQLite:data.dbl" ) || die "Cannot connect: $DBI::errstr";


$dbh->do( "CREATE TABLE authors ( lastname, firstname )" );
$dbh->do( "INSERT INTO authors VALUES ( 'Conway', 'Damian' ) " );
$dbh->do( "INSERT INTO authors VALUES ( 'Booch', 'Grady' ) " );

$dbh->do( "CREATE TABLE books ( title, author )" );
$dbh->do( "INSERT INTO books VALUES ( 'Object Oriented Perl',
                                          'Conway' ) " );
$dbh->do( "INSERT INTO books VALUES ( 'Object-Oriented Analysis and Design',
                                          'Booch' ) ");
$dbh->do( "INSERT INTO books VALUES ( 'Object Solutions', 'Booch' ) " );


$res = $dbh->selectall_arrayref( q( SELECT a.lastname, a.firstname, b.title
				    FROM books b, authors a
				    WHERE b.title like '%Orient%'
				    AND a.lastname = b.author ) );

foreach( @$res ) {
  print "$_->[0], $_->[1]:\t$_->[2]\n";
}

$dbh->disconnect;
      

The books table references the authors table via a foreign key. This is a one-to-many relationship: each book has only a single author, but an author can have written multiple books. What would we have done if books could have been co-authored by multiple writers? We would have used a cross-reference table to represent the resulting many-to-many relationship. Try that with Berkeley DBs!

SQLite shares with the Berkeley DB the approach of being mostly typeless and treating data as simple strings. In our table definitions above, we did not specify the data types of any of the columns, as would be required by Oracle or Postgres. SQLite provides a bit more structure, in that it provides separate columns, so that distinct values do not have to be glued together to form a string, as was required when using a Berkeley DB. Furthermore, SQLite will try to convert strings to floating point numbers if numerical comparisons or operations (such as SUM or AVG) are performed on them.

Since SQLite is a relational database, the data architecture (the "schema") can be arbitrarily complex, with data distributed across multiple tables and with foreign keys to represent relationships. This is a big topic -- any book on the relational model will provide plenty of information, including issues such as database normalization and tools such as Entity/Relationship-Models.

Summary

In this paper, we considered three ways to add database capabilities to a Perl program, without requiring major external database installations or administrative support. Depending on the typical access patterns, flat files or Berkeley DBs may be suitable choices. For more complex queries, the SQLite project provides all the power of relational database architecture in a single Perl DBD module.

Resources

This Week on Perl 6, Week Ending 2004-09-03

Another week, a free weekend, and still I haven't started writing the summary until Monday. Still, I don't actually start at college 'til next week, so that's all right then.

We start with perl6-internals.

Compile op with return values

The discussion of how to return something from dynamically compiled code continued with Leo, Dan and Steve Fink all working to clarify and address the issues.

Compile op with return values

Library loading

Dan started the process of nailing down Parrot's dynamic loading API so that it can be added to the embedding interface. Steve Fink and Aaron Sherman had suggestions.

Library loading

Pathological register allocation scenarios

Gregor N Purdy had asked Dan if his work compiler could be made to spit out structurally equivalent C code to the Parrot code that was breaking IMCC. His idea being that we could then see how C compilers dealt with such nastiness. Dan thought that, whilst this was a good idea, it would be too much work to implement. Gregor wasn't so sure.

Pathological Register Allocation Scenarios

Dan and Leo demonstrate comic timing. Again.

14:17:09 GMT Dan: PerlHash test 20 is failing? Anyone know what's up
so we can fix it?
15:30:41 GMT Leo: It stopped failing at 15:55 CEST (13:55 GMT)
16:32:29 GMT Dan: D'oh!

We love it when a patch comes together.

Failing perlhash test

PMC Instantiation

Leo had raised issues with the current scheme for PMC instantiation. This week Dan came through with some design which got discussed and (I think) implemented.

PMC Instantiation

Last bits of the basic math semantics

If you believe Barbie, "Math is hard". She's right, up to a point. The list's spent a couple of weeks now sorting out the design of Parrots underlying mathematical and numeric systems to make sure that maths works right (for sufficiently useful values of "right"). This particular line of discussion covers rotations and stuff, where you're actually treating a number as a bit field.

Last bits of the basic math semantics

Cross-compiling parrot

And you thought compiling Parrot on a Win32 box was hard. Robert Schwebel wants to cross compile Parrot and isn't having a good time. Dan wasn't surprised because the Parrot build process still gets most of its information from the local perl installation which will generally be wrong when you're cross compiling.

Dan noted that part of the problem is that we don't have people on the team with a need or the experience of doing cross compilation and added that he'd be thrilled if this were to change. Any patches to make things better for cross compilers will, of course, be gratefully received.

Cross-compiling Parrot?

Proposal for a new PMC layout and more

Leo's concerned that the current PMC layout isn't the Right Thing, and laid out a proposal describing some changes he thinks would be worthwhile. In essence, he argues for removing the requirement for fixed sized PMC headers and separate variable sized buffers in favour of unifying buffers and PMCs so that PMCs become variable sized, thus eliminating some time consuming indirection, and space consuming overheads.

Nicholas Clark thought the proposal was interesting, but said that, since the proposed changes would be invisible to the user, he'd be far happier with a functionally complete implementation of parrot with stable, useful APIs.

Dan rejected the proposal (both for technical reasons and because he agreed with Nicholas). I don't think Leo was convinced by the technical reasons, but the "Let's get the interfaces finished!" argument clinched it.

Leo's layout proposal

Dan explains why not

Semantics for regexes

Dan appears to have opened an entertaining can of worms when he outlined his view of the minimum string semantics required to support a regular expression engine and asked for comments. Boy did he get them. And boy did they run off in all sorts of strange directions. Interesting directions mind. Anyway, further down the thread, Dan, Chip Salzenburg and Patrick Michaud seemed to reach something approximating agreement about the low level semantics required.

Semantics for regexes

TODOs and calls for volunteers

Leo came up with a list of things that need fixing/implementing and asked for volunteers. These include sorting out what happens with the experimental ops, implementing new_extended for every PMC class and finishing up BigInt's MMD and vtable functions.

He also had some proposals for how we should get the Integer classes properly implemented now we know what the semantics will be.

TODOish fix ops

Takers wanted: new_extended

Takers wanted - BigInt/BigNum

Integer PMCs

Meanwhile, in perl6-language

Roles trying to be nice

Abhijit Mahabal had some questions about making roles work. Luke, Patrick, and Jonathan Scott Duff set about answering them. I'm not entirely sure that any of the answers so far are enough for Abhijit, but then I'm not entirely sure that any answer could be enough. At some point you have to take things on trust and hope that nothing breaks at runtime.

Roles trying to be nice

Pipeline performance

Rod Adams brought up some issues with the performance of "pipelining" arrays in Perl 5 — in general doing say grep {...} map {...} @ary is rather slower than writing an explicit loop. He wondered if Perl 6 would be faster. Larry's answer that all lists function lazily if they can in Perl 6 seems to imply that yes, Perl 6 will be faster.

Pipeline performance

Synopsis 9 draft 1

"Synopsis 9?" I hear you ask "But we haven't seen Apocalypse 9 yet!". Indeed we haven't, but that's not stopped Larry writing it. Synopsis 9 gives an overview of Perl 6's data structures (hopefully enough for anyone who happens to be starting work on a rough cut of a Perl 6 implementation) which will be covered in more detail when the Apocalypse itself comes out.

The usual storm of discussion and general proofreading goodness went on.

Synopsis 9 draft 1

The range operator

Joe Gottman wondered if there would be sufficiently lazy way to generate ranges. In particular, he wanted to know if he'd be able to write reverse (1..5) and have Perl handle that lazily, or if he could do 5 .. 1 :by(-1). Larry thought that, if range objects were made sufficiently smart, there would be no reason why the reverse approach couldn't be lazy.

Reverse .. operator

Can PRE and POST be removed from program flow?

John Siracusa wondered if it would be possible to turn off the PRE and POST Design By Contract hooks in production code to improve performance. (Leaving aside arguments about whether this is sensible; personally I reckon that a production environment is where you should be most worried about values that might fail the PRE/POST hooks). Larry reckoned it would be possible to simply clobber the global definitions of PRE and POST to make them no ops. This wasn't enough for John, who wanted to be able to get rid of them entirely so even the call to the no op didn't happen. So Damian showed him the obvious macros...

S4: Can PRE and POST be removed from program flow?

Announcements, Apologies, Acknowledgements

It looks like the Google groups gateway is working again, so I'll keep with the Google style linking.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send feedback or contributions to a getting Piers to OSCON 2005 fund.

The Perl Foundation

Perl 6 Development site

Or, you can check out my website.

Lightning Articles



Simon Cozens

Serendipity -- it means those occasions when things come together to give you moments of inspiration. While preparing perl.com one week, I was editing an article on how to give lightning talks by Mark Fowler and at the same time I was dealing with another author who said he was having difficulty stretching out an article -- a very good article, on a topic I wanted to see covered -- to a full 2,500-words-or-so length.

I then realized there were probably a load of people out there with interesting things to say about what they're doing with Perl, but who couldn't or didn't want to write a full-sized article. This is, after all, the principle that makes lightning talks so popular. Maybe we could create a forum where people could have short, informal articles published on an interesting Perl topic of their choice -- lightning articles.

In the same vein as lightning talks, they can be both about an interesting use of the technology, or social commentary on Perl and its community, or just a bit of fun. If you've got something you want to get off your chest, or you've got an interesting new idea you want to talk about, but don't think you could fill 2,500 words, try writing a lightning article. You have an absolute maximum of 500 words -- as measured by wc -w (or perl -0lne 'print scalar split/\s+/') on your POD or plain text file -- to say what you need to say.

Send them to chromatic@oreilly.com, and when we've got a good batch of five or six together, we'll publish them here.

Detecting Problems Automatically

Mark Fowler

How many times have you shipped something, and then as soon as it's gone live you've spotted a glaring mistake that, despite staring you in the face the entire time you were developing the code, you've somehow overlooked?

One thing we have problems with at work is double-encoded HTML entities in our web pages. We often use HTML entities to encode letters that aren't ASCII, since this way we then don't have to worry about the text encoding we're using.

For example, we want to render the string Hello Léon> into our HTML document. So, instead of including the é into our document directly we transform it into its entity form, replacing it with &eacute;:

  <p>Hello L&eacute;on</p>

This can be done automatically by Perl in numerous ways, e.g. with HTML::Entities:

  use HTML::Entities;
  
  my $string = "Hello $name";
  encode_entity($string);

Or if you're using the Template Toolkit through the html_entity filter:

  <p>Hello [% name | html_entity %]</p>

The root of the troubles we were experiencing was that entity encoding could occur in our code in multiple, different places depending on where the string we were about to render came from. And if we accidentally did it more than once then we ended up with HTML that looked like this:

   <p>Hello L&amp;eactue;on</p>

This means the web browser sees the &amp; and converts it to an ampersand and then renders the rest of the eacute; as normal text:

  Hello L&eacute;on

Not exactly what we wanted. Of course, these things are fairly trivial to fix once you spot them. The real problem we were having is that these errors kept repeatedly popping up, and having our testing department coming back to us every release with another one of these errors was getting embarrassing. We'd gone blind to the errors -- working so closely with the web site we'd glance at the page and not notice what should have been staring us in the face.

So we decided to automatically test for the problem.

In the end I decided to write Test::DoubleEncodedEntities, a Test::Builder module that would test for these errors and run under Test::Harness like all our other tests. The ok_dee function relies on the fact that none of our web sites would ever use strings like &amp;eacute; (this is true of most web sites - the only web sites that do are ones like this that feature articles on how to encode HTML).

  use LWP::Simple qw(get);
  use Test::More tests => 2;
  use Test::DoubleEncodedEntities;
  
  my $page = get "http://testserver/index.html";
  ok($page, "got page okay");
  ok_dee($page, "check for double encoded entities")

If the test fails then we get some useful diagnostic output:

  1..2
  ok 1 - got page okay
  not ok 2 - check for double encoded entities
  #     Failed test (t/website.t at line 7)
  # Found 1 "&amp;eacute;"
  # Looks like you failed 1 test of 2.

And now we don't even have to look for these things. Our test suite catches double-encoded entities for us and brings them to our attention. Problem solved.

CPAN, Miniaturized

Ricardo Signes

Everyone has seen a problem too boring to solve. Rather than keep a stiff upper lip and trudge forward, you head to the CPAN and find the pre-packaged solution that probably already exists. It's just another display of your complete impatience and laziness, and that's great: with the CPAN at your side, you can solve boring problems effortlessly.

The problem, of course, is that CPAN isn't always at your side. Sure, a simple install Class::DBI might be enough to implement half of your project, but when you're offline and stuck on the plane, good luck getting to your usual mirror. I've found myself in that position a number of times, and usually when I've most wanted to get some work done. On the way home from conventions, I've sat in cars and planes, wishing I'd remembered to install Test::Smoke or DBD::SQLite before leaving home.

The solution, of course, is to just mirror the whole CPAN. It's only three gigs, and if you've got a week to spare on your dial-up, that's just fine. After all, later rsyncs are just a few hours a week!

Other problems loom, not the least of which is the possibility of losing those three gigs when your drive crashes on the road. You can always back up the mirror to a DVD in case you need to re-mirror it quickly... but by this point the solution to your problem has become tedious, and I know how you feel about solving tedious problems.

A better solution to this problem was published a few years ago by Randal Schwartz: mini-CPAN. Its guiding principle is an old programmer standard: "You aren't going to need 90 percent of that crap."

Randal's script only retrieves the CPAN indices, and then the modules listed in the package index -- basically, only the newest version of every distribution -- the only files ever used by CPAN.pm and CPANPLUS to install modules. On subsequent runs, only those distributions that have changed are updated.

With this miniature CPAN, you've cut CPAN down to about 400 MB. Not only does it take a fraction of the time to mirror, but it fits nicely onto a CD. You can stick it in your bag, right behind your rescue disk, and know that no matter what happens, the CPAN will be right by your side.

With the script configured and run, you'll have your own personal CPAN sitting on your machine, ready to be used. Pointing CPAN.pm at it is easy as can be:

        cpan> o conf urllist unshift file:///home/japh/minicpan

Alternately, just edit your CPAN::Config or CPANPLUS::Config.

The only problem left with mini-CPAN is that it was so hard to find. It's been such a fantastic addition to my toolbox that I feel slighted, having spent two years oblivious to its existence. To help others avoid this pain, I've tweaked the script, shoved its guts into a module, and stuck it onto CPAN. Just by installing CPAN::Mini, you can have minicpan dropped into place and ready to go:

 minicpan -r http://your.favorite.mirror/of/cpan -l /home/japh/minicpan

...and your personal CPAN is waiting.

Bit-twiddling in your Database

Aaron Trevena

Why would you use a bit mask in your database ?

They can be useful where you have a many-to-many relationship where one side changes rarely and the other frequently. A good example is facilities provided by a Tourist Resort, where the actual amenities change rarely but resorts are added and updated often.

Normalization would add an intermediate table between them, but that can be painfully slow if you have a significant number of amenities and frequently queried records about many of them.

The same problem can occur to a lesser degree within a single table; perhaps you are doing some statistical analysis on demographics with columns representing gender, marital status, etc. For more than a few thousand records, querying subsets of the table based on these criteria can become expensive very quickly.

How would you use bit masks?

Instead of holding information in a separate table or a group of columns, use a bit mask with each bit representing a column in the table or a record in a second table. For example, use 8 bits to represent gender, marital status, employment status, smoker, drinker, driver, car-owner, and house-owner. A query to find drivers under 25 who don't drink or smoke and own their own car contains six conditions:

    select id, age from people where age < 25 and employment_status =
    'employed' and smoker = 'N' and drinker = 'N' and car_owner = 'Y'
    and driver = 'Y'

While a bitmap would use two:

    select id, age from people where age < 25 where bitmap_col & 00000110

To allow employment status to have values for child, student, unemployed, employed, and retired, you would add extra bits representing each value. This applies to any column with a low number of potential values.

This is a simple bitmap index, and you can use all kinds of bitwise operations on your queries against it. Complex queries can be simplified to a couple of binary operations. Of course, there is a downside. It's harder to maintain, and if mapping a many-to-many relationship, you need to ensure consistency between the bit mask order and the contents of the other table. This can be enforced using a trigger or within the application.

If you split the bitmap into individual columns and rotate it you can make a compressed bitmap index that only stores ranges of records that are true for each column or value of a column. Oracle provides this feature.

These compressed bitmap indexes are even faster to query when used together in combination, and take up very little space. However, as multiple records' values can be held in a single entry in the index, updates to indexed columns can be slower and suffer from excessive locking on the index itself.

Well-designed bitmap indices can have a huge impact on the performance of queries, as they are much smaller than standard b-tree indices and much faster on queries where a large proportion of that dataset is being queried.

Goodbye, Farewell, Amen

Simon Cozens

Let me first apologize for this personal note; I won't do it again.

Around the time I was busy being born, or maybe just a little after, Larry Wall was planning to be a missionary with the Wycliffe Bible Translators. Working on a degree in "Natural and Artificial Languages," and then in linguistics graduate school, he learned about the ideas of tagmemics, semantics, and the cultural and anthropological concepts that have found their expression in the Perl language and culture. Unfortunately due to health problems, Larry couldn't do the missionary thing, so he invented Perl. That's why we're all here.

It was the beginning of May 2001 when Mark Dominus asked me if I'd be interesting in taking over the job of managing editor here at perl.com. I was delighted, and excited about the thought of working with the Perl community and hopefully producing and publishing some great resources for Perl programmers. I hope I've done a fair bit of that over the last three years, but now my time is up. I'm moving on, and starting next week the man simply known as "chromatic" will become Mr. Perl.com. Please treat him as well as you've treated me!

I need to thank a bunch of people, who've done all the hard work behind the scenes that you don't hear about: Mark, of course, for getting me involved here; Chris Coleman and Bruce Stewart from the O'Reilly Network who've had the curious experience of trying to manage me; Steve McCannell and Chris Valdez have been the producers of perl.com, and worked incessantly to get articles up on the site, often on shockingly short notice; Tara McGoldrick and others have been the copy editors; and of course, I've worked with a wide range of great authors and contributors. Thank you all. And thanks, of course, to the Perl community -- that's you -- without whom this wouldn't be half as much fun.

And about that missionary thing? Well, if Larry's not going to be able to do it, someone has to. Like many Perl programmers, and indeed Larry himself, I've been interested in Japan for a very long time. In fact, I lived in Japan for a year, and was studying Japanese for my university major back when I started at perl.com; last year I decided that the time was right to prepare to move back to Japan, as a fulltime missionary.

So in two weeks I'll be going to All Nations University here in England for a two-year course to get me ready, and then I shall be off! I'm sure you won't have heard the last of me, though, and I certainly won't be stopping programming -- missionaries have things they need to automate too... But for now, farewell! It's been fun, and now it's going to be fun in a different way.

Take care out there, and happy hacking!

This Week on Perl 6, Week Ending 2004-08-27

Where does the time go? I blame folk festivals. Once I'm getting busy with the teacher training I'm going to have weekends free to write the summaries. And if you'll believe that...

We start with perl6-internals.

Incremental Garbage Collection

Discussion of implementing incremental garbage collection continued.

Calling Conventions for Un-prototyped Subroutines

Mattia Barbon asked for some clarification of how the calling conventions work for unprototyped subroutines in IMCC. Specifically, whether IMCC was behaving as documented in the Parrot Design Document (PDD3). Leo reassured him that the TODO entry that had caused his disquiet was obsolete and that IMCC does the right thing.

http://groups.google.com/groups?threadm=Mahogany-0.66.0-1060-20040821-182506.00@rbnet.it

Compile op with Return Values

Steve Fink asked how to go about returning something from dynamically compiled code. He, Leo, and Dan went over the issues. An unnamed (and as yet unvolunteered) volunteer will probably write it all up and add it to the parrot compiler FAQ.

http://groups.google.com/groups?threadm=20040822060343.GI563@foxglove

NCI and Callback Functions

Stephane Peiry and Leo continued to work on getting NCI callbacks working with the GTK library. They finally tracked the problem down, but it looks awfully like Stephane's going to have to re-implement the gtk_main loop to get everything working. Best of luck Stephane. ParrotGTK would definitely be a cool thing.

http://groups.google.com/groups?threadm=20040822164755.GA23921@pittypanda

Planet Parrot

Robert Spier announced the creation of Planet Parrot, an aggregator of Parrot related blogs. If you have such a blog, Robert would probably be interested in hearing from you.

http://groups.google.com/groups?threadm=4129706F.6090304@pobox.com

http://planet.parrotcode.org/

GC/DOD API

Leo and Dan continued to discuss and work on documenting Parrot's memory management API.

http://groups.google.com/groups?threadm=a0611040cbd4f9cce3f0b@[172.24.10.164]

NCI GC Issues

Dan noticed a problem with garbage collection and NCI PMCs. Leo tracked the problem down and fixed it.

I love it when a plan comes together.

http://groups.google.com/groups?threadm=200408271316.i7RDGIE02447@thu8.leo.home

Opinions on Base Behavior

Base behavior is disgusting and should be stamped out I tell you. We have a moral duty to...

Ah...

Actually, the subject refers to Dan and Leo's desire to nail down the behavior of Parrot's basic types. He outlined the issues and asked for discussion. And that's what he got.

It looks like Parrot's going to have a reasonably complete set of numeric types which will allow accuracy nerds to avoid reals for as long as possible. Which will be nice.

http://groups.google.com/groups?threadm=a06110403bd502ac7855b@[172.24.10.164]

http://groups.google.com/groups?threadm=a0611040ebd51295d034c@[172.24.10.164]

http://groups.google.com/groups?threadm=a06110416bd52357268b0@[172.24.10.164]

http://groups.google.com/groups?threadm=a0611041abd52737df346@[172.24.10.164]

http://groups.google.com/groups?threadm=a06110400bd53ecfa74a4@[10.0.1.2]

Concat, Cows, & Basil

Comfortably winning the "weirdest subject line of the week" award, Sam Phillips laid out some of the problems he's having implementing a Basil compiler that targets Parrot. Leo supplied answers.

http://groups.google.com/groups?threadm=412B2D64.7000204@mac.com

Low-level Math[s] op Behavior

As well as discussing basic types, Dan opened a discussion of the proper semantics of Parrot's low-level maths (yeah, he spelt it "math," but I'm English and I'll spell it correctly) operators. In particular he proposed variants that throw exceptions if strange things happen.

http://groups.google.com/groups?threadm=a06110409bd5107fa798b@[172.24.10.164]

Benchmark Stuff

 Joshua Gatcomb posted some patches he'd made to F<parrotbench.pl> to
 make it a little less fragile and more useful. A few days later he
 announced a slightly more finished set of patches (Leo had checked his
 original patches in rather sooner than expected). 

http://groups.google.com/groups?threadm=20040824182920.63794.qmail@web60804.mail.yahoo.com

http://groups.google.com/groups?threadm=20040825225912.13884.qmail@web60809.mail.yahoo.com

Resuscitating a mod_parrot

Dan Sugalski looked upon mod_parrot and saw that it was cool, to the point of being laid out on a mortuary slab. He thought that the time had come to try and put some life back into the project, now that Parrot's so much more capable. He asked for volunteers and Jeff Horwitz seems to be (the only?) one of 'em.

http://groups.google.com/groups?threadm=a0611041dbd52934064f5@[172.24.10.164]

PMC Instantiation

Leo posted another of his discussion documents, this time on instantiating PMCs. No response so far.

http://groups.google.com/groups?threadm=412D9577.4090409@toetsch.at

Meanwhile, in perl6-language

Return with no Expression

In the usual freewheeling perl6-language way, a thread entitled "Return with no Expression" ended up with a discussion of how to disambiguate a ::, which is part of ... ?? ... :: ... expression and :: in its role as module sigil. (According to Luke it's easy; white space is required.)

http://groups.google.com/groups?threadm=20040820162108.GA18201@wall.org

Synopses Discussed

Discussion of the draft synopses continued.

http://groups.google.com/groups?threadm=BD4C401F.1E983%siracusa@mindspring.com - Synopsis 2

POD Tables?

Aaron Sherman wondered if Perl 6's incarnation of POD would have support for tables. Luke wasn't overly keen on Aaron's proposed syntax, and Larry pointed out that they'd managed to use Perl 5's POD and some clever processing to write tables for the third edition of Programming Perl. Once again, it seems that everyone agrees that POD is great, but could be greater, but nobody can agree on how to make it greater. Me, I think Larry got it right enough.

http://groups.google.com/groups?threadm=4126D251.9070105@ajs.com

Constructors and Mixins

Ovid had some questions about how Constructors work in Perl 6 and about how to disambiguate methods when using Roles. Larry and Damian provided answers.

http://groups.google.com/groups?threadm=20040822150448.25180.qmail@web60804.mail.yahoo.com

Instantiation

Instantiation was on Aaron Sherman's mind, too. He proposed some syntactic sugar for creating objects at "use" time. For some reason, syntactic sugar proposals do seem to create lots of discussion.

http://groups.google.com/groups?threadm=1093279983.1171.104.camel@pps

Announcements, Apologies, Acknowledgements

I am reliably informed that a lot of last week's links were broken. Which is odd because the code to generate them hasn't changed. It looks like the problem's down to the gateway between nntp.perl.org and groups.google.com being slightly radished at present.

However, I'm going to stick with the Google links in this summary in the hopes that normal service will be returned shortly and things will get back-filled. If anyone has a handy tool for getting from a message id to an archive URL that works then I'd be happy to hear from them.

If you find these summaries useful or enjoyable, please consider contributing to the Perl Foundation to help support the development of Perl. You might also like to send feedback or contributions to a "Getting Piers to OSCON 2005" fund to mailto:pdcawley@bofh.org.uk.

http://donate.perl-foundation.org/ -- The Perl Foundation

http://dev.perl.org/perl6/ -- Perl 6 Development site

Amazingly, there's a small amount of newer content on my web site this week. Will wonders never cease?

http://www.bofh.org.uk/

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en