Lexing and Parsing Continued

This article is the second part of a series which started with An Overview of Lexing and Parsing. That article aimed to discuss lexing and parsing in general terms, while trying to minimize the amount on how to actually use Marpa::R2 to do the work. In the end, however, it did have quite a few specifics. This article has yet more detail with regard to working with both a lexer and a parser. BTW, Marpa's blog

(For more information, see the Marpa blog or download the example files for this article.)

Brief Recap: The Two Grammars

Article 1 defined the first sub-grammar as the one which identifies tokens and the second sub-grammar as the one which specifies which combinations of tokens are legal within the target language. As I use these terms, the lexer implements the first sub-grammar and the parser implements the second.

Some Context

Consider this image:

It's actually a copy of the image of a manual page for Graph::Easy. Note: My module Graph::Easy::Marpa is a complete re-write of Graph::Easy. After I offered to take over maintenance of the latter, I found the code so complex I literally couldn't understand any of it.

There are three ways (of interest to us) to specify the contents of this image:

  • As a Perl program using the GraphViz2 module
  • This is teamwork.pl:

        #!/usr/bin/env perl
    
        use strict;
        use warnings;
    
        use GraphViz2;
    
        # ---------------
    
        my($graph) = GraphViz2 -> new
            (
            graph  => {rankdir => 'LR'},
            edge   => {color => 'grey'},
            logger => '',
            );
    
        $graph -> default_node(fontsize => '12pt', shape => 'rectangle',
                               style    => 'filled, solid');
    
        $graph -> add_node(name => 'Teamwork', fillcolor => 'red');
        $graph -> add_node(name => 'Victory',  fillcolor => 'yellow');
    
        # The dir attribute makes the arrowhead appear.
    
        $graph -> add_edge(dir => 'forward', from  => 'Teamwork',
                           to  => 'Victory', label => 'is the key to');
    
        my($format)      = shift || 'svg';
        my($output_file) = shift || "teamwork.$format";
    
        $graph -> run(format => $format, output_file => $output_file);
  • As a Graphviz DOT file written in a little language
  • This approach uses the Graph::Easy language invented by the author (Tels) of the Graph::Easy Perl module. Call this teamwork.easy. It's actually input for Graph::Easy::Marpa:

        graph {rankdir: LR}
        node {fontsize: 12pt; shape: rectangle; style: filled, solid}
        [Teamwork]{fillcolor: yellow}
        -> {label: is the key to}
        [Victory]{fillcolor: red}

    Note: In some rare cases, the syntax supported by Graph::Easy::Marpa will not be exactly identical to the syntax supported by the original Graph::Easy.

  • As a DOT file
  • Call this teamwork.dot:

        digraph Perl
        {
        graph [ rankdir="LR" ]
        node  [ fontsize="12pt" shape="rectangle" style="filled, solid" ]
        edge  [ color="grey" ]
        "Teamwork" [ fillcolor="yellow" ]
        "Victory"  [ fillcolor="red" ]
        "Teamwork" -> "Victory" [ label="is the key to" ]
        }

This article is about using GraphViz2::Marpa to parse DOT files.

Of course the Graphviz package itself provides a set of programs which parse DOT files in order to render them into many different formats. Why then would someone write a new parser for DOT? One reason is to practice your Marpa skills. Another is, perhaps, to write an on-line editor for Graphviz files.

Alternately you might provide add-on services to the Graphviz package. For instance, some users might want to find all clusters of nodes, where a cluster is a set of nodes connected to each other, but not connected to any nodes outside the cluster. Yet other uses might want to find all paths of a given length emanating from a given node.

I myself have written algorithms which provide these last two features. See the module GraphViz2::Marpa::PathUtils and the PathUtils demo page.

But back to using Marpa::R2 from within GraphViz2::Marpa.

Scripts for Testing

The code being developed obviously needs to be tested thoroughly, because any such little language has many ways to get things right and a horrendously large number of ways to get things slightly wrong, or worse. Luckily, because graphs specified in DOT can be very brief, it's a simple matter to make up many samples. Further, other more complex samples can be copied from the Graphviz distro's graphs/directed/ and graphs/undirected/ directories.

The GraphViz2::Marpa distro I developed includes with 86 data/*.gv (input) files and the 79 corresponding data/*.lex, data/*.parse, data/*.rend, and html/*.svg (output) files. The missing files are due to deliberate errors in the input files, so they do not have output files. The distribution also includes obvious scripts such as lex.pl (lex a file), parse.pl (parse a lexed file), rend.pl (render a parsed file back into DOT), and one named vaguely after the package, g2m.pl, which runs the lexer and the parser.

Why a rend.pl? If the code can't reconstruct the input DOT file, something got lost in translation....

The distribution also includes scripts which operate on a set of files.

  • data/*.gv -> dot2lex.pl -> runs lex.pl once per file -> data/*.lex (CSV files).
  • data/*.lex -> lex2parse.pl -> runs parse.pl once per file -> data/*.parse (CSV files).
  • data/*.parse -> parse2rend.pl -> runs rend.pl once per file -> data/*.rend (dot files).
  • data/*.rend -> rend2svg.pl -> html/*.svg.

Finally, generate.demo.pl creates the GraphViz2 Marpa demo page.

Normal users will use g2m.pl exclusively. The other scripts help developers with testing. See the GraphViz2 Marpa scripts documentation for more information.

Some Modules

GraphViz2::Marpa::Lexer::DFA is a wrapper around Set::FA::Element. It has various tasks to do:

  • Process the State Transition Table (STT)
  • The STT comes in via GraphViz2::Marpa::Lexer, which produced it from within its own source code or an external CSV file. The lexer has already validated the structure of the STT's data.

  • Transform the STT from the input form (spreadsheet/CSV file) into what Set::FA::Element expects
  • Set up the logger
  • In truth, it gets this from its caller, GraphViz2::Marpa::Lexer.

  • Provide the code for all the functions which handle enter-state and exit-state events
  • This is the code which can apply checking above and beyond what was built into the set of regexps which came from the spreadsheet. For example, I could have used Perl's \w instead of /[a-zA-Z_][a-zA-Z_0-9]*/ to find alphanumeric tokens, and at this point rejected those starting with a digit, because DOT imposes that restriction.

    Most importantly, this code stockpiles the tokens themselves with metadata to identify the type of each token (hence the two columns in the upcoming sample data/27.lex just below).

  • Run the DFA
  • Check the result of that run
  • Did the DFA end up in an accepting state? Yes is okay and no is an error.

Here is some sample data which ships with GraphViz2::Marpa, formatted for maximum clarity:

  • A DOT file, data/27.gv, which is input to the lexer:
  •     digraph graph_27
        {
            node_27_1
            [
                color     = red
                fontcolor = green
            ]
            node_27_2
            [
                color     = green
                fontcolor = red
            ]
            node_27_1 -> node_27_2
        }
  • A token file, data/27.lex, which is output from the lexer:
  •     "type","value"
        strict          , "no"
        digraph         , "yes"
        graph_id        , "graph_27"
        start_scope     , "1"
        node_id         , "node_27_1"
        open_bracket    , "["
        attribute_id    , "color"
        equals          , "="
        attribute_value , "red"
        attribute_id    , "fontcolor"
        equals          , "="
        attribute_value , "green"
        close_bracket   , "]"
        node_id         , "node_27_2"
        open_bracket    , "["
        attribute_id    , "color"
        equals          , "="
        attribute_value , "green"
        attribute_id    , "fontcolor"
        equals          , "="
        attribute_value , "red"
        close_bracket   , "]"
        node_id         , "node_27_1"
        edge_id         , "->"
        node_id         , "node_27_2"
        end_scope       , "1"

You can see the details on the GraphViz2 Marpa demo page.

Some Notes on the STT

Firstly, note that the code allows whole-line comments (matching m!^(?:#|//)!. These lines are discarded when the input file is read, and so do not appear in the STT.

Working With An Incomplete BNF

Suppose you've gone to all of the work to find or create a BNF (Backus-Naur Form) grammar for your input language. You might encounter the situation where you can use BNF to specify your language in general, but not precisely in every situation. DOT is one offender.

DOT IDs can be surrounded by double-quotes, and in some case must be surrounded by double-quotes. To be more specific, if we regard an attribute declaration to be of the form key=value, both the key and the value can be embedded in double-quotes, and sometimes the value must be.

Even worse, IDs can be attributes. For instance, you might want your font color to be green. That appears to be simple, but note: attributes must be attached to some component of the graph, something like node_27_1 [fontcolor = green].

Here's the pain. In DOT, the thing to which the attribute belongs may be omitted as implied. That is, the name of the thing's owner is optional. For instance, you might want a graph which is six inches square. Here's how you can specify that requirement:

  • graph [size = "6,6"]
  • The double-quotes are mandatory in this context.

  • [size = "6,6"]
  • This also works.

  • size = "6,6"
  • So does this.

But wait, there's more! The value of the attribute can be omitted, if it's true. Hence the distro, and the demo page, have a set of tests, called data/42.*.gv, which test that feature of the code. Grrrr.

All this means is that when the terminator of the attribute declaration (in the input stream) is detected, and we switch states from within-attribute to after-attribute, the code which emits output from the lexer has to have some knowledge of what the hell is going on, so that it can pretend it received the first of these three forms even if it received the second or third form. It must output graph (the graph as a whole) as the owner of the attribute in question.

As you've seen, any attribute declaration can contain a set of attribute key=value pairs as in node_27_2 [ color = green fontcolor = red ].

You can't solve this with regexps, unless you have amazing superpowers and don't care if anyone else can maintain your code. Instead, be prepared to add code in two places:

  • At switch of state
  • After all input has been parsed
  • Indeed, GraphViz2::Marpa::Lexer::DFA contains a long sub, _clean_up(), which repeatedly fiddles the array of detected tokens, fixing up the list before it's fit to inflict on the parser.

Understanding the State Transition Table

I included this diagram in the first article:

             DOT's Grammar
                  |
                  V
        ---------------------
        |                   |
     strict                 |
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
     digraph     or       graph
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
       $id                  |
        |                   |
        ---------------------
                  |
                  V
                {...}

A dot file starts with an optional strict, followed by either graph or digraph. (Here di stands for directed, meaning edges between nodes have arrowheads on them. Yes, there are many attributes which can be attached to edges. See http://www.graphviz.org/content/attrs and look for an E (edge) in the second column of the table of attribute names).

The /(di|)graph/ is in turn followed by an optional ID.

Line One

From the first non-heading line of the STT, you can see how I ended up with:

  • A start state flag => Yes
  • The state defined on this line is the start state.

  • An accept flag => ''
  • Because the initial state can't be an accepting start, this column--in the row--must be empty.

    Later STT states have Yes in this column.

  • A state name => initial
  • I chose to call it initial. Other people call it start.

  • An event => strict
  • This--although it might not yet be clear--is actually a regexp, /(?:strict)/. The DFA adds the do-not-capture prefix /(?: and suffix )/.

  • A next state => graph
  • This is the state to which to jump if a match occurs. Here, match means a match between the regexp (event) in the previous column and the head of the input stream.

  • An entry function => ''
  • I don't use it here, but if I did, it would mean to call a particular function when entering the named state.

  • An exit function => save_prefix
  • Similar to the entry function, this says to call a particular function when exiting the named state.

  • A regexp => (?:"[^"]*"|<\s*<.*?\s*>|[a-zA-Z_][a-zA-Z_0-9]*|-?(?:\.[0-9]+|[0-9]+(?:\.[0-9])*))>
  • I can save a set of regexps in this column and use spreadsheet formula elsewhere to refer to them.

  • An interpretation => ID
  • These are my notes to myself. This one says that regexp in the previous column specifies what an ID must match in the DOT language.

Line Two

The event in the second line, /(?:graph|digraph)/, indicates what to do if the strict is absent from the input stream.

To clarify this point, recall that the DFA matches entries in the Event column one at a time, from the first listed against the name of the state--here, /(?:strict)/--down to the last for the given state--here, /(?:\s+)/. The first regexp to match wins, in that the first regexp to match will triggersthe exit state logic and that its entry in the Next column specifies the next state to enter, which in turn specifies the (next) state's entry function to call, if any.

If strict is not at the head of the input stream, and it can definitely be absent, as is seen in the above diagram, this regexp--/(?:graph|digraph)/--is the next one tested by the DFA's logic.

Line Three

The hard-to-read regexp \/\*.*\*\/ says to skip C-language-style multi-line (/* ... */) comments. The skip takes place because the Next state is initial, the current state. In other words, discard any text at the head of the input stream which this regexp will gobble.

Why does it get discarded? That's the way Set::FA::Element operates. Looping within a state does not trigger the exit-state and enter-state functions, and so there is no opportunity to stockpile the matched text. That's good in this case. There's no reason to save it, because it's a comment.

Think about the implications for a moment. Once the code has discarded a comment (or anything else), you can never recreate the verbatim input stream from the stockpiled text. Hence you should only discard something once you fully understand the consequences. If you're parsing code to execute it (whatever that means), fine. If you're writing a pretty printer or indenter, you cannot discard comments.

Lastly, we can say this regexp is used often, meaning we accept such comments at many places in the input stream.

Line Four

The regexp \s+ says to skip spaces (in front of or between interesting tokens). As with the previous line, we skip to the very same state.

This state has four regexps attached to it.

More States

Re-examining the STT shows two introductory states, for input with and without a (leading) strict. I've called these states by the arbitrary names initial and graph.

If the initial strict is present, state initial handles it (in the exit function) and jumps to state graph to handle what comes next. If, however, strict is absent, state initial still handles the input, but then jumps to state graph_id.

A (repeated) word of warning about Set::FA::Element. A loop within a state does not trigger the exit-state and enter-state functions. Sometimes this can actually be rather unfortunate. You can see elsewhere in the STT where I have had to use pairs of almost-identically named states (such as statement_list_1 and statement_list_2), and designed the logic to rock the STT back and forth between them, just to allow the state machine to gobble up certain input sequences. You may have to use this technique yourself. Be aware of it.

Proceeding in this fashion, driven by the BNF of the input language, eventually you can construct the whole STT. Each time a new enter-state or exit-state function is needed, write the code, then run a small demo to test it. There is no substitute for that testing.

The graph State

You reach this state simply by the absence of a leading strict in the input stream. Apart from not bothering to cater for comments (as did the initial state), this state is really the same as the initial state.

A few paragraphs back I warned about a feature designed into Set::FA::Element, looping within a state. That fact is why the graph state exists. If the initial state could have looped to itself upon detecting strict, and executed the exit or entry functions, there would be no need for the graph state.

The graph_id State

Next, look for an optional graph id, at the current head of the input stream (because anything which matched previously is gone).

Here's the first use of a formula: Cell H2 contains (?:"[^"]*"|<[^>]*>|[a-zA-Z_][a-zA-Z_0-9]*|-?(?:\.[0-9]+|[0-9]+(?:\.[0-9])*)). This accepts a double-quoted ID, or an ID quoted with < and >, or an alphanumeric (but not starting with a digit) ID, or a number.

When the code sees such a token, it jumps to the open_brace state, meaning the very next non-whitespace character had better (barring comments) be a {, or there's an error, so the code will die. Otherwise, it accepts { without an ID and jumps to the statement_list_1 state, or discards comments and spaces by looping within the graph_id state.

The Remaining States

What follows in the STT gets complex, but in reality is more of the same. Several things should be clear by now:

  • The development of the STT is iterative
  • You need lots of tiny but different test data files, to test these steps
  • You need quite a lot of patience, which, unfortunately, can't be downloaded from the internet...

Lexer Actions (Callbacks)

Matching something with a DFA only makes sense if you can capture the matched text for processing. Hence the use of state-exit and state-entry callback functions. In these functions, you must decide what text to output for each recognized input token.

To help with this, I use a method called items(), accessed in each function via $myself. This method manages an stack (array) of items of type Set::Array. Each element in this array is a hashref:

    {
        count => $integer, # 1 .. N.
        name  => '',       # Unused.
        type  => $string,  # The type of the token.
        value => $value,   # The value from the input stream.
    }

Whenever a token is recognized, push a new item onto the stack. The value of the type string is the result of the DFA's work identifying the token. This identification process uses the first of the two sub-grammars mentioned in the first article.

A long Exit-state Function

The save_prefix function looks like:

    # Warning: This is a function (i.e. not a method).

    sub save_prefix
    {
        my($dfa)   = @_;
        my($match) = trim($dfa -> match);

        # Upon each call, $match will be 1 of:
        # * strict.
        # * digraph.
        # * graph.

        # Note: Because this is a function, $myself is a global alias to $self.

        $myself -> log(debug => "save_prefix($match)");

        # Input     => Output (a new item, i.e. a hashref):
        # o strict  => {name => strict,  value => yes}.
        # o digraph => {name => digraph, value => yes}.
        # o graph   => {name => digraph, value => no}.

        if ($match eq 'strict')
        {
            $myself -> new_item($match, 'yes');
        }
        else
        {
            # If the first token is '(di)graph' (i.e. there was no 'strict'),
            # jam a 'strict' into the output stream.

            if ($myself -> items -> length == 0) # Output stream is empty.
            {
                $myself -> new_item('strict', 'no');
            }

            $myself -> new_item('digraph', $match eq 'digraph' ? 'yes' : 'no');
        }

    } # End of save_prefix.

A tiny Exit-state Function

Here's one of the shorter exit functions, attached in the STT to the open_brace and start_statement states:

    sub start_statements
    {
        my($dfa) = @_;

        $myself -> new_item('open_brace', $myself -> increment_brace_count);

    } # End of start_statements.

The code to push a new item onto the stack is just:

    sub new_item
    {
        my($self, $type, $value) = @_;

        $self -> items -> push
            ({
                count => $self -> increment_item_count,
                name  => '',
                type  => $type,
                value => $value,
             });

    } # End of new_item.

Using Marpa in the Lexer

Yes, you can use Marpa in the lexer, as discussed in the first article. I prefer to use a spreadsheet full of regexps--but enough of the lexer. It's time to discuss the parser.

The Parser's Structure

The parser incorporates the second sub-grammar and uses Marpa::R2 to validate the output from the lexer against this grammar. The parser's structure is very similar to that of the lexer:

  • Initialize using the parameters to new()
  • Declare the grammar
  • Run Marpa
  • Save the output

Marpa Actions (Callbacks)

As with the lexer, the parser works via callbacks, which are functions named within the grammar and called by Marpa::R2 whenever the input sequence of lexed items matches some component of the grammar. Consider these four rule descriptors in the grammar declared in GraphViz2::Marpa::Parser's grammar() method:

    [
    ...
    {   # Prolog stuff.
        lhs => 'prolog_definition',
        rhs => [qw/strict_definition digraph_definition graph_id_definition/],
    },
    {
        lhs    => 'strict_definition',
        rhs    => [qw/strict/],
        action => 'strict', # <== Callback.
    },
    {
        lhs    => 'digraph_definition',
        rhs    => [qw/digraph/],
        action => 'digraph', # <== Callback.
    },
    {
        lhs    => 'graph_id_definition',
        rhs    => [qw/graph_id/],
        action => 'graph_id', # <== Callback. See sub graph_id() just below.
    },
    ...
    ]

In each case the lhs is a name I've chosen so that I can refer to each rule descriptor in other rule descriptors. That's how I chain rules together to make a tree structure. (See the Chains and Trees section of the previous article.)

This grammar fragment expects the input stream of items from the lexer to consist (at the start of the stream, actually) of three components: a strict thingy, a digraph thingy, and a graph_id thingy. Because I wrote the lexer, I can ensure that this is exactly what the lexer produces.

To emphasise, the grammar says that these the items are the only things it will accept at this point in the input stream, and that only if they are in the given order, and that they must literally consist of the three tokens (see rhs): strict, digraph and graph_id.

These latter three come from the type key in the array of hashrefs built by the lexer. The three corresponding value keys in those hashrefs are yes or no for strict, yes or no for digraph, and an id or the empty string for graph_id.

As with the lexer, when in incoming token (type) matches expectations, Marpa::R2 triggers a call to an action, here called (for clarity) the same as the rhs.

Consider one of those functions:

    sub graph_id
    {
        my($stash, $t1, undef, $t2)  = @_;

        $myself -> new_item('graph_id', $t1);

        return $t1;

    } # End of graph_id.

The parameter list is courtesy of how Marpa::R2 manages callbacks. $t1 is the incoming graph id. In data/27.gv (shown earlier), that is graph_27.

Marpa does not supply the string graph_id to this function, because there's no need. I designed the grammar such that this function is only called when the value of the incoming type is graph_id, so I know precisely under what circumstances this function was called. That's why I could hard-code the string graph_id in the body of the graph_id() function.

The Grammar in Practice

Now you might be thinking: Just a second! That code seems to be doing no more than copying the input token to the output stream. Well, you're right, sort of.

True understanding comes when you realize that Marpa calls that code only at the appropriate point precisely because the type graph_id and its value graph_27 were at exactly the right place in the input stream. By that I mean that the location of the pair:

    (type => value)
    ('graph_id' => 'graph_27')

in the input stream was exactly where it had to be to satisfy the grammar initialized by Marpa::R2::Grammar. If it had not been there, Marpa would have thrown an exception, which we would recognize as a syntax error--a syntax error in the input stream fed into the lexer, but which Marpa picked up by testing that input stream against the grammar declared in the parser. The role of the lexer as an intermediary is to simplify the logic of the code as a whole with a divide-and-conquer strategy.

In other words, it's no accident that that function gets called at a particular point in time during the parser's processing of its input stream.

Consider another problem which arises as you build up the set of rule descriptors within the grammar.

Trees Have Leaves

The first article discussed chains and trees (see the prolog_definition mentioned earlier in this article). Briefly, each rule descriptor must be chained to other rule descriptors.

The astute reader will have already seen a problem: How do you define the meanings of the leaves of this tree when the chain of definitions must end at each leaf?

Here's part of the data/27.lex input file:

    "type","value"
    strict          , "no"
    digraph         , "yes"
    graph_id        , "graph_27"
    start_scope     , "1"
    node_id         , "node_27_1"
    open_bracket    , "["
    attribute_id    , "color"
    equals          , "="
    attribute_value , "red"
    attribute_id    , "fontcolor"
    equals          , "="
    attribute_value , "green"
    close_bracket   , "]"
    ...

The corresponding rules descriptors look like:

    [
    ...
    {
        lhs => 'attribute_statement',
        rhs => [qw/attribute_key has attribute_val/],
    },
    {
        lhs    => 'attribute_key',
        rhs    => [qw/attribute_id/], # <=== This is a terminal.
        min    => 1,
        action => 'attribute_id',
    },
    {
        lhs    => 'has',
        rhs    => [qw/equals/],
        min    => 1,
    },
    {
        lhs    => 'attribute_val',
        rhs    => [qw/attribute_value/], # <=== And so is this.
        min    => 1,
        action => 'attribute_value',
    },
    ...
    ]

The items marked as terminals (standard parsing terminology) have no further definitions, so attribute_key and attribute_val are leaves in the tree of rule descriptors. What does that mean? The terminals attribute_id and attribute_value must appear literally in the input stream.

Switching between attribute_key and attribute_id is a requirement of Marpa to avoid ambiguity in the statement of the grammar. Likewise for attribute_val and attribute_value.

The min makes the attributes mandatory. Not in the sense that nodes and edges must have attributes, they don't, but in the sense that if the input stream has an attribute_id token, then it must have an attribute_value token and vice versa.

Remember the earlier section "Working With An Incomplete BNF"? If the original *.gv file used one of:

    size = "6,6"
    [size = "6,6"]
    graph [size = "6,6"]

... then the one chosen really represents the graph attribute:

    graph [size = "6,6"]

To make this work, the lexer must force the output to be:

    "type","value"
    ...
    class_id        , "graph"
    open_bracket    , "["
    attribute_id    , "size"
    equals          , "="
    attribute_value , "6,6"
    close_bracket   , "]"

This matches the requirements, in that both attribute_id and attribute_value are present, is their (so to speak) owner, the object itself, which is identified by the type class_id.

All of this should reinforce the point that the design of the lexer is intimately tied to the design of the parser. By taking decisions like this in the lexer you can standardize its output and simplify the work that the parser needs to don.

Where to go from here

The recently released Perl module MarpaX::Simple::Rules takes a BNF and generates the corresponding grammar in the format expected by Marpa::R2.

Jeffrey Kegler (author of Marpa) has blogged about MarpaX::Simple::Rules.

This is a very interesting development, because it automates the laborious process of converting a BNF into a set of Marpa's rule descriptors. Consequently, it makes sense for anyone contemplating using Marpa::R2 to investigate how appropriate it would be to do so via MarpaX::Simple::Rules.

Wrapping Up and Winding Down

You've seen samples of lexer output and some parts of the grammar which both define the second sub-grammar of what to expect and what should match precisely the input from that lexer. If they don't match, it is in fact the parser which issues the dread syntax error message, because only it (not the lexer) knows which combinations of input tokens are acceptable.

Just like in the lexer, callback functions stockpile items which have passed Marpa::R2's attempt to match up input tokens with rule descriptors. This technique records exactly which rules fired in which order. After Marpa::R2 has run to completion, you have a stack of items whose elements are a (lexed and) parsed version of the original file. Your job is then to output that stack to a file, or await the caller of the parser to ask for the stack as an array reference. From there, the world.

For more details, consult my July 2011 article on Marpa::R2.

The Lexer and the State Transition Table - Revisited

The complexity of the STT in GraphViz2::Marpa justifies the decision to split the lexer and the parser into separate modules. Clearly that will not always be the case. Given a sufficiently simple grammar, the lexer phase may be redundant. Consider this test data file, data/sample.1.ged, from Genealogy::Gedcom:

    0 HEAD
    1 SOUR Genealogy::Gedcom
    2 NAME Genealogy::Gedcom::Reader
    2 VERS V 1.00
    2 CORP Ron Savage
    3 ADDR Box 3055
    4 STAE Vic
    4 POST 3163
    4 CTRY Australia
    3 EMAIL ron@savage.net.au
    3 WWW http://savage.net.au
    2 DATA
    3 COPR Copyright 2011, Ron Savage
    1 NOTE
    2 CONT This file is based on test data in Paul Johnson's Gedcom.pm
    2 CONT Gedcom.pm is Copyright 1999-2009, Paul Johnson (paul@pjcj.net)
    2 CONT Version 1.16 - 24th April 2009
    2 CONT
    2 CONT Ron's modules under the Genealogy::Gedcom namespace are free
    2 CONT
    2 CONT The latest versions of these modules are available from
    2 CONT my homepage http://savage.net.au and http://metacpan.org
    1 GEDC
    2 VERS 5.5.1-5
    2 FORM LINEAGE-LINKED
    1 DATE 10-08-2011
    1 CHAR ANSEL
    1 SUBM @SUBM1@
    0 TRLR

Each line matches /^(\d+)\s([A-Z]{3,4})\s(.+)$/: an integer, a keyword, and a string. In this case I'd skip the lexer, and have the parser tokenize the input. So, horses for courses. (GEDCOM defines genealogical data; see the GEDCOM definition for more details).

Sample Output

I've provided several links of sample output for your perusal.

Happy lexing and parsing!

Consuming RESTful Services with Perl

In my previous article I described how to create board game images using Image::Magick, thus allowing you to design board games using Perl. This time I want to show you how to upload those images to The Game Crafter so you get get a professional copy of the game manufactured.

In the last article I created a board game version of the video game called The Lacuna Expanse. This time I'll show how to upload those images to the site to create a custom board game. Don't worry if you're not ready to create a board game or if you'll never be ready; the principles of designing a useful API and using it apply to all sorts of services you might want to use, from weather tracking to stocks to medical systems. I picked games for two reasons. One, me and I team just built this system, so it's shiny and new and I've learned a lot. Two, games are more fun (and visual) than showing how to record invoice information in a RESTful ERP application.

Getting Ready

First, get yourself a copy of TheGameCrafter::Client. It's a Perl module that makes it trivial to interact with The Game Crafter's RESTful web service API. When you use TheGameCrafter::Client; it imports tgc_get(), tgc_post(), tgc_put(), and tgc_delete() into your program for easy use. Good APIs are descriptive and obvious. From there you can have a look at The Game Crafter's developer documentation to do any custom stuff you want. Good APIs have useful documentation.

You'll also need to activate the developer setting in your TGC account, and request an API key. Good APIs demonstrate security in authorization and authentication.

A Little about The Game Crafter's API

TheGameCrafter::Client is just a tiny wrapper around our RESTful web services. I designed the web services atop Dancer and DBIx::Class. My goal with this was to build a very reliable and consistent API not only for external use but internal. You see, the entire TGC web site actually runs off these web services. Not only that, but these web services tie directly into our manufacturing facility, so they are controlling the physical world in addition to the virtual. Good APIs allow you to build multiple clients with different uses. (They don't require multiple clients, but they don't forbid it and do enable it.)

With Lacuna, I built JSON::RPC::Dispatcher (JRD), which is a JSON-RPC 2.0 web service handler on top of Plack. I love JRD, but it has two weaknesses, one of them fatal. One weakness is that you must format parameters using JSON, which means that it's not easy to just call a URL and get a result with something like curl. (Good RESTful APIs allow multiple clients. If you can't use curl, you probably have a problem.) The fatal weakness of JSON-RPC 2.0 is that there is no way to do file uploads within the spec. The Game Crafter is all about file uploads, so that meant I either needed to handle those separately (aka inconsistently), or develop something new. I opted for the latter.

With TGC's web services I decided to adopt some of the things I really liked about JSON-RPC, namely the way it handles responses whether they be result sets or errors. So you always get a consistent return:

{ "result" : { ... } }

{
   "error" : {
        "code" : 404,
        "message" : "File not found.",
        "data" : "file_id"
   }
}

With TGC I also wanted a consistent and easy way of turning DBIx::Class into web services through Dancer. I looked into things like AutoCRUD, but I'm not a fan of Catalyst, and it also took too much configuration (in my opinion) to get it working. I wanted something simpler and faster, so I decided to roll my own. The result was a thin layer of glue between Dancer and DBIx::Class that allows you to define your web service interface in your normal DBIx::Class declarations. It automatically then generates the web services, databases tables, web form handling, and more for you. This little glue layer is now in use in all web app development within Plain Black, and eventually we'll be releasing it onto CPAN for all to use for free. The best part of that is that you know you're getting a production-ready system because it's been running The Game Crafter and other sites for over a year now. (Good APIs are often extracted from working systems.) More on that in a future article.

Let's Do This Thing Already

Before you can make any API calls, you need to authenticate.

 my $session = tgc_post('session',{
   username    => 'me',
   password    => '123qwe',
   api_key_id  => 'abcdefhij',
 });

Before you can start uploading, fetch your user account information. This contains several pieces of info that you can use.

 my $user = tgc_get('user', {
   session_id    => $session->{id},           # using our session to do stuff
});

Think of TGC projects like filesystems: you have folders which contain folders and files. First create a folder, then upload a file:

 my $folder = tgc_post('folder', {
  session_id  => $session->{id},
  name        => 'Lacuna',
  user_id     => $user->{id},
  parent_id   => $user->{root_folder_id},  # putting this in the home folder
 });

 my $file = tgc_post('file', {
  session_id  => $session->{id},
  name        => 'Mayhem Training',
  file        => ['mayhem.png'],         # the array ref signifies this is a file path
  folder_id   => $folder->{id},       # putting it in the just-created folder
 });

Assuming at this point you've uploaded all of your files, you can now build out your game. The Game Crafter has this notion of a "Designer", which is sort of like your very own publishing company. Games are attached to the designer, so first you must create the designer, then the game.

 my $designer = tgc_post('designer', {
  session_id  => $session->{id},
  user_id     => $user->{id},
  name        => 'Lacuna Expanse Corp',
 });

 my $game = tgc_post('game', {
  session_id  => $session->{id},
  designer_id => $designer->{id},
  name        => 'Lacuna Expanse: A New Empire',
 });

With a game created and assets uploaded, you can now create a deck of cards. This is pretty straight forward just like before.

 my $deck = tgc_post('minideck', {
  session_id => $session->{id},
  name       => 'Planet',
  game_id    => $game->{id},
 });

 my $card = tgc_post('minicard', {
  session_id => $session->{id},
  name       => 'Mayhem Training',
  face_id    => $file->{id},
  deck_id    => $deck->{id},
 });

You have probably noticed already how closely this resembles CRUD operations, because it does. Behind the scenes, who knows what TGC does with this information? (I do, but that's because I wrote it.) It doesn't matter to the API, because all of those details are hidden behind a good API. Good APIs expose only the necessary details—in this case, the relationships between folders and files and between designers and games.

You can also see that the API is as stateless as possible, where the session identifier is part of every API call. It's easy to imagine a more complicated API which hides this, but I stuck with the bare-bones REST for at least two reasons: it's simple, and it's easy to see what's happening. Someone could build over the top of this API if desired. Good APIs allow extension and further abstraction.

Just like that, you've created a game and added a deck of cards to it. There are of course lots of other fancy things you can do with the API, but this should get you started. I wouldn't leave you hanging there, however. I've open sourced the actual code I used to create the Lacuna Expanse board game so you'd have something to reference. There's also a developer's forum if you have any questions. Good luck to you, and happy gaming!

Designing Board Games With Perl

Board games are hotter than they've ever been. In fact, the board game market has grown 25% in the past year while the video game market shrank 20%. But you're a Perl hacker, not an Adobe Illustrator, so how can you design a board game? Well, that's exactly what I aim to show you in this article.

First, you need an idea. You can turn literally anything into a game. Whether you just want to design your own custom playing cards, or you want to make a full on custom board or card game, the options are limitless. (Full disclosure: I'm one of the owners of The Game Crafter, which itself is written entirely in Perl. )

For the purposes of this article, I'm going to make a board game version of the popular Perl-based web game The Lacuna Expanse. (I'm also one of the owners of Lacuna.) I chose this because I already have some artwork for it, albeit not in board game form. However, you can get free art from various sites around the web.

My Lacuna-based board game will be a tile placement game where all the players work together cooperatively to fend off an alien invasion.

Let's Get To The Perl Already!

There are several great image manipulation libraries on the CPAN, but my personal favorite is Image::Magick. I started by creating a base image which I could manipulate in any way that I wanted. (I based my choice off of The Game Crafter's list of component sizes and prices.) I decided to use mini cards, because the table would fill up too quickly with full poker-sized cards; there'll be a lot of cards on the table!

 my $card = Image::Magick->new(size=>'600x825');
 say $card->ReadImage('canvas:white');
Note that I used say in front of the ReadImage call. Image::Magick will emit a textual exception on each call if anything goes wrong. I could easily wrap that with better error handling, but for now printing to the screen is sufficient for my needs.

When printing things (really printing them, with ink and all) you also have to take into account something called bleed and cut lines. It's easy to draw the cut line on the card in red as the boundary of the printable image content.

 say $card->Draw(stroke=>'red', fill => 'none', strokewidth=>1, primitive=>'rectangle', points=>'38,38 562,787');
blank with cut lines

So far so good. The next step is to give this card a background so that it starts to look like a card. For this I'll take one of the planet surface images from the Lacuna Expanse and rotate it and stretch it to fit the shape of the card.

 my $surface = Image::Magick->new;
 say $surface->ReadImage('surface-p17.jpg');
 say $surface->Rotate(90);
 say $surface->Resize('600x825!');
 say $card->Composite(compose => 'over', image => $surface, x => 0, y => 0);

Note the exclamation point (!) on the Resize command. That tells Image::Magick to distort the native aspect ratio of the image. In other words, stretch the image to fill the size I've specified.

background

You may have noticed that this image looks enormous. That's because it's for print (on paper!) rather than screens. Print has more pixels per inch/centimeter than screens, thus the image looks bigger when you display it on a screen.

Now the card needs a title. Adding text to the image is straightforward.

 $card->Annotate(text => 'Mayhem Training', font => 'ALIEN5.ttf', y => -275, fill => 'white', pointsize => 70, gravity => 'Center');
title

As you can see I've used a custom font. Image::Magick is capable of using nearly any OpenType or TrueType font.

With a background and a title, the next step is to overlay the card with a picture of the Mayhem Training building from the video game.

 my $image = Image::Magick->new;
 say $image->ReadImage('mayhemtraining9.png');
 say $card->Composite(compose => 'over', image => $image, x => 100, y => 165);
added image

Now we're finally getting somewhere! This is really starting to look like a card. Use the same technique to overlay an icon onto the card. As in so many games, these icons symbolize an ability that the card grants the player who uses it. You can get free icons from all over the web; one of my favorite libraries is Glyphish.

 my $icon = Image::Magick->new;
 say $icon->ReadImage('target.png');
 say $card->Composite(compose => 'over', image => $icon, x => 100, y => 570);
added icon

You can't get away with icons all the time; a little text will explain things to new players. Adding some explanation to the card would be really tricky, if it weren't for some really neat code that Gabe Schaffer contributed to the ImageMagick forums a long time ago. Basically without this code you'd have to make the text wrap at word boundaries yourself, but with it, you can just do a simple Annotate call like this:

 $card->Set(font => 'promethean.ttf', pointsize => 35);
 my $text = 'Demolish one of your buildings to use this ability.';
 my $text_wrapped = wrap($text, $card, 400);
 say $card->Annotate(text => $text_wrapped, x => 100, y => 690, font => 'promethean.ttf', fill => 'white', pointsize => 35);
added text

A game like this wouldn't be very interesting if you could place any card anywhere you want. To solve this, I want to to add something to the card to indicate how other cards can connect to it. This is the most challenging part yet, because I want to make a half-circle/half-rectangle connector. Because this is a bit more complicated and I want to use it for drawing connection points on various sides of the card, I'll turn it into a subroutine.

 sub draw_connection_point {
   my ($card, $color, $rotation, $x, $y) = @_;

   # draw a half circle, it's a half cuz we're drawing outide the image
   my $half_circle  = Image::Magick->new(size=>'70x35');
   say $half_circle->ReadImage('canvas:transparent');
   say $half_circle->Draw(stroke => $color, fill => $color, strokewidth=>1, primitive=>'circle', points=>'35,35, 35,70');

   # create the connection point image
   my $connection = Image::Magick->new(size=>'70x85');
   say $connection->ReadImage('canvas:transparent');

   # add the half circle to the connection point
   say $connection->Composite(compose => 'over', image => $half_circle, x => 0, y => 0);

   # extend the connection point the the edge
   say $connection->Draw(stroke=>$color, fill => $color, strokewidth=>1, primitive=>'rectangle', points=>'0,35 70,85');

   # orient the connection point for its position
   say $connection->Rotate($rotation);

   # apply the connection point to the image
   say $card->Composite(compose => 'over', image => $connection, x => $x, y => $y);
 }

 draw_connection_point($card, 'purple', 0, 265, 740);
connection added

Sometimes it's nice to give players hints about stuff so they can form better strategies. To that end, I added a series of pips above the title to indicate how many copies of this card are in the deck. In this case, this card is unique.

 my $quantity = 1;
 my $pips = '.' x $quantity;
 say $card->Annotate(text => $pips, y => -340, fill => 'white', pointsize => 70, gravity => 'Center');
finished

Remember to remove the cut line before you save the file.

 #say $card->Draw(stroke=>'red', fill => 'none', strokewidth=>1, primitive=>'rectangle', points=>'38,38 562,787');
 say $card->Write('mayhem.png');
cut lines removed

Rationale

Now that I've shown you how to create a card, you may have one question. Why would you go through the work of coding it rather than just using Photoshop or the Gimp? There are lots of reasons to code it including things like you don't know how to use image editors. However the really important reason is the same reason you write code to do anything... automation! A game isn't made of just one card. Likewise, games aren't designed in just one try. It takes lots of play testing and revisions. If you design your board game using code you can whip out a new revision as easily as changing a config file.

Of course, automatic image generation isn't only for games....

Next Time

I've shown you how to create the images for a game. If you're like me, the next thing you want to do is print your game. You could do this at home, but it will cost you a lot of time and money (ink jet ink costs more than human blood). You could take it to Kinkos, but you won't get a nice quality product because they don't specialize in making games. Instead, you can upload your files to The Game Crafter, where you'll get a custom game that looks like it game from the game store. There's a nice easy to use web interface to do this, but you're a Perl programmer. Why do something manually if you can automate it?

Besides that, it's a real-world example of interacting with a web service written completely in Perl—on both sides. Who wouldn't be interested in that?

For The Impatient

The Lacuna Expanse Board Game is available for purchase now if you're interested. Also, I've released the code I wrote to develop it via this public GitHub repository.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en