<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <title>Perl.com</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/" />
    <link rel="self" type="application/atom+xml" href="http://www.perl.com/pub/atom.xml" />
    <id>tag:www.perl.com,2010-07-21:/pub//2</id>
    <updated>2013-02-01T02:57:42Z</updated>
    <subtitle>news and views of the Perl programming language</subtitle>
    <generator uri="http://www.sixapart.com/movabletype/">Movable Type Pro 5.13-en</generator>

<entry>
    <title>Lexing and Parsing Continued</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2013/01/lexing-and-parsing-continued.html" />
    <id>tag:www.perl.com,2013:/pub//2.2081</id>

    <published>2013-02-01T02:50:54Z</published>
    <updated>2013-02-01T02:57:42Z</updated>

    <summary>Many practical programming problems require you to parse data. Ron Savage continues his demonstration of Marpa and other tools and techniques for lexing and parsing data. Put down the regexps; get it right this time.</summary>
    <author>
        <name>Ron Savage</name>
        <uri>http://savage.net.au/index.html</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p>This article is the second part of a series which started with <a
    href="http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html">An
    Overview of Lexing and Parsing</a>. That article aimed to discuss lexing
and parsing in general terms, while trying to minimize the amount on how to
actually use Marpa::R2 to do the work. In the end, however, it did have quite a
few specifics. This article has yet more detail with regard to working with
both a lexer and a parser. BTW, Marpa's blog</p>

<p>(For more information, see the Marpa blog or download the example files for this article.)</p>

<h3>Brief Recap: The Two Grammars</h3>

<p>Article 1 defined the first sub-grammar as the one which identifies tokens and the second sub-grammar as the one which specifies which combinations of tokens are legal within the target language. As I use these terms, the lexer implements the first sub-grammar and the parser implements the second.</p>

<h3>Some Context</h3>

<p>Consider this image:</p>

<div class="HTML">

<img src="http://savage.net.au/Ron/html/graphviz2.marpa/teamwork.svg" /></div>

<p>It's actually a copy of the image of a manual page for <a
    href="http://search.cpan.org/perldoc?Graph::Easy">Graph::Easy</a>. Note: My
module <a
    href="http://search.cpan.org/perldoc?Graph::Easy::Marpa">Graph::Easy::Marpa</a>
is a complete re-write of <code>Graph::Easy</code>. After I offered to take
over maintenance of the latter, I found the code so complex I literally
couldn't understand any of it.</p>

<p>There are three ways (of interest to us) to specify the contents of this image:</p>

<ul>

<li>As a Perl program using the <a href="http://search.cpan.org/perldoc?GraphViz2">GraphViz2</a> module</li>

<p>This is <em>teamwork.pl</em>:</p>

<pre><code>    #!/usr/bin/env perl

    use strict;
    use warnings;

    use GraphViz2;

    # ---------------

    my($graph) = GraphViz2 -&gt; new
        (
        graph  =&gt; {rankdir =&gt; &#39;LR&#39;},
        edge   =&gt; {color =&gt; &#39;grey&#39;},
        logger =&gt; &#39;&#39;,
        );

    $graph -&gt; default_node(fontsize =&gt; &#39;12pt&#39;, shape =&gt; &#39;rectangle&#39;,
                           style    =&gt; &#39;filled, solid&#39;);

    $graph -&gt; add_node(name =&gt; &#39;Teamwork&#39;, fillcolor =&gt; &#39;red&#39;);
    $graph -&gt; add_node(name =&gt; &#39;Victory&#39;,  fillcolor =&gt; &#39;yellow&#39;);

    # The dir attribute makes the arrowhead appear.

    $graph -&gt; add_edge(dir =&gt; &#39;forward&#39;, from  =&gt; &#39;Teamwork&#39;,
                       to  =&gt; &#39;Victory&#39;, label =&gt; &#39;is the key to&#39;);

    my($format)      = shift || &#39;svg&#39;;
    my($output_file) = shift || &quot;teamwork.$format&quot;;

    $graph -&gt; run(format =&gt; $format, output_file =&gt; $output_file);</code></pre>

<li>As a Graphviz DOT file written in a little language</li>

<p>This approach uses the <code>Graph::Easy</code> language invented by the author (Tels) of the <code>Graph::Easy</code> Perl module. Call this <em>teamwork.easy</em>. It's actually input for <code>Graph::Easy::Marpa</code>:</p>

<pre><code>    graph {rankdir: LR}
    node {fontsize: 12pt; shape: rectangle; style: filled, solid}
    [Teamwork]{fillcolor: yellow}
    -&gt; {label: is the key to}
    [Victory]{fillcolor: red}</code></pre>

<p>Note: In some rare cases, the syntax supported by <code>Graph::Easy::Marpa</code> will not be exactly identical to the syntax supported by the original <code>Graph::Easy</code>.</p>

<li>As a DOT file</li>

<p>Call this <em>teamwork.dot</em>:</p>

<pre><code>    digraph Perl
    {
    graph [ rankdir=&quot;LR&quot; ]
    node  [ fontsize=&quot;12pt&quot; shape=&quot;rectangle&quot; style=&quot;filled, solid&quot; ]
    edge  [ color=&quot;grey&quot; ]
    &quot;Teamwork&quot; [ fillcolor=&quot;yellow&quot; ]
    &quot;Victory&quot;  [ fillcolor=&quot;red&quot; ]
    &quot;Teamwork&quot; -&gt; &quot;Victory&quot; [ label=&quot;is the key to&quot; ]
    }</code></pre>

</ul>

<p>This article is about using <code>GraphViz2::Marpa</code> to parse DOT files.</p>

<p>Of course the Graphviz package itself provides a set of programs which parse DOT files in order to render them into many different formats. Why then would someone write a new parser for DOT? One reason is to practice your Marpa skills. Another is, perhaps, to write an on-line editor for Graphviz files.</p>

<p>Alternately you might provide add-on services to the Graphviz package. For instance, some users might want to find all clusters of nodes, where a cluster is a set of nodes connected to each other, but not connected to any nodes outside the cluster. Yet other uses might want to find all paths of a given length emanating from a given node.</p>

<p>I myself have written algorithms which provide these last two features. See the module <a href="http://search.cpan.org/perldoc?GraphViz2::Marpa::PathUtils">GraphViz2::Marpa::PathUtils</a> and the <a href="http://savage.net.au/Perl-modules/html/graphviz2.pathutils/index.html">PathUtils demo page</a>.</p>

<p>But back to using <code>Marpa::R2</code> from within <code>GraphViz2::Marpa</code>.</p>

<h2>Scripts for Testing</h2>

<p>The code being developed obviously needs to be tested thoroughly, because any such little language has many ways to get things right and a horrendously large number of ways to get things slightly wrong, or worse. Luckily, because graphs specified in DOT can be very brief, it's a simple matter to make up many samples. Further, other more complex samples can be copied from the Graphviz distro's <em>graphs/directed/</em> and <em>graphs/undirected/</em> directories.</p>

<p>The <code>GraphViz2::Marpa</code> distro I developed includes with 86 <em>data/*.gv</em> (input) files and the 79 corresponding <em>data/*.lex</em>, <em>data/*.parse</em>, <em>data/*.rend</em>, and <em>html/*.svg</em> (output) files. The missing files are due to deliberate errors in the input files, so they do not have output files. The distribution also includes obvious scripts such as <em>lex.pl</em> (lex a file), <em>parse.pl</em> (parse a lexed file), rend.pl (render a parsed file back into DOT), and one named vaguely after the package, <em>g2m.pl</em>, which runs the lexer and the parser.</p>

<p>Why a rend.pl? If the code <em>can't</em> reconstruct the input DOT file, something got lost in translation....</p>

<p>The distribution also includes scripts which operate on a set of files.</p>

<ul>

<li><em>data/*.gv</em> -&gt; <em>dot2lex.pl</em> -&gt; runs <em>lex.pl</em> once per file -&gt; <em>data/*.lex</em> (CSV files).</li>

<li><em>data/*.lex</em> -&gt; <em>lex2parse.pl</em> -&gt; runs <em>parse.pl</em> once per file -&gt; <em>data/*.parse</em> (CSV files).</li>

<li><em>data/*.parse</em> -&gt; <em>parse2rend.pl</em> -&gt; runs <em>rend.pl</em> once per file -&gt; <em>data/*.rend</em> (dot files).</li>

<li><em>data/*.rend</em> -&gt; <em>rend2svg.pl</em> -&gt; <em>html/*.svg</em>.</li>

</ul>

<p>Finally, <em>generate.demo.pl</em> creates <a href="http://savage.net.au/Perl-modules/html/graphviz2.marpa/">the GraphViz2 Marpa demo page</a>.</p>

<p>Normal users will use <em>g2m.pl</em> exclusively. The other scripts help developers with testing. See the <a href="http://savage.net.au/Perl-modules/html/GraphViz2/Marpa.html#Scripts">GraphViz2 Marpa scripts documentation</a> for more information.</p>

<h3>Some Modules</h3>

<p><a
    href="http://search.cpan.org/perldoc?GraphViz2:Marpa::Lexer::DFA">GraphViz2::Marpa::Lexer::DFA</a>
is a wrapper around <a
    href="http://search.cpan.org/perldoc?Set::FA::Element">Set::FA::Element</a>.
It has various tasks to do:</p>

<ul>

<li>Process the State Transition Table (STT)</li>

<p>The <a href="http://savage.net.au/Perl-modules/html/graphviz2.marpa/stt.html">STT</a> comes in via <code>GraphViz2::Marpa::Lexer</code>, which produced it from within its own source code or an external CSV file. The lexer has already validated the structure of the STT's data.</p>

<li>Transform the STT from the input form (spreadsheet/CSV file) into what Set::FA::Element expects</li>

<li>Set up the logger</li>

<p>In truth, it gets this from its caller, GraphViz2::Marpa::Lexer.</p>

<li>Provide the code for all the functions which handle enter-state and exit-state events</li>

<p>This is the code which can apply checking above and beyond what was built into the set of regexps which came from the spreadsheet. For example, I could have used Perl's <code>\w</code> instead of <code>/[a-zA-Z_][a-zA-Z_0-9]*/</code> to find alphanumeric tokens, and at this point rejected those starting with a digit, because <a href="http://www.graphviz.org/content/dot-language">DOT</a> imposes that restriction.</p>

<p>Most importantly, this code stockpiles the tokens themselves with metadata to identify the type of each token (hence the <em>two</em> columns in the upcoming sample <em>data/27.lex</em> just below).</p>

<li>Run the DFA</li>

<li>Check the result of that run</li>

<p>Did the DFA end up in an accepting state? Yes is okay and no is an error.</p>

</ul>

<p>Here is some sample data which ships with <code>GraphViz2::Marpa</code>, formatted for maximum clarity:</p>

<ul>

<li>A DOT file, <em>data/27.gv</em>, which is input to the lexer:</li>

<pre><code>    digraph graph_27
    {
        node_27_1
        [
            color     = red
            fontcolor = green
        ]
        node_27_2
        [
            color     = green
            fontcolor = red
        ]
        node_27_1 -&gt; node_27_2
    }</code></pre>

<li>A token file, <em>data/27.lex</em>, which is output from the lexer:</li>

<pre><code>    &quot;type&quot;,&quot;value&quot;
    strict          , &quot;no&quot;
    digraph         , &quot;yes&quot;
    graph_id        , &quot;graph_27&quot;
    start_scope     , &quot;1&quot;
    node_id         , &quot;node_27_1&quot;
    open_bracket    , &quot;[&quot;
    attribute_id    , &quot;color&quot;
    equals          , &quot;=&quot;
    attribute_value , &quot;red&quot;
    attribute_id    , &quot;fontcolor&quot;
    equals          , &quot;=&quot;
    attribute_value , &quot;green&quot;
    close_bracket   , &quot;]&quot;
    node_id         , &quot;node_27_2&quot;
    open_bracket    , &quot;[&quot;
    attribute_id    , &quot;color&quot;
    equals          , &quot;=&quot;
    attribute_value , &quot;green&quot;
    attribute_id    , &quot;fontcolor&quot;
    equals          , &quot;=&quot;
    attribute_value , &quot;red&quot;
    close_bracket   , &quot;]&quot;
    node_id         , &quot;node_27_1&quot;
    edge_id         , &quot;-&gt;&quot;
    node_id         , &quot;node_27_2&quot;
    end_scope       , &quot;1&quot;</code></pre>

</ul>

<p>You can see the details on <a href="http://savage.net.au/Perl-modules/html/graphviz2.marpa/">the GraphViz2 Marpa demo page</a>.</p>

<h2>Some Notes on the STT</h2>

<p>Firstly, note that the code allows whole-line comments (matching <code>m!^(?:#|//)!</code>. These lines are discarded when the input file is read, and so do not appear in the STT.</p>

<h3>Working With An Incomplete BNF</h3>

<p>Suppose you've gone to all of the work to find or create a BNF (<a
    href="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form">Backus-Naur
    Form</a>) grammar for your input language. You might encounter the
situation where you can use BNF to specify your language in general, but not
precisely in every situation. DOT is one offender.</p>

<p>DOT IDs can be surrounded by double-quotes, and in some case <em>must</em> be surrounded by double-quotes. To be more specific, if we regard an attribute declaration to be of the form <code>key=value</code>, <em>both</em> the key and the value can be embedded in double-quotes, and sometimes the value <em>must</em> be.</p>

<p>Even worse, IDs can be attributes. For instance, you might want your font color to be green. That appears to be simple, but note: attributes <em>must</em> be attached to some component of the graph, something like <code>node_27_1 [fontcolor = green]</code>.</p>

<p>Here's the pain. In DOT, the <em>thing</em> to which the attribute belongs <em>may be omitted</em> as implied. That is, the name of the thing's owner is <em>optional</em>. For instance, you might want a graph which is six inches square. Here's how you can specify that requirement:</p>

<ul>

<li><code>graph [size = "6,6"]</code></li>

<p>The double-quotes are mandatory in this context.</p>

<li><code>[size = "6,6"]</code></li>

<p>This also works.</p>

<li><code>size = "6,6"</code></li>

<p>So does this.</p>

</ul>

<p>But wait, there's more! The <em>value</em> of the attribute can be omitted, if it's true. Hence the distro, and the demo page, have a set of tests, called <em>data/42.*.gv</em>, which test that feature of the code. Grrrr.</p>

<p>All this means is that when the terminator of the attribute declaration (in the input stream) is detected, and we switch states from <code>within-attribute</code> to <code>after-attribute</code>, the code which emits output from the lexer has to have some knowledge of what the hell is going on, so that it can pretend it received the first of these three forms even if it received the second or third form. It <em>must</em> output <code>graph</code> (the graph as a whole) as the owner of the attribute in question.</p>

<p>As you've seen, any attribute declaration can contain a set of attribute <code>key=value</code> pairs as in <code>node_27_2 [ color = green fontcolor = red ]</code>.</p>

<p>You can't solve this with regexps, unless you have amazing superpowers and don't care if anyone else can maintain your code. Instead, be prepared to add code in two places:</p>

<ul>

<li>At switch of state</li>

<li>After all input has been parsed</li>

<p>Indeed, </code>GraphViz2::Marpa::Lexer::DFA</code> contains a long sub,
<code>_clean_up()</code>, which repeatedly fiddles the array of detected
tokens, fixing up the list before it's fit to inflict on the parser.</p>

</ul>

<h3>Understanding the State Transition Table</h3>

<p>I included this diagram in the first article:</p>

<pre><code>             DOT&#39;s Grammar
                  |
                  V
        ---------------------
        |                   |
     strict                 |
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
     digraph     or       graph
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
       $id                  |
        |                   |
        ---------------------
                  |
                  V
                {...}</code></pre>

<p>A dot file starts with an optional <code>strict</code>, followed by either
<code>graph</code> or <code>digraph</code>. (Here <code>di</code> stands for
directed, meaning edges between nodes have arrowheads on them. Yes, there are
many attributes which can be attached to edges. See <a
    href="http://www.graphviz.org/content/attrs">http://www.graphviz.org/content/attrs</a>
and look for an <code>E</code> (edge) in the second column of the table of
attribute names).</p>

<p>The <code>/(di|)graph/</code> is in turn followed by an optional ID.</p>

<h4>Line One</h4>

<p>From the first non-heading line of the STT, you can see how I ended up with:</p>

<ul>

<li>A start state flag =&gt; <code>Yes</code></li>

<p>The state defined on this line is the start state.</p>

<li>An accept flag =&gt; <code>''</code></li>

<p>Because the initial state can't be an accepting start, this column--in the row--must be empty.</p>

<p>Later STT states have <code>Yes</code> in this column.</p>

<li>A state name =&gt; <code>initial</code></li>

<p>I chose to call it <code>initial</code>. Other people call it <code>start</code>.</p>

<li>An event =&gt; <code>strict</code></li>

<p>This--although it might not yet be clear--is actually a regexp, <code>/(?:strict)/</code>. The DFA adds the do-not-capture prefix <code>/(?:</code> and suffix <code>)/</code>.</p>

<li>A next state =&gt; <code>graph</code></li>

<p>This is the state to which to jump if a match occurs. Here, match means a match between the regexp (event) in the previous column and the head of the input stream.</p>

<li>An entry function =&gt; <code>''</code></li>

<p>I don't use it here, but if I did, it would mean to call a particular function when entering the named state.</p>

<li>An exit function =&gt; <code>save_prefix</code></li>

<p>Similar to the entry function, this says to call a particular function when exiting the named state.</p>

<li>A regexp =&gt; <code>(?:"[^"]*"|<\s*<.*?</code>\s*>|[a-zA-Z_][a-zA-Z_0-9]*|-?(?:\.[0-9]+|[0-9]+(?:\.[0-9])*))></li>

<p>I can save a set of regexps in this column and use spreadsheet formula elsewhere to refer to them.</p>

<li>An interpretation =&gt; <code>ID</code></li>

<p>These are my notes to myself. This one says that regexp in the previous column specifies what an ID must match in the DOT language.</p>

</ul>

<h4>Line Two</h4>

<p>The event in the second line, <code>/(?:graph|digraph)/</code>, indicates what to do if the <code>strict</code> is absent from the input stream.</p>

<p>To clarify this point, recall that the DFA matches entries in the Event column one at a time, from the first listed against the name of the state--here, <code>/(?:strict)/</code>--down to the last for the given state--here, <code>/(?:\s+)/</code>. The first regexp to match wins, in that the first regexp to match will triggersthe <em>exit</em> state logic and that its entry in the Next column specifies the next state to enter, which in turn specifies the (next) state's <em>entry</em> function to call, if any.</p>

<p>If <code>strict</code> is not at the head of the input stream, and it can definitely be absent, as is seen in the above diagram, this regexp--<code>/(?:graph|digraph)/</code>--is the next one tested by the DFA's logic.</p>

<h4>Line Three</h4>

<p>The hard-to-read regexp <code>\/\*.*\*\/</code> says to skip C-language-style multi-line (<code>/* ... */</code>) comments. The skip takes place because the Next state is <code>initial</code>, the current state. In other words, discard any text at the head of the input stream which this regexp will gobble.</p>

<p>Why does it get discarded? That's the way <code>Set::FA::Element</code>
operates. Looping <em>within</em> a state does <em>not</em> trigger the
exit-state and enter-state functions, and so there is no opportunity to
stockpile the matched text. That's good in this case. There's no reason to save
it, because it's a comment.</p>

<p>Think about the implications for a moment. Once the code has discarded a comment (or anything else), you can never recreate the verbatim input stream from the stockpiled text. Hence you should only discard something once you fully understand the consequences. If you're parsing code to execute it (whatever that means), fine. If you're writing a pretty printer or indenter, you cannot discard comments.</p>

<p>Lastly, we can say this regexp is used often, meaning we accept such comments at many places in the input stream.</p>

<h4>Line Four</h4>

<p>The regexp <code>\s+</code> says to skip spaces (in front of or between interesting tokens). As with the previous line, we skip to the very same state.</p>

<p>This state has four regexps attached to it.</p>

<h3>More States</h3>

<p>Re-examining the STT shows two introductory states, for input with and without a (leading) <code>strict</code>. I've called these states by the arbitrary names <code>initial</code> and <code>graph</code>.</p>

<p>If the initial <code>strict</code> is present, state <code>initial</code> handles it (in the exit function) and jumps to state <code>graph</code> to handle what comes next. If, however, <code>strict</code> is absent, state <code>initial</code> still handles the input, but then jumps to state <code>graph_id</code>.</p>

<p>A (repeated) word of warning about <code>Set::FA::Element</code>. A loop
<em>within</em> a state does <em>not</em> trigger the exit-state and
enter-state functions. Sometimes this can actually be rather unfortunate. You
can see elsewhere in the STT where I have had to use pairs of
almost-identically named states (such as <code>statement_list_1</code> and
<code>statement_list_2</code>), and designed the logic to rock the STT back and
forth between them, just to allow the state machine to gobble up certain input
sequences. You may have to use this technique yourself. Be aware of it.</p>

<p>Proceeding in this fashion, driven by the BNF of the input language, eventually you can construct the whole STT. Each time a new enter-state or exit-state function is needed, write the code, then run a small demo to test it. There is no substitute for that testing.</p>

<h4>The <code>graph</code> State</h4>

<p>You reach this state simply by the absence of a leading <code>strict</code> in the input stream. Apart from not bothering to cater for comments (as did the <code>initial</code> state), this state is really the same as the <code>initial</code> state.</p>

<p>A few paragraphs back I warned about a feature designed into Set::FA::Element, looping within a state. That fact is why the <code>graph</code> state exists. If the <code>initial</code> state could have looped to itself upon detecting <code>strict</code>, <em>and</em> executed the exit or entry functions, there would be no need for the <code>graph</code> state.</p>

<h4>The <code>graph_id</code> State</h4>

<p>Next, look for an optional graph id, at the current head of the input stream (because anything which matched previously is gone).</p>

<p>Here's the first use of a formula: Cell H2 contains
<code>(?:"[^"]*"|&lt;[^&gt;]*&gt;|[a-zA-Z_][a-zA-Z_0-9]*|-?(?:\.[0-9]+|[0-9]+(?:\.[0-9])*))</code>.
This accepts a double-quoted ID, or an ID quoted with <code>&lt;</code> and
<code>&gt;</code>, or an alphanumeric (but not starting with a digit) ID, or a
number.</p>

<p>When the code sees such a token, it jumps to the <code>open_brace</code> state, meaning the very next non-whitespace character had better (barring comments) be a <code>{</code>, or there's an error, so the code will die. Otherwise, it accepts <code>{</code> without an ID and jumps to the <code>statement_list_1</code> state, or discards comments and spaces by looping within the <code>graph_id</code> state.</p>

<h4>The Remaining States</h4>

<p>What follows in the STT gets complex, but in reality is more of the same. Several things should be clear by now:</p>

<ul>

<li>The development of the STT is iterative</li>

<li>You need lots of tiny but different test data files, to test these steps</li>

<li>You need quite a lot of patience, which, unfortunately, can't be downloaded from the internet...</li>

</ul>

<h2>Lexer Actions (Callbacks)</h2>

<p>Matching something with a DFA only makes sense if you can capture the matched text for processing. Hence the use of state-exit and state-entry callback functions. In these functions, you must decide what text to output for each recognized input token.</p>

<p>To help with this, I use a method called <code>items()</code>, accessed in each function via <code>$myself</code>. This method manages an stack (array) of items of type <code>Set::Array</code>. Each element in this array is a hashref:</p>

<pre><code>    {
        count =&gt; $integer, # 1 .. N.
        name  =&gt; &#39;&#39;,       # Unused.
        type  =&gt; $string,  # The type of the token.
        value =&gt; $value,   # The value from the input stream.
    }</code></pre>

<p>Whenever a token is recognized, push a new item onto the stack. The value of the <code>type</code> string is the result of the DFA's work identifying the token. This identification process uses the first of the two sub-grammars mentioned in the first article.</p>

<h3>A long Exit-state Function</h3>

<p>The <code>save_prefix</code> function looks like:</p>

<pre><code>    # Warning: This is a function (i.e. not a method).

    sub save_prefix
    {
        my($dfa)   = @_;
        my($match) = trim($dfa -&gt; match);

        # Upon each call, $match will be 1 of:
        # * strict.
        # * digraph.
        # * graph.

        # Note: Because this is a function, $myself is a global alias to $self.

        $myself -&gt; log(debug =&gt; &quot;save_prefix($match)&quot;);

        # Input     =&gt; Output (a new item, i.e. a hashref):
        # o strict  =&gt; {name =&gt; strict,  value =&gt; yes}.
        # o digraph =&gt; {name =&gt; digraph, value =&gt; yes}.
        # o graph   =&gt; {name =&gt; digraph, value =&gt; no}.

        if ($match eq &#39;strict&#39;)
        {
            $myself -&gt; new_item($match, &#39;yes&#39;);
        }
        else
        {
            # If the first token is &#39;(di)graph&#39; (i.e. there was no &#39;strict&#39;),
            # jam a &#39;strict&#39; into the output stream.

            if ($myself -&gt; items -&gt; length == 0) # Output stream is empty.
            {
                $myself -&gt; new_item(&#39;strict&#39;, &#39;no&#39;);
            }

            $myself -&gt; new_item(&#39;digraph&#39;, $match eq &#39;digraph&#39; ? &#39;yes&#39; : &#39;no&#39;);
        }

    } # End of save_prefix.</code></pre>

<h3>A tiny Exit-state Function</h3>

<p>Here's one of the shorter exit functions, attached in the STT to the <code>open_brace</code> and <code>start_statement</code> states:</p>

<pre><code>    sub start_statements
    {
        my($dfa) = @_;

        $myself -&gt; new_item(&#39;open_brace&#39;, $myself -&gt; increment_brace_count);

    } # End of start_statements.</code></pre>

<p>The code to push a new item onto the stack is just:</p>

<pre><code>    sub new_item
    {
        my($self, $type, $value) = @_;

        $self -&gt; items -&gt; push
            ({
                count =&gt; $self -&gt; increment_item_count,
                name  =&gt; &#39;&#39;,
                type  =&gt; $type,
                value =&gt; $value,
             });

    } # End of new_item.</code></pre>

<h2>Using Marpa in the Lexer</h2>

<p>Yes, you can use Marpa in the lexer, as discussed in the first article. I prefer to use a spreadsheet full of regexps--but enough of the lexer. It's time to discuss the parser.</p>

<h2>The Parser's Structure</h2>

<p>The parser incorporates the second sub-grammar and uses <code>Marpa::R2</code> to validate the output from the lexer against this grammar. The parser's structure is very similar to that of the lexer:</p>

<ul>

<li>Initialize using the parameters to <code>new()</code></li>

<li>Declare the grammar</li>

<li>Run Marpa</li>

<li>Save the output</li>

</ul>

<h2>Marpa Actions (Callbacks)</h2>

<p>As with the lexer, the parser works via callbacks, which are functions named within the grammar and called by <code>Marpa::R2</code> whenever the input sequence of lexed items matches some component of the grammar. Consider these four <em>rule descriptors</em> in the grammar declared in <code>GraphViz2::Marpa::Parser</code>'s <code>grammar()</code> method:</p>

<pre><code>    [
    ...
    {   # Prolog stuff.
        lhs =&gt; &#39;prolog_definition&#39;,
        rhs =&gt; [qw/strict_definition digraph_definition graph_id_definition/],
    },
    {
        lhs    =&gt; &#39;strict_definition&#39;,
        rhs    =&gt; [qw/strict/],
        action =&gt; &#39;strict&#39;, # &lt;== Callback.
    },
    {
        lhs    =&gt; &#39;digraph_definition&#39;,
        rhs    =&gt; [qw/digraph/],
        action =&gt; &#39;digraph&#39;, # &lt;== Callback.
    },
    {
        lhs    =&gt; &#39;graph_id_definition&#39;,
        rhs    =&gt; [qw/graph_id/],
        action =&gt; &#39;graph_id&#39;, # &lt;== Callback. See sub graph_id() just below.
    },
    ...
    ]</code></pre>

<p>In each case the <code>lhs</code> is a name I've chosen so that I can refer to each rule descriptor in other rule descriptors. That's how I chain rules together to make a tree structure. (See the <em>Chains and Trees</em> section of the previous article.)</p>

<p>This grammar fragment expects the input stream of items from the lexer to consist (at the start of the stream, actually) of three components: a strict thingy, a digraph thingy, and a graph_id thingy. Because I wrote the lexer, I can ensure that this is exactly what the lexer produces.</p>

<p>To emphasise, the grammar says that these the items are the only things it will accept at this point in the input stream, and that only if they are in the given order, and that they must literally consist of the three tokens (see <code>rhs</code>): <em>strict</em>, <em>digraph</em> and <em>graph_id</em>.</p>

<p>These latter three come from the <em>type</em> key in the array of hashrefs built by the lexer. The three corresponding <em>value</em> keys in those hashrefs are <code>yes</code> or <code>no</code> for <em>strict</em>, <code>yes</code> or <code>no</code> for <em>digraph</em>, and an id or the empty string for <em>graph_id</em>.</p>

<p>As with the lexer, when in incoming token (<em>type</em>) matches expectations, <code>Marpa::R2</code> triggers a call to an <em>action</em>, here called (for clarity) the same as the <code>rhs</code>.</p>

<p>Consider one of those functions:</p>

<pre><code>    sub graph_id
    {
        my($stash, $t1, undef, $t2)  = @_;

        $myself -&gt; new_item(&#39;graph_id&#39;, $t1);

        return $t1;

    } # End of graph_id.</code></pre>

<p>The parameter list is courtesy of how <code>Marpa::R2</code> manages callbacks. <code>$t1</code> is the incoming graph id. In <em>data/27.gv</em> (shown earlier), that is <code>graph_27</code>.</p>

<p>Marpa does not supply the string <code>graph_id</code> to this function, because there's no need. I designed the grammar such that this function is only called when the value of the incoming <em>type</em> is <code>graph_id</code>, so I know precisely under what circumstances this function was called. That's why I could hard-code the string <code>graph_id</code> in the body of the <code>graph_id()</code> function.</p>

<h2>The Grammar in Practice</h2>

<p>Now you might be thinking: Just a second! That code seems to be doing no more than copying the input token to the output stream. Well, you're right, sort of.</p>

<p>True understanding comes when you realize that Marpa calls that code only at the appropriate point precisely because the <em>type</em> <code>graph_id</code> and its <em>value</em> <code>graph_27</code> were at exactly the right place in the input stream. By that I mean that the location of the pair:</p>

<pre><code>    (type =&gt; value)
    (&#39;graph_id&#39; =&gt; &#39;graph_27&#39;)</code></pre>

<p>in the input stream was exactly where it had to be to satisfy the grammar initialized by <code>Marpa::R2::Grammar</code>. If it had not been there, Marpa would have thrown an exception, which we would recognize as a syntax error--a syntax error in the input stream fed into the <em>lexer</em>, but which Marpa picked up by testing that input stream against the grammar declared in the <em>parser</em>. The role of the lexer as an intermediary is to simplify the logic of the code as a whole with a divide-and-conquer strategy.</p>

<p>In other words, it's no accident that that function gets called at a particular point in time during the parser's processing of its input stream.</p>

<p>Consider another problem which arises as you build up the set of <em>rule descriptors</em> within the grammar.</p>

<h2>Trees Have Leaves</h2>

<p>The first article discussed chains and trees (see the <code>prolog_definition</code> mentioned earlier in this article). Briefly, each <em>rule descriptor</em> must be chained to other <em>rule descriptors</em>.</p>

<p>The astute reader will have already seen a problem: How do you define the meanings of the leaves of this tree when the chain of definitions must end at each leaf?</p>

<p>Here's part of the <em>data/27.lex</em> input file:</p>

<pre><code>    &quot;type&quot;,&quot;value&quot;
    strict          , &quot;no&quot;
    digraph         , &quot;yes&quot;
    graph_id        , &quot;graph_27&quot;
    start_scope     , &quot;1&quot;
    node_id         , &quot;node_27_1&quot;
    open_bracket    , &quot;[&quot;
    attribute_id    , &quot;color&quot;
    equals          , &quot;=&quot;
    attribute_value , &quot;red&quot;
    attribute_id    , &quot;fontcolor&quot;
    equals          , &quot;=&quot;
    attribute_value , &quot;green&quot;
    close_bracket   , &quot;]&quot;
    ...</code></pre>

<p>The corresponding rules descriptors look like:</p>

<pre><code>    [
    ...
    {
        lhs =&gt; &#39;attribute_statement&#39;,
        rhs =&gt; [qw/attribute_key has attribute_val/],
    },
    {
        lhs    =&gt; &#39;attribute_key&#39;,
        rhs    =&gt; [qw/attribute_id/], # &lt;=== This is a terminal.
        min    =&gt; 1,
        action =&gt; &#39;attribute_id&#39;,
    },
    {
        lhs    =&gt; &#39;has&#39;,
        rhs    =&gt; [qw/equals/],
        min    =&gt; 1,
    },
    {
        lhs    =&gt; &#39;attribute_val&#39;,
        rhs    =&gt; [qw/attribute_value/], # &lt;=== And so is this.
        min    =&gt; 1,
        action =&gt; &#39;attribute_value&#39;,
    },
    ...
    ]</code></pre>

<p>The items marked as terminals (standard parsing terminology) have no further definitions, so <code>attribute_key</code> and <code>attribute_val</code> are leaves in the tree of <em>rule descriptors</em>. What does that mean? The terminals <code>attribute_id</code> and <code>attribute_value</code> must appear literally in the input stream.</p>

<p>Switching between <code>attribute_key</code> and <code>attribute_id</code> is a requirement of Marpa to avoid ambiguity in the statement of the grammar. Likewise for <code>attribute_val</code> and <code>attribute_value</code>.</p>

<p>The <code>min</code> makes the attributes mandatory. Not in the sense that nodes and edges <em>must</em> have attributes, they don't, but in the sense that if the input stream has an <code>attribute_id</code> token, then it <em>must</em> have an <code>attribute_value</code> token and vice versa.</p>

<p>Remember the earlier section "Working With An Incomplete BNF"? If the original <em>*.gv</em> file used one of:</p>

<pre><code>    size = &quot;6,6&quot;
    [size = &quot;6,6&quot;]
    graph [size = &quot;6,6&quot;]</code></pre>

<p>... then the one chosen really represents the graph attribute:</p>

<pre><code>    graph [size = &quot;6,6&quot;]</code></pre>

<p>To make this work, the lexer must force the output to be:</p>

<pre><code>    &quot;type&quot;,&quot;value&quot;
    ...
    class_id        , &quot;graph&quot;
    open_bracket    , &quot;[&quot;
    attribute_id    , &quot;size&quot;
    equals          , &quot;=&quot;
    attribute_value , &quot;6,6&quot;
    close_bracket   , &quot;]&quot;</code></pre>

<p>This matches the requirements, in that both <code>attribute_id</code> and <code>attribute_value</code> are present, is their (so to speak) owner, the object itself, which is identified by the <em>type</em> <code>class_id</code>.</p>

<p>All of this should reinforce the point that the design of the lexer is intimately tied to the design of the parser. By taking decisions like this in the lexer you can standardize its output and simplify the work that the parser needs to don.</p>

<h2>Where to go from here</h2>

<p>The recently released Perl module <a
    href="http://search.cpan.org/perldoc?MarpaX::Simple::Rules">MarpaX::Simple::Rules</a>
takes a BNF and generates the corresponding grammar in the format expected by
<code>Marpa::R2</code>.</p>

<p>Jeffrey Kegler (author of Marpa) <a href="http://jeffreykegler.github.com/Ocean-of-Awareness-blog/individual/2012/06/the-useful-the-playful-the-easy-the-hard-and-the-beautiful.html>">has blogged about MarpaX::Simple::Rules</a>.</p>

<p>This is a very interesting development, because it automates the laborious process of converting a BNF into a set of Marpa's <em>rule descriptors</em>. Consequently, it makes sense for anyone contemplating using <code>Marpa::R2</code> to investigate how appropriate it would be to do so via <code>MarpaX::Simple::Rules</code>.</p>

<h2>Wrapping Up and Winding Down</h2>

<p>You've seen samples of lexer output and some parts of the grammar which both define the second sub-grammar of what to expect and what should match precisely the input from that lexer. If they don't match, it is in fact the parser which issues the dread syntax error message, because only it (not the lexer) knows which combinations of input tokens are acceptable.</p>

<p>Just like in the lexer, callback functions stockpile items which have passed Marpa::R2's attempt to match up input tokens with rule descriptors. This technique records exactly which rules fired in which order. After Marpa::R2 has run to completion, you have a stack of items whose elements are a (lexed and) parsed version of the original file. Your job is then to output that stack to a file, or await the caller of the parser to ask for the stack as an array reference. From there, the world.</p>

<p>For more details, consult <a href="http://savage.net.au/Ron/html/writing.graph.easy.marpa.html>">my July 2011 article on Marpa::R2</a>.</p>

<h2>The Lexer and the State Transition Table - Revisited</h2>

<p>The complexity of the STT in <code>GraphViz2::Marpa</code> justifies the
decision to split the lexer and the parser into separate modules. Clearly that
will not always be the case. Given a sufficiently simple grammar, the lexer
phase may be redundant. Consider this test data file,
<em>data/sample.1.ged</em>, from <a
    href="http://search.cpan.org/perldoc?Genealogy::Gedcom">Genealogy::Gedcom</a>:</p>

<pre><code>    0 HEAD
    1 SOUR Genealogy::Gedcom
    2 NAME Genealogy::Gedcom::Reader
    2 VERS V 1.00
    2 CORP Ron Savage
    3 ADDR Box 3055
    4 STAE Vic
    4 POST 3163
    4 CTRY Australia
    3 EMAIL ron@savage.net.au
    3 WWW http://savage.net.au
    2 DATA
    3 COPR Copyright 2011, Ron Savage
    1 NOTE
    2 CONT This file is based on test data in Paul Johnson&#39;s Gedcom.pm
    2 CONT Gedcom.pm is Copyright 1999-2009, Paul Johnson (paul@pjcj.net)
    2 CONT Version 1.16 - 24th April 2009
    2 CONT
    2 CONT Ron&#39;s modules under the Genealogy::Gedcom namespace are free
    2 CONT
    2 CONT The latest versions of these modules are available from
    2 CONT my homepage http://savage.net.au and http://metacpan.org
    1 GEDC
    2 VERS 5.5.1-5
    2 FORM LINEAGE-LINKED
    1 DATE 10-08-2011
    1 CHAR ANSEL
    1 SUBM @SUBM1@
    0 TRLR</code></pre>

<p>Each line matches <code>/^(\d+)\s([A-Z]{3,4})\s(.+)$/</code>: an integer, a
keyword, and a string. In this case I'd skip the lexer, and have the parser
tokenize the input. So, horses for courses. (GEDCOM defines genealogical data;
see <a href="http://wiki.webtrees.net/File:Ged551-5.pdf>">the GEDCOM
    definition</a> for more details).</p>

<h2>Sample Output</h2>

<p>I've provided several links of sample output for your perusal.</p>

<ul>

<li><a href="http://savage.net.au/Perl-modules/html/graphviz2/">GraphViz2 (non-Marpa)</a></li>

<li><a href="http://savage.net.au/Perl-modules/html/graphviz2.marpa/">GraphViz2::Marpa</a></li>

<li><a href="http://savage.net.au/Perl-modules/html/graphviz2.pathutils/">GraphViz2::Marpa::PathUtils</a></li>

<li><a href="http://savage.net.au/Perl-modules/html/graph.easy.marpa/">Graph::Easy::Marpa</a></li>

</ul>

<p>Happy lexing and parsing!</p>]]>
        
    </content>
</entry>

<entry>
    <title>Consuming RESTful Services with Perl</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/12/consuming-restful-services-with-perl.html" />
    <id>tag:www.perl.com,2012:/pub//2.2079</id>

    <published>2012-12-31T14:00:01Z</published>
    <updated>2013-01-01T03:32:24Z</updated>

    <summary>When JT Smith ported his web game The Lacuna Expanse to a board game, he used Perl to create the board game itself. Here&apos;s how he built the web service behind The Game Crafter.</summary>
    <author>
        <name>JT Smith</name>
        <uri>http://www.plainblack.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p>In my previous article I described <a
    href="http://www.perl.com/pub/2012/11/designing-board-games-with-perl.html">how
    to create board game images using <code>Image::Magick</code></a>, thus
allowing you to design board games using Perl. This time I want to show you how
to upload those images to <a href="https://www.thegamecrafter.com/">The Game
    Crafter</a> so you get get a professional copy of the game
manufactured.</p>

<p>In the last article I <a
    href="https://www.thegamecrafter.com/games/lacuna-expanse:-a-new-empire">created
    a board game version</a> of the video game called <a
    href="http://www.lacunaexpanse.com">The Lacuna Expanse</a>. This time I'll
show how to upload those images to the site to create a custom board game.
Don't worry if you're not ready to create a board game or if you'll never be
ready; the principles of designing a useful API and using it apply to all sorts
of services you might want to use, from weather tracking to stocks to medical
systems. I picked games for two reasons. One, me and I team just built this
system, so it's shiny and new and I've learned a lot. Two, games are more fun
(and visual) than showing how to record invoice information in a RESTful ERP
application.</p>

<h2>Getting Ready</h2>

<p>First, get yourself a copy of <a
    href="http://search.cpan.org/~rizen/TheGameCrafter-Client/lib/TheGameCrafter/Client.pm">TheGameCrafter::Client</a>.
It's a Perl module that makes it trivial to interact with The Game Crafter's
RESTful web service API. When you <code>use TheGameCrafter::Client;</code> it
imports <code>tgc_get()</code>, <code>tgc_post()</code>,
<code>tgc_put()</code>, and <code>tgc_delete()</code> into your program for
easy use. <em>Good APIs are descriptive and obvious.</em> From there you can
have a look at The Game Crafter's <a
    href="https://www.thegamecrafter.com/developer/">developer
    documentation</a> to do any custom stuff you want. <em>Good APIs have
    useful documentation.</em></p>

<p>You'll also need to activate the <a
    href="https://www.thegamecrafter.com/account">developer setting in your TGC
    account</a>, and <a
    href="https://www.thegamecrafter.com/account/apikeys">request an API
    key</a>. <em>Good APIs demonstrate security in authorization and
    authentication.</em></p>

<h2>A Little about The Game Crafter's API</h2>

<p><code>TheGameCrafter::Client</code> is just a tiny wrapper around our
RESTful web services. I designed the web services atop <a
    href="http://search.cpan.org/~xsawyerx/Dancer/lib/Dancer.pm">Dancer</a> and
<a
    href="http://search.cpan.org/~frew/DBIx-Class/lib/DBIx/Class.pm">DBIx::Class</a>.
My goal with this was to build a very reliable and consistent API not only for
external use but internal. You see, the <em>entire</em> TGC web site actually
runs off these web services. Not only that, but these web services tie directly
into our manufacturing facility, so they are controlling the physical world in
addition to the virtual. <em>Good APIs allow you to build multiple clients with
    different uses.</em> (They don't <em>require</em> multiple clients, but
they don't forbid it and do enable it.)</p>

<p>With Lacuna, I built <a
    href="http://search.cpan.org/~rizen/JSON-RPC-Dispatcher/lib/JSON/RPC/Dispatcher.pm.orig">JSON::RPC::Dispatcher</a>
(JRD), which is a JSON-RPC 2.0 web service handler on top of <a
    href="http://search.cpan.org/~miyagawa/Plack/lib/Plack.pm">Plack</a>. I
love JRD, but it has two weaknesses, one of them fatal. One weakness is that
you must format parameters using JSON, which means that it's not easy to just
call a URL and get a result with something like <code>curl</code>. (<em>Good
    RESTful APIs allow multiple clients.</em> If you can't use
<code>curl</code>, you probably have a problem.) The fatal weakness of JSON-RPC
2.0 is that there is no way to do file uploads within the spec. The Game
Crafter is all about file uploads, so that meant I either needed to handle
those separately (aka inconsistently), or develop something new. I opted for
the latter.</p>

<p>With TGC's web services I decided to adopt some of the things I really liked
about JSON-RPC, namely the way it handles responses whether they be result sets
or errors. So you always get a consistent return:</p>

<pre><code>{ "result" : { ... } }

{
   "error" : {
        "code" : 404,
        "message" : "File not found.",
        "data" : "file_id"
   }
}</code></pre>

<p>With TGC I also wanted a consistent and easy way of turning
<code>DBIx::Class</code> into web services through Dancer. I looked into things
like <a
    href="http://search.cpan.org/~oliver/Catalyst-Plugin-AutoCRUD-2.122460/lib/Catalyst/Plugin/AutoCRUD.pm">AutoCRUD</a>,
but I'm not a fan of Catalyst, and it also took too much configuration (in my
opinion) to get it working. I wanted something simpler and faster, so I decided
to roll my own. The result was a thin layer of glue between Dancer and
<code>DBIx::Class</code> that allows you to define your web service interface
in your normal <code>DBIx::Class</code> declarations. It automatically then
generates the web services, databases tables, web form handling, and more for
you. This little glue layer is now in use in all web app development within <a
    href="http://www.plainblack.com/">Plain Black</a>, and eventually we'll be
releasing it onto CPAN for all to use for free. The best part of that is that
you know you're getting a production-ready system because it's been running The
Game Crafter and other sites for over a year now. (<em>Good APIs are often
    extracted from working systems.</em>) More on that in a future article.</p>

<h2>Let's Do This Thing Already</h2>

<p>Before you can make any API calls, you need to authenticate.</p>

<pre><code> my $session = tgc_post('session',{
   username    =&gt; 'me',
   password    =&gt; '123qwe',
   api_key_id  =&gt; 'abcdefhij',
 });
</code></pre>

<p>Before you can start uploading, fetch your user account information. This
contains several pieces of info that you can use.</p>

<pre><code> my $user = tgc_get('user', {
   session_id    =&gt; $session-&gt;{id},           # using our session to do stuff
});
</code></pre>

<p>Think of TGC projects like filesystems: you have folders which contain
folders and files. First create a folder, then upload a file:</p>

<pre><code> my $folder = tgc_post('folder', {
  session_id  =&gt; $session-&gt;{id},
  name        =&gt; 'Lacuna',
  user_id     =&gt; $user-&gt;{id},
  parent_id   =&gt; $user-&gt;{root_folder_id},  # putting this in the home folder
 });

 my $file = tgc_post('file', {
  session_id  =&gt; $session-&gt;{id},
  name        =&gt; 'Mayhem Training',
  file        =&gt; ['mayhem.png'],         # the array ref signifies this is a file path
  folder_id   =&gt; $folder-&gt;{id},       # putting it in the just-created folder
 });</code></pre>

<p>Assuming at this point you've uploaded all of your files, you can now build
out your game. The Game Crafter has this notion of a "Designer", which is sort
of like your very own publishing company. Games are attached to the designer,
so first you must create the designer, then the game.</p>

<pre><code> my $designer = tgc_post('designer', {
  session_id  =&gt; $session-&gt;{id},
  user_id     =&gt; $user-&gt;{id},
  name        =&gt; 'Lacuna Expanse Corp',
 });

 my $game = tgc_post('game', {
  session_id  =&gt; $session-&gt;{id},
  designer_id =&gt; $designer-&gt;{id},
  name        =&gt; 'Lacuna Expanse: A New Empire',
 });</code></pre>

<p>With a game created and assets uploaded, you can now create a deck of cards. This is pretty straight forward just like before.</p>

<pre><code> my $deck = tgc_post('minideck', {
  session_id =&gt; $session-&gt;{id},
  name       =&gt; 'Planet',
  game_id    =&gt; $game-&gt;{id},
 });

 my $card = tgc_post('minicard', {
  session_id =&gt; $session-&gt;{id},
  name       =&gt; 'Mayhem Training',
  face_id    =&gt; $file-&gt;{id},
  deck_id    =&gt; $deck-&gt;{id},
 });</code></pre>

<p>You have probably noticed already how closely this resembles CRUD
operations, because it does. Behind the scenes, who knows what TGC does with
this information? (I do, but that's because I wrote it.) It doesn't matter to
the API, because all of those details are hidden behind a good API. <em>Good
    APIs expose only the necessary details</em>&mdash;in this case, the
relationships between folders and files and between designers and games.</p>

<p>You can also see that the API is as stateless as possible, where the session
identifier is part of every API call. It's easy to imagine a more complicated
API which hides this, but I stuck with the bare-bones REST for at least two
reasons: it's simple, and it's easy to see what's happening. Someone could
build over the top of this API if desired. <em>Good APIs allow extension and
    further abstraction.</em></p>

<p>Just like that, you've created a game and added a deck of cards to it. There
are of course lots of other fancy things you can do with the API, but this
should get you started. I wouldn't leave you hanging there, however. I've <a
    href="https://github.com/plainblack/Lacuna-Board-Game">open sourced the
    actual code</a> I used to create <a
    href="https://www.thegamecrafter.com/games/lacuna-expanse:-a-new-empire">the
    Lacuna Expanse board game</a> so you'd have something to reference. There's
also a <a
    href="https://community.thegamecrafter.com/forums/developers">developer's
    forum</a> if you have any questions. Good luck to you, and happy
gaming!</p>]]>
        
    </content>
</entry>

<entry>
    <title>Designing Board Games With Perl</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/11/designing-board-games-with-perl.html" />
    <id>tag:www.perl.com,2012:/pub//2.2077</id>

    <published>2012-11-30T14:00:01Z</published>
    <updated>2012-12-03T00:38:00Z</updated>

    <summary>When JT Smith ported his web game The Lacuna Expanse to a board game, he used Perl to automate things. Here&apos;s how he did it.</summary>
    <author>
        <name>JT Smith</name>
        <uri>http://www.plainblack.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p>Board games are hotter than they've ever been. In fact, <a
    href="http://www.icv2.com/articles/news/24066.html">the board game market
    has grown 25% in the past year while the video game market shrank 20%</a>.
But you're a Perl hacker, not an Adobe Illustrator, so how can you design a
board game? Well, that's exactly what I aim to show you in this article.</p>

<p>First, you need an idea. You can turn literally anything into a game.  <a
    href="https://www.thegamecrafter.com/games/adventurelings"
    title="Adventurelings">Whether</a> <a
    href="https://www.thegamecrafter.com/games/plague-the-card-game"
    title="Plague">you</a> <a href="https://www.thegamecrafter.com/games/merc"
    title="MERC">just</a> <a
    href="https://www.thegamecrafter.com/games/the-decktet-firmament-"
    title="The Decktet">want</a> <a
    href="https://www.thegamecrafter.com/games/zombiezone"
    title="ZombieZone">to</a> <a
    href="https://www.thegamecrafter.com/games/surviving-design-projects"
    title="Surviving Design Projects">design</a> <a
    href="https://www.thegamecrafter.com/games/rejection-therapy-the-game"
    title="Rejection Therapy">your</a> <a
    href="https://www.thegamecrafter.com/games/hackers-agents" title="Hackers
    and Agents">own</a> <a
    href="https://www.thegamecrafter.com/games/the-tarat" title="The
    TaRat">custom</a> <a
    href="https://www.thegamecrafter.com/games/wild-pursuit-" title="Wild
    Pursuit">playing</a> <a
    href="https://www.thegamecrafter.com/games/jump-gate" title="Jump
    Gate">cards</a>, <a
    href="https://www.thegamecrafter.com/games/sandwich-city" title="Sandwich
    City">or</a> <a href="https://www.thegamecrafter.com/games/shake-out-"
    title="Shake Out">you</a> <a
    href="https://www.thegamecrafter.com/games/trade-fleet" title="Trade
    Fleet">want</a> <a href="https://www.thegamecrafter.com/games/diggity"
    title="Diggity">to</a> <a
    href="https://www.thegamecrafter.com/games/elemental-clash:-the-basic-set"
    title="Elemental Clash">make</a> <a
    href="https://www.thegamecrafter.com/games/black-and-red" title="Black and
    Red Playing Cards">a</a> <a
    href="https://www.thegamecrafter.com/games/frogs-" title="Frogs!">full</a>
<a href="https://www.thegamecrafter.com/games/phytocats"
    title="Phytocats">on</a> <a
    href="https://www.thegamecrafter.com/games/city-of-gears" title="City of
    Gears">custom</a> <a href="https://www.thegamecrafter.com/games/gibs"
    title="Gibs">board</a> <a
    href="https://www.thegamecrafter.com/games/dr-pergias-race" title="Doctor
    Pergaias' Race Across The Continent">or</a> <a
    href="https://www.thegamecrafter.com/games/goblin-warlord" title="Goblin
    Warlord">card</a> <a
    href="https://www.thegamecrafter.com/games/braaaaains-"
    title="Braaaaains!">game</a>, <a
    href="https://www.thegamecrafter.com/games/the-great-race1" title="The
    Great Race">the</a> <a
    href="https://www.thegamecrafter.com/games/road-to-magnate" title="Road to
    Magnate">options</a> <a
    href="https://www.thegamecrafter.com/games/zerpang-"
    title="Zerpang!">are</a> <a
    href="https://www.thegamecrafter.com/games/angels-elements" title="Angels
    Elements">limitless</a>.  (Full disclosure: I'm one of the owners of <a
    href="https://www.thegamecrafter.com/">The Game Crafter</a>, which itself
is written entirely in Perl. )</p>

<p>For the purposes of this article, I'm going to make <a
    href="https://www.thegamecrafter.com/games/lacuna-expanse:-a-new-empire">a
    board game version</a> of the popular Perl-based web game <a
    href="http://www.lacunaexpanse.com">The Lacuna Expanse</a>. (I'm also one
of the owners of Lacuna.) I chose this because I already have some artwork for
it, albeit not in board game form. However, <a
    href="https://community.thegamecrafter.com/publish/file-preparation/art-resources">you
    can get free art from various sites around the web</a>.</p>


<p>My Lacuna-based board game will be a tile placement game where all the
players work together cooperatively to fend off an alien invasion.</p>

<h2>Let's Get To The Perl Already!</h2>

<p>There are several great image manipulation libraries on the CPAN, but my
personal favorite is <a
    href="http://search.cpan.org/~jcristy/PerlMagick-6.77/Magick.pm.in">Image::Magick</a>.
I started by creating a base image which I could manipulate in any way that I
wanted. (I based my choice off of <a
    href="https://www.thegamecrafter.com/publish/pricing">The Game Crafter's
    list of component sizes and prices</a>.) I decided to use <a
    href="https://community.thegamecrafter.com/publish/templates/cards/mini-cards">mini
    cards</a>, because the table would fill up too quickly with full
poker-sized cards; there'll be a lot of cards on the table!</p>

<pre><code> my $card = Image::Magick-&gt;new(size=&gt;'600x825');
 say $card-&gt;ReadImage('canvas:white');</code></pre>

Note that I used <code>say</code> in front of the <code>ReadImage</code> call.
<code>Image::Magick</code> will emit a textual exception on each call if
anything goes wrong. I could easily wrap that with better error handling, but
for now printing to the screen is sufficient for my needs.</p>

<p>When printing things (really <em>printing</em> them, with ink and all) you
also have to take into account something called <a
    href="http://youtu.be/NqZSFpmS2dM">bleed and cut lines</a>. It's easy to
draw the cut line on the card in red as the boundary of the printable image
content.</p>

<pre><code> say $card-&gt;Draw(stroke=&gt;'red', fill =&gt; 'none', strokewidth=&gt;1, primitive=&gt;'rectangle', points=&gt;'38,38 562,787');
</code></pre>

<img src="/pub/2012/11/blank.jpg" alt="blank with cut lines">

<p>So far so good. The next step is to give this card a background so that it
starts to look like a card. For this I'll take one of the planet surface images
from the Lacuna Expanse and rotate it and stretch it to fit the shape of the
card.</p>

<pre><code> my $surface = Image::Magick-&gt;new;
 say $surface-&gt;ReadImage('surface-p17.jpg');
 say $surface-&gt;Rotate(90);
 say $surface-&gt;Resize('600x825!');
 say $card-&gt;Composite(compose =&gt; 'over', image =&gt; $surface, x =&gt; 0, y =&gt; 0);</code></pre>

<p>Note the exclamation point (<strong>!</strong>) on the <code>Resize</code>
command. That tells <code>Image::Magick</code> to distort the native aspect
ratio of the image. In other words, stretch the image to fill the size I've
specified.</p>

<img src="/pub/2012/11/background.jpg" alt="background">

<p>You may have noticed that this image looks enormous. That's because it's for
print (on paper!) rather than screens. Print has more <a
    href="http://proshooter.com/article_whatisa300dpiJPeg.htm">pixels per
    inch/centimeter</a> than screens, thus the image looks bigger when you
display it on a screen.</p>

<p>Now the card needs a title. Adding text to the image is straightforward.</p>

<pre><code> $card-&gt;Annotate(text =&gt; 'Mayhem Training', font =&gt; 'ALIEN5.ttf', y =&gt; -275, fill =&gt; 'white', pointsize =&gt; 70, gravity =&gt; 'Center');</code></pre>

<img src="/pub/2012/11/title.jpg" alt="title">

<p>As you can see I've used a custom font. <code>Image::Magick</code> is
capable of using nearly any OpenType or TrueType font.</p>

<p>With a background and a title, the next step is to overlay the card with a
picture of the Mayhem Training building from the video game.</p>

<pre><code> my $image = Image::Magick-&gt;new;
 say $image-&gt;ReadImage('mayhemtraining9.png');
 say $card-&gt;Composite(compose =&gt; 'over', image =&gt; $image, x =&gt; 100, y =&gt; 165);</code></pre>

<img src="/pub/2012/11/image.jpg" alt="added image">

<p>Now we're finally getting somewhere! This is really starting to look like a
card. Use the same technique to overlay an icon onto the card. As in so many
games, these icons symbolize an ability that the card grants the player who
uses it. You can get free icons from all over the web; one of my favorite
libraries is <a href="http://www.glyphish.com">Glyphish</a>.</p>

<pre><code> my $icon = Image::Magick-&gt;new;
 say $icon-&gt;ReadImage('target.png');
 say $card-&gt;Composite(compose =&gt; 'over', image =&gt; $icon, x =&gt; 100, y =&gt; 570);</code></pre>

<img src="/pub/2012/11/icon.jpg" alt="added icon">

<p>You can't get away with icons all the time; a little text will explain
things to new players. Adding some explanation to the card would be really
tricky, if it weren't for <a
    href="http://www.imagemagick.org/discourse-server/viewtopic.php?f=7&amp;t=3708">some
    really neat code that Gabe Schaffer contributed to the ImageMagick forums a
    long time ago</a>. Basically without this code you'd have to make the text
wrap at word boundaries yourself, but with it, you can just do a simple
<code>Annotate</code> call like this:</p>

<pre><code> $card-&gt;Set(font =&gt; 'promethean.ttf', pointsize =&gt; 35);
 my $text = 'Demolish one of your buildings to use this ability.';
 my $text_wrapped = wrap($text, $card, 400);
 say $card-&gt;Annotate(text =&gt; $text_wrapped, x =&gt; 100, y =&gt; 690, font =&gt; 'promethean.ttf', fill =&gt; 'white', pointsize =&gt; 35);
</code></pre>

<img src="/pub/2012/11/text.jpg" alt="added text">

<p>A game like this wouldn't be very interesting if you could place any card
anywhere you want. To solve this, I want to to add something to the card to
indicate how other cards can connect to it. This is the most challenging part
yet, because I want to make a half-circle/half-rectangle connector. Because
this is a bit more complicated and I want to use it for drawing connection
points on various sides of the card, I'll turn it into a subroutine.</p>

<pre><code> sub draw_connection_point {
   my ($card, $color, $rotation, $x, $y) = @_;

   # draw a half circle, it's a half cuz we're drawing outide the image
   my $half_circle  = Image::Magick-&gt;new(size=&gt;'70x35');
   say $half_circle-&gt;ReadImage('canvas:transparent');
   say $half_circle-&gt;Draw(stroke =&gt; $color, fill =&gt; $color, strokewidth=&gt;1, primitive=&gt;'circle', points=&gt;'35,35, 35,70');

   # create the connection point image
   my $connection = Image::Magick-&gt;new(size=&gt;'70x85');
   say $connection-&gt;ReadImage('canvas:transparent');

   # add the half circle to the connection point
   say $connection-&gt;Composite(compose =&gt; 'over', image =&gt; $half_circle, x =&gt; 0, y =&gt; 0);

   # extend the connection point the the edge
   say $connection-&gt;Draw(stroke=&gt;$color, fill =&gt; $color, strokewidth=&gt;1, primitive=&gt;'rectangle', points=&gt;'0,35 70,85');

   # orient the connection point for its position
   say $connection-&gt;Rotate($rotation);

   # apply the connection point to the image
   say $card-&gt;Composite(compose =&gt; 'over', image =&gt; $connection, x =&gt; $x, y =&gt; $y);
 }

 draw_connection_point($card, 'purple', 0, 265, 740);</code></pre>

<img src="/pub/2012/11/connection.jpg" alt="connection added">

<p>Sometimes it's nice to give players hints about stuff so they can form
better strategies. To that end, I added a series of pips above the title to
indicate how many copies of this card are in the deck. In this case, this card
is unique.</p>

<pre><code> my $quantity = 1;
 my $pips = '.' x $quantity;
 say $card-&gt;Annotate(text =&gt; $pips, y =&gt; -340, fill =&gt; 'white', pointsize =&gt; 70, gravity =&gt; 'Center');</code></pre>

<img src="/pub/2012/11/finished.jpg" alt="finished">

<p>Remember to remove the cut line before you save the file.</p>

<pre><code> #say $card-&gt;Draw(stroke=&gt;'red', fill =&gt; 'none', strokewidth=&gt;1, primitive=&gt;'rectangle', points=&gt;'38,38 562,787');
 say $card-&gt;Write('mayhem.png');</code></pre>

<img src="/pub/2012/11/cut-line-removed.jpg" alt="cut lines removed">

<h2>Rationale</h2>

<p>Now that I've shown you how to create a card, you may have one question. Why
would you go through the work of coding it rather than just using Photoshop or
the Gimp? There are lots of reasons to code it including things like you don't
know how to use image editors. However the really important reason is the same
reason you write code to do anything... automation! A game isn't made of just
one card. Likewise, games aren't designed in just one try. It takes lots of
play testing and revisions. If you design your board game using code you can
whip out a new revision as easily as changing a config file.</p>

<p>Of course, automatic image generation isn't only for games....</p>

<h2>Next Time</h2>

<p>I've shown you how to create the images for a game. If you're like me, the
next thing you want to do is print your game. You could do this at home, but it
will cost you a lot of time and money (ink jet ink costs more than human
blood). You could take it to Kinkos, but you won't get a nice quality product
because they don't specialize in making games. Instead, you can upload your
files to <a href="https://www.thegamecrafter.com">The Game Crafter</a>, where
you'll get a custom game that looks like it game from the game store. There's a
nice easy to use web interface to do this, but you're a Perl programmer. Why do
something manually if you can automate it?</p>

<p>Besides that, it's a real-world example of interacting with a web service
written completely in Perl&mdash;on both sides. Who wouldn't be interested in
that?</p>

<h2>For The Impatient</h2>

<p><a
    href="https://www.thegamecrafter.com/games/lacuna-expanse:-a-new-empire">The
    Lacuna Expanse Board Game</a> is available for purchase now if you're
interested. Also, I've <a
    href="https://github.com/plainblack/Lacuna-Board-Game">released the code I
    wrote to develop it via this public GitHub repository</a>.]]>
        
    </content>
</entry>

<entry>
    <title>Newcomer Experience in the Perl Community Survey</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/10/newcomer-experience-in-the-perl-community-survey.html" />
    <id>tag:www.perl.com,2012:/pub//2.2076</id>

    <published>2012-10-15T13:00:01Z</published>
    <updated>2012-10-14T19:52:25Z</updated>

    <summary>Kevin Carillo asks for assistance surveying newcomers to the Perl community about their experiences, positive and otherwise.</summary>
    <author>
        <name>Kevin Carillo</name>
        <uri>http://kevincarillo.org/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p>I am currently running <a
href="https://limesurvey.sim.vuw.ac.nz/index.php?sid=89971&amp;lang=en">a
survey looking at the Perl community</a> specifically and I am looking for
people who joined the Perl community within the last two years. I am equally
interested in hearing from people having had either positive or negative
experiences as newcomers.</p>

<h2>Perl subcultures and newcomer experience in Perl</h2>

<p>The Perl community has been a growing and evolving multi-faceted entity that
is now comprised of a large number of social groups (for example, 262 Perl
Mongers groups as of 10/10/12) and various sub-projects that all have their own
history and subculture.</p>

<p>Some projects are very newcomer conscious while others are not. Projects may
rely on different strategies and means when trying to integrate new recruits.
Younger projects like <a href="http://mojolicio.us/">Mojo</a> or <a
href="http://perldancer.org/">Dancer</a> may do things differently compared to
older projects such as <a href="http://catalystframework.org/">Catalyst</a> or
<a href="http://moose.perl.org/">Moose</a>. Depending on a project's
subculture, there may be more or fewer resources dedicated to helping
newcomers, people may be more or less supportive, or people may or may not
voluntarily mentor newcomers. We could also debate about the differences
between Perl 5 and Perl 6 when dealing with newcomers.</p>

<p>There is then a lot of variation in the way newcomers of the overall Perl
community are handled making each person's newcomer experience pretty
unique.</p>

<p>Among all these different types of experience, some of them will lead one to
become a valued sustainable contributor (from the Perl community perspective)
while others may simply end up in people slowly giving up or even running
away.</p>

<p>This study tries to identify the important aspects of one's newcomer
experience in Perl that have a positive influence in generating "good"
contributors&mdash;"good" in terms of being good citizens for the Perl
community.</p>

<h2>How is it going to help Perl?</h2>

<p>The data will help gain insights about the experience of newcomers within
the Perl community. In addition, it will allow to understand how to design
effective newcomer initiatives to ensure that Perl will remain a successful and
healthy community.</p>

<h2>Where is the project going after that?</h2>

<p>This survey is the first step of the research project as a refined survey
will be later administered to other FOSS communities at the same time. This
will allow to generate cross-community results that will be to some extent
generalizable. This research project overall aims at helping FOSS communities
to design newcomer initiatives in line with their well-being and
sustainability.</p>

<h2>About the survey</h2>

<p>This survey is anonymous, and no information that would identify you is
being collected. I expect the survey to take around 20 minutes of your
time.</p>

<p>The survey is available at <a
href="https://limesurvey.sim.vuw.ac.nz/index.php?sid=89971&amp;lang=en">https://limesurvey.sim.vuw.ac.nz/index.php?sid=89971&amp;lang=en</a>.
It will be available until Monday, 22 October, 2012.</p>

<p>If you know members of the Perl community who you think would be interested
in completing it, please let them know about the survey.</p>

<p>I will post news about my progress with this research, and the results on my
blog at <a href="http://kevincarillo.org/">kevincarillo.org</a>. Don't hesitate
to contact me at <a
href="mailto:kevin.carillo@vuw.ac.nz">kevin.carillo@vuw.ac.nz</a>.</p>]]>
        
    </content>
</entry>

<entry>
    <title>An Overview of Lexing and Parsing</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/10/an-overview-of-lexing-and-parsing.html" />
    <id>tag:www.perl.com,2012:/pub//2.2074</id>

    <published>2012-10-01T13:00:01Z</published>
    <updated>2012-10-01T19:13:31Z</updated>

    <summary>Perl programmers spend a lot of time reading, modifying, and writing data. When regular expressions aren&apos;t enough, turn to something more powerful: parsing.</summary>
    <author>
        <name>Ron Savage</name>
        <uri>http://savage.net.au/index.html</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p>Perl programmers spend a lot of time munging data: reading it in,
transforming it, and writing out the results. Perl's great at this ad hoc text
transformation, with all sorts of string manipulation tools, including regular
expressions.</p>

<p>Regular expressions will only get you so far: witness the repeated advice
that you cannot parse HTML or XML with regular expressions themselves. When
Perl's builtin text processing tools aren't enough, you have to turn to
something more powerful.</p>

<p>That something is <em>parsing</em>.</p>

<h1>An Overview of Lexing and Parsing</h1>

<p>For a more formal discussion of what exactly lexing and parsing are, start
with Wikipedia's definitions: <a
    href="http://en.wikipedia.org/wiki/Lexing">Lexing</a> and <a
    href="http://en.wikipedia.org/wiki/Parsing">Parsing</a>.</p>

<p>Unfortunately, when people use the word parsing, they sometimes include the
idea of lexing. Other times they don't. This can cause confusion, but I'll
try to keep them clear. Such situations arise with other words, and our minds
usually resolve the specific meaning intended by analysing the context in which
the word is used. So, keep your mind in mind.</p>

<p>The lex phase and the parse phase can be combined into a single process, but
I advocate always keeping them separate. Trust me for a moment; I'll explain
shortly. If you're having trouble keeping the ideas separate, note that the
phases very conveniently run in alphabetical order: first we lex, and then we
parse.</p>

<h1>A History Lesson - In Absentia</a></h1>

<p>At this point, an article such as this would normally provide a summary of
historical developments in this field, to explain how the world ended up where
it is.  I won't do that, especially as I first encountered parsing many years
ago, when the only tools (lex, bison, yacc) were so complex to operate I took
the pledge to abstain.  Nevertheless, it's good to know such tools are still
available, so here are a few references:</p>

<p><a href="http://directory.fsf.org/wiki/Flex">Flex</a> is a successor to <a
    href="http://en.wikipedia.org/wiki/Lex_programming_tool">lex</a>, and <a
    href="http://www.gnu.org/software/bison/">Bison</a> is a successor to <a
    href="http://en.wikipedia.org/wiki/Yacc">yacc</a>. These are
well-established (old) tools to keep you from having to build a lexer or parser
by hand. This article explains why I (still) don't use any of these.</p>

<h1>But Why Study Lexing and Parsing?</h1>

<p>There are many situations where the only path to a solution requires a lexer
and a parser:</p>

<ol>
    <li><p><em>Running a program</em></p>

<p>This is trivial to understand, but not to implement.  In order to run a
program we need to set up a range of pre-conditions:</p>

<ul>
<li>Define the language, perhaps called Perl</li>
<li>Write a compiler (combined lexer and parser) for that language's grammar</li>
<li>Write a program in that language</li>
<li><p><em>Lex and parse</em> the source code</p>

<p>After all, it must be syntactically correct before we run it.  If not, we
display syntax errors. The real point of this step is to determine the
programmer's <em>intention</em>, that is, the reason for writing the code. We
don't <em>run</em> the code in this step, but we do get output. How do we do
that?</p></li>

<li><p>Run the code</p>

<p>Then we can gaze at the output which, hopefully, is correct.  Otherwise,
perhaps, we must find and fix logic errors.</p></li>

</ul></li>

<li><p><em>Rendering a web page of HTML + content</em>

<p>The steps are identical to those of the first example, with HTML replacing
Perl, although I can't bring myself to call writing HTML writing a program.</p>

<p>This time, we're asking: What is the web page designer's <em>intention</em>.
What would they like to render and how?  Of course, syntax checking is far
looser that with a programming language, but must still be undertaken.  For
instance, here's an example of clearly-corrupt HTML which can be parsed by <a
    href="http://www.jeffreykegler.com/marpa">Marpa</a>:</p>

<pre><code>        &#60;title&#62;Short&#60;/title&#62;&#60;p&#62;Text&#60;/head&#62;&#60;head&#62;</code></pre>

<p>See <a href="http://metacpan.org/module/Marpa::HTML">Marpa::HTML</a> for
more details. So far, I have used <a
    href="http://metacpan.org/module/Marpa::R2">Marpa::R2</a> in all my work,
which does not involve HTML.</p></li>

<li><p>Rendering an image, perhaps in SVG</p>

<p>Consider this file, written in the <a
    href="http://www.graphviz.org/content/dot-language">DOT</a> language, as
used by the <a href="http://www.graphviz.org/">Graphviz</a> graph visualizer
(<em>teamwork.dot</em>):</p>

<pre><code>        digraph Perl
        {
        graph [ rankdir=&#34;LR&#34; ]
        node  [ fontsize=&#34;12pt&#34; shape=&#34;rectangle&#34; style=&#34;filled, solid&#34; ]
        edge  [ color=&#34;grey&#34; ]
        &#34;Teamwork&#34; [ fillcolor=&#34;yellow&#34; ]
        &#34;Victory&#34;  [ fillcolor=&#34;red&#34; ]
        &#34;Teamwork&#34; -&#62; &#34;Victory&#34; [ label=&#34;is the key to&#34; ]
        }</code></pre>

<p>Given this "program", a renderer give effects to the author's
<em>intention</em> by rendering an image:</p>

<img src="teamwork.svg" />

<p>What's required to do that? As above, <em>lex</em>, <em>parse</em>,
<em>render</em>. Using Graphviz's <code>dot</code> command to carry out these
tasks, we would run:</p>

<pre><code>        shell&#62; dot -Tsvg teamwork.dot &#62; teamwork.svg</code></pre>

<p>Note: Files used in these examples can be downloaded from <a href="http://savage.net.au/Ron/html/graphviz2.marpa/teamwork.tgz">http://savage.net.au/Ron/html/graphviz2.marpa/teamwork.tgz</a>.</p>

<p>The link to the DOT language points to a definition of DOT's syntax, written
in a somewhat casual version of BNF: <a
    href="http://en.wikipedia.org/wiki/Backus%E2%80%93Naur_Form">Backus-Naur
    Form</a>. This is significant, as it's usually straight-forward to
translate a BNF description of a language into code within a lexer and
parser.</p></li>

<li><p>Rendering that same image, using a different language in the input file</p>

<p>Suppose that you decide that the Graphviz language is too complex, and hence
you write a wrapper around it, so end users can code in a simplified version of
that language. This actually happened, with the original effort available in
the now-obsolete Perl module <a
    href="http://metacpan.org/module/Graph::Easy">Graph::Easy</a>. Tels, the
author, devised his own very clever <a
    href="http://en.wikipedia.org/wiki/Little_languages">little language</a>,
which he called <a
    href="http://bloodgate.com/perl/graph/manual/"><code>Graph::Easy</code></a>.</p>

<p>When I took over maintenance of <a
    href="http://metacpan.org/module/Graph::Easy">Graph::Easy</a>, I found the
code too complex to read, let alone work on, so I wrote another implementation
of the lexer and parser, released as <a
    href="http://metacpan.org/module/Graph::Easy::Marpa">Graph::Easy::Marpa</a>.
I'll have much more to say about that in another article. For now, let's
discuss the previous graph rewritten in <code>Graph::Easy</code>
(<em>teamwork.easy</em>):</p>

<pre><code>        graph {rankdir: LR}
        node {fontsize: 12pt; shape: rectangle; style: filled, solid}
        edge {color: grey}
        [Teamwork]{fillcolor: yellow}
        -&#62; {label: is the key to}
        [Victory]{fillcolor: red}</code></pre>

<p>That's simpler for sure, but how does <a href="http://metacpan.org/module/Graph::Easy::Marpa">Graph::Easy::Marpa</a> work? Easy: <em>lex</em>, <em>parse</em>, render. More samples of <a href="http://metacpan.org/module/Graph::Easy::Marpa">Graph::Easy::Marpa</a>'s work are <a href="http://savage.net.au/Perl-modules/html/graph.easy.marpa/index.html">my Graph::Easy::Marpa examples</a>.</p></li>

</ol>

<p>It should be clear by now that lexing and parsing are in fact widespread,
although they often operate out of sight, with only their rendered output
visible to the average programmer, user, or web surfer.</p>

<p>All of these problems have in common a complex but well-structured source
text format, with a bit of hand-waving over the tacky details available to
authors of documents in HTML. In each case, it is the responsibility of the
programmer writing the lexer and parser to honour the intention of the original
text's author. We can only do that by recognizing each token in the input as a
discrete unit of meaning (where a word such as <code>print</code>
<em>means</em> to output something of the author's choosing), and by bringing
that meaning to fruition (for <code>print</code>, to make the output visible on
a device).</p>

<p>With all that I can safely claim that the ubiquity and success of lexing and
parsing justify their recognition as vital constituents in the world of
software engineering. Why study them indeed!</p>

<h1>Good Solutions and Home-grown Solutions</h1>

<p>There's another&mdash;more significant&mdash; reason to discuss lexing and
parsing: to train programmers, without expertise in such matters, to resist the
understandable urge to opt for using tools they are already familiar with, with
regexps being the obvious choice.</p>

<p>Sure, regexps suit many simple cases, and the old standbys of flex and bison
are always available, but now there's a new kid on the block: <a
    href="http://www.jeffreykegler.com/marpa">Marpa</a>. Marpa draws heavily
from theoretical work done over many decades, and comes in various forms:</p>

<dl>
<dt>libmarpa</dt>

<dd><p>Hand-crafted in C.</p></dd>

<dt><code>Marpa::XS</code></dt>

<dd><p>The Perl and C-based interface to the previous version of libmarpa.</p></dd>

<dt><code>Marpa::R2</code></dt>

<dd><p>The Perl and C-based interface to the most recent version of libmarpa.
This is the version I use.</p></dd>

<dt><code>Marpa::R2::Advanced::Thin</code></dt>

<dd><p>The newest and thinnest interface to libmarpa, which documents how to
make Marpa accessible to non-Perl languages.</p></dd>

</dl>

<p>The problem, of course, is whether or not any of these are a good, or even
excellent, choice. Good news! Marpa's advantages are huge.  It's well tested,
which alone is of great significance.  It has a Perl interface, so that I can
specify my task in Perl and let Marpa handle the details. It has its own <a
    href="http://groups.google.com/group/marpa-parser?hl=en">Marpa Google
    Group</a>. It's already used by various modules on the CPAN (see <a
    href="https://metacpan.org/search?q=Marpa">a search for Marpa on the
    CPAN</a>); Open Source says you can see exactly how other people use
it.</p>

<p>Even better, Marpa has a very simple syntax, once you get used to it, of
course! If you're having trouble, just post on the Google Group. (If you've
ever worked with flex and bison, you'll be astonished at how simple it is to
drive Marpa.) Marpa is also very fast, with libmarpa written in C. Its speed is
a bit surprising, because new technology usually needs some time to surpass
established technology while delivering the all-important stability.</p>

<p>Finally, Marpa is being improved all the time.  For instance, recently the
author eliminated the dependency on Glib, to improve portability. His work
continues, so that users can expect a series of incremental improvements for
some time to come.</p>

<p>I myself use Marpa in <a
    href="http://metacpan.org/module/Graph::Easy::Marpa">Graph::Easy::Marpa</a>
and <a href="http://metacpan.org/module/GraphViz2::Marpa">GraphViz2::Marpa</a>,
but this is not an article on Marpa in specific.</p>

<h1>The Lexer's Job Description</h1>

<p>As I mentioned earlier, the stages conveniently, run in English alphabetical
order. First you lex. Then you parse.</p>

<p>Here, I use <em>lexing </em> to mean the comparatively simple (compared to
parsing) process of tokenising a stream of text, which means chopping that
input stream into discrete tokens and identifying the type of each. The output
is a new stream, this time of stand-alone tokens. (Lexing is comparatively
simpler than parsing.)</p>

<p>Lexing does nothing more than identify tokens. Questions about the meanings
of those tokens or their acceptable order or anything else are matters for the
parser. The lexer will say: I have found another token and have identified it
as being of some type T. Hence, for each recognized token, the lexer will
produce two items:</p>

<ul>
<li>The type of the token</li>
<li>The value of the token</li>
</ul>

<p>Because the lexing process happens repeatedly, the lexer will produce an
output of an array of token elements, with each element needing at least these
two components: type and value.</p>

<p>In practice, I prefer to represent these elements as a hashref:</p>

<pre><code>        {
                count =&#62; $integer, # 1 .. N.
                name  =&#62; '',       # Unused.
                type  =&#62; $string,  # The type of the token.
                value =&#62; $value,   # The value from the input stream.
        }</code></pre>

<p>... with the array managed by an object of type <a
    href="http://metacpan.org/module/Set::Tiny">Set::Tiny</a>. The latter
module has many nice methods, making it very suitable for such work. Up until
recently I used <a href="http://metacpan.org/module/Set::Array">Set::Array</a>,
which I did not write but which I do now maintain. However, insights from a
recent report of mine, <a
    href="http://savage.net.au/Perl-modules/html/setops.report.html">Set-handling
    modules</a>, comparing a range of similar modules, has convinced me to
switch to <a href="http://metacpan.org/module/Set::Tiny">Set::Tiny</a>. For an
application which might best store its output in a tree, the Perl module <a
    href="http://metacpan.org/module/Tree::DAG_Node">Tree::DAG_Node</a> is
superb.</p>

<p>The <code>count</code> field, apparently redundant, is sometimes useful in
the clean-up phase of the lexer, which may need to combine tokens unnecessarily
split by the regexps used in lexing. Also, it is available to the parser if
needed, so I always include it in the hashref.</p>

<p>The <code>name</code> field really is unused, but gives people who fork or
sub-class my code a place to work with their own requirements, without worrying
that their edits will affect the fundamental code.</p>

<h1>The Parser's Job Description</h1>

<p>The parser concerns itself with the context in which each token appears,
which is a way of saying it cares about whether or not the sequence and
combination of tokens actually detected fits the expected grammar.</p>

<p>Ideally, the grammar is provided in BNF Form. This makes it easy to
translate into the form acceptable to Marpa. If you have a grammar in another
form, your work will probably be more difficult, simply because someone else
has <em>not</em> done the hard work of formalizing the grammar.</p>

<p>That's a parser. What's a grammar?</p>

<h1>Grammars and Sub-grammars</h1>

<p>I showed an example grammar earlier, for the <a
    href="http://www.graphviz.org/content/dot-language">DOT</a> format. How
does a normal person understand a block of text written in BNF? Training helps.
Besides that, I've gleaned a few things from practical experience.  To us
beginners eventually comes the realization that grammars, no matter how
formally defined, contain within them two sub-grammars:</p>

<h2>Sub-grammar #1</h2>

<p>One sub-grammar specifies what a token looks like, meaning the range of
forms it can assume in the input stream. If the lexer detects an
incomprehensible candidate, the lexer can generate an error, or it can activate
a strategy called <a
    href="http://blogs.perl.org/users/jeffrey_kegler/2011/11/marpa-and-the-ruby-slippers.html">Ruby
    Slippers</a> (no relation to the Ruby programming language). This technique
was named by Jeffrey Kegler, the author of Marpa.</p>

<p>In simple terms, the Ruby Slippers strategy fiddles the current token (or an
even larger section of the input stream) in a way that satisfies the grammar
and restarts processing at the new synthesized token. Marpa is arguably unique
in being able to do this.</p>

<h2>Sub-grammar #2</h2>

<p>The other sub-grammar specifies the allowable ways in which these tokens may
combine, meaning if they don't conform to the grammar, the code generates a
syntax error of some sort.</p>

<p>Easy enough?</p>

<p>I split the grammar into two sub-grammars because it helps me express my
Golden Rule of Lexing and Parsing: <em>encode the first sub-grammar into the
    lexer and the second into the parser</em>.</p>

<p>If you know what tokens look like, you can tokenize the input stream by
splitting it into separate tokens using a lexer. Then you give those tokens to
the parser for (more) syntax checking, and for interpretation of what the user
presumably intended with that specific input stream (combination of
tokens).</p>

<p>That separation between lexing and parsing gives a clear plan-of-attack for
any new project.</p>

<p>In case you think this is going to be complex, truly it only <em>sounds</em>
complicated. Yes, I've introduced a few new concepts (and will introduce a few
more), but don't despair. It's not really that difficult.</p>

<p>For any given grammar, you must somehow and somewhere manage the complexity
of the question "Is this a valid document?" Recognizing a token with a regex is
easy. (That's probably why so many people stop at the point of using regexes to
pick at documents instead of moving to parsing.) Keeping track of the context
in which that token appeared, and the context in which a grammar allows that
token, is hard.</p>

<p>The complexity of setting up and managing a formal grammar and its
implementation seems like a lot of work, but it's a specified and well
understood mechanism you don't have to reinvent every time. The lexer and
parser approach limits the code you have to write to two things: a set of rules
for how to construct tokens within a grammar and a set of rules for what
happens when we construct a valid combination of tokens. This limit allows you
to focus on the important part of the application&mdash;determining what a
document which conforms to the grammar means (the author's
<em>intention</em>)&mdash;and less on the mechanics of verifying that a
document matches the grammar.</p>

<p>In other words, you can focus on <em>what</em> you want to do with a
document more than <em>how</em> to do something with it.</p>

<h1>Coding the Lexer</h1>

<p>The lexer's job is to recognise tokens. Sub-grammar #1 specifies what those
tokens look like. Any lexer will have to examine the input stream, possibly one
character at a time, to see if the current input, appended to the immediately
preceding input, fits the definition of a token.</p>

<p>A programmer can write a lexer in many ways. I do so by combining regexps
with a DFA (<a
    href="http://en.wikipedia.org/wiki/Deterministic_finite_automaton">Discrete
    Finite Automaton</a>) module. The blog entry <a
    href="http://blogs.perl.org/users/andrew_rodland/2012/01/more-marpa-madness.html">More
    Marpa Madness</a> discusses using Marpa in the lexer (as well as in the
parser, which is where I use it).</p>

<p>What is a DFA? Abusing any reasonable definition, let me describe them
thusly. The <em>Deterministic</em> part means that given the same input at the
same stage, you'll always get the same result. The <em>Finite</em> part means
the input stream only contains a limited number of different tokens, which
simplifies the code. The <em>Automata</em> is, essentially, a software
machine&mdash;a program. DFAs are also often called STTs (State Transition
Tables).</p>

<p>How do you make this all work in Perl? <a
    href="https://metacpan.org/">MetaCPAN</a> is your friend! In particular, I
like to use <a
    href="http://metacpan.org/module/Set::FA::Element">Set::FA::Element</a> to
drive the process. For candidate alternatives I assembled a list of Perl
modules with relevance in the area, while cleaning up the docs for
<code>Set::FA</code>. See <a
    href="https://metacpan.org/module/Set::FA#See-Also">Alternatives to
    Set::FA</a>.  I did not write <a
    href="http://metacpan.org/module/Set::FA">Set::FA</a>, nor <a
    href="http://metacpan.org/module/Set::FA::Element">Set::FA::Element</a>,
but I now maintain them.</p>

<p>Transforming a grammar from BNF (or whatever form you use) into a DFA
provides:</p>

<ul>
    <li><p><em>Insight into the problem</em></p>

    <p>To cast BNF into regexps means you must understand exactly what the
    grammar definition is saying.</p></li>

    <li><p><em>Clarity of formulation</em></p>

    <p>You end up with a spreadsheet which simply and clearly encodes your
    understanding of tokens.</p>

    <p>Spreadsheet? Yes, I store the derived regexps, along with other
    information, in a spreadsheet. I even incorporate this spreadsheet into the
    source code.</p></li>

</ul>

<h1>Back to the Finite Automaton</h1>

<p>In practice, building a lexer is a process of reading and rereading, many
times, the definition of the BNF (here the <a
    href="http://www.graphviz.org/content/dot-language">DOT</a> language) to
build up the corresponding set of regexps to handle each case. This is
laborious work, no doubt about it.</p>

<p>For example, by using a regexp like <code>/[a-zA-Z_][a-zA-Z_0-9]*/</code>,
you can get Perl's regexp engine to intelligently gobble up characters as long
as they fit the definition. In plain English, this regexp says: start with a
letter, upper- or lower-case, or an underline, followed by 0 or more letters,
digits or underlines. Look familiar? It's very close to the Perl definition of
<code>\w</code>, but it disallows leading digits. Actually, <a
    href="http://www.graphviz.org/content/dot-language">DOT</a> disallows them
(in certain circumstances), but DOT does allow pure numbers in certain
circumstances.</p>

<p>What is the result of all of these hand-crafted regexps? They're
<em>data</em> fed into the DFA, along with the input stream. The output of the
DFA is a flag that signifies Yes or No, the input stream matches/doesn't match
the token definitions specified by the given regexps. Along the way, the DFA
calls a callback functions each time it recognizes a token, stockpiling them.
At the end of the run, you can output them as a stream of tokens, each with its
identifying type, as per The Lexer's Job Description I described earlier.</p>

<p>A note about callbacks: Sometimes it's easier to design a regexp to capture
more than seems appropriate, and to use code in the callback to chop up what's
been captured, outputting several token elements as a consequence.</p>

<p>Because developing the state transition table is such an iterative process,
I recommend creating various test files with all sorts of example programs, as
well as scripts with very short names to run the tests (short names because
you're going to be running these scripts an unbelievable number of
times...).</p>

<h1>States</h1>

<p>What are states and why do you care about them?</p>

<p>At any moment, the STT (automation, software machine) is in precisely
<em>on)</em> state. Perhaps it has not yet received even one token (so that
it's in the start state), or perhaps it has just finished processing the
previous one. Whatever the case, the code maintains information so as to know
exactly what state it is in, and this leads to knowing exactly what set of
tokens is now acceptable. That is, it has a set of tokens, any of which will be
legal in its current state.</p>

<p>The implication is this: you must associate each regexp with a specific
state and visa versa. Furthermore, the machine will remain in its current state
as long as each new input character matches a regexp belonging to the current
state. It will jump (make a transition) to a new state when that character does
not match.</p>

<h1>Sample Lexer Code</h1>

<p>Consider this simplistic code from the synopsis of <a
    href="http://metacpan.org/module/Set::FA::Element">Set::FA::Element</a>:</p>

<pre><code>        my($dfa) = Set::FA::Element -&#62; new
        (
                accepting   =&#62; ['baz'],
                start       =&#62; 'foo',
                transitions =&#62;
                [
                        ['foo', 'b', 'bar'],
                        ['foo', '.', 'foo'],
                        ['bar', 'a', 'foo'],
                        ['bar', 'b', 'bar'],
                        ['bar', 'c', 'baz'],
                        ['baz', '.', 'baz'],
                ],
        );</code></pre>

<p>In the <em>transitions</em> parameter the first line says: "foo" is a
state's name, and "b" is a regexp. Jump to state "bar" if the next input char
matches that regexp. Other lines are similar.</p>

<p>To use <a
    href="http://metacpan.org/module/Set::FA::Element">Set::FA::Element</a>,
you must prepare that transitions parameter matching this format. Now you see
the need for states and regexps.</p>

<p>This is code I've used, taken directly from <a
    href="http://metacpan.org/module/GraphViz2::Marpa::Lexer::DFA">GraphViz2::Marpa::Lexer::DFA</a>:</p>

<pre><code>        Set::FA::Element -&#62; new
        (
                accepting   =&#62; \@accept,
                actions     =&#62; \%actions,
                die_on_loop =&#62; 1,
                logger      =&#62; $self -&#62; logger,
                start       =&#62; $self -&#62; start,
                transitions =&#62; \@transitions,
                verbose     =&#62; $self -&#62; verbose,
        );</code></pre>

<p>Let's discuss these parameters.</p>

<dl>
<dt>accepting</dt>

<dd><p>This is an arrayref of state names. After processing the entire input
stream, if the machine ends up in one of these states, it has accepted that
input stream. All that means is that every input token matched an appropriate
regexp, where "appropriate" means every char matched the regexp belonging to
the current state, whatever the state was at the instant that char was
input.</p></dd>

<dt>actions</dt>

<dd>
<p>This is a hashref of function names so that the machine can call a function,
optionally, upon entering or leaving any state. That's how the stockpile for
recognized tokens works.</p>

<p>Because I wrote these functions myself and wrote the rules to attach each to
a particular combination of state and regexp, I encoded into each function the
knowledge of what type of token the DFA has matched. That's how the stockpile
ends up with (token, type) pairs to output at the end of the run.</p></dd>

<dt>die_on_loop</dt>

<dd>
<p>This flag, if true, tells the DFA to stop if none of the regexps belonging to the current state match the current input char.  Rather than looping forever, stop. Throw an exception.</p>

<p>You might wonder what stopping automatically is not the default, or even
mandatory. The default behavior allows you to try to recover from this bad
state, or at least give a reasonable error message, before dying.</p></dd>

<dt>logger</dt>

<dd><p>This is an (optional) logger object.</p></dd>

<dt>start</dt>

<dd><p>This is the name of the state in which the STT starts, so the code knows
which regexp(s) to try upon receiving the very first character of
input.</p></dd>

<dt>transitions</dt>

<dd><p>This is a potentially large arrayref which lists separately for all
states all the regexps which may possibly match the current input
char.</p></dd>

<dt>verbose</dt>

<dd>
<p>Specifies how much to report if the logger object is not defined.</p></dd>

</dl>

<p>With all of that configured, the next problem is how to prepare the grammar
in such a way as to fit into this parameter list.</p>

<h1>Coding the Lexer - Revisited</h1>

<p>The coder thus needs to develop regexps etc which can be fed directly into
the chosen DFA, here <a
    href="http://metacpan.org/module/Set::FA::Element">Set::FA::Element</a>, or
which can be transformed somehow into a format acceptable to that module. So
far I haven't actually said how I do that, but now it's time to be
explicit.</p>

<p>I use a spreadsheet with nine columns:</p>

<dl>
<dt>Start</dt>

<dd><p>This contains one word, "Yes", against the name of the state which is
the start state.</p></dd>

<dt>Accept</dt>

<dd><p>This contains the word "Yes" against the name of any state which will be
an accepting state (the machine has matched an input stream).</p></dd>

<dt>State</dt>

<dd><p>This is the name of the state.</p></dd>

<dt>Event</dt>

<dd><p>This is a regexp. The event will fire the current input char matches this
regexp.</p>

<p>Because the regexp belongs to a given state, we know the DFA will only
process regexps associated with the current state, of which there will be
usually one or or at most a few.</p>

<p>When there are multiple regexps per state, I leave all other columns
empty.</p></dd>

<dt>Next</dt>

<dd><p>The name of the "next" state to which the STT will jump if the current
char matches the regexp given on the same line of the spreadsheet (in the
current state of course).</p></dd>

<dt>Entry</dt>

<dd><p>The optional name of the function the DFA is to call upon (just before)
entry to the (new) state.</p></dd>

<dt>Exit</dt>

<dd><p>The optional name of the function the DFA is to call upon exiting from
the current state.</p></dd>

<dt>Regexp</dt>

<dd><p>This is a working column, in which I put formulas so that I can refer to
them in various places in the Event column. It is not passed to the DFA in the
transitions parameter.</p></dd>

<dt>Interpretation</dt>

<dd><p>Comments to myself.</p></dd>

</dl>

<p>I've put the STT for <a
    href="http://savage.net.au/Perl-modules/html/graphviz2.marpa/default.stt.html">STT
    for GraphViz2::Marpa</a> online.</p>

<p>This spreadsheet has various advantages:</p>

<p><em>Legibility.</em> It is very easy to read and to work with. Don't forget,
to start with you'll be basically switching back and forth between the grammar
definition document (hopefully in BNF) and this spreadsheet. I don't do much
(any) coding at this stage.</p>

<p><em>Exportability.</em> Because I have no code yet, there are several
possibilities. I could read the spreadsheet directly. The two problems with
this approach are the complexity of the code (in the external module which does
the reading of course), and the slowness of loading and running this code.</p>

<p>Because I use <a href="http://www.libreoffice.org/">LibreOffice</a> I can
either force end-users to install <a
    href="http://metacpan.org/module/OpenOffice::OODoc">OpenOffice::OODoc</a>,
or export the spreadsheet as an Excel file, in order to avail themselves of
this option. I have chosen to not support reading the<em>.ods </em>file
directly in the modules (<a
    href="http://metacpan.org/module/Graph::Easy::Marpa">Graph::Easy::Marpa</a>
and <a href="http://metacpan.org/module/GraphViz2::Marpa">GraphViz2::Marpa</a>)
I ship.</p>

<dt>I could alternately export the spreadsheet to a CSV file first. This way,
we can read a CSV file into the DFA fairly quickly, without loading the module
which reads spreadsheets.</p>

<p>Be careful here with LibreOffice, because it forces you to use Unicode for
the spreadsheet but exports odd character sequences, such as double-quotes as
the three byte sequence 0xe2, 0x80, 0x9c. When used in a regexp, this sequence
will never match a <em>real</em> double-quote in your input stream. Sigh. Do No
Evil. If only.</p>

<p>I could also incorporate the spreadsheet directly into my code. This is my
favorite approach. I do this in two stages. I export my data to a CSV file,
then append that file to the end of the source code of the module, after the
<code>__DATA__</code> token.</p>

<p>Such in-line data can be accessed effortlessly by the very neat and very
fast module <a
    href="http://metacpan.org/module/Data::Section::Simple">Data::Section::Simple</a>.
Because Perl has already loaded the module&mdash;and is executing
it&mdash;there is essentially no overhead whatsoever in reading data from
within it. Don't you just love Perl! And MetaCPAN of course. And a community
which contributes such wondrous code.</p>

<p>An advantage of this alternative is that it allows end users to edit the
shipped <em>.csv </em>or <em>.ods </em>files, after which they can use a
command line option on scripts to read their own file, overriding the built-in
STT.</p>

<p>After all this, it's just a matter of code to read and validate the
structure of the STT's data, then to reformat it into what <a
    href="http://metacpan.org/module/Set::FA::Element">Set::FA::Element</a>
demands.</p>

<h1>Coding the Parser</h1>

<p>At this point, you know how to incorporate the first sub-grammar into the
design and code of the lexer. You also know that the second sub-grammar must be
encoded into the parser, for that's how the parser performs syntax
checking.</p>

<p>How you do this depends intimately on which pre-existing module, if any, you
choose to use to aid the development of the parser. Because I choose Marpa
(currently <a href="http://metacpan.org/module/Marpa::R2">Marpa::R2</a>), I am
orienting this article to that module. However, only in the next article will I
deal in depth with Marpa.</p>

<p>Whichever tool you choose, think of the parsing process like this: Your
input stream is a set of pre-defined tokens (probably but necessarily output
from the lexer). You must now specify all possible legal combinations of those
tokens. This is the <em>syntax</em> of the language (or, more accurately, the
<em>remainder</em> of the syntax, because the first sub-grammar has already
handled all of the definitions of legal tokens).  At this point, assume all
incoming tokens are legal. In other words, the parser will not try to parse and
run a program containing token-based syntax errors, although it may contain
logic errors (even if written in Perl :-).</p>

<p>A combination of tokens which does not match any of the given legal
combinations can be immediately rejected as a syntax error. Keep in mind that
the friendliest compilers find as many syntax errors as possible per parse.</p>

<p>Because this check takes place on a token-by-token basis, you (ought to)
know precisely which token triggered the error, which means that you can emit a
nice error message, identifying the culprit and its context.</p>

<h1>Sample Parser Code</h1>

<p>Here's a sample of a <code>Marpa::R2</code> grammar (adapted from its
synopsis):</p>

<pre><code>        my($grammar) = Marpa::R2::Grammar -&#62; new
        ({
                actions =&#62; 'My_Actions',
                start   =&#62; 'Expression',
                rules   =&#62;
                [
                        { lhs =&#62; 'Expression', rhs =&#62; [qw/Term/] },
                        { lhs =&#62; 'Term',       rhs =&#62; [qw/Factor/] },
                        { lhs =&#62; 'Factor',     rhs =&#62; [qw/Number/] },
                        { lhs =&#62; 'Term',       rhs =&#62; [qw/Term Add Term/],
                                action =&#62; 'do_add'
                        },
                        { lhs =&#62; 'Factor',     rhs =&#62; [qw/Factor Multiply Factor/],
                                action =&#62; 'do_multiply'
                        },
                ],
                default_action =&#62; 'do_something',
        });</code></pre>

<p>Despite the differences between this and the calls to <code>Set::FA::Element
    -&#62; new()</code> in the lexer example, these two snippets are basically
the same:</p>

<dl>
<dt>actions</dt>

<dd><p>This is the name of a Perl package in which Marpa will look for actions
such as <code>do_add()</code> and <code>do_multiply()</code>. (Okay, the lexer
has no such option, as it defaults to the current package.)</p></dd>

<dt>start</dt>

<dd><p>This is the <em>lhs</em> name of the rule to start with, as with the
lexer.</p></dd>

<dt>rules</dt>

<dd><p>This is an arrayref of <em>rule descriptors</em> defining the syntax of
the grammar. This is the lexer's <em>transitions</em> parameter.</p></dd>

<dt>default_action</dt>

<dd><p>Use this (optional) callback as the action for any rule element which does not explicitly specify its own action.</p></dd>

</dl>

<p>The real problem is recasting the syntax from BNF, or whatever, into a set
of <em>rule descriptors</em>. How do you think about this problem? I suggest
contrast-and-compare real code with what the grammar says it must be.</p>

<p>Here's the <em>teamwork.dot</em> file I explained earlier.</p>

<pre><code>        digraph Perl
        {
        graph [ rankdir=&#34;LR&#34; ]
        node  [ fontsize=&#34;12pt&#34; shape=&#34;rectangle&#34; style=&#34;filled, solid&#34; ]
        edge  [ color=&#34;grey&#34; ]
        &#34;Teamwork&#34; [ fillcolor=&#34;yellow&#34; ]
        &#34;Victory&#34;  [ fillcolor=&#34;red&#34; ]
        &#34;Teamwork&#34; -&#62; &#34;Victory&#34; [ label=&#34;is the key to&#34; ]
        }</code></pre>

<p>In general, a valid <a href="http://www.graphviz.org/">Graphviz</a> (DOT)
graph must start with one of:</p>

<pre><code>        strict digraph $id {...} # Case 1. $id is a variable.
        strict digraph     {...}
        strict   graph $id {...} # Case 3
        strict   graph     {...}
               digraph $id {...} # Case 5
               digraph     {...}
                 graph $id {...} # Case 7
                 graph     {...}</code></pre>

<p>... as indeed this real code does. The graph's id is <em>Perl</em>, which is
case 5. If you've ever noticed that you can write a BNF as a tree (right?), you
can guess what comes next. I like to write my <em>rule descriptors</em> from
the root down.</p>

<p>Drawing this as a tree gives:</p>

<pre><code>             DOT's Grammar
                  |
                  V
        ---------------------
        |                   |
     strict                 |
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
     digraph     or       graph
        |                   |
        ---------------------
                  |
                  V
        ---------------------
        |                   |
       $id                  |
        |                   |
        ---------------------
                  |
                  V
                {...}</code></pre>

<h1>Connecting the Parser back to the Lexer</h1>

<p>Wait, what's this? Didn't I say that <em>strict</em> is optional. It's not
optional, not in the parser. It is optional in the DOT language, but I designed
the lexer, and I therein ensured it would necessarily output <em>strict =&#62;
    no</em> when the author of the graph omitted the <em>strict</em>.</p>

<p>By the time the parser runs, <em>strict</em> is no longer optional.  I did
this to make the life easier for consumers of the lexer's output stream, such
as authors of parsers. (Making the parser work less is often good.) </p>

<p>Likewise, for <em>digraph</em> 'v' <em>graph</em>, I designed the lexer to
output <em>digraph =&#62; 'yes'</em> in one case and <em>digraph =&#62;
    'no'</em> in the other. What does that mean? For <em>teamwork.dot</em>, the
lexer will output (in some convenient format) the equivalent of:</p>

<pre><code>        strict   =&#62; no
        digraph  =&#62; yes
        graph_id =&#62; Perl
        ...</code></pre>

<p>I chose <em>graph_id</em> because the DOT language allows other types of
ids, such as for nodes, edges, ports, and compass points.</p>

<p>All of this produces the first six Marpa-friendly rules:</p>

<pre><code>        [
        {   # Root-level stuff.
                lhs =&#62; 'graph_grammar',
                rhs =&#62; [qw/prolog_and_graph/],
        },
        {
                lhs =&#62; 'prolog_and_graph',
                rhs =&#62; [qw/prolog_definition graph_sequence_definition/],
        },
        {   # Prolog stuff.
                lhs =&#62; 'prolog_definition',
                rhs =&#62; [qw/strict_definition digraph_definition graph_id_definition/],
        },
        {
                lhs    =&#62; 'strict_definition',
                rhs    =&#62; [qw/strict/],
                action =&#62; 'strict', # &#60;== Callback.
        },
        {
                lhs    =&#62; 'digraph_definition',
                rhs    =&#62; [qw/digraph/],
                action =&#62; 'digraph', # &#60;== Callback.
        },
        {
                lhs    =&#62; 'graph_id_definition',
                rhs    =&#62; [qw/graph_id/],
                action =&#62; 'graph_id', # &#60;== Callback.
        },
        ...
        ]</code></pre>

<p>In English, all of this asserts that the graph as a whole consists of a
prolog thingy, then a graph sequence thingy. (Remember, I made up the names
<code>prolog_and_graph</code>, etc.</p>

<p>Next, a prolog consists of a strict thingy, which is now not optional, and
then.  a digraph thingy, which will turn out to match the lexer input of
<code>/^(di|)graph$/</code>, and the lexer output of <code>digraph =&#62;
    /^(yes|no)$/</code>, and then a graph_id, which is optional, and then some
other stuff which will be the precise definition of real live graphs,
represented by <code>{...}</code> in the list of the eight possible formats for
the prolog.</p>

<p>Whew.</p>

<h1>Something Fascinating about Rule Descriptors</h1>

<p>Take another look at those rule descriptors. They say <em>nothing</em> about
the values of the tokens! For instance, in <em>graph_id =&#62; Perl</em> what
happens to ids such as <em>Perl</em>. Nothing. They are ignored. That's just
how these grammars work.</p>

<p>Recall: it's the job of the <em>lexer</em> to identify valid graph ids based
on the first sub-grammar. By the time the data hits the parser, we know we have
a valid graph id, and as long as it plugs in to the <em>structure</em> of the
grammar in the right place, we are prepared to accept <em>any valid</em> graph
id. Hence <a href="http://metacpan.org/module/Marpa::R2">Marpa::R2</a> does not
even look at the graph id, which is a way of saying this one grammar works with
<em>every</em> valid graph id.</p>

<p>This point also raises the tricky discussion of whether a specific
implementation of lexer/parser code can or must keep the two phases separate,
or whether in fact you can roll them into one without falling into the
premature optimisation trap. I'll just draw a veil over that discussion, as
I've already declared my stance: my implementation uses two separate
modules.</p>

<h1>Chains and Trees</h1>

<p>If these rules have to be chained into a tree, how do you handle the root?
Consider this call to <a
    href="http://metacpan.org/module/Marpa::R2">Marpa::R2</a>'s
<code>new()</code> method:</p>

<pre><code>        my($grammar) = Marpa::R2::Grammar -&#62; new(... start =&#62; 'graph_grammar', ...);</code></pre>

<p><em>graph_grammar</em> is precisely the <em>lhs</em> in the first rule
descriptor.</p>

<p>After that, every rule's <em>rhs</em>, including the root's, must be defined
later in the list of rule descriptors. These definitions form the links in the
chain. If you draw this, you'll see the end result is a tree.</p>

<p>Here's the full <a href="http://metacpan.org/module/Marpa::R2">Marpa::R2</a>
grammar for DOT (as used in the <a
    href="http://metacpan.org/module/GraphViz2::Marpa">GraphViz2::Marpa</a>
module) as an image: <a
    href="http://savage.net.au/Ron/html/graphviz2.marpa/Marpa.Grammar.svg">http://savage.net.au/Ron/html/graphviz2.marpa/Marpa.Grammar.svg</a>.
I created this image with (you guessed it!) <a
    href="http://www.graphviz.org/">Graphviz</a> via <a
    href="http://metacpan.org/module/GraphViz2">GraphViz2</a>. I've added
numbers to node names in the tree, otherwise Graphviz would regard any two
identical numberless names as one and the same node.</p>

<h1>Less Coding, More Design</h1>

<p>Here I'll stop building the tree of the grammar (see the next article), and
turn to some design issues.</p>

<h1>My Rules-of-Thumb for Writing Lexers/Parsers</h1>

<p>The remainder of this document is to help beginners orient their thinking
when tackling a problem they don't yet have much experience in. Of course, if
you're an expert in lexing and parsing, feel free to ignore everything I say,
and if you think I've misused lexing/parsing terminology here, please let me
know.</p>

<h2>Eschew Premature Optimisation</h2>

<p>Yep, this old one again. It has various connotations:</p>

<ul>

    <li><p><em>The lexer and the parser</em></p>

    <p>Don't aim to combine the lexer and parser, even though that might
    eventually happen. Do wait until the design of each is clear and finalized,
    before trying to jam them into a single module (or program).</p></li>

    <li><p><em>The lexer and the tokens</em></p>

    <p>Do make the lexer identify the existence of tokens, but not identify
    their ultimate role or meaning.</p></li>

    <li><p><em>The lexer and context</em></p>

    <p>Don't make the lexer do context analysis. Do make the parser
    disambiguate tokens with multiple meanings, by using the context. Let the
    lexer do the hard work of identifying tokens.</p>

    <p>And <a href="http://en.wikipedia.org/wiki/Context_analysis">context
        analysis for businesses</a>, for example, is probably not what you want
    either.</p></li>

    <li><p><em>The lexer and syntax</em></p>

    <p>Don't make the lexer do syntax checking. This is effectively the same as
    the last point.</p></li>

    <li><p><em>The lexer and its output</em></p>

    <p>Don't minimize the lexer's output stream. For instance, don't force the
    code which reads the lexer's output to guess whether or not a
    variable-length set of tokens has ended. Output a specific token as a set
    terminator. The point of this token is to tell the parser exactly what's
    going on. Without such a token, the next token has to do double-duty:
    Firstly it tells the parser the variable-length part has finished and
    secondly, it represents itself. Such overloading is unnecessary.</p></li>

    <li><p><em>The State Transition Table</em></p>

    <p>In the STT, don't try to minimize the number of states, at least not
    until the code has stabilized (that is, it's no longer under [rapid]
    development).</p>

    <p>I develop my STTs in a spreadsheet program, which means a formula
    (regexp) stored in one cell can be referred to by any number of other
    cells. This is <em>very</em> convenient.</p></li>

</ul>

<h2>Divide and Conquer</h2>

<p>Hmmm, another ancient <a
    href="http://en.wikipedia.org/wiki/Aphorism">aphorism</a>. Naturally, these
persist precisely because they're telling us something important. Here, it
means study the problem carefully, and deal with each part (lexer, parser) of
it separately. Enough said.</p>

<h2>Don't Reinvent the Wheel</h2>

<p>Yes, I know <em>you'd</em> never do that.</p>

<p>The CPAN has plenty of Perl modules to help with things like the STT, such
as <a href="http://metacpan.org/module/Set::FA::Element">Set::FA::Element</a>.
Check its See Also (in <a
    href="http://metacpan.org/module/Set::FA">Set::FA</a>, actually) for other
STT helpers.</p>

<h2>Be Patient with the STT</h2>

<p>Developing the STT takes many iterations:</p>

<ul>
    <li><p>The test cases</p>

    <p>For each iteration, prepare a separate test case.</p></li>

    <li><p>The tiny script</p>

    <p>Have a tiny script which runs a single test. Giving it a
    short&mdash;perhaps temporary&mdash;name, makes each test just that little
    bit easier to run. You can give it a meaningful name later, when including
    it in the distro.</p></li>

    <li><p>The wrapper script</p>

    <p>Have a script which runs all tests.</p>

    <p>I keep the test data files in the data/ dir, and the scripts in the scripts/ dir. Then, creating tests in the t/ dir can perhaps use these two sets of helpers.</p>

    <p>Because I've only used <a
        href="http://metacpan.org/module/Marpa::R2">Marpa::R2</a> for graphical
    work, the output of the wrapper is a web page, which makes viewing the
    results simple. I like to include (short) input or output text files on
    such a page, beside the SVG images. That way I can see at a glance what the
    input was and hence I can tell what the output should be without switching
    to the editor's window.</p>

    <p>There's a little bit of effort initially, but after that it's
    <em>so</em> easy to check the output of the latest test.</p></li>

</ul>

<p>I've made available sample output from my wrapper scripts:</p>

<p><a href="http://savage.net.au/Perl-modules/html/graphviz2/">GraphViz2
    (non-Marpa)</a></p>

<p><a
    href="http://savage.net.au/Perl-modules/html/graphviz2.marpa/">GraphViz2::Marpa</a></p>

<p><a
    href="http://savage.net.au/Perl-modules/html/graph.easy.marpa/">Graph::Easy::Marpa</a></p>

<h2>Be Patient with the Grammar</h2>

<p>As with the STT, creating a grammar is at least for me very much a
trial-and-error process. I offer a few tips:</p>

<p>Tips:</p>

<ul>

    <li><p>Paper, not code</p>

    <p>A good idea is not to start by coding with your editor, but to draw the
    grammar as a tree, on paper.</p></li>

    <li><p>Watch out for alternatives</p>

    <p>This refers to when one of several tokens can appear in the input
    stream. Learn exactly how to draw that without trying to minimize the
    number of branches in the tree.</p>

    <p>Of course, you will still need to learn how to code such a construct.
    Here's a bit of code from <a
        href="http://metacpan.org/module/Graph::Easy::Marpa">Graph::Easy::Marpa</a>
    which deals with this (note: we're back to the <code>Graph::Easy</code>
    language from here on!):</p>

<pre><code>        {   # Graph stuff.
                lhs =&#62; 'graph_definition',
                rhs =&#62; [qw/graph_statement/],
        },
        {
                lhs =&#62; 'graph_statement', # 1 of 3.
                rhs =&#62; [qw/group_definition/],
        },
        {
                lhs =&#62; 'graph_statement', # 2 of 3.
                rhs =&#62; [qw/node_definition/],
        },
        {
                lhs =&#62; 'graph_statement', # 3 of 3.
                rhs =&#62; [qw/edge_definition/],
        },</code></pre>

<p>This is telling you that a graph thingy can be any one of a group, node, or
edge. It's <a href="http://metacpan.org/module/Marpa::R2">Marpa::R2</a>'s job
to try these alternatives in order to see which (if any) matches the input
stream. This ruleset represents a point in the input stream where one of
several <em>alternatives</em> can appear.</p>

<p>The tree looks like:</p>

<pre><code>                        graph_definition
                               |
                               V
                        graph_statement
                               |
                               V
            ---------------------------------------
            |                  |                  |
            V                  V                  V
     group_definition   node_definition    edge_definition</code></pre>

    <p>My comment <code>3 of 3</code> says an edge can stand alone.</p></li>

    <li><p>Watch out for sequences</p>

    <p>... but consider the <em>node_definition</em>:</p>

<pre><code>        {   # Node stuff.
                lhs =&#62; 'node_definition',
                rhs =&#62; [qw/node_sequence/],
                min =&#62; 0,
        },
        {
                lhs =&#62; 'node_sequence', # 1 of 4.
                rhs =&#62; [qw/node_statement/],
        },
        {
                lhs =&#62; 'node_sequence', # 2 of 4.
                rhs =&#62; [qw/node_statement daisy_chain_node/],
        },
        {
                lhs =&#62; 'node_sequence', # 3 of 4.
                rhs =&#62; [qw/node_statement edge_definition/],
        },
        {
                lhs =&#62; 'node_sequence', # 4 of 4.
                rhs =&#62; [qw/node_statement group_definition/],
        },</code></pre>

    <p>Here <code>3 of 4</code> tells you that nodes can be followed by
    edges.</p>

    <p>A realistic sample is: <code>[node_1] -&#62; [node_2]</code>, where
    <code>[x]</code> is a node and <code>-&#62;</code> is an edge, because an
    edge can be followed by a node (applying <code>3 of 4</code>).</p>

    <p>This full example represents a point in the input stream where one of
    several specific <em>sequences</em> of tokens are allowed/expected. Here's
    the <em>edge_definition</em>:</p>

<pre><code>        {   # Edge stuff.
                lhs =&#62; 'edge_definition',
                rhs =&#62; [qw/edge_sequence/],
                min =&#62; 0,
        },
        {
                lhs =&#62; 'edge_sequence', # 1 of 4.
                rhs =&#62; [qw/edge_statement/],
        },
        {
                lhs =&#62; 'edge_sequence', # 2 of 4.
                rhs =&#62; [qw/edge_statement daisy_chain_edge/],
        },
        {
                lhs =&#62; 'edge_sequence', # 3 of 4.
                rhs =&#62; [qw/edge_statement node_definition/],
        },
        {
                lhs =&#62; 'edge_sequence', # 4 of 4.
                rhs =&#62; [qw/edge_statement group_definition/],
        },
        {
                lhs =&#62; 'edge_statement',
                rhs =&#62; [qw/edge_name attribute_definition/],
        },
        {
                lhs    =&#62; 'edge_name',
                rhs    =&#62; [qw/edge_id/],
                action =&#62; 'edge_id',
        },</code></pre></li>
</ul>

<p>But, I have to stop somewhere, so...</p>

<h1>Wrapping Up and Winding Down</h1>

<p>I hope I've clarified what can be a complex and daunting part of
programming, and I also hope I've convinced you that working in Perl, with the
help of a spreadsheet, is the modern (or "only") path to lexer and parser
bliss.</p>

<p><em><a href="http://savage.net.au/index.html">Ron Savage</a></em> is a
longtime Perl programmer and prolific CPAN contributor.</p>]]>
        
    </content>
</entry>

<entry>
    <title>Show off Perl in Plat_Forms 2012</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/08/show-off-perl-in-plat-forms-2012.html" />
    <id>tag:www.perl.com,2012:/pub//2.2072</id>

    <published>2012-08-29T20:21:52Z</published>
    <updated>2012-08-29T20:23:47Z</updated>

    <summary>Thanks to Torsten Raudssus, who wrote in with this announcement. What is Plat_Forms? Plat_Forms is a contest and competition in which top-class teams of three programmers compete to implement the same requirements for a web-based system within two days, using...</summary>
    <author>
        <name>chromatic</name>
        <uri>http://www.modernperlbooks.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>Thanks to <a href="https://raudss.us/">Torsten Raudssus</a>, who wrote in with this announcement.</em></p>

<h1>What is Plat_Forms?</h1>

<p><a href="https://www.plat-forms.org/">Plat_Forms</a> is a contest and competition in which top-class teams of three programmers compete to implement the same requirements for a web-based system within two days, using different technology platforms (e.g. Java, .NET, Perl, PHP, Python, Ruby, Scala, Smalltalk, JavaScript or what-have-you).</p>

<p>Its purpose is not to determine "the" best platform, but rather to provide new insights into the real (rather than purported) pros, cons, and emergent properties of each platform. The evaluation will analyze many aspects of each solution, both external (scalability, functionality, reliability, security, performance, etc.) and internal (structure, modularity, understandability, flexibility, etc.). <a href="#fn:footnote1" defang_id="fnref:footnote1" class="footnote">1</a></p>

<p>In just two days, the teams will implement as much of the requested functionality as they can and at the same time optimize the usefulness of the resulting system (functionality, usability, reliability, etc.), the understandability of the code, the modifiability of the system design, the efficiency and scalability.</p>

<p>The contest will be conducted on October, 9-10, 2012. At the end of the 2 days, the teams hand over their source code and a turnkey-runnable VMware configuration of their system. <a href="#fn:footnote2" defang_id="fnref:footnote2" class="footnote">2</a></p>

<p>The event is at the <a href="https://maps.google.com/maps?q=Freie+Universit%C3%83%C2%A4t+Berlin,+Berlin,+Germany&amp;hl=en&amp;ie=UTF8&amp;ll=52.456009,13.293457&amp;spn=22.652618,17.995605&amp;sll=52.446685,13.285786&amp;sspn=0.005509,0.004393&amp;oq=Freie+Universit%C3%83%C2%A4t+berlin&amp;t=h&amp;hq=Freie+Universit%C3%83%C2%A4t+Berlin,+Berlin,+Germany&amp;z=6">Freie Universit&auml;t Berlin in Berlin, Germany</a>.</p>

<h1>Perl and Plat_Forms</h1>

<p>We, of course, want to represent Perl at this event. But we need teams and sponsorships to make this happen. Sadly the end of the registration is very near, but I think that we still have a chance to gather some productive and business effective working teams. A team has 3 people. We can register a maximum of 4 teams.</p>

<p>Companies may wish to enter a team of their own developers, as this gives them a chance to perform in a new and different environment, test themselves as a team against others, and see how other teams work together. As well as these opportunities, companies should also consider just sponsoring your most famous web framework to help send a team to show that it IS the best web framework! :-) Please help us make this a representative event for the modern Perl world. Sadly the <a href="https://www.plat-forms.org/results-2011">results</a> of last year weren't that good, and we think, we can do better ;).</p>

<p>If you want to participate you must give us your details <strong>before Fri 2012-09-07</strong>, as then the registration for the event closes. Our contact email is at the end. If you want to sponsor the event, you can also contact us after the registration end date, but please <strong>ASAP</strong> :).</p>

<h1>Contact and More Info</h1>

<p><a href="https://raudss.us/">Torsten Raudssus</a> organizes the team and the sponsorships this year, so if you want to see your team here or would like to sponsor, please contact him (<strong>ASAP</strong>):</p>

<ul>
<li><p><a href="mailto:getty@cpan.org">getty@cpan.org</a></p></li>
<li><p>#Plat_Forms on irc.perl.org</p></li>
<li><p><a href="http://wiki.enlightenedperl.org/platforms2012">Wiki Page</a></p></li>
</ul>

<div class="footnotes">
<hr>
<ol>

<li defang_id="fn:footnote1"><p>Taken from <a href="https://www.plat-forms.org/">Plat_Forms homepage</a><a href="#fnref:footnote1" class="reversefootnote">&nbsp;↩</a></p></li>

<li defang_id="fn:footnote2"><p>Taken from <a href="https://www.plat-forms.org/platforms-2012-rev-2-announcement">Plat_Forms 2012 rev2 homepage</a><a href="#fnref:footnote2" class="reversefootnote">&nbsp;↩</a></p></li>

</ol>
</div>]]>
        
    </content>
</entry>

<entry>
    <title>Embrace the Reluctant Perl Programmer</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/08/embrace-the-reluctant-perl-programmer.html" />
    <id>tag:www.perl.com,2012:/pub//2.2070</id>

    <published>2012-08-27T13:00:01Z</published>
    <updated>2012-09-21T01:54:15Z</updated>

    <summary>editor&apos;s note: an earlier version of this article appeared at The Reluctant Perl Programmer. Per the suggestion of Ask Bjørn Hansen, this revision appears on Perl.com. Who We Are We all love Perl for different reasons. Some of us are...</summary>
    <author>
        <name>chromatic</name>
        <uri>http://www.modernperlbooks.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p><em>editor's note: an earlier version of this article appeared at <a
href="http://www.modernperlbooks.com/mt/2012/06/the-reluctant-perl-programmer.html">The
Reluctant Perl Programmer</a>. Per the suggestion of <a
href="http://www.askbjoernhansen.com/">Ask Bjørn Hansen</a>, this revision
appears on Perl.com.</em></p>

<h2>Who We Are</h2>

<p>We all love Perl for different reasons.</p>

<p>Some of us are programmers at heart. We love writing code. We love solving
problems. We love that the distance between a problem and its solution is often
a few lines of Perl which flow almost effortlessly from our minds through our
fingers to our screens.</p>

<p>Some of us are administrators. We love order. We love consistency. We love
knowing that the scripts we wrote in 2008 or 1998 or 1988 run unmodified on
every system we touch and we don't have to think about it. We love that Perl
doesn't get in the way of our solving problems, whether we have a few minutes
to fight a fire or a few weeks to plan something big.</p>

<p>Some of us are artists. We tinker. We play. We experiment. We write poems
and steal features from wherever we can find them. We color outside the lines,
and we love the flexibility we have to let our muses take us where they will,
because we know that Perl will stay out of our way.</p>

<p>Some of us are engineers. We love reliability. We love working software. We
love when an upgrade is boring, when there are no unpleasant surprises. We love
having the CPAN always within reach, with far more great software than we can
ever use there for us whenever we want it.</p>

<p>We are many and we are varied. We build great things, and we collectively
make up something even greater.</p>

<p>We started with the vision of one man far too lazy and hubristic to solve a
simple problem of cross-continent communication in the easy way. We grew as
system administrators acknowledged that something more powerful and consistent
than shell but simpler and more forgiving than C occupied an enormous
ecological niche. Then we grew again as we realized that a web server could do
more than just serve a plain static page, and that wrangling text was a job for
a powerful, malleable language.</p>

<p>Along the way we built something grand.</p>

<p>We're pragmatic. We're relentlessly pragmatic. We get things done. We
iterate and improve and refine. We've stolen the Unix ethos (build many small
tools, loosely coupled) and turned it into the CPAN and the ecosystem around
the CPAN such that, at times, the CPAN is our language far more than Perl is,
and many of us are all the better for it.</p>

<p>Yet not everyone can benefit from that.</p>

<h2>Who They Are</h2>

<p>We stand on little ceremony, but we do stand on ceremony. Some say the
"right" way to write Perl is a thin layer of glue connecting as much of the
CPAN as you need. Others suggest that the "right" way to write Perl is the code
they wrote in 1987 when Larry first introduced his work to the world. Most of
us are somewhere in between.</p>

<p>Most of us are somewhat wrong.</p>

<p>Consider the plight of the reluctant programmer who faces a problem. He or
she may pick up Perl, and what then?</p>

<p>Where does this reluctant programmer go for information? Where does this reluctant programmer go for help?</p>

<p>Where can you learn that the first dozen Perl tutorials easily found with a
search engine are out of date or even wrong? Where can you learn that any of a
dozen good text editors or IDEs are available and are better than
<code>notepad.exe</code>? Where can you learn how to install CPAN modules on
ActivePerl (or that Strawberry Perl exists)&mdash;and more importantly,
why?</p>

<p>Where can you go from "I need to process this report by the end of the day,
and I have this text file, and I heard PERL was good for that?" to "<em>I</em>
can solve my problems with this language, and I never considered myself a
programmer before!"</p>

<p>Those of us who've already undergone that transition may find it difficult
to remember the days gone by. We've spent so long shaping the language, our
tools, and our community to fit our problems, perhaps we've forgotten both the
energy of promise and the growing pains of our youths.</p>

<p>Once upon a time, we all just wanted to get something done. There is much to
admire in that approach. By all means, let us continue solving problems. Let us
encourage people to solve their problems. Let us solve problems so well we
never encounter them again.</p>

<p>Let us not forget, however, to let reluctant programmers solve their
problems immediately, however poorly, and only then help them open their eyes
to the new possibilities their successes now afford them.</p>

<h2>Who We May Be</h2>

<p>This change is a change of mindset, not of technology nor tools.</p>

<p>With few exceptions, our growth will not come from those who already know
the beauty of programming and the freedom of Perl. It will come from those who
merely (as if it were a mere thing!) wish to solve a problem. If and when they
succeed, they will need guidance to understand the new powers they possess, and
we can be there.</p>

<p>Yet first, we must accept that their goals are not our goals&mdash;not yet
anyhow, and perhaps not never. Their goals may be strange to us, but they are
no less valuable for their peculiarities. In truth, that makes them more
valuable. These are more problems for us to solve, more ideas for us to adopt,
and more people to welcome into our community.</p>

<p>By all means, let us help them write great code and let us teach them the
value of working with the community in the structures and per the techniques we
have developed to harness our powers. Let us mentor them so that we may welcome
them as peers and equals in ability (even as we acknowledge them as worthy of
respect and praise from the start). Yet let us first welcome them into the
greater Perl community with all of the pragmatism we embrace.</p>

<h2>All are Welcome</h2>

<p>Reluctant programmers, come solve your problems with us. We are proud of
what we have built together, but we build beautiful things and share them with
you because in their <em>use</em> we find the greatest beauty.</p>

<p>You are welcome here.</p>

<p><em>chromatic is the author of <a
href="http://modernperlbooks.com/books/modern_perl/">Modern Perl: the book</a>
and the lead developer at <a href="http://bigbluemarblellc.com/">Big Blue
Marble</a>.</em></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Further Resources</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-further-resources.html" />
    <id>tag:www.perl.com,2012:/pub//2.2068</id>

    <published>2012-06-29T13:00:01Z</published>
    <updated>2012-07-10T20:31:41Z</updated>

    <summary>This series has shown you several features of Unicode by example, as well as several techniques for working with Unicode correctly and easily with recent releases of Perl 5. By now you know more than many programmers do about Unicode,...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<p>This series has shown you several features of Unicode by example, as well as
several techniques for working with Unicode correctly and easily with recent
releases of Perl 5. By now you know more than many programmers do about
Unicode, but your journey to mastery continues.</p>

<p>Perl 5 includes several pieces of documentation which explain Unicode and
Perl's Unicode support. See <a
href="http://search.cpan.org/perldoc?perlunicode">perlunicode</a>, <a
href="http://search.cpan.org/perldoc?perluniprops">perluniprops</a>, <a
href="http://search.cpan.org/perldoc?perlre">perlre</a>, <a
href="http://search.cpan.org/perldoc?perlrecharclass">perlrecharclass</a>, <a
href="http://search.cpan.org/perldoc?perluniintro">perluniintro</a>, <a
href="http://search.cpan.org/perldoc?perlunitut">perlunitut</a> and <a
href="http://search.cpan.org/perldoc?perlunifaq">perlunifaq</a>.</p>

<p>Perl 5 and the CPAN provide several modules and distributions to allow the
effective use of Unicode. As of Perl 5.16, many of these are in the core
library. Many of them work just as well with earlier versions of Perl 5, though
for the best and most correct support for Unicode as a whole, consider using
Perl 5.14 or 5.16.</p>

<p>These modules include:</p>

<ul>

<li><a href="http://search.cpan.org/perldoc?PerlIO">PerlIO</a></li>

<li><a href="http://search.cpan.org/perldoc?DB_File">DB_File</a></li>

<li><a href="http://search.cpan.org/perldoc?DBM_Filter">DBM_Filter</a></li>

<li><a href="http://search.cpan.org/perldoc?DBM_Filter::utf8">DBM_Filter::utf8</a></li>

<li><a href="http://search.cpan.org/perldoc?Encode">Encode</a></li>

<li><a href="http://search.cpan.org/perldoc?Encode::Locale">Encode::Locale</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::UCD">Unicode::UCD</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::Normalize">Unicode::Normalize</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::GCString">Unicode::GCString</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::LineBreak">Unicode::LineBreak</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::Collate">Unicode::Collate</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::Collate::Locale">Unicode::Collate::Locale</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::Unihan">Unicode::Unihan</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::CaseFold">Unicode::CaseFold</a></li>

<li><a href="http://search.cpan.org/perldoc?Unicode::Tussle">Unicode::Tussle</a></li>

<li><a href="http://search.cpan.org/perldoc?Lingua::JA::Romanize::Japanese">Lingua::JA::Romanize::Japanese</a></li>

<li><a href="http://search.cpan.org/perldoc?Lingua::ZH::Romanize::Pinyin">Lingua::ZH::Romanize::Pinyin</a></li>

<li><a href="http://search.cpan.org/perldoc?Lingua::KO::Romanize::Hangul">Lingua::KO::Romanize::Hangul</a></li>

</ul>

<p>The <small>CPAN</small> distribution <a
href="http://search.cpan.org/perldoc?Unicode::Tussle"><code>Unicode::Tussle</code></a>
module includes many command-line programs to help with working with Unicode,
including these programs to fully or partly replace standard utilities:
<em>tcgrep</em> instead of <em>egrep</em>, <em>uniquote</em> instead of <em>cat
-v</em> or <em>hexdump</em>, <em>uniwc</em> instead of <em>wc</em>,
<em>unilook</em> instead of <em>look</em>, <em>unifmt</em> instead of
<em>fmt</em>, and <em>ucsort</em> instead of <em>sort</em>. For exploring
Unicode character names and character properties, see its <em>uniprops</em>,
<em>unichars</em>, and <em>uninames</em> programs. It also supplies these
programs, all of which are general ﬁlters that do Unicode-y things:
<em>unititle</em> and <em>unicaps</em>; <em>uniwide</em> and
<em>uninarrow</em>; <em>unisupers</em> and <em>unisubs</em>; <em>nfd</em>,
<em>nfc</em>, <em>nfkd</em>, and <em>nfkc</em>; and <em>uc</em>, <em>lc</em>,
and <em>tc</em>.</p>

<p>Finally, see <a href="http://unicode.org/standard/standard.html">the
published Unicode Standard</a> (page numbers are from version 6.0.0),
including these speciﬁc annexes and technical reports:</p>

<ul>

<li>§3.13 Default Case Algorithms, page 113</li>
<li>§4.2 Case, pages 120-122</li>
<li> Case Mappings, page 166-172, especially Caseless Matching starting on page 170</li>
<li><a href="http://unicode.org/reports/tr44/"><small>UAX</small> #44: Unicode Character Database</a></li>
<li><a href="http://unicode.org/reports/tr18/"><small>UTS</small> #18: Unicode Regular Expressions</a></li>
<li><a href="http://unicode.org/reports/tr15/"><small>UAX</small> #15: Unicode Normalization Forms</a></li>
<li><a href="http://unicode.org/reports/tr10/"><small>UTS</small> #10: Unicode Collation Algorithm</a></li>
<li><a href="http://unicode.org/reports/tr29/"><small>UAX</small> #29: Unicode Text Segmentation</a></li>
<li><a href="http://unicode.org/reports/tr14/"><small>UAX</small> #14: Unicode Line Breaking Algorithm</a></li>
<li><a href="http://unicode.org/reports/tr11/"><small>UAX</small> #11: East Asian Width</a></li>

</ul>

<p>Tom Christiansen &lt;tchrist@perl.com&gt; wrote this series, with occasional
kibbitzing from Larry Wall and Jeﬀrey Friedl in the background.</p>

<p>Most of these examples came from the current edition of the "Camel Book";
that is, from the <a
href="http://http://shop.oreilly.com/product/9780596004927.do">4<sup>th</sup>
Edition of <em>Programming Perl</em></a>, Copyright © 2012 Tom Christiansen
<em>et al.</em>, 2012-02-13 by O'Reilly Media. The code itself is freely
redistributable, and you are encouraged to transplant, fold, spindle, and
mutilate any of the examples in this series however you please for inclusion
into your own programs without any encumbrance whatsoever. Acknowledgement via
code comment is polite but not required.</p>

<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-demo-of-unicode-collation-and-printing.html">℞ 44: Demo of Unicode Collation and Printing</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Demo of Unicode Collation and Printing</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-demo-of-unicode-collation-and-printing.html" />
    <id>tag:www.perl.com,2012:/pub//2.2066</id>

    <published>2012-06-22T13:00:01Z</published>
    <updated>2012-07-10T20:30:58Z</updated>

    <summary>℞ 44: PROGRAM: Demo of Unicode collation and printing The past several weeks of Unicode recipes have explained how Unicode works and shown how to use it in your programs. If you&apos;ve gone through those recipes, you now understand more...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<h2 id="PROGRAM:-Demo-of-Unicode-collation-and-printing">℞ 44: <small>PROGRAM</small>: Demo of Unicode collation and printing</h2>

<p>The past several weeks of Unicode recipes have explained how Unicode works
and shown how to use it in your programs. If you've gone through those recipes,
you now understand more than most programmers.</p>

<p>How about putting everything together?</p>

<p>Here's a full program showing how to make use of locale-sensitive sorting,
Unicode casing, and managing print widths when some of the characters take up
zero or two columns, not just one column each time. When run, the following
program produces this nicely aligned output (though the quality of the alignment depends on the quality of your Unicode font, of course):</p>

<pre><code>    Crème Brûlée....... €2.00
    Éclair............. €1.60
    Fideuà............. €4.20
    Hamburger.......... €6.00
    Jamón Serrano...... €4.45
    Linguiça........... €7.00
    Pâté............... €4.15
    Pears.............. €2.00
    Pêches............. €2.25
    Smørbrød........... €5.75
    Spätzle............ €5.50
    Xoriço............. €3.00
    Γύρος.............. €6.50
    막걸리............. €4.00
    おもち............. €2.65
    お好み焼き......... €8.00
    シュークリーム..... €1.85
    寿司............... €9.99
    包子............... €7.50</code></pre>

<p>Here's that program; tested on v5.14.</p>

<pre><code> #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting and unicode_strings
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     &quot;γύρος&quot;             =&gt; 6.50, # gyros, Greek
     &quot;pears&quot;             =&gt; 2.00, # like um, pears
     &quot;linguiça&quot;          =&gt; 7.00, # spicy sausage, Portuguese
     &quot;xoriço&quot;            =&gt; 3.00, # chorizo sausage, Catalan
     &quot;hamburger&quot;         =&gt; 6.00, # burgermeister meisterburger
     &quot;éclair&quot;            =&gt; 1.60, # dessert, French
     &quot;smørbrød&quot;          =&gt; 5.75, # sandwiches, Norwegian
     &quot;spätzle&quot;           =&gt; 5.50, # Bayerisch noodles, little sparrows
     &quot;包子&quot;              =&gt; 7.50, # bao1 zi5, steamed pork buns, Mandarin
     &quot;jamón serrano&quot;     =&gt; 4.45, # country ham, Spanish
     &quot;pêches&quot;            =&gt; 2.25, # peaches, French
     &quot;シュークリーム&quot;    =&gt; 1.85, # cream-filled pastry like éclair, Japanese
     &quot;막걸리&quot;            =&gt; 4.00, # makgeolli, Korean rice wine
     &quot;寿司&quot;              =&gt; 9.99, # sushi, Japanese
     &quot;おもち&quot;            =&gt; 2.65, # omochi, rice cakes, Japanese
     &quot;crème brûlée&quot;      =&gt; 2.00, # tasty broiled cream, French
     &quot;fideuà&quot;            =&gt; 4.20, # more noodles, Valencian (Catalan=fideuada)
     &quot;pâté&quot;              =&gt; 4.15, # gooseliver paste, French
     &quot;お好み焼き&quot;        =&gt; 8.00, # okonomiyaki, Japanese
 );

 # find the widest allowed width for the name column
 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won&#39;t freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = Unicode::Collate::Locale-&gt;new( locale =&gt; &quot;ja&quot; );

 for my $item ($coll-&gt;sort(keys %price)) {
     print pad(entitle($item), $width, &quot;.&quot;);
     printf &quot; €%.2f\n&quot;, $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString-&gt;new($str)-&gt;columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str     =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }</code></pre>

<p>Simple enough, isn't it? Put together, everything just works nicely.</p>
<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-unicode-text-in-dbm-files-the-easy-way.html">℞ 43: Unicode Text in DBM Files (the easy way)</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>
<p>Next: <a href="http://www.perl.com/pub/2012/06/perlunicook-further-resources.html">℞ 45: Further Resources</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Unicode Text in DBM Files (the easy way)</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-unicode-text-in-dbm-files-the-easy-way.html" />
    <id>tag:www.perl.com,2012:/pub//2.2058</id>

    <published>2012-06-20T13:00:01Z</published>
    <updated>2012-07-10T20:29:28Z</updated>

    <summary>℞ 43: Unicode text in DBM hashes, the easy way Some Perl libraries require you to jump through hoops to handle Unicode data. Would that everything worked as easily as Perl&apos;s open pragma! For DBM files, here&apos;s how to implicitly...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<h2 id="Unicode-text-in-DBM-hashes-the-easy-way">℞ 43: Unicode text in <small>DBM</small> hashes, the easy way</h2>

<p><a
href="http://www.perl.com/pub/2012/06/perlunicook-unicode-text-in-stubborn-libraries.html">Some
Perl libraries require you to jump through hoops to handle Unicode data</a>.
Would that everything worked as easily as Perl's <a
href="http://perldoc.perl.org/open.html">open</a> pragma!</p>

<p>For DBM files, here's how to implicitly manage the translation; all encoding
and decoding is done automatically, just as with streams that have a particular
encoding attached to them. The <a
href="http://search.cpan.org/perldoc?DBM_Filter">DBM_Filter</a> module allows
you to apply filters to keys and values to manipulate their contents before
storing or fetching. The module includes a "utf8" filter. Use it like:</p>

<pre><code>    use DB_File;
    use DBM_Filter;

    my $dbobj = tie %dbhash, &quot;DB_File&quot;, &quot;pathname&quot;;
    $dbobj-&gt;Filter_Value_Push(&quot;utf8&quot;);  # this is the magic bit

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    $dbhash{$uni_key} = $uni_value;

  # FETCH

    # $uni_key holds a normal Perl string (abstract Unicode)
    my $uni_value = $dbhash{$uni_key};</code></pre>

<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-unicode-text-in-stubborn-libraries.html">℞ 42: Unicode Text in Stubborn Libraries</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>
<p>Next: <a href="http://www.perl.com/pub/2012/06/perlunicook-demo-of-unicode-collation-and-printing.html">℞ 44: Demo of Unicode Collation and Printing</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Unicode Text in Stubborn Libraries</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-unicode-text-in-stubborn-libraries.html" />
    <id>tag:www.perl.com,2012:/pub//2.2060</id>

    <published>2012-06-18T13:00:01Z</published>
    <updated>2012-07-10T20:29:18Z</updated>

    <summary>℞ 42: Unicode text in DBM hashes, the tedious way While Perl 5 has long been very careful about handling Unicode correctly inside the world of Perl itself, every time you leave the Perl internals, you cross a boundary at...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<h2 id="Unicode-text-in-DBM-hashes-the-tedious-way">℞ 42: Unicode text in <small>DBM</small> hashes, the tedious way</h2>

<p>While Perl 5 has long been very careful about handling Unicode correctly
inside the world of Perl itself, every time you leave the Perl internals, you
cross a boundary at which <em>something</em> may need to handle decoding and
encoding. This happens when performing IO across a network or to files, when
speaking to a database, or even when using XS to use a shared library from
Perl.</p>

<p>For example, consider the core module <a
href="http://search.cpan.org/perldoc?DB_File">DB_File</a>, which allows you to
use Berkeley DB files from Perl&mdash;persistent storage for key/value
pairs.</p> Using a regular Perl string as a key or value for a
<small>DBM</small> hash will trigger a wide character exception if any
codepoints won't ﬁt into a byte. Here's how to manually manage the
translation:</p>

<pre><code>    use DB_File;
    use Encode qw(encode decode);
    tie %dbhash, &quot;DB_File&quot;, &quot;pathname&quot;;

 # STORE

    # assume $uni_key and $uni_value are abstract Unicode strings
    my $enc_key   = encode(&quot;UTF-8&quot;, $uni_key, 1);
    my $enc_value = encode(&quot;UTF-8&quot;, $uni_value, 1);
    $dbhash{$enc_key} = $enc_value;

 # FETCH

    # assume $uni_key holds a normal Perl string (abstract Unicode)
    my $enc_key   = encode(&quot;UTF-8&quot;, $uni_key, 1);
    my $enc_value = $dbhash{$enc_key};
    my $uni_value = decode(&quot;UTF-8&quot;, $enc_key, 1);</code></pre>

<p>By performing this manual encoding and decoding yourself, you know that your
storage file will have a consistent representation of your data. The correct
encoding depends on the type of data you store and the capabilities of the
external code, of course.</p>

<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-unicode-linebreaking.html">℞ 41: Unicode Linebreaking</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>
<p>Next: <a href="http://www.perl.com/pub/2012/06/perlunicook-unicode-text-in-dbm-files-the-easy-way.html">℞ 43: Unicode Text in DBM Files (the easy way)</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Unicode Linebreaking</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-unicode-linebreaking.html" />
    <id>tag:www.perl.com,2012:/pub//2.2062</id>

    <published>2012-06-12T13:00:01Z</published>
    <updated>2012-07-10T20:29:04Z</updated>

    <summary>℞ 41: Unicode linebreaking If you&apos;ve ever tried to fit a large amount of text into a display area too narrow for the full width of the text, you&apos;ve dealt with the joy of linebreaking (or word wrapping). As you...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<h2 id="Unicode-linebreaking">℞ 41: Unicode linebreaking</h2>

<p>If you've ever tried to fit a large amount of text into a display area too
narrow for the full width of the text, you've dealt with the joy of
linebreaking (or word wrapping). As you may have come to expect from Unicode
now, the specification provides a <a
href="http://www.unicode.org/reports/tr14/">Unicode Line Breaking Algorithm</a>
which respects the available line breaking opportunities provided by Unicode
text.</p>

<p>Unicode characters, of course, may have properties which influence these
rules.</p>

<p>As you have come to expect from Perl, a module implements the Unicode Line
Breaking Algorithm. Install <a
href="http://search.cpan.org/perldoc?Unicode::LineBreak">Unicode::LineBreak</a>.
This module respects direct and indirect break points as well as the grapheme
width of the string. Its basic use is simple:</p>

<pre><code> use Unicode::LineBreak;
 use charnames qw(:full);

 my $para = &quot;This is a super\N{HYPHEN}long string. &quot; x 20;
 my $fmt  = Unicode::LineBreak->new;
 print $fmt-&gt;break($para), &quot;\n&quot;;</code></pre>

<p>The result of its <code>break()</code> method is an array of lines broken at
valid points. (The default maximum number of columns is 76, so this example
works well for email and console use. See the module's documentation for other
configuration options.)</p>

<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-locale-comparison.html">℞ 40: Case- and Accent-insensitive Locale Comparisons</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>
<p>Next: <a href="http://www.perl.com/pub/2012/06/perlunicook-unicode-text-in-stubborn-libraries.html">℞ 42: Unicode Text in Stubborn Libraries</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Case- and Accent-insensitive Locale Comparisons</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-locale-comparison.html" />
    <id>tag:www.perl.com,2012:/pub//2.2064</id>

    <published>2012-06-11T13:00:01Z</published>
    <updated>2012-07-10T20:31:29Z</updated>

    <summary>℞ 40: Case- and accent-insensitive locale comparisons You now know how to compare Unicode strings while ignoring case and accent differences. This approach uses the standard Unicode collation algorithm. To perform a similar comparison while respecting a speciﬁc locale&apos;s rules,...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<h2 id="Case--and-accent-insensitive-locale-comparisons">℞ 40: Case- <em>and</em> accent-insensitive locale comparisons</h2>

<p>You now know how to <a
href="http://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html">compare
Unicode strings while ignoring case and accent differences</a>. This approach
uses the standard Unicode collation algorithm. To perform a similar comparison
while respecting a speciﬁc locale's rules, use <a
href="http://search.cpan.org/perldoc?Unicode::Collate::Locale">Unicode::Collate::Locale</a>:

<pre><code> my $de = Unicode::Collate::Locale-&gt;new(
            locale =&gt; &quot;de__phonebook&quot;,
          );

 # now this is true:
 $de-&gt;eq(&quot;tschüß&quot;, &quot;TSCHUESS&quot;);  # notice ü =&gt; UE, ß =&gt; SS</code></pre>

<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html">℞ 39: Case- and Accent-insensitive Comparison</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>
<p>Next: <a href="http://www.perl.com/pub/2012/06/perlunicook-unicode-linebreaking.html">℞ 41: Unicode Linebreaking</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Case- and Accent-insensitive Comparison</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html" />
    <id>tag:www.perl.com,2012:/pub//2.2056</id>

    <published>2012-06-08T13:00:01Z</published>
    <updated>2012-07-10T20:25:55Z</updated>

    <summary>℞ 39: Case- and accent-insensitive comparisons As you&apos;ve noticed by now, many Unicode strings have multiple possible representations. Comparing two Unicode strings for equality requires far more than merely comparing their codepoints. Not only must you account for multiple representations,...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<h2 id="Case--and-accent-insensitive-comparisons">℞ 39: Case- <em>and</em> accent-insensitive comparisons</h2>

<p>As you've noticed by now, many Unicode strings have multiple possible
representations. Comparing two Unicode strings for equality requires far more
than merely comparing their codepoints. Not only must you account for multiple
representations, you must decide which types of differences are significant: do
you care about the case of individual characters? How about the presence or
absence of accents?</p>

<p>Use a collator object to compare Unicode text by character instead of by
codepoint. To perform comparisions without regard for case or accent
differences, choose the appropriate comparison level. <a
href="http://search.cpan.org/perldoc?Unicode::Collate">Unicode::Collate</a>'s
<code>eq()</code> method offers customizable Unicode-aware equality:</p>

<pre><code> use Unicode::Collate;
 my $es = Unicode::Collate-&gt;new(
     level         =&gt; 1,
     normalization =&gt; undef
 );

  # now both are true:
 $es-&gt;eq(&quot;García&quot;,  &quot;GARCIA&quot; );
 $es-&gt;eq(&quot;Márquez&quot;, &quot;MARQUEZ&quot;);</code></pre>

<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-make-cmp-work-on-text-instead-of-codepoints.html">℞ 38: Make cmp Work on Text instead of Codepoints</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>
<p>Next: <a href="http://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-locale-comparison.html">℞ 40: Case- and Accent-insensitive Locale Comparisons</a></p>]]>
        
    </content>
</entry>

<entry>
    <title>Perl Unicode Cookbook: Make cmp Work on Text instead of Codepoints</title>
    <link rel="alternate" type="text/html" href="http://www.perl.com/pub/2012/06/perlunicook-make-cmp-work-on-text-instead-of-codepoints.html" />
    <id>tag:www.perl.com,2012:/pub//2.2054</id>

    <published>2012-06-07T13:00:01Z</published>
    <updated>2012-07-10T20:25:44Z</updated>

    <summary>℞ 38: Making cmp work on text instead of codepoints Even with Perl 5.12&apos;s &quot;unicode_strings&quot; feature, some of Perl&apos;s core operations do not perform as expected on Unicode strings by default. For example, how is the cmp operator to know...</summary>
    <author>
        <name>Tom Christiansen</name>
        <uri>http://training.perl.com/</uri>
    </author>
    
    
    <content type="html" xml:lang="en" xml:base="http://www.perl.com/pub/">
        <![CDATA[<h2 id="Making-cmp-work-on-text-instead-of-codepoints">℞ 38: Making <code>cmp</code> work on text instead of codepoints</h2>

<p>Even with Perl 5.12's <a
href="http://perldoc.perl.org/feature.html#The-%27unicode_strings%27-feature">"unicode_strings"
feature</a>, some of Perl's core operations do not perform as expected on
Unicode strings by default. For example, how is the <code>cmp</code> operator
to know whether its arguments are octets, larger codepoints, or graphemes, or
whether a specific collation should be in effect?</p>

<p>Where you might write:</p>

<pre><code> @srecs = sort {
     $b-&gt;{AGE}   &lt;=&gt;  $a-&gt;{AGE}
                 ||
     $a-&gt;{NAME}  cmp  $b-&gt;{NAME}
 } @recs;</code></pre>

<p>... a Unicode-aware comparison should instead use <a
href="http://search.cpan.org/perldoc?Unicode::Collate">Unicode::Collate</a>:</p>

<pre><code> my $coll = Unicode::Collate-&gt;new();
 for my $rec (@recs) {
     $rec-&gt;{NAME_key} = $coll-&gt;getSortKey( $rec-&gt;{NAME} );
 }
 @srecs = sort {
     $b-&gt;{AGE}       &lt;=&gt;  $a-&gt;{AGE}
                     ||
     $a-&gt;{NAME_key}  cmp  $b-&gt;{NAME_key}
 } @recs;</code></pre>

<p>This module's <code>getSortKey()</code> method returns an appropriate <a
href="http://www.unicode.org/reports/tr10/#Step_3">form sort key</a> respecting
the appropriate collation (and collation level) for a given Unicode string.
<code>cmp</code> can handle these keys effectively.</p>

<p>Previous: <a href="http://www.perl.com/pub/2012/06/perlunicook-unicode-locale-collation.html">℞ 37: Unicode Locale Collation</a></p>
<p>Series Index: <a href="http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html">The Standard Preamble</a></p>
<p>Next: <a href="http://www.perl.com/pub/2012/06/perlunicook-case--and-accent-insensitive-comparison.html">℞ 39: Case- and Accent-insensitive Comparison</a></p>]]>
        
    </content>
</entry>

</feed>
