Lexing Your Data
by Curtis Poe
|
Pages: 1, 2, 3, 4
To make this simpler, use the HOP::Lexer module from the CPAN. This module, described by Mark Jason Dominus in his book Higher Order Perl, makes creating lexers a rather trivial task and makes them a bit more powerful than the example. Here's the new code:
use HOP::Lexer 'make_lexer';
my @sql = $sql;
my $lexer = make_lexer(
sub { shift @sql },
[ 'KEYWORD', qr/(?i:select|from|as)/ ],
[ 'COMMA', qr/,/ ],
[ 'OP', qr{[-=+*/]} ],
[ 'PAREN', qr/\(/, sub { [shift, 1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],
[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text ],
[ 'SPACE', qr/\s*/, sub {} ],
);
sub text {
my ($label, $value) = @_;
$value =~ s/^["']//;
$value =~ s/["']$//;
return [ $label, $value ];
}
This certainly doesn't look any easier to read, but bear with me.
The make_lexer subroutine takes as its first argument an iterator, which returns the text to match on every call. In this case, you only have one snippet of text to match, so merely shift it off of an array. If you were reading lines from a log file, the iterator would be quite handy.
After the first argument comes a series of array references. Each reference takes two mandatory and one optional argument(s):
[ $label, $pattern, $optional_subroutine ]
The $label is the name of the token. The pattern should match whatever the label identifies. The third argument, a subroutine reference, takes as arguments the label and the text the label matched, and returns whatever you wish for a token.
Consider how you typically use the make_lexer subroutine.
[ 'KEYWORD', qr/(?i:select|from|as)/ ],
Here's an example of how to transform the data before making the token:
[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text ],
As mentioned previously, the regular expression might be naive, but leave that for now and focus on the &text subroutine.
sub text {
my ($label, $value) = @_;
$value =~ s/^["']//;
$value =~ s/["']$//;
return [ $label, $value ];
}
This says, "Take the label and the value, strip leading and trailing quotes from the value and return them in an array reference."
To strip the white space you don't care about, simply return nothing:
'SPACE', qr/\s*/, sub {} ],
Now that you have your lexer, put it to work. Remember that column aliases are the TEXT not in parentheses, but immediately prior to commas or the from keyword. How do we know if you're inside of parentheses? Cheat a little bit:
[ 'PAREN', qr/\(/, sub { [shift, 1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],
With that, you can add a one whenever you get to an opening parenthesis and subtract it when you get to a closing one. Whenever the result is zero, you know that you're outside of parentheses.
To get the tokens, call the $lexer iterator repeatedly.
while ( defined (my $token = $lexer->() ) { ... }
The tokens look like this:
[ 'KEYWORD', 'select' ]
[ 'TEXT', 'the_date' ]
[ 'KEYWORD', 'as' ]
[ 'TEXT', 'date' ]
[ 'COMMA', ',' ]
[ 'TEXT', 'round' ]
[ 'PAREN', 1 ]
[ 'TEXT', 'months_between' ]
[ 'PAREN', 1 ]
And so on.

