Lexing Your Data
by Curtis Poe
|
Pages: 1, 2, 3, 4
Here's how to process the tokens:
1: my $inside_parens = 0;
2: while ( defined (my $token = $lexer->()) ) {
3: my ($label, $value) = @$token;
4: $inside_parens += $value if 'PAREN' eq $label;
5: next if $inside_parens || 'TEXT' ne $label;
6: if (defined (my $next = $lexer->('peek'))) {
7: my ($next_label, $next_value) = @$next;
8: if ('COMMA' eq $next_label) {
9: print "$value\n";
10: }
11: elsif ('KEYWORD' eq $next_label && 'from' eq $next_value) {
12: print "$value\n";
13: last; # we're done
14: }
15: }
16: }
This is pretty straightforward, but there are some tricky bits. Each token is a two-element array reference, so line 3 makes the label and value fairly explicit. Lines 4 and 5 use the "cheat" for handling parentheses. Line 5 also skips anything that isn't text and therefore cannot be a column alias.
Line 6 is a bit odd. In HOP::Lexer, passing the string peek to the lexer will return the next token without actually advancing the $lexer iterator. From there, it's straightforward logic to find out if the value is a column alias that matches the criteria.
Putting all of this together makes:
#!/usr/bin/perl
use strict;
use warnings;
use HOP::Lexer 'make_lexer';
my $sql = <<END_SQL;
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
when a=b then 'c'
else 'd'
end tough_one
from XXX
END_SQL
my @sql = $sql;
my $lexer = make_lexer(
sub { shift @sql },
[ 'KEYWORD', qr/(?i:select|from|as)/ ],
[ 'COMMA', qr/,/ ],
[ 'OP', qr{[-=+*/]} ],
[ 'PAREN', qr/\(/, sub { [shift, 1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],
[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text ],
[ 'SPACE', qr/\s*/, sub {} ],
);
sub text {
my ( $label, $value ) = @_;
$value =~ s/^["']//;
$value =~ s/["']$//;
return [ $label, $value ];
}
my $inside_parens = 0;
while ( defined ( my $token = $lexer->() ) ) {
my ( $label, $value ) = @$token;
$inside_parens += $value if 'PAREN' eq $label;
next if $inside_parens || 'TEXT' ne $label;
if ( defined ( my $next = $lexer->('peek') ) ) {
my ( $next_label, $next_value ) = @$next;
if ( 'COMMA' eq $next_label ) {
print "$value\n";
}
elsif ( 'KEYWORD' eq $next_label && 'from' eq $next_value ) {
print "$value\n";
last; # we're done
}
}
}
That prints out the column aliases:
date
months_old
product
year
tough_one
So are you done? No, probably not. What you really need now are many other examples of the SQL generated in the first problem statement. Maybe the &text subroutine is naive. Maybe there are other operators you forgot. Maybe there are floating-point numbers embedded in the SQL. When you have to lex data by hand, fine-tuning the lexer to match your actual data can take a few tries.
It's also important to note that precedence is very important here. &make_lexer evaluates each array reference passed in the order it receives them. If you passed the TEXT array reference before the KEYWORD array reference, the TEXT regular expression would match keywords before the KEYWORD could, thus generating spurious results.
Happy lexing!
You must be logged in to the O'Reilly Network to post a talkback.
Showing messages 1 through 2 of 2.
- Good article, small typo
2006-01-09 16:34:49 larryl [Reply]
- Good article, small typo
2006-01-09 16:36:20 larryl [Reply]
Dangit, hit return accidentally...
Anyway, good article, it inspired me to give this scheme a try.
FYI, in the example tokens, 'x' should be a 'VAR' and '+' should be an 'OP'.
Larry
- Good article, small typo



