Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Lexing Your Data
by Curtis Poe | Pages: 1, 2, 3, 4

Here's how to process the tokens:

 1:  my $inside_parens = 0;
 2:  while ( defined (my $token = $lexer->()) ) {
 3:      my ($label, $value) = @$token;
 4:      $inside_parens += $value if 'PAREN' eq $label;
 5:      next if $inside_parens || 'TEXT' ne $label;
 6:      if (defined (my $next = $lexer->('peek'))) {
 7:          my ($next_label, $next_value) = @$next;
 8:          if ('COMMA' eq $next_label) {
 9:              print "$value\n";
10:          }
11:          elsif ('KEYWORD' eq $next_label && 'from' eq $next_value) {
12:              print "$value\n";
13:              last; # we're done
14:          }
15:      }
16:  }

This is pretty straightforward, but there are some tricky bits. Each token is a two-element array reference, so line 3 makes the label and value fairly explicit. Lines 4 and 5 use the "cheat" for handling parentheses. Line 5 also skips anything that isn't text and therefore cannot be a column alias.

Line 6 is a bit odd. In HOP::Lexer, passing the string peek to the lexer will return the next token without actually advancing the $lexer iterator. From there, it's straightforward logic to find out if the value is a column alias that matches the criteria.

Putting all of this together makes:

#!/usr/bin/perl

use strict;
use warnings;
use HOP::Lexer 'make_lexer';

my $sql = <<END_SQL;
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
  when a=b then 'c'
    else 'd'
      end tough_one
      from XXX
END_SQL

my @sql   = $sql;
my $lexer = make_lexer(
    sub { shift @sql },
    [ 'KEYWORD', qr/(?i:select|from|as)/          ],
    [ 'COMMA',   qr/,/                            ],
    [ 'OP',      qr{[-=+*/]}                      ],
    [ 'PAREN',   qr/\(/,      sub { [shift,  1] } ],
    [ 'PAREN',   qr/\)/,      sub { [shift, -1] } ],
    [ 'TEXT',    qr/(?:\w+|'\w+'|"\w+")/, \&text  ],
    [ 'SPACE',   qr/\s*/,     sub {}              ],
);

sub text {
    my ( $label, $value ) = @_;
    $value =~ s/^["']//;
    $value =~ s/["']$//;
    return [ $label, $value ];
}

my $inside_parens = 0;
while ( defined ( my $token = $lexer->() ) ) {
    my ( $label, $value ) = @$token;
    $inside_parens += $value if 'PAREN' eq $label;
    next if $inside_parens || 'TEXT' ne $label;
    if ( defined ( my $next = $lexer->('peek') ) ) {
        my ( $next_label, $next_value ) = @$next;
        if ( 'COMMA' eq $next_label ) {
            print "$value\n";
        }
        elsif ( 'KEYWORD' eq $next_label && 'from' eq $next_value ) {
            print "$value\n";
            last; # we're done
        }
    }
}

That prints out the column aliases:

date
months_old
product
year
tough_one

So are you done? No, probably not. What you really need now are many other examples of the SQL generated in the first problem statement. Maybe the &text subroutine is naive. Maybe there are other operators you forgot. Maybe there are floating-point numbers embedded in the SQL. When you have to lex data by hand, fine-tuning the lexer to match your actual data can take a few tries.

It's also important to note that precedence is very important here. &make_lexer evaluates each array reference passed in the order it receives them. If you passed the TEXT array reference before the KEYWORD array reference, the TEXT regular expression would match keywords before the KEYWORD could, thus generating spurious results.

Happy lexing!