Lexing Your Data
by Curtis Poe
|
Pages: 1, 2, 3, 4
Lexing Basics
As mentioned earlier, lexing is essentially the task of analyzing data and breaking it down into a series of easy-to-use tokens. While the data may be in other forms, usually this means analyzing strings. To give a trivial example, consider the expression:
x = (3 + 2) / y
When lexed, you might get a series of tokens, such as:
my @tokens = (
[ OP => 'x' ],
[ OP => '=' ],
[ OP => '(' ],
[ INT => '3' ],
[ VAR => '+' ],
[ INT => '2' ],
[ OP => ')' ],
[ OP => '/' ],
[ VAR => 'y' ],
);
With a proper grammar, you could then read this series of tokens and take actions based upon their values, perhaps to build a simple language interpreter or translate this code into another programming language. Even without a grammar, you can find these tokens useful.
Identifying Tokens
The first step in building a lexer is identifying the tokens you wish to parse. Look again at the SQL.
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
when a=b then 'c'
else 'd'
end tough_one
from ...
where ...
There's nothing really to care about anything after the from keyword. In looking at this closer, everything you do care about is immediately prior to a comma or the from keyword. However, splitting on commas isn't enough, as there are some commas embedded in function parentheses.
The first thing to do is to identify the various things you can match with simple regular expressions.
These "things" appear to be parentheses, commas, operators, keywords, and random text. A first pass at it might look something like this:
my $lparen = qr/\(/;
my $rparen = qr/\)/;
my $keyword = qr/(?i:select|from|as)/; # this is all this problem needs
my $comma = qr/,/;
my $text = qr/(?:\w+|'\w+'|"\w+")/;
my $op = qr{[-=+*/<>]};
The text matching is somewhat naive and you might want Regexp::Common for some of the regular expressions, but keep this simple for now.
The operators are a bit more involved. Assume that some SQL might have math statements embedded in them.
Now create the actual lexer. One way to do this is to make your own. It might look something like this:
sub lexer {
my $sql = shift;
return sub {
LEXER: {
return ['KEYWORD', $1] if $sql =~ /\G ($keyword) /gcx;
return ['COMMA', ''] if $sql =~ /\G ($comma) /gcx;
return ['OP', $1] if $sql =~ /\G ($op) /gcx;
return ['PAREN', 1] if $sql =~ /\G $lparen /gcx;
return ['PAREN', -1] if $sql =~ /\G $rparen /gcx;
return ['TEXT', $1] if $sql =~ /\G ($text) /gcx;
redo LEXER if $sql =~ /\G \s+ /gcx;
}
};
}
my $lexer = lexer($sql);
while (defined (my $token = $lexer->())) {
# do something with the token
}
Without going into the detail of how that works, it's fair to say that this is not the best solution. By looking at the original Perlmonks post, you should find that you need to make two passes through the data to extract what you want. I've left the explanation an exercise for the reader.

