Lexing Your Data
by Curtis PoeJanuary 05, 2006
s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g
Most of us have tried at one time or another to use regular expressions to do things we shouldn't: parsing HTML, obfuscating code, washing dishes, etc. This is what the technical term "showing off" means. I've done it too:
$html =~ s{
(<a\s(?:[^>](?!href))*href\s*)
(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
([^>]+>)
}
{$1 . decode_entities($2) . $4}gsexi;
I was strutting like a peacock when I wrote that, followed quickly by eating crow when I ran it. I never did get that working right. I'm still not sure what I was trying to do. That regular expression forced me to learn how to use HTML::TokeParser. More importantly, that was the regular expression that taught me how difficult regular expressions can be.
The Problem with Regular Expressions
Look at that regex again:
/(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/
|
Related Reading
Mastering Regular Expressions |
Do you know that matches? Exactly? Are you sure? Even if it works, how easily can you modify it? If you don't know what it was trying to do (and to be fair, don't forget it's broken), how long did you spend trying to figure it out? When's the last time a single line of code gave you such fits?
The problem, of course, is that this regular expression is trying to do far more work than a single line of code is likely to do. When facing with a regular expression like that, there are a few things I like to do.
- Document it carefully.
- Use the
/xswitch so I can expand it over several lines. - Possibly, encapsulate it in a subroutine.
Sometimes, though, there's a fourth option: lexing.
Lexing
When developing code, we typically take a problem and break it down into a series of smaller problems that are easier to solve. Regular expressions are code and you can break them down into a series of smaller problems that are easier to solve. One technique is to use lexing to facilitate this.
Lexing is the act of breaking data down into discrete tokens and assigning meaning to those tokens. There's a bit of fudging in that statement, but it pretty much covers the basics.
Parsing typically follows lexing to convert the tokens into something more useful. Parsing is frequently the domain of some tool that applies a well-defined grammar to the lexed tokens.
Sometimes well-defined grammars are not practical for extracting and reporting information. There might not be a grammar available for a company's ad-hoc log file format. Other times you might find it easier to process the tokens manually then to spend the time writing a grammar. Still other times you might only care about part of the data you've lexed, not all of it. All three of these reasons apply to some problems.
Parsing SQL
Recently, on Perlmonks (parse a query string), someone had some SQL to parse:
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
when a=b then 'c'
else 'd'
end tough_one
from ...
where ...
The poster needed the alias for each column from that SQL. In this case, the aliases are date, months_old, product, year, and tough_one. Of course, this was only one example. There's actually plenty of generated SQL, all with subtle variations on the column aliases, so this is not a trivial task. What's interesting about this, though, is that we don't give a fig about anything except the column aliases. The rest of the text is merely there to help us find those aliases.
Your first thought might be to parse this with SQL::Statement. As it turns out, this module does not handle CASE statements. Thus, you must figure out how to patch SQL::Statement, submit said patch, and hope it gets accepted and released in a timely fashion. (Note that SQL::Statement uses SQL::Parser, so the latter is also not an option.)
Second, many of us have worked in environments where we have problems to solve in production now, but we still have to wait three weeks to get the necessary modules installed, if we can get them approved at all.
The most important reason, though, is even if SQL::Statement could handle this problem, this would be an awfully short article if you used it instead of a lexer.

