Power Regexps, Part II
by Simon Cozens
|
Pages: 1, 2
split, //g and other shenanigans
In the previous article, we had a nice piece of multiline, formatted data, such as one might expect to parse with Perl:
Name: Mark-Jason Dominus
Occupation: Perl trainer
Favourite thing: Octopodes
Name: Simon Cozens
Occupation: Hacker
Favourite thing: Sleep
Now, there's a boring way to parse this. If you're coming from a C or Java background, then you might try:
my $record = {}
my @records;
for (split /\n/, $text {
chomp;
if (/([^:]+): (.*)/) {
$record->{$1} = $2;
} elsif ($_ =~ /^\s*$/) {
# Blank line => end of current record
push @records, $record;
$record = {};
} else {
die "Wasn't expecting to see '$_' here";
}
}
And, of course, this will work. But there's several more Perl-ish solutions that this. When you know the fields provided by your data, it's rather nice to have a regular expression that reflects the data structure:
while ($data =~ /Name:\s(.*)\n
Occupation:\s(.*)\n
Favourite.*:\s(.*)/gx) {
push @records, { name => $1, occupation => $2, favourite => $3 }
}
Here we use the /g modifier, which allows us to resume the match from
where it last left off.
If we don't know the fields while we're writing our program, then we'll have to break the process up into two stages. First, we extract individual records: records are delimited by a blank line:
my @texts = split /\n\s*\n/, $text;
And then for each record, we can either use the /g trick again, or
simply split each record into lines. I prefer the latter, for reasons
you'll see in a second:
for (@texts) {
my $record = {};
for (split /\n/, $_) {
/([^:]+): (.*)/;
$record->{$1} = $2;
}
push @records, $record;
}
This is not dissimilar from the initial solution, but it allows us to
make some interesting improvements. For starters, when you see code
that transforms data with a for loop, you should wonder whether it
could be better written with a map statement. This goes double if
you're using push inside the for loop as we are here. So this
version is a natural evolution:
@records = map {
my $record = {};
for (split /\n/, $_) {
/([^:]+): (.*)/;
$record->{$1} = $2;
}
$record;
} split /\n\s*\n/, $text;
And we can actually do away with the inner for loop too:
@records = map {
{
map { /([^:]+): (.*)/ and ($1 => $2) } split /\n/, $_
}
} split /\n\s*\n/, $text;
But if we're prepared to be a little lax about trailing whitespace,
there's actually an even nicer way to do it, using the one thing that
everyone forgets about split: if your split pattern contains
parentheses, then the captured text is inserted into the list returned
by split. That is, the following code:
split( /(\W+)/, "perl-5.8.0.tar.gz")
will produce the list
("perl", "-", "5", ".", "8", ".", "0", ".", "tar", ".", "gz")
So we can actually use the field name, colon and space at the start of
each line as the split expression itself:
split /^([^:]+):\s*/m
There is a slight problem with this idea - because the first
thing in each record is delimeter we're looking for, the first thing
returned by split will be an empty string. But we can easily get
around this by adding another undef to provide a fake undef => ''
hash element. This allows us to reduce the parser code to:
@records = map {
{ undef, split /^([^:]+):\s*/m, $_ }
} split /\n\s*\n/, $text;
It may not be pretty, but it's quick and it works.
Of course, you may also use lookahead and lookbehind assertions with
split; I sometimes use the following code to break a string into
tokens:
split /(?<=\W)|(?=\W)/, $string;
This is almost the same as
split /(\W)/, $string
but with a subtle difference. Again, as Perl wants to see a nonword character as a delimiter, it will return an empty string between two adjacent nonwords:
split /(\W)/, '$foo := $bar';
# '', '$', 'foo', ' ', '', ':', '', '=', '', ' ', '', '$', 'bar'
Splitting on a word boundary goes too much the other way:
split /\b/, '$foo := $bar';
# '$', 'foo', ' := $', 'bar'
And so it turns out that we want to cleave the string where we've just seen a nonword character, or if we're about to see one:
split /(?<=\W)|(?=\W)/, $string;
# '$', 'foo', ' ', ':', '=', ' ', '$', 'bar'
And this gives us the sort of tokenisation we want.
Regexp Modules
Now, though, we are getting into the sort of regular expressions that are not written lightly, and we may need some help constructing and debugging these expressions. Thankfully, there are plenty of modules which make regexp handling much easier for us.
re
The re module is as invaluable as it is obscure. It's one of those
hidden treasures of the Perl core that Casey was talking about last
month. As well as turning on two features of the regular expression
engine, tainting subexpressions and evaluated assertions, it provides a
debugging facility that allows you to watch your expression being
compiled and executed.
Here's a relative simple expression:
$a =~ /([^:]+):\s*(.*)/;
When this code is run under -Mre=debug, then the following will be
printed when the regexp is compiled:
Compiling REx `([^:]+):\s*(.*)'
size 25 first at 4
1: OPEN1(3)
3: PLUS(13)
4: ANYOF[\0-9;-\377](0)
13: CLOSE1(15)
15: EXACT <:>(17)
17: STAR(19)
18: SPACE(0)
19: OPEN2(21)
21: STAR(23)
22: REG_ANY(0)
23: CLOSE2(25)
25: END(0)
This tells us the instructions for the little machine that the regular
expression compiler creates: it should first open a bracket, then go
into a loop (PLUS) finding characters that are ANYOF character
zero through to 9 and ; through to character 255 - that is,
everything apart from a :. Then we close the bracket, look for a
specific character, and so on. The numbers in brackets after each
instruction are the line number to jump to on completion; then the
PLUS loop exits, it should go on to line 13, CLOSE1 and so on.
Next when we try to run this match against some text:
$a = "Name: Mark-Jason Dominus";
It will first tell us something about the optimizations it performs:
Guessing start of match, REx `([^:]+):\s*(.*)' against `Name: ...'
Found floating substr `:' at offset 4...
Does not contradict STCLASS...
Guessed: match at offset 0
What this means is that it has found the constant element : in the
regular expression, and tries to locate that in the string, and then
work backward to find out where it should start the match. Since the
: is at position four in our string, it will go on to deduce that
the match should start at the beginning and...
Matching REx `([^:]+):\s*(.*)' against `Name: Mark-Jason Dominus'
Setting an EVAL scope, savestack=3
0 <> <Name: Mark-J> | 1: OPEN1
0 <> <Name: Mark-J> | 3: PLUS
ANYOF[\0-9;-\377] can match 4 times out of 32767...
The [^:] can match four times, since it knows there are four things
that are not colons there.
The re module is absolutely essential for heavy-duty study of how the
regular expression engine works, and why it doesn't do what you think it
should.
YAPE::Regex::Explain
The description given by re is a little low-level for some people;
well, most people. YAPE::Regex::Explain aims to put the explanation
at a much higher level; for instance,
% perl -MYAPE::Regex::Explain -e 'print
YAPE::Regex::Explain->new(qr/(?<=\W)|(?=\W)/)->explain'
will produce quite a verbose explanation of the regular expression like so:
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
(?<= look behind to see if there is:
----------------------------------------------------------------------
\W non-word characters (all but a-z, A-Z,
0-9, _)
----------------------------------------------------------------------
...
GraphViz::Regex
I find that one of the best ways to debug and understand a complex
procedure is to draw a picture. GraphViz::Regex uses the graphviz
visualization library to draw a state machine diagram for a given
regular expression:
use GraphViz::Regex;
my $regex = '(([abcd0-9])|(foo))';
my $graph = GraphViz::Regex->new($regex);
print $graph->as_png;
Regexp::Common
So much for explaining complicated regular expressions; what about
generating them? The Regexp::Common module aims to be a repository
for all kinds of commonly needed regular expressions, such as URIs,
balanced texts, domain names and IP addresses. The interface
is a little freaky, but it can hugely help to clarify complex regexps:
my $ts = qr/\d+:\d+:\d+\.\d+/;
$tcpdump =~ /$ts ($RE{net}{IPv4}) > ($RE{net}{IPv4}) : (tcp|udp) (\d+)/;
Text::Balanced
Finally, one particularly common family of things to match for are quoted,
parenthesised or tagged text. Damian's Text::Balanced module helps
produce both regular expressions and subroutines to match and extract
balanced text sequences. For instance, we can create a regular
expression for matching double-quoted strings like so:
use Text::Balanced qw(gen_delimited_pat);
$pat = gen_delimited_pat(q{"})
# (?:\"(?:[^\\\"]*(?:\\.[^\\\"]*)*)\")
This pattern will match quoted text, but will also be aware of escape
sequences like \" and \\, and hence not break off in the middle of
"\"So\", he said, \"How about lunch?\""
Text::Balanced also contains routines for extracting tagged text,
finding balanced pairs of parentheses, and much more.
Summary
We've looked at some slightly more-complex features of regular expressions,
and shown how we can use these to slice and dice text with Perl. As these
regexes get more complicated, the need for tools to help us debug them
increases; and so we've looked also at re, YAPE and GraphViz::Regex.
Finally, the Regexp::Common and Text::Balanced modules help us create
complex regular expressions of our own.

