Optimizing Your Perl
Choosing the right data structure
by Robert SpierFebruary 12, 2002
Is your Perl program taking too long to run? This might be because you've chosen a data structure or algorithm that takes a long time to run. By rethinking how you've implemented a function, you might be able to realize huge gains in speed.
Some Simple Complexity Theory
Before we can talk about speeding up something, we need a way to describe how long something takes. Because we're talking about algorithms that may have varying amounts of input, the actual "time" to do something isn't conclusive. Computer scientists and mathematicians use a system called big-O notation to describe the order of magnitude of how long something will take. Big-O notation represents a worst-case analysis. There are other notations to represent the magnitude of minimum and actual runtimes.
Don't let the talk of computer scientists and mathematicians scare you. The next few paragraphs talk about a way to describe the difference between orders of magnitude such as a second, minute, hour or day. Or 1, 10, 100 and 1,000. You don't need to know any fancy or scary math to understand it.
|
Related Reading
Perl in a Nutshell, 2nd Edition |
It's simple. If the runtime of a function is constant, then this is written as O(1). (Note the capital or "big" O.) Constant means that no matter how much data is processed, (i.e., how much input there is) the time doesn't change.
If the runtime is directly related to the amount of input
(a.k.a. linear), then this is O(n). If there is twice as much input, then the
algorithm will take twice as long to run. Some functions are O(n^2)
(quadratic) or O(n^3) (cubic) or even worse.
Because we are only looking at orders of magnitude (is the operation
going to take one minute or one hour) we can ignore constant factors and
smaller terms. O(2n) and O(n) are equivalent.
As are O(n+1) and
O(n) or O(n^2+n) and O(n^2). The smaller term
is insignificant
compared to the larger.
There are rules for analyzing code and determining its big-O, but the simplest way is to look at how many times you iterate over the data. if you don't iterate, then it's O(1). If you iterate over it once, then it's O(n). If you have two nested loops, then it's probably O(n^2).

Example #1: Graphing
I developed a new graph module for a project at my day job, because the existing one was overkill, yet did not support a feature I needed. It was easy to implement and worked very well, until the graph got very big. I needed flattened subgraphs to do dependency checks, but it was taking more than two minutes to compute (on a 1Ghz P3) for a 1,000 node graph. I needed the computation to be much faster, or my users would get mad.
My initial Graph module looked like this. (Simplified to only show relevant parts.)
package Graph; # for Directed Graphs
use strict;
sub new {
my $this = { edges => [],
vertices => {} };
bless $this, "Graph";
}
sub add_edge { # from x to y
my ($this,$x,$y) = @_;
$this->{vertices}{$x}++;
$this->{vertices}{$y}++;
push @{$this->{edges}},[$x=>$y];
}
sub in_edges {
my ($this,$v) = @_;
grep { $_->[1] eq $v } @{$this->{edges}};
}
The add_edge method is O(1) because it will always take the same
amount of time no matter how many nodes/edges are in the graph. The
in_edges method is O(n) because it has to iterate over each edge in
the graph. (The iteration is "inside" the grep.)
The problematic section of the code looked like this:
sub flat_subgraph {
my ($graph,$start) = @_;
my %seen;
my @worklist = ($start);
while (@worklist) {
my $x = shift @worklist;
push @worklist, $graph->in_edges($x)
unless $seen{$x}++;
}
return keys %seen;
}
And I actually needed to do that for multiple vertices;
my %dependencies;
for (keys %{$graph->{vertices}}) {
$dependencies{$_} = [ flat_subgraph( $graph, $_ ) ];
}
This would make the entire flattening operation O(n^3). ( The dependency loop is O(n), flat_subgraph is effectively O(n), and in_edges is O(n)). That means that it takes much longer to compute as the Graph becomes bigger. (Imagine a curve with the following values: 1,8,27,64..., that's the curve of relative times.)
Caching
With my users expecting instant feedback, something had to be done. The first thing I tried was to cache the in_edges computation.
sub in_edges {
my ($this,$v) = @_;
if (exists $this->{cache}{in_edges}{$v}) {
return @{$this->{cache}{in_edges}{$v}};
} else {
my @t = grep { $_->[1] eq $v } @{$this->{edges}};
$this->{cache}{in_edges}{$v} = [@t];
return @t;
}
}
This helped, but not as much as I had hoped. The in_edges calculation
was still expensive when it wasn't cached. It's still O(n) in
the worst case, but becomes O(1) when cached. On average, that means
the entire flattening operation is somewhere between O(n^2) and O(n^3),
which is still potentially slow.
In situations where the same functions are called with the same arguments, caching can be useful. Mark-Jason Dominus has written a module called Memoize, which makes it easy to write caching functions. See his Web site http://perl.plover.com/MiniMemoize/ for more information.
Change the Structure
The next thing I tried was to change the Graph data structure. When I originally designed it, my goal was simplicity, not necessarily speed. Now speed was becoming important and I had to make changes.
I modified graph so that in_edges were calculated when the new edge was being added.
package Graph; # for Directed Graphs
use strict;
sub new {
my $this = { edges => [],
vertices => {},
in_edges => {} };
bless $this, "Graph";
}
sub add_edge { # from x to y
my ($this,$x,$y) = @_;
$this->{vertices}{$x}++;
$this->{vertices}{$y}++;
push @{$this->{in_edges}{$y}}, $x;
push @{$this->{edges}},[$x=>$y];
}
sub in_edges {
my ($this,$v) = @_;
return @{$this->{in_edges}{$v}};
}
add_edge remains O(1) - it still executes in a constant time.
in_edges is O(1) as well now; its runtime will not vary based on the
number of edges.
This was the solution I needed, getting a full flattening down from more than four minutes to only six seconds.
Pages: 1, 2 |


