A Refactoring Example
by Michael Schwern
|
Pages: 1, 2, 3, 4, 5
Eliminate Unnecessary Longhand
Boy, we sure use $page_ids[$idx] a lot. It's the current page
ID. But don't we have a variable for that?
Replace all the unnecessary array accesses and just use the more
concise and descriptive $page_id.
my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ?
SQL
my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data = ?
SQL
my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ? AND
data <= ?
SQL
open (INPUT, "< $filepageid") || &file_open_error("$filepageid");
while ($page_id=<INPUT>){
$idx++;
chomp($page_id);
$page_ids[$idx] = $page_id;
$hits_upto_sth->execute($page_id, $start_date);
$hits{upto}[$idx] = $hits_upto_sth->fetchrow_array();
$hits_daily_sth->execute($page_id, $today);
$hits{today}[$idx] = $hits_daily_sth->fetchrow_array();
$hits_daily_sth->execute($page_id, $yesterday);
$hits{yesterday}[$idx] = $hits_daily_sth->fetchrow_array();
$hits_range_sth->execute($page_id, $start_of_30_days,
$end_of_30_days,);
$hits{monthly}[$idx] = $hits_range_sth->fetchrow_array();
}
Test. Log. Check in.
Rearrange Data Structures to Fit Their Use
Currently, %hits is accessed by the order the page ID was read out
of the file. Well, that doesn't seem very useful at all. Its purpose
seems to be for listing the page counts in exactly the same order as
you read them in. Even then you need to iterate through @page_ids
simultaneously because no where in %hits is the page ID stored.
Consider a common operation, looking up the hit counts for a given page ID. You have to iterate through the whole list of page IDs to do it.
foreach my $idx (0..$#page_ids) {
if( $page_ids[$idx] eq $our_page_id ) {
print "Hits for $our_page_id today: $hits{today}[$idx]\n";
last;
}
}
Cumbersome. A much better layout would be a hash keyed on the page ID.
$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();
Now we can directly access the data for a given page ID. If
necessary, we can still list the hits in the same order they were read
in by iterating through @page_ids.
my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ?
SQL
my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data = ?
SQL
my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ? AND
data <= ?
SQL
open (INPUT, "< $filepageid") || &file_open_error("$filepageid");
while ($page_id=<INPUT>){
$idx++;
chomp($page_id);
$page_ids[$idx] = $page_id;
$hits_upto_sth->execute($page_id, $start_date);
$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();
$hits_daily_sth->execute($page_id, $today);
$hits{today}{$page_id} = $hits_daily_sth->fetchrow_array();
$hits_daily_sth->execute($page_id, $yesterday);
$hits{yesterday}{$page_id} = $hits_daily_sth->fetchrow_array();
$hits_range_sth->execute($page_id, $start_of_30_days,
$end_of_30_days,);
$hits{monthly}{$page_id} = $hits_range_sth->fetchrow_array();
}
Test. Log. Check in.
Eliminate Unnecessary Variables
Now that %hits is no longer ordered by how it was read in, $idx
isn't used much anymore. It's only used to stick $page_id onto the
end of @page_ids, but we can do that with push.
This is minor but little things build up to cause big messes.
my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ?
SQL
my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data = ?
SQL
my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ? AND
data <= ?
SQL
open (INPUT, "< $filepageid") || &file_open_error("$filepageid");
while ($page_id=<INPUT>){
chomp($page_id);
push @page_ids, $page_id;
$hits_upto_sth->execute($page_id, $start_date);
$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();
$hits_daily_sth->execute($page_id, $today);
$hits{today}{$page_id} = $hits_daily_sth->fetchrow_array();
$hits_daily_sth->execute($page_id, $yesterday);
$hits{yesterday}{$page_id} = $hits_daily_sth->fetchrow_array();
$hits_range_sth->execute($page_id, $start_of_30_days,
$end_of_30_days,);
$hits{monthly}{$page_id} = $hits_range_sth->fetchrow_array();
}
Test. Log. Check in.
Pull Logical Chunks Out into Functions
Our final refactoring is one of the most common and most useful.
Let's assume that we need to generate page counts somewhere else in the code. Rather than repeat the code to do this, we want to put it in a subroutine so it can be reused. One subroutine for each statement.
In order to do this, start by identifying the code that would go into the routine.
$hits_upto_sth->execute($page_id, $start_date);
$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();
Then wrap a subroutine around it.
sub hits_upto {
$hits_upto_sth->execute($page_id, $start_date);
$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();
}
Now look at all the variables used.
$hits_upto_sth is a global (well, file-scoped lexical) and is
defined entirely outside the function. We can keep using it in our
subroutine in the same way we are now.
$hits{upto}{$page_id} is receiving the result of the calculation.
It contains the return value. So it goes outside the function to
receive the return value. Where its assignment was, we put a return.
sub hits_upto {
$hits_upto_sth->execute($page_id, $start_date);
return $hits_upto_sth->fetchrow_array();
}
$page_id and $start_date vary from call to call. These are our
function arguments.
sub hits_upto {
my($page_id, $start_date) = @_;
$hits_upto_sth->execute($page_id, $start_date);
return $hits_upto_sth->fetchrow_array();
}
Finally, rename things in a more generic manner. This is a subroutine
for calculating the number of hits up to a certain date. Instead of
$start_date which was specific to one calculation, we'd call it
$date.
sub hits_upto {
my($page_id, $date) = @_;
$hits_upto_sth->execute($page_id, $date);
return $hits_upto_sth->fetchrow_array();
}
And that's our new subroutine, does the same thing as the original code. Then it's a simple matter to use it in the code.
$hits{upto}{$page_id} = hits_upto($page_id, $start_date);
my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ?
SQL
my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data = ?
SQL
my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM lognew
WHERE pageid = ? AND
data >= ? AND
data <= ?
SQL
open (INPUT, "< $filepageid") || &file_open_error("$filepageid");
while ($page_id=<INPUT>){
chomp($page_id);
push @page_ids, $page_id;
$hits{upto}{$page_id} = hits_upto($page_id, $start_date);
$hits{today}{$page_id} = hits_daily($page_id, $today);
$hits{yesterday}{$page_id} = hits_daily($page_id, $yesterday);
$hits{monthly}{$page_id} = hits_range($page_id, $start_of_30_days,
$end_of_30_days,);
}
sub hits_upto {
my($page_id, $date) = @_;
$hits_upto_sth->execute($page_id, $date);
return scalar $hits_upto_sth->fetchrow_array();
}
sub hits_daily {
my($page_id, $date) = @_;
$hits_daily_sth->execute($page_id, $date);
return scalar $hits_daily_sth->fetchrow_array();
}
sub hits_range {
my($page_id, $start, $end) = @_;
$hits_range_sth->execute($page_id, $start, $end);
return scalar $hits_range_sth->fetchrow_array();
}
Test. Log. Check in.
Undo.
Some may balk at putting that small of a snippet of code into a subroutine like that. There are definite performance concerns about adding four subroutine calls to a loop. But I'm not worried about that at all.
One of the beauties of Refactoring is that it's reversible. Refactorings don't change how the program works. We can reverse any of these refactorings and the code will work exactly the same. If a refactoring turns out to be a bad idea, undo it. Logging each refactoring in version control makes the job even easier.
So if it turns out moving the executes into their own functions causes a performance problem the change can easily be undone.
Done?
At this point, things are looking pretty nice. The code is well structured, readable, and efficient. The variables are sensibly named. The data is organized in a fairly flexible manner.
It's good enough. This is not to say that there's not more that could be done, but we don't need to. And Refactoring is about doing as much redesign as you need instead of what you might need.
Refactoring and the Swiss Army Knife
As programmers we have a tendency towards over-design. We like to design our code to deal with any possible situation that might arise, since it was hard to change the design later. This is known as Big Design Up Front (BDUF). It's like one of those enormous Swiss Army Knives with 50 functions. Most of the time all you need is a knife with something to open your beer with and then maybe pick your teeth afterwards but you never know. So you over-engineer because it's hard to improve it later. If it never gets used then a lot of effort has been wasted.
Refactoring turns design on its ear. Now you can continually evolve your design as needed. There's no longer a need to write for every possible situation up front so you can focus on just what you need right now. If you need more flexibility later, you can add that flexibility through refactoring. It's like having a Swiss Army knife that you can add tools to as you need them.
Further Reference
- The original thread on Fun With Perl
- The Portland Pattern Repository answers the question -- WhatIsRefactoring?
- The Refactoring Book by Martin Fowler

