Recently in Style Guides Category

Ten Essential Development Practices


The following ten tips come from Perl Best Practices, a new book of Perl coding and development guidelines by Damian Conway.

1. Design the Module's Interface First

The most important aspect of any module is not how it implements the facilities it provides, but the way in which it provides those facilities in the first place. If the module's API is too awkward, or too complex, or too extensive, or too fragmented, or even just poorly named, developers will avoid using it. They'll write their own code instead. In that way, a poorly designed module can actually reduce the overall maintainability of a system.

Designing module interfaces requires both experience and creativity. Perhaps the easiest way to work out how an interface should work is to "play test" it: to write examples of code that will use the module before implementing the module itself. These examples will not be wasted when the design is complete. You can usually recycle them into demos, documentation examples, or the core of a test suite.

O'Reilly Open Source Convention 2005.

The key, however, is to write that code as if the module were already available, and write it the way you'd most like the module to work.

Once you have some idea of the interface you want to create, convert your "play tests" into actual tests (see Tip #2). Then it's just a Simple Matter Of Programming to make the module work the way that the code examples and the tests want it to.

Of course, it may not be possible for the module to work the way you'd most like, in which case attempting to implement it that way will help you determine what aspects of your API are not practical, and allow you to work out what might be an acceptable alternative.

2. Write the Test Cases Before the Code

Probably the single best practice in all of software development is writing your test suite first.

A test suite is an executable, self-verifying specification of the behavior of a piece of software. If you have a test suite, you can--at any point in the development process--verify that the code works as expected. If you have a test suite, you can--after any changes during the maintenance cycle--verify that the code still works as expected.

Write the tests first. Write them as soon as you know what your interface will be (see #1). Write them before you start coding your application or module. Unless you have tests, you have no unequivocal specification of what the software should do, and no way of knowing whether it does it.

Writing tests always seems like a chore, and an unproductive chore at that: you don't have anything to test yet, so why write tests? Yet most developers will--almost automatically--write driver software to test their new module in an ad hoc way:

> cat try_inflections.pl

# Test my shiny new English inflections module...

use Lingua::EN::Inflect qw( inflect );

# Try some plurals (both standard and unusual inflections)...

my %plural_of = (
   'house'         => 'houses',
   'mouse'         => 'mice',
   'box'           => 'boxes',
   'ox'            => 'oxen',
   'goose'         => 'geese',
   'mongoose'      => 'mongooses', 
   'law'           => 'laws',
   'mother-in-law' => 'mothers-in-law',
);
 
# For each of them, print both the expected result and the actual inflection...

for my $word ( keys %plural_of ) {
   my $expected = $plural_of{$word};
   my $computed = inflect( "PL_N($word)" );
 
   print "For $word:\n", 
         "\tExpected: $expected\n",
         "\tComputed: $computed\n";
}

A driver like that is actually harder to write than a test suite, because you have to worry about formatting the output in a way that is easy to read. It's also much harder to use the driver than it would be to use a test suite, because every time you run it you have to wade though that formatted output and verify "by eye" that everything is as it should be. That's also error-prone; eyes are not optimized for picking out small differences in the middle of large amounts of nearly identical text.

Instead of hacking together a driver program, it's easier to write a test program using the standard Test::Simple module. Instead of print statements showing what's being tested, you just write calls to the ok() subroutine, specifying as its first argument the condition under which things are okay, and as its second argument a description of what you're actually testing:

> cat inflections.t

use Lingua::EN::Inflect qw( inflect);

use Test::Simple qw( no_plan);

my %plural_of = (
   'mouse'         => 'mice',
   'house'         => 'houses',
   'ox'            => 'oxen',
   'box'           => 'boxes',
   'goose'         => 'geese',
   'mongoose'      => 'mongooses', 
   'law'           => 'laws',
   'mother-in-law' => 'mothers-in-law',
);

for my $word ( keys %plural_of ) {
   my $expected = $plural_of{$word};
   my $computed = inflect( "PL_N($word)" );

   ok( $computed eq $expected, "$word -> $expected" );
}

Note that this code loads Test::Simple with the argument qw( no_plan ). Normally that argument would be tests => count, indicating how many tests to expect, but here the tests are generated from the %plural_of table at run time, so the final count will depend on how many entries are in that table. Specifying a fixed number of tests when loading the module is useful if you happen know that number at compile time, because then the module can also "meta-test:" verify that you carried out all the tests you expected to.

The Test::Simple program is slightly more concise and readable than the original driver code, and the output is much more compact and informative:

> perl inflections.t

ok 1 - house -> houses
ok 2 - law -> laws
not ok 3 - mongoose -> mongooses
#     Failed test (inflections.t at line 21)
ok 4 - goose -> geese
ok 5 - ox -> oxen
not ok 6 - mother-in-law -> mothers-in-law
#     Failed test (inflections.t at line 21)
ok 7 - mouse -> mice
ok 8 - box -> boxes
1..8
# Looks like you failed 2 tests of 8. 

More importantly, this version requires far less effort to verify the correctness of each test. You just scan down the left margin looking for a not and a comment line.

You might prefer to use the Test::More module instead of Test::Simple. Then you can specify the actual and expected values separately, by using the is() subroutine, rather than ok():

use Lingua::EN::Inflect qw( inflect );
use Test::More qw( no_plan ); # Now using more advanced testing tools

my %plural_of = (
   'mouse'         => 'mice',
   'house'         => 'houses',
   'ox'            => 'oxen',
   'box'           => 'boxes',
   'goose'         => 'geese',
   'mongoose'      => 'mongooses', 
   'law'           => 'laws',
   'mother-in-law' => 'mothers-in-law',
);

for my $word ( keys %plural_of ) {
   my $expected = $plural_of{$word};
   my $computed = inflect( "PL_N($word)" );

   # Test expected and computed inflections for string equality...
   is( $computed, $expected, "$word -> $expected" );
}

Apart from no longer having to type the eq yourself, this version also produces more detailed error messages:

> perl inflections.t

ok 1 - house -> houses
ok 2 - law -> laws
not ok 3 - mongoose -> mongooses
#     Failed test (inflections.t at line 20)
#          got: 'mongeese'
#     expected: 'mongooses'
ok 4 - goose -> geese
ok 5 - ox -> oxen
not ok 6 - mother-in-law -> mothers-in-law
#     Failed test (inflections.t at line 20)
#          got: 'mothers-in-laws'
#     expected: 'mothers-in-law'
ok 7 - mouse -> mice
ok 8 - box -> boxes
1..8
# Looks like you failed 2 tests of 8.

The Test::Tutorial documentation that comes with Perl 5.8 provides a gentle introduction to both Test::Simple and Test::More.

3. Create Standard POD Templates for Modules and Applications

One of the main reasons documentation can often seem so unpleasant is the "blank page effect." Many programmers simply don't know how to get started or what to say.

Perhaps the easiest way to make writing documentation less forbidding (and hence, more likely to actually occur) is to circumvent that initial empty screen by providing a template that developers can cut and paste into their code.

For a module, that documentation template might look something like this:

=head1 NAME

<Module::Name> - <One-line description of module's purpose>

=head1 VERSION

The initial template usually just has:

This documentation refers to <Module::Name> version 0.0.1.

=head1 SYNOPSIS

   use <Module::Name>;

   # Brief but working code example(s) here showing the most common usage(s)
   # This section will be as far as many users bother reading, so make it as
   # educational and exemplary as possible.

=head1 DESCRIPTION

A full description of the module and its features.

May include numerous subsections (i.e., =head2, =head3, etc.).

=head1 SUBROUTINES/METHODS

A separate section listing the public components of the module's interface.

These normally consist of either subroutines that may be exported, or methods
that may be called on objects belonging to the classes that the module
provides.

Name the section accordingly.

In an object-oriented module, this section should begin with a sentence (of the
form "An object of this class represents ...") to give the reader a high-level
context to help them understand the methods that are subsequently described.

=head1 DIAGNOSTICS

A list of every error and warning message that the module can generate (even
the ones that will "never happen"), with a full explanation of each problem,
one or more likely causes, and any suggested remedies.

=head1 CONFIGURATION AND ENVIRONMENT

A full explanation of any configuration system(s) used by the module, including
the names and locations of any configuration files, and the meaning of any
environment variables or properties that can be set. These descriptions must
also include details of any configuration language used.

=head1 DEPENDENCIES

A list of all of the other modules that this module relies upon, including any
restrictions on versions, and an indication of whether these required modules
are part of the standard Perl distribution, part of the module's distribution,
or must be installed separately.

=head1 INCOMPATIBILITIES

A list of any modules that this module cannot be used in conjunction with.
This may be due to name conflicts in the interface, or competition for system
or program resources, or due to internal limitations of Perl (for example, many
modules that use source code filters are mutually incompatible).

=head1 BUGS AND LIMITATIONS

A list of known problems with the module, together with some indication of
whether they are likely to be fixed in an upcoming release.

Also, a list of restrictions on the features the module does provide: data types
that cannot be handled, performance issues and the circumstances in which they
may arise, practical limitations on the size of data sets, special cases that
are not (yet) handled, etc.

The initial template usually just has:

There are no known bugs in this module.

Please report problems to <Maintainer name(s)> (<contact address>)

Patches are welcome.

=head1 AUTHOR

<Author name(s)>  (<contact address>)

=head1 LICENSE AND COPYRIGHT

Copyright (c) <year> <copyright holder> (<contact address>).
All rights reserved.

followed by whatever license you wish to release it under.

For Perl code that is often just:

This module is free software; you can redistribute it and/or modify it under
the same terms as Perl itself. See L<perlartistic>.  This program is
distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.

Of course, the specific details that your templates provide may vary from those shown here, according to your other coding practices. The most likely variation will be in the license and copyright, but you may also have specific in-house conventions regarding version numbering, the grammar of diagnostic messages, or the attribution of authorship.

4. Use a Revision Control System

Maintaining control over the creation and modification of your source code is utterly essential for robust team-based development. And not just over source code: you should be revision controlling your documentation, and data files, and document templates, and makefiles, and style sheets, and change logs, and any other resources your system requires.

Just as you wouldn't use an editor without an Undo command or a word processor that can't merge documents, so too you shouldn't use a file system you can't rewind, or a development environment that can't integrate the work of many contributors.

Programmers make mistakes, and occasionally those mistakes will be catastrophic. They will reformat the disk containing the most recent version of the code. Or they'll mistype an editor macro and write zeros all through the source of a critical core module. Or two developers will unwittingly edit the same file at the same time and half their changes will be lost. Revision control systems can prevent those kinds of problems.

Moreover, occasionally the very best debugging technique is to just give up, stop trying to get yesterday's modifications to work correctly, roll the code back to a known stable state, and start over again. Less drastically, comparing the current condition of your code with the most recent stable version from your repository (even just a line-by-line diff) can often help you isolate your recent "improvements" and work out which of them is the problem.

Revision control systems such as RCS, CVS, Subversion, Monotone, darcs, Perforce, GNU arch, or BitKeeper can protect against calamities, and ensure that you always have a working fallback position if maintenance goes horribly wrong. The various systems have different strengths and limitations, many of which stem from fundamentally different views on what exactly revision control is. It's a good idea to audition the various revision control systems, and find the one that works best for you. Pragmatic Version Control Using Subversion, by Mike Mason (Pragmatic Bookshelf, 2005) and Essential CVS, by Jennifer Vesperman (O'Reilly, 2003) are useful starting points.

5. Create Consistent Command-Line Interfaces

Command-line interfaces have a strong tendency to grow over time, accreting new options as you add features to the application. Unfortunately, the evolution of such interfaces is rarely designed, managed, or controlled, so the set of flags, options, and arguments that a given application accepts are likely to be ad hoc and unique.

This also means they're likely to be inconsistent with the unique ad hoc sets of flags, options, and arguments that other related applications provide. The result is inevitably a suite of programs, each of which is driven in a distinct and idiosyncratic way. For example:

> orchestrate source.txt -to interim.orc

> remonstrate +interim.rem -interim.orc 

> fenestrate  --src=interim.rem --dest=final.wdw
Invalid input format

> fenestrate --help
Unknown option: --help.
Type 'fenestrate -hmo' for help

Here, the orchestrate utility expects its input file as its first argument, while the -to flag specifies its output file. The related remonstrate tool uses -infile and +outfile options instead, with the output file coming first. The fenestrate program seems to require GNU-style "long options:" --src=infile and --dest=outfile, except, apparently, for its oddly named help flag. All in all, it's a mess.

When you're providing a suite of programs, all of them should appear to work the same way, using the same flags and options for the same features across all applications. This enables your users to take advantage of existing knowledge--instead of continually asking you.

Those three programs should work like this:

> orchestrate -i source.txt -o dest.orc

> remonstrate -i source.orc -o dest.rem

> fenestrate  -i source.rem -o dest.wdw
Input file ('source.rem') not a valid Remora file
(type "fenestrate --help" for help)

> fenestrate --help
fenestrate - convert Remora .rem files to Windows .wdw format
Usage: fenestrate [-i <infile>] [-o <outfile>] [-cstq] [-h|-v]
Options:
   -i <infile> Specify input source [default: STDIN]
   -o <outfile> Specify output destination [default: STDOUT]
   -c Attempt to produce a more compact representation
   -h Use horizontal (landscape) layout
   -v Use vertical (portrait) layout
   -s Be strict regarding input
   -t Be extra tolerant regarding input
   -q Run silent
   --version Print version information
   --usage Print the usage line of this summary
   --help Print this summary
   --man Print the complete manpage

Here, every application that takes input and output files uses the same two flags to do so. A user who wants to use the substrate utility (to convert that final .wdw file to a subroutine) is likely to be able to guess correctly the required syntax:

> substrate  -i dest.wdw -o dest.sub

Anyone who can't guess that probably can guess that:

> substrate --help

is likely to render aid and comfort.

A large part of making interfaces consistent is being consistent in specifying the individual components of those interfaces. Some conventions that may help to design consistent and predictable interfaces include:

  • Require a flag preceding every piece of command-line data, except filenames.

    Users don't want to have to remember that your application requires "input file, output file, block size, operation, fallback strategy," and requires them in that precise order:

    > lustrate sample_data proc_data 1000 normalize log

    They want to be able to say explicitly what they mean, in any order that suits them:

    > lustrate sample_data proc_data -op=normalize -b1000 --fallback=log
  • Provide a flag for each filename, too, especially when a program can be given files for different purposes.

    Users might also not want to remember the order of the two positional filenames, so let them label those arguments as well, and specify them in whatever order they prefer:

    > lustrate -i sample_data -op normalize -b1000 --fallback log -o proc_data
  • Use a single - prefix for short-form flags, up to three letters (-v, -i, -rw, -in, -out).

    Experienced users appreciate short-form flags as a way of reducing typing and limiting command-line clutter. Don't make them type two dashes in these shortcuts.

  • Use a double -- prefix for longer flags (--verbose, --interactive, --readwrite, --input, --output).

    Flags that are complete words improve the readability of a command line (in a shell script, for example). The double dash also helps to distinguish between the longer flag name and any nearby file names.

  • If a flag expects an associated value, allow an optional = between the flag and the value.

    Some people prefer to visually associate a value with its preceding flag:

    > lustrate -i=sample_data -op=normalize -b=1000 --fallback=log -o=proc_data

    Others don't:

    > lustrate -i sample_data -op normalize -b1000 --fallback log -o proc_data

    Still others want a bit each way:

    > lustrate -i sample_data -o proc_data -op=normalize -b=1000 --fallback=log

    Let the user choose.

  • Allow single-letter options to be "bundled" after a single dash.

    It's irritating to have to type repeated dashes for a series of flags:

    > lustrate -i sample_data -v -l -x

    Allow experienced users to also write:

    > lustrate -i sample_data -vlx
  • Provide a multi-letter version of every single-letter flag.

    Short-form flags may be nice for experienced users, but they can be troublesome for new users: hard to remember and even harder to recognize. Don't force people to do either. Give them a verbose alternative to every concise flag; full words that are easier to remember, and also more self-documenting in shell scripts.

  • Always allow - as a special filename.

    A widely used convention is that a dash (-) where an input file is expected means "read from standard input," and a dash where an output file is expected means "write to standard output."

  • Always allow -- as a file list marker.

    Another widely used convention is that the appearance of a double dash (--) on the command line marks the end of any flagged options, and indicates that the remaining arguments are a list of filenames, even if some of them look like flags.

6. Agree Upon a Coherent Layout Style and Automate It with perltidy

Formatting. Indentation. Style. Code layout. Whatever you choose to call it, it's one of the most contentious aspects of programming discipline. More and bloodier wars have been fought over code layout than over just about any other aspect of coding.

What is the best practice here? Should you use classic Kernighan and Ritchie style? Or go with BSD code formatting? Or adopt the layout scheme specified by the GNU project? Or conform to the Slashcode coding guidelines?

Of course not! Everyone knows that <insert your personal coding style here> is the One True Layout Style, the only sane choice, as ordained by <insert your favorite Programming Deity here> since Time Immemorial! Any other choice is manifestly absurd, willfully heretical, and self-evidently a Work of Darkness!

That's precisely the problem. When deciding on a layout style, it's hard to decide where rational choices end and rationalized habits begin.

Adopting a coherently designed approach to code layout, and then applying that approach consistently across all your coding, is fundamental to best-practice programming. Good layout can improve the readability of a program, help detect errors within it, and make the structure of your code much easier to comprehend. Layout matters.

However, most coding styles--including the four mentioned earlier--confer those benefits almost equally well. While it's true that having a consistent code layout scheme matters very much indeed, the particular code layout scheme you ultimately decide upon does not matter at all! All that matters is that you adopt a single, coherent style; one that works for your entire programming team, and, having agreed upon that style, that you then apply it consistently across all your development.

In the long term, it's best to train yourself and your team to code in a consistent, rational, and readable style. However, the time and commitment necessary to accomplish that isn't always available. In such cases, a reasonable compromise is to prescribe a standard code-formatting tool that must be applied to all code before it's committed, reviewed, or otherwise displayed in public.

There is now an excellent code formatter available for Perl: perltidy. It provides an extensive range of user-configurable options for indenting, block delimiter positioning, column-like alignment, and comment positioning.

Using perltidy, you can convert code like this:

if($sigil eq '$'){
   if($subsigil eq '?'){ 
       $sym_table{substr($var_name,2)}=delete $sym_table{locate_orig_var($var)};
       $internal_count++;$has_internal{$var_name}++
   } else {
       ${$var_ref} =
           q{$sym_table{$var_name}}; $external_count++; $has_external{$var_name}++;
}} elsif ($sigil eq '@'&&$subsigil eq '?') {
   @{$sym_table{$var_name}} = grep
       {defined $_} @{$sym_table{$var_name}};
} elsif ($sigil eq '%' && $subsigil eq '?') {
delete $sym_table{$var_name}{$EMPTY_STR}; } else
{
${$var_ref}
=
q{$sym_table{$var_name}}
}

into something readable:

if ( $sigil eq '$' ) {
   if ( $subsigil eq '?' ) {
       $sym_table{ substr( $var_name, 2 ) }
           = delete $sym_table{ locate_orig_var($var) };
       $internal_count++;
       $has_internal{$var_name}++;
   }
   else {
       ${$var_ref} = q{$sym_table{$var_name}};
       $external_count++;
       $has_external{$var_name}++;
   }
}
elsif ( $sigil eq '@' && $subsigil eq '?' ) {
   @{ $sym_table{$var_name} }
       = grep {defined $_} @{ $sym_table{$var_name} };
}
elsif ( $sigil eq '%' && $subsigil eq '?' ) {
   delete $sym_table{$var_name}{$EMPTY_STR};
}
else {
   ${$var_ref} = q{$sym_table{$var_name}};
}

Mandating that everyone use a common tool to format their code can also be a simple way of sidestepping the endless objections, acrimony, and dogma that always surround any discussion on code layout. If perltidy does all the work for them, then it will cost developers almost no effort to adopt the new guidelines. They can simply set up an editor macro that will "straighten" their code whenever they need to.

7. Code in Commented Paragraphs

A paragraph is a collection of statements that accomplish a single task: in literature, it's a series of sentences conveying a single idea; in programming, a series of instructions implementing a single step of an algorithm.

Break each piece of code into sequences that achieve a single task, placing a single empty line between each sequence. To further improve the maintainability of the code, place a one-line comment at the start of each such paragraph, describing what the sequence of statements does. Like so:

# Process an array that has been recognized...
sub addarray_internal {
   my ($var_name, $needs_quotemeta) = @_;

   # Cache the original...
   $raw .= $var_name;

   # Build meta-quoting code, if requested...
   my $quotemeta = $needs_quotemeta ?  q{map {quotemeta $_} } : $EMPTY_STR;

   # Expand elements of variable, conjoin with ORs...
   my $perl5pat = qq{(??{join q{|}, $quotemeta \@{$var_name}})};

   # Insert debugging code if requested...
   my $type = $quotemeta ? 'literal' : 'pattern';
   debug_now("Adding $var_name (as $type)");
   add_debug_mesg("Trying $var_name (as $type)");

   return $perl5pat;
}

Paragraphs are useful because humans can focus on only a few pieces of information at once. Paragraphs are one way of aggregating small amounts of related information, so that the resulting "chunk" can fit into a single slot of the reader's limited short-term memory. Paragraphs enable the physical structure of a piece of writing to reflect and emphasize its logical structure.

Adding comments at the start of each paragraph further enhances the chunking by explicitly summarizing the purpose of each chunk (note: the purpose, not the behavior). Paragraph comments need to explain why the code is there and what it achieves, not merely paraphrase the precise computational steps it's performing.

Note, however, that the contents of paragraphs are only of secondary importance here. It is the vertical gaps separating each paragraph that are critical. Without them, the readability of the code declines dramatically, even if the comments are retained:

sub addarray_internal {
   my ($var_name, $needs_quotemeta) = @_;
   # Cache the original...
   $raw .= $var_name;
   # Build meta-quoting code, if required...
   my $quotemeta = $needs_quotemeta ?  q{map {quotemeta $_} } : $EMPTY_STR;
   # Expand elements of variable, conjoin with ORs...
   my $perl5pat = qq{(??{join q{|}, $quotemeta \@{$var_name}})};
   # Insert debugging code if requested...
   my $type = $quotemeta ? 'literal' : 'pattern';
   debug_now("Adding $var_name (as $type)");
   add_debug_mesg("Trying $var_name (as $type)");
   return $perl5pat;
}

8. Throw Exceptions Instead of Returning Special Values or Setting Flags

Returning a special error value on failure, or setting a special error flag, is a very common error-handling technique. Collectively, they're the basis for virtually all error notification from Perl's own built-in functions. For example, the built-ins eval, exec, flock, open, print, stat, and system all return special values on error. Unfortunately, they don't all use the same special value. Some of them also set a flag on failure. Sadly, it's not always the same flag. See the perlfunc manpage for the gory details.

Apart from the obvious consistency problems, error notification via flags and return values has another serious flaw: developers can silently ignore flags and return values, and ignoring them requires absolutely no effort on the part of the programmer. In fact, in a void context, ignoring return values is Perl's default behavior. Ignoring an error flag that has suddenly appeared in a special variable is just as easy: you simply don't bother to check the variable.

Moreover, because ignoring a return value is the void-context default, there's no syntactic marker for it. There's no way to look at a program and immediately see where a return value is deliberately being ignored, which means there's also no way to be sure that it's not being ignored accidentally.

The bottom line: regardless of the programmer's (lack of) intention, an error indicator is being ignored. That's not good programming.

Ignoring error indicators frequently causes programs to propagate errors in entirely the wrong direction. For example:

# Find and open a file by name, returning the filehandle
# or undef on failure...
sub locate_and_open {
   my ($filename) = @_;

   # Check acceptable directories in order...
   for my $dir (@DATA_DIRS) {
       my $path = "$dir/$filename";

       # If file exists in an acceptable directory, open and return it...
       if (-r $path) {
           open my $fh, '<', $path;
           return $fh;
       }
   }

   # Fail if all possible locations tried without success...
   return;
}

# Load file contents up to the first <DATA/> marker...
sub load_header_from {
   my ($fh) = @_;

   # Use DATA tag as end-of-"line"...
   local $/ = '<DATA/>';

   # Read to end-of-"line"...
   return <$fh>;
}

# and later...
for my $filename (@source_files) {
   my $fh = locate_and_open($filename);
   my $head = load_header_from($fh);
   print $head;
}

The locate_and_open() subroutine simply assumes that the call to open works, immediately returning the filehandle ($fh), whatever the actual outcome of the open. Presumably, the expectation is that whoever calls locate_and_open() will check whether the return value is a valid filehandle.

Except, of course, "whoever" doesn't check. Instead of testing for failure, the main for loop takes the failure value and immediately propagates it "across" the block, to the rest of the statements in the loop. That causes the call to loader_header_from() to propagate the error value "downwards." It's in that subroutine that the attempt to treat the failure value as a filehandle eventually kills the program:

readline() on unopened filehandle at demo.pl line 28.

Code like that--where an error is reported in an entirely different part of the program from where it actually occurred--is particularly onerous to debug.

Of course, you could argue that the fault lies squarely with whoever wrote the loop, for using locate_and_open() without checking its return value. In the narrowest sense, that's entirely correct--but the deeper fault lies with whoever actually wrote locate_and_open() in the first place, or at least, whoever assumed that the caller would always check its return value.

Humans simply aren't like that. Rocks almost never fall out of the sky, so humans soon conclude that they never do, and stop looking up for them. Fires rarely break out in their homes, so humans soon forget that they might, and stop testing their smoke detectors every month. In the same way, programmers inevitably abbreviate "almost never fails" to "never fails," and then simply stop checking.

That's why so very few people bother to verify their print statements:

if (!print 'Enter your name: ') {
   print {*STDLOG} warning => 'Terminal went missing!'
}

It's human nature to "trust but not verify."

Human nature is why returning an error indicator is not best practice. Errors are (supposed to be) unusual occurrences, so error markers will almost never be returned. Those tedious and ungainly checks for them will almost never do anything useful, so eventually they'll be quietly omitted. After all, leaving the tests off almost always works just fine. It's so much easier not to bother. Especially when not bothering is the default!

Don't return special error values when something goes wrong; throw an exception instead. The great advantage of exceptions is that they reverse the usual default behaviors, bringing untrapped errors to immediate and urgent attention. On the other hand, ignoring an exception requires a deliberate and conspicuous effort: you have to provide an explicit eval block to neutralize it.

The locate_and_open() subroutine would be much cleaner and more robust if the errors within it threw exceptions:

# Find and open a file by name, returning the filehandle
# or throwing an exception on failure...
sub locate_and_open {
   my ($filename) = @_;

   # Check acceptable directories in order...
   for my $dir (@DATA_DIRS) {
       my $path = "$dir/$filename";

       # If file exists in acceptable directory, open and return it...
       if (-r $path) {
           open my $fh, '<', $path
               or croak( "Located $filename at $path, but could not open");
           return $fh;
       }
   }

   # Fail if all possible locations tried without success...
   croak( "Could not locate $filename" );
}

# and later...
for my $filename (@source_files) {
   my $fh = locate_and_open($filename);
   my $head = load_header_from($fh);
   print $head;
}

Notice that the main for loop didn't change at all. The developer using locate_and_open() still assumes that nothing can go wrong. Now there's some justification for that expectation, because if anything does go wrong, the thrown exception will automatically terminate the loop.

Exceptions are a better choice even if you are the careful type who religiously checks every return value for failure:

SOURCE_FILE:
for my $filename (@source_files) {
   my $fh = locate_and_open($filename);
   next SOURCE_FILE if !defined $fh;
   my $head = load_header_from($fh);
   next SOURCE_FILE if !defined $head;
   print $head;
}

Constantly checking return values for failure clutters your code with validation statements, often greatly decreasing its readability. In contrast, exceptions allow an algorithm to be implemented without having to intersperse any error-handling infrastructure at all. You can factor the error-handling out of the code and either relegate it to after the surrounding eval, or else dispense with it entirely:

for my $filename (@directory_path) {

   # Just ignore any source files that don't load...
   eval {
       my $fh = locate_and_open($filename);
       my $head = load_header_from($fh);
       print $head;
   }
}

9. Add New Test Cases Before you Start Debugging

The first step in any debugging process is to isolate the incorrect behavior of the system, by producing the shortest demonstration of it that you reasonably can. If you're lucky, this may even have been done for you:

To: DCONWAY@cpan.org
From: sascha@perlmonks.org
Subject: Bug in inflect module

Zdravstvuite,

I have been using your Lingua::EN::Inflect module to normalize terms in a
data-mining application I am developing, but there seems to be a bug in it,
as the following example demonstrates:

   use Lingua::EN::Inflect qw( PL_N );
   print PL_N('man'), "\n";       # Prints "men", as expected
   print PL_N('woman'), "\n";     # Incorrectly prints "womans"

Once you have distilled a short working example of the bug, convert it to a series of tests, such as:

use Lingua::EN::Inflect qw( PL_N );
use Test::More qw( no_plan );
is(PL_N('man') ,  'men', 'man -> men'     );
is(PL_N('woman'), 'women', 'woman -> women' );

Don't try to fix the problem straight away, though. Instead, immediately add those tests to your test suite. If that testing has been well set up, that can often be as simple as adding a couple of entries to a table:

my %plural_of = (
   'mouse'         => 'mice',
   'house'         => 'houses',
   'ox'            => 'oxen',
   'box'           => 'boxes',
   'goose'         => 'geese',
   'mongoose'      => 'mongooses', 
   'law'           => 'laws',
   'mother-in-law' => 'mothers-in-law', 

   # Sascha's bug, reported 27 August 2004...
   'man'           => 'men',
   'woman'         => 'women',
);

The point is: if the original test suite didn't report this bug, then that test suite was broken. It simply didn't do its job (finding bugs) adequately. Fix the test suite first by adding tests that cause it to fail:

> perl inflections.t
ok 1 - house -> houses
ok 2 - law -> laws
ok 3 - man -> men
ok 4 - mongoose -> mongooses
ok 5 - goose -> geese
ok 6 - ox -> oxen
not ok 7 - woman -> women
#     Failed test (inflections.t at line 20)
#          got: 'womans'
#     expected: 'women'
ok 8 - mother-in-law -> mothers-in-law
ok 9 - mouse -> mice
ok 10 - box -> boxes
1..10
# Looks like you failed 1 tests of 10.

Once the test suite is detecting the problem correctly, then you'll be able to tell when you've correctly fixed the actual bug, because the tests will once again fall silent.

This approach to debugging is most effective when the test suite covers the full range of manifestations of the problem. When adding test cases for a bug, don't just add a single test for the simplest case. Make sure you include the obvious variations as well:

my %plural_of = (
   'mouse'         => 'mice',
   'house'         => 'houses',
   'ox'            => 'oxen',
   'box'           => 'boxes',
   'goose'         => 'geese',
   'mongoose'      => 'mongooses', 
   'law'           => 'laws',
   'mother-in-law' => 'mothers-in-law', 

   # Sascha's bug, reported 27 August 2004...
   'man'           => 'men',
   'woman'         => 'women',
   'human'         => 'humans',
   'man-at-arms'   => 'men-at-arms', 
   'lan'           => 'lans',
   'mane'          => 'manes',
   'moan'          => 'moans',
);

The more thoroughly you test the bug, the more completely you will fix it.

10. Don't Optimize Code--Benchmark It

If you need a function to remove duplicate elements of an array, it's natural to think that a "one-liner" like this:

sub uniq { return keys %{ { map {$_=>1} @_ } } }

will be more efficient than two statements:

sub uniq {
   my %seen;
   return grep {!$seen{$_}++} @_;
}

Unless you are deeply familiar with the internals of the Perl interpreter (in which case you already have far more serious personal issues to deal with), intuitions about the relative performance of two constructs are exactly that: unconscious guesses.

The only way to know for sure which of two--or more--alternatives will perform better is to actually time each of them. The standard Benchmark module makes that easy:

# A short list of not-quite-unique values...
our @data = qw( do re me fa so la ti do );

# Various candidates...
sub unique_via_anon {
   return keys %{ { map {$_=>1} @_ } };
}

sub unique_via_grep {
   my %seen;
   return grep { !$seen{$_}++ } @_;
}

sub unique_via_slice {
   my %uniq;
   @uniq{@_} = ();
   return keys %uniq;
}

# Compare the current set of data in @data
sub compare {
   my ($title) = @_;
   print "\n[$title]\n";

   # Create a comparison table of the various timings, making sure that
   # each test runs at least 10 CPU seconds...
   use Benchmark qw( cmpthese );
   cmpthese -10, {
       anon  => 'my @uniq = unique_via_anon(@data)',
       grep  => 'my @uniq = unique_via_grep(@data)',
       slice => 'my @uniq = unique_via_slice(@data)',
   };

   return;
}

compare('8 items, 10% repetition');

# Two copies of the original data...
@data = (@data) x 2;
compare('16 items, 56% repetition');

# One hundred copies of the original data...
@data = (@data) x 50;
compare('800 items, 99% repetition');

The cmpthese() subroutine takes a number, followed by a reference to a hash of tests. The number specifies either the exact number of times to run each test (if the number is positive), or the absolute number of CPU seconds to run the test for (if the number is negative). Typical values are around 10,000 repetitions or ten CPU seconds, but the module will warn you if the test is too short to produce an accurate benchmark.

The keys of the test hash are the names of your tests, and the corresponding values specify the code to be tested. Those values can be either strings (which are eval'd to produce executable code) or subroutine references (which are called directly).

The benchmarking code shown above would print out something like the following:

[8 items, 10% repetitions]
        Rate anon  grep slice
anon  28234/s --  -24%  -47%
grep  37294/s   32% --  -30%
slice 53013/s   88% 42%    --

[16 items, 50% repetitions]
        Rate anon  grep slice
anon  21283/s --  -28%  -51%
grep  29500/s   39% --  -32%
slice 43535/s  105% 48%    --

[800 items, 99% repetitions]
       Rate  anon grep slice
anon   536/s --  -65%  -89%
grep  1516/s  183% --  -69%
slice 4855/s  806%  220% --

Each of the tables printed has a separate row for each named test. The first column lists the absolute speed of each candidate in repetitions per second, while the remaining columns allow you to compare the relative performance of any two tests. For example, in the final test tracing across the grep row to the anon column reveals that the grepped solution was 1.83 times (183 percent) faster than using an anonymous hash. Tracing further across the same row also indicates that grepping was 69 percent slower (-69 percent faster) than slicing.

Overall, the indication from the three tests is that the slicing-based solution is consistently the fastest for this particular set of data on this particular machine. It also appears that as the data set increases in size, slicing also scales much better than either of the other two approaches.

However, those two conclusions are effectively drawn from only three data points (namely, the three benchmarking runs). To get a more definitive comparison of the three methods, you'd also need to test other possibilities, such as a long list of non-repeating items, or a short list with nothing but repetitions.

Better still, test on the real data that you'll actually be "unique-ing."

For example, if that data is a sorted list of a quarter of a million words, with only minimal repetitions, and which has to remain sorted, then test exactly that:

our @data = slurp '/usr/share/biglongwordlist.txt';

use Benchmark qw( cmpthese );

cmpthese 10, {
    # Note: the non-grepped solutions need a post-uniqification re-sort
    anon  => 'my @uniq = sort(unique_via_anon(@data))',
    grep  => 'my @uniq = unique_via_grep(@data)',
    slice => 'my @uniq = sort(unique_via_slice(@data))',
};

Not surprisingly, this benchmark indicates that the grepped solution is markedly superior on a large sorted data set:

s/iter anon slice  grep
anon    4.28 --   -3%  -46%
slice   4.15 3%    --  -44%
grep    2.30 86%   80%    --

Perhaps more interestingly, the grepped solution still benchmarks as being marginally faster when the two hash-based approaches aren't re-sorted. This suggests that the better scalability of the sliced solution as seen in the earlier benchmark is a localized phenomenon, and is eventually undermined by the growing costs of allocation, hashing, and bucket-overflows as the sliced hash grows very large.

Above all, that last example demonstrates that benchmarks only benchmark the cases you actually benchmark, and that you can only draw useful conclusions about performance from benchmarking real data.

A Refactoring Example

About a year ago, a person asked the Fun With Perl mailing list about some code they had written to do database queries. It's important to note that this person was posting from an .it address; why will become apparent later. The code was reading records in from a text file and then doing a series of queries based on that information. They wanted to make it faster.

Here's his code:

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
$nump++;
chop($riga);
$pagina[$nump] = $riga;

$sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
pageid='$pagina[$nump]' and data>='$startdate'");
$sth->execute;
$totalvisit[$nump] = $sth->fetchrow_array();

$sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
(pageid='$pagina[$nump]' and data='$dataoggi')");
$sth->execute;
$totalvisittoday[$nump] = $sth->fetchrow_array();

 $sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
(pageid='$pagina[$nump]' and data='$dataieri')");
$sth->execute;
$totalyvisit[$nump] = $sth->fetchrow_array();

 $sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
(pageid='$pagina[$nump]' and data<='$fine30gg' and data>='$inizio30gg')");
$sth->execute;
$totalmvisit[$nump] = $sth->fetchrow_array();

 }

I decided that rather than try to read through this code and figure out what it's doing and how to make it faster, I'd clean it up first. Clean it up before you figure out how it works? Yes, using a technique called Refactoring.

Refactoring?

In his book, Martin Fowler defines Refactoring as "the process of changing a software system in such a way that it does not alter the external behavior of the code yet improves its internal structure." In other words, you clean up your code but don't change what it does.

Refactoring can be as simple as changing this code:

$exclamation = 'I like '.$pastry.'!';

To this:

$exclamation = "I like $pastry!";

Still does the same thing, but it's easier to read.

It's important to note that I don't need to know anything about the contents of $pastry or how $exclamation is used. The change is completely self-contained and does not affect surrounding code or change what it does. This is Refactoring.

On the principle of "show me don't tell me," rather than talk about it, we'll dive right into refactoring our bit of code.

Fix the Indentation

Your first impulse when faced with a hunk of code slammed against the left margin is to indent it. This is our first refactoring.

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
                         pageid='$pagina[$nump]' and data>='$startdate'");
    $sth->execute;
    $totalvisit[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
                          (pageid='$pagina[$nump]' and data='$dataoggi')");
    $sth->execute;
    $totalvisittoday[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare("SELECT count(*) FROM lognew WHERE
                          (pageid='$pagina[$nump]' and data='$dataieri')");
    $sth->execute;
    $totalyvisit[$nump] = $sth->fetchrow_array();

    $sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
                         (pageid='$pagina[$nump]' and data<='$fine30gg' 
						 and data>='$inizio30gg')");
    $sth->execute;
    $totalmvisit[$nump] = $sth->fetchrow_array();
}

close (INPUT);

Already it looks better. We can see that we're iterating over a file, performing some SELECTs on each line and shoving the results into a bunch of arrays.

A Single, Simple Change

One of the most important principles of Refactoring is that you work in small steps. This re-indentation is a single step. And part of this single step includes running the test suite, logging the change, and checking it into CVS.

Checking into CVS after something this simple? Yes. Many programmers ask the question, "When should I check in?" When you're refactoring it's simple: check in when you've done one refactoring and have tested that it works. Our re-indentation is one thing; we test that it works and check it in.

This may seem excessive, but it prevents us from entangling two unrelated changes together. By doing one change at a time we know that any new bugs were introduced by that one change. Also, you will often decide in the middle of a refactoring that it's not such a good idea. When you've checked in at every one you can simply rollback to the last version rather than having to undo it by hand. Convenient, and you're sure no stray bits of your aborted change are hanging around.

So our procedure for doing a proper refactoring is:

  • Make one logical change to the code.
  • Make sure it passes tests.
  • Log and check in.

Big Refactorings from Small

The goal of this refactoring is to make the code go faster. One of the simplest ways to do achieve that is to pull necessary code out of the loop. Preparing four new statements in every iteration of the loop seems really unnecessary. We'd like to pull those prepare() statements out of the loop. This is a refactoring. To achieve this larger refactoring, a series of smaller refactorings must be done.

Use Bind Variables

Each time through the loop, a new set of SQL statements is created based on the line read in. But they're all basically the same, just the data is changing. If we could pull that data out of the statement we'd be closer to our goal of pulling the prepare()s out of the loop.

So my next refactoring pulls variables out of the SQL statements and replaces them with placeholders. Then the data is bound to the statement using bind variables. This means we're now prepare()ing the same statements every time through the loop.

Before refactoring:

$sth= $dbh->prepare("SELECT count(*) FROM lognew WHERE
                     pageid='$pagina[$nump]' and data>='$startdate'");
$sth->execute;

After refactoring:

$sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                     pageid=? and data>=?');
$sth->execute($pagina[$nump], $startdate);

Bind variables also protect against a naughty user from trying to slip some extra SQL into your program via the data you read in. As a side-effect of our code cleanup, we've closed a potential security hole.

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                         pageid=? and data>=?');
    $sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                          (pageid=? and data=?)');
    $sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $sth->fetchrow_array();

    $sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                          (pageid=? and data=?)');
    $sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $sth->fetchrow_array();

    $sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE
                         (pageid=? and data<=? and data>=?)');
    $sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Split a Poorly Reused Variable

The next stumbling block to pulling the prepare() statements out of the loop is that they all use the same variable, $sth. We'll have to change it so they all use different variables. While we're at it, we'll name those statement handles something more descriptive of what the statement does. Since at this point we haven't figured out what the statements do, we can base the name on the array it gets assigned to.

While we're at it, throw in some my() declarations to limit the scope of these variables to just the loop.

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    my $totalvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                                        pageid=? and data>=?');
    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    my $totalvisittoday_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                             (pageid=? and data=?)');
    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    my $totalyvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                         (pageid=? and data=?)');
    $totalyvisit_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalyvisit_sth->fetchrow_array();

    my $totalmvisit_sth= $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                        (pageid=? and data<=? and data>=?)');
    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Getting Better All the Time

The new names are better, but they're not great. This is ok. Naming is something people often get hung up on. One can spend hours wracking their brains thinking of the perfect name for a variable or a function. If you can think of a better one than what's there right now, use it. The beauty of Refactoring is you an always improve upon it later.

This is an important lesson of Refactoring. Voltare said, "the best is the enemy of the good". We often get so wound up trying to make code great that we fail to improve it at all. In refactoring, it's not so important to make your code great in one leap, just a little better all the time (it's a little known fact John Lennon was into Refactoring.) These small improvements will build up into a clean piece of code, with less bugs, more surely than a large-scale code cleanup would.

Pull Code Out of the Loop

Now it's a simple cut and paste to pull the four prepare() statements out of the loop.

my $totalvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE 
                                    pageid=? and data>=?');

my $totalvisittoday_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                         (pageid=? and data=?)');

my $totalyvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                     (pageid=? and data=?)');

my $totalmvisit_sth = $dbh->prepare('SELECT count(*) FROM lognew WHERE
                                     (pageid=? and data<=? and data>=?)');

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalyvisit_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalyvisit_sth->fetchrow_array();

    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Already the code is looking better. With the SQL separated, the inner workings of the loop are much less daunting.

Test. Log. Check in.

A Place to Stop

Remember our goal, to make this code run faster. By pulling the prepare() statements outside the loop we've likely achieved this goal. Additionally, it still does exactly what it did before even though we still don't fully understand what that is. If this were a real project, you'd do some benchmarking to see if the code is fast enough and move on to another task.

Since this is an example, I'll continue with more refactorings with the goal of clarifying the code further and figuring out what it does.

Keep in mind that after every refactoring the code still does exactly what it did before. This means we can stop choose to stop after any refactoring. If a more pressing task suddenly pops up we can pause our refactoring work and attend to that feeling confident we didn't leave any broken code lying around.

Reformat SQL for Better Readability

In order to make sense of the code, we have to make sense of the SQL. The simplest way to better understand the SQL is to put it into a clearer format.

The three major parts of an SQL SELECT statement are:

  • The rows (ie. SELECT count(*))
  • The table (ie. FROM lognew)
  • The predicate (ie. WHERE pageid = ...)

I've chosen a new format that highlights these parts.

I've also removed some unnecessary parenthesis because they just serve to clutter things up rather than disambiguate an expression.

I've also decided to change the quoting style from single quotes to a here-doc. It would have also been okay to use q{}.

my $totalvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $totalvisittoday_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalyvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalmvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalyvisit_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalyvisit_sth->fetchrow_array();

    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Remove Redundancy

With the SQL in a more readable format, some commonalities become clear.

  • All the statements are doing a count(*).
  • They're all using the lognew table
  • They're all looking for a certain pageid.

In fact, $totalvisittoday_sth and $totalyvisit_sth are exactly the same! Let's eliminate one of them, doesn't matter which, we're going to rename them in a moment anyway. $totalyvisit_sth goes away, making sure to change all references to it to $totalvisittoday_sth.

my $totalvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $totalvisittoday_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalmvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($riga=<INPUT>){
    $nump++;
    chop($riga);
    $pagina[$nump] = $riga;

    $totalvisit_sth->execute($pagina[$nump], $startdate);
    $totalvisit[$nump] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataoggi);
    $totalvisittoday[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pagina[$nump], $dataieri);
    $totalyvisit[$nump] = $totalvisittoday_sth->fetchrow_array();

    $totalmvisit_sth->execute($pagina[$nump], $fine30gg, $inizio30gg);
    $totalmvisit[$nump] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Fix Conflicting Styles

Now the only difference between the statements is the choice of data ranges.

Using the variables are passed into each statement we can make some more deductions. Let's have a look...

  • $startdate
  • $dataoggi
  • $dataieri
  • $fine30gg, $inizio30gg

One of these things is not like the other. What's $startdate doing there? Everything else is talking about 'data'. What's 'ieri'? 'oggi'? Remember, the programmer who submitted this code is Italian. Maybe the names are in Italian. Grabbing an Italian-English dictionary we find out that 'data' is Italian for 'date'! Now it makes sense, this code was probably originally written in English, then worked on by an Italian (or vice-versa).

This code has committed a cardinal stylistic sin. It uses two different languages for naming variables. Not just different languages, languages which have different meanings for the same words. Taken out of context, we can't know if $data represents a hunk of facts or "Thursday."

Since the styles conflict, one of them has to go. Since I don't speak Italian, I'm going to translate it into English.

Pulling out our Italian-to-English dictionary...

  • "riga" is "line"
  • "pagina" is "page"
  • "nump" is probably short for "numero pagina" which is "page number"
  • "data" is "date"
  • "oggi" is "today"
  • "ieri" is "yesterday"
  • "inizio" is "start"
  • "fine" is "end"
  • "gg" is probably short for "giorni" which is "days"
    • "fine30gg" would then be "the end of 30 days"
    • "inizio30gg" would be "the beginning of 30 days"

It would be a straightforward matter of a bunch of search-and-replaces in any good editor but for one snag, the SQL column 'data.' We'd like to change this to its English 'date', but databases are very global with possibly lots of other programs using it. So we can't change the column name without breaking other code. While in a well-organized programming shop you might have the ability to find all the code which uses your database, we won't assume we have that luxury here. For the moment then, we'll leave that be and deal with it in a separate refactoring.

my $totalvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $totalvisittoday_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $totalmvisit_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($line=<INPUT>){
    $page_num++;
    chop($line);
    $pages[$page_num] = $line;

    $totalvisit_sth->execute($pages[$page_num], $start_date);
    $totalvisit[$page_num] = $totalvisit_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pages[$page_num], $today);
    $totalvisittoday[$page_num] = $totalvisittoday_sth->fetchrow_array();

    $totalvisittoday_sth->execute($pages[$page_num], $yesterday);
    $totalyvisit[$page_num] = $totalvisittoday_sth->fetchrow_array();

    $totalmvisit_sth->execute($pages[$page_num], $end_of_30_days,
                                                 $start_of_30_days);
    $totalmvisit[$page_num] = $totalmvisit_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Better Names

With decent variable names in place, the purpose of the program becomes much clearer. This is a program to calculate the number of visits to a page for various date ranges. Based on this new information we can give the statement handles and the arrays they put data into better names.

Looking at the SQL we see we've got:

  • One to get all the visits up to a single day.
  • One to get the visits for a certain date.
  • One to get the visits for a range of dates.

A good set of new names would be:

  • daily
  • up to
  • range

Also, Total Visits is too long. We could shorten that to just Visits, or even shorter to Hits.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($line=<INPUT>){
    $page_num++;
    chop($line);
    $pages[$page_num] = $line;

    $hits_upto_sth->execute($pages[$page_num], $start_date);
    $totalvisit[$page_num] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $today);
    $totalvisittoday[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $yesterday);
    $totalyvisit[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($pages[$page_num], $end_of_30_days,
                                                $start_of_30_days);
    $totalmvisit[$page_num] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Changing Global Variable Names

The array names need work, too. Currently, they're rather ambiguous. @totalyvisit, what does the y mean? Looking at each variable name and the variables that got passed to execute() to produce it...

  • @totalvisit comes up to a $start_date. So that can be @hits_upto
  • @totalvisittoday comes from $today and is pretty obvious. @hits_today
  • @totalyvisit comes from $yesterday so 'y' must be for 'yesterday'. @hits_yesterday
  • @totalmvisit comes from the range produced by the $start_of_30_days and the $end_of_30_days. So 'm' must be 'month'. @hits_monthly
my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($line=<INPUT>){
    $page_num++;
    chop($line);
    $pages[$page_num] = $line;

    $hits_upto_sth->execute($pages[$page_num], $start_date);
    $hits_upto[$page_num] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $today);
    $hits_today[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($pages[$page_num], $yesterday);
    $hits_yesterday[$page_num] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($pages[$page_num], $end_of_30_days,
                                                $start_of_30_days);
    $hits_monthly[$page_num] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test... uh-oh, test failed!

There's something very different about this change compared to the others. The variables we changed were not declared in our little code block. Likely they're used in other parts of the code, such as our test which caused it to break.

In the Real World, we would be sure to replace all occurrences of the variable. The simplest way to do this is to use your editor to perform a search and replace rather than doing it by your all too fallible hands. If it could be used over a set of files, grepping through those files for all occurrences of it and changing those as well would be necessary.

# If you don't have rgrep, grep -r does the same thing.
rgrep '[@$]totalvisit' /path/to/your/project

I do this so often that I've taken to calling grep -r, 'Refactoring Grep'. Other languages who's syntax is -- ummm -- not as inspired as Perl's, such as Java, C++ and Python, have tools for doing this sort of thing automatically. Because of the complexity of Perl's syntax, we still have to do it mostly by hand, though there are some efforts underway to rectify this.

Changing the array names in our test as well we get them to pass.

Log. Check in.

Improve Overly Generic Names

Continuing with our variable name improvements, we're left with the last few unimproved names. Let's start with $line.

Since we can see clearly that $line = <INPUT>, calling the variable 'line' tells us nothing new. A better name might be what each line contains. Looking at how the line is used we see $pages[$page_num] = $line and how that is then used in the SQL. It's a page id.

But it doesn't make much sense to put a page id into an array called @pages. It doesn't contain pages, it contains @page_ids.

What about $page_num? It doesn't contain a page number, it contains the line number of the file we're reading in. Or more conventionally, an $index or $idx.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   <= ? AND 
       data   >= ?
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chop($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits_upto[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits_today[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits_yesterday[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $end_of_30_days,
                                                   $start_of_30_days);
    $hits_monthly[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Fixing Odd Interfaces

What's wrong with this picture?

$hits_range_sth->execute($page_ids[$idx], $end_of_30_days,
                                               $start_of_30_days);

Isn't it a little odd to specify a date range with the end first? Sure is. It also guarantees someone is going to get it backwards. Reverse it. Don't forget the SQL, too.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chop($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits_upto[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits_today[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits_yesterday[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $start_of_30_days,
                                                   $end_of_30_days);
    $hits_monthly[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

s/chop/chomp/

Now that we've stared at the code for a while, you might have noticed the use of chop(). Using chop() to strip a newline is asking for portability problems, so let's fix it by using chomp().

Technically this isn't a refactoring since we altered the behavior of the code by fixing the bug. But using chop() where you meant chomp() is such a common mistake we'll make it an honorary refactoring.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits_upto[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits_today[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits_yesterday[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $start_of_30_days,
                                                   $end_of_30_days,);
    $hits_monthly[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

Collect Related Variables into Hashes

The common prefix hits_ is a dead giveaway that much of the data in this code is related. Related variables should be grouped together into a single structure, probably a hash to make the relation obvious and allow them to be passed around to subroutines more easily. Its easier to move around one hash than four arrays.

I've decided to collect together all the @hit_ arrays into a single hash %hits since they'll probably be used together parts of the program. If this code snippet represents a function it means I can return one hash reference rather than four array refs. It also makes future expansion easier, rather than returning an additional array it simply becomes another key in the hash.

Before.

$hits{upto}[$idx] = $hits_upto_sth->fetchrow_array();

After.

$hits_upto[$idx]  = $hits_upto_sth->fetchrow_array();

It's interesting to note what a small, natural change this is. Circumstantial evidence that this is a good refactoring.

As before, since these arrays are global data, we must be sure to change them everywhere. This includes the tests.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_ids[$idx], $start_date);
    $hits{upto}[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $today);
    $hits{today}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_ids[$idx], $yesterday);
    $hits{yesterday}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_ids[$idx], $start_of_30_days,
                                                   $end_of_30_days,);
    $hits{monthly}[$idx] = $hits_range_sth->fetchrow_array();
}

close (INPUT);

Test. Log. Check in.

When Not to Refactor

The statement handles are also related, but I'm not going to collect them together into a hash. The statement handles are short-lived lexicals, they're never likely to be passed around. Their short scope and grouping within the code makes their relationship obvious. The design would not be improved by the refactoring.

Refactoring is not a set of rules to be slavishly followed, it's a collection of tools. And like any other tool you must carefully consider when and when not to use it. Since collecting the statement handles together doesn't improve the design, I won't do it.

Eliminate Unnecessary Longhand

Boy, we sure use $page_ids[$idx] a lot. It's the current page ID. But don't we have a variable for that?

Replace all the unnecessary array accesses and just use the more concise and descriptive $page_id.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}[$idx] = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $today);
    $hits{today}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $yesterday);
    $hits{yesterday}[$idx] = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_id, $start_of_30_days,
                                       $end_of_30_days,);
    $hits{monthly}[$idx] = $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Rearrange Data Structures to Fit Their Use

Currently, %hits is accessed by the order the page ID was read out of the file. Well, that doesn't seem very useful at all. Its purpose seems to be for listing the page counts in exactly the same order as you read them in. Even then you need to iterate through @page_ids simultaneously because no where in %hits is the page ID stored.

Consider a common operation, looking up the hit counts for a given page ID. You have to iterate through the whole list of page IDs to do it.

foreach my $idx (0..$#page_ids) {
    if( $page_ids[$idx] eq $our_page_id ) {
        print "Hits for $our_page_id today: $hits{today}[$idx]\n";
        last;
    }
}

Cumbersome. A much better layout would be a hash keyed on the page ID.

$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

Now we can directly access the data for a given page ID. If necessary, we can still list the hits in the same order they were read in by iterating through @page_ids.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    $idx++;
    chomp($page_id);
    $page_ids[$idx] = $page_id;

    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $today);
    $hits{today}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $yesterday);
    $hits{yesterday}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_id, $start_of_30_days,
                                       $end_of_30_days,);
    $hits{monthly}{$page_id} = $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Eliminate Unnecessary Variables

Now that %hits is no longer ordered by how it was read in, $idx isn't used much anymore. It's only used to stick $page_id onto the end of @page_ids, but we can do that with push.

This is minor but little things build up to cause big messes.

my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    chomp($page_id);
    push @page_ids, $page_id;

    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $today);
    $hits{today}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_daily_sth->execute($page_id, $yesterday);
    $hits{yesterday}{$page_id} = $hits_daily_sth->fetchrow_array();

    $hits_range_sth->execute($page_id, $start_of_30_days,
                                       $end_of_30_days,);
    $hits{monthly}{$page_id} = $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Pull Logical Chunks Out into Functions

Our final refactoring is one of the most common and most useful.

Let's assume that we need to generate page counts somewhere else in the code. Rather than repeat the code to do this, we want to put it in a subroutine so it can be reused. One subroutine for each statement.

In order to do this, start by identifying the code that would go into the routine.

$hits_upto_sth->execute($page_id, $start_date);
$hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();

Then wrap a subroutine around it.

sub hits_upto {
    $hits_upto_sth->execute($page_id, $start_date);
    $hits{upto}{$page_id} = $hits_upto_sth->fetchrow_array();
}

Now look at all the variables used.

$hits_upto_sth is a global (well, file-scoped lexical) and is defined entirely outside the function. We can keep using it in our subroutine in the same way we are now.

$hits{upto}{$page_id} is receiving the result of the calculation. It contains the return value. So it goes outside the function to receive the return value. Where its assignment was, we put a return.

sub hits_upto {
    $hits_upto_sth->execute($page_id, $start_date);
    return $hits_upto_sth->fetchrow_array();
}

$page_id and $start_date vary from call to call. These are our function arguments.

sub hits_upto {
    my($page_id, $start_date) = @_;
    $hits_upto_sth->execute($page_id, $start_date);
    return $hits_upto_sth->fetchrow_array();
}

Finally, rename things in a more generic manner. This is a subroutine for calculating the number of hits up to a certain date. Instead of $start_date which was specific to one calculation, we'd call it $date.

sub hits_upto {
    my($page_id, $date) = @_;
    $hits_upto_sth->execute($page_id, $date);
    return $hits_upto_sth->fetchrow_array();
}

And that's our new subroutine, does the same thing as the original code. Then it's a simple matter to use it in the code.

    $hits{upto}{$page_id} = hits_upto($page_id, $start_date);


my $hits_upto_sth = $dbh->prepare(<<'SQL');
SELECT count(*)
FROM   lognew 
WHERE  pageid =  ? AND 
       data   >= ?
SQL

my $hits_daily_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid = ? AND 
       data   = ?
SQL

my $hits_range_sth = $dbh->prepare(<<'SQL');
SELECT count(*) 
FROM   lognew 
WHERE  pageid =  ? AND
       data   >= ? AND
       data   <= ? 
SQL

open (INPUT, "< $filepageid") || &file_open_error("$filepageid");

while ($page_id=<INPUT>){
    chomp($page_id);
    push @page_ids, $page_id;

    $hits{upto}{$page_id}      = hits_upto($page_id, $start_date);
    $hits{today}{$page_id}     = hits_daily($page_id, $today);
    $hits{yesterday}{$page_id} = hits_daily($page_id, $yesterday);
    $hits{monthly}{$page_id}   = hits_range($page_id, $start_of_30_days,
                                                        $end_of_30_days,);
}

sub hits_upto {
    my($page_id, $date) = @_;
    $hits_upto_sth->execute($page_id, $date);
    return scalar $hits_upto_sth->fetchrow_array();
}

sub hits_daily {
    my($page_id, $date) = @_;
    $hits_daily_sth->execute($page_id, $date);
    return scalar $hits_daily_sth->fetchrow_array();
}

sub hits_range {
    my($page_id, $start, $end) = @_;
    $hits_range_sth->execute($page_id, $start, $end);
    return scalar $hits_range_sth->fetchrow_array();
}

Test. Log. Check in.

Undo.

Some may balk at putting that small of a snippet of code into a subroutine like that. There are definite performance concerns about adding four subroutine calls to a loop. But I'm not worried about that at all.

One of the beauties of Refactoring is that it's reversible. Refactorings don't change how the program works. We can reverse any of these refactorings and the code will work exactly the same. If a refactoring turns out to be a bad idea, undo it. Logging each refactoring in version control makes the job even easier.

So if it turns out moving the executes into their own functions causes a performance problem the change can easily be undone.

Done?

At this point, things are looking pretty nice. The code is well structured, readable, and efficient. The variables are sensibly named. The data is organized in a fairly flexible manner.

It's good enough. This is not to say that there's not more that could be done, but we don't need to. And Refactoring is about doing as much redesign as you need instead of what you might need.

Refactoring and the Swiss Army Knife

As programmers we have a tendency towards over-design. We like to design our code to deal with any possible situation that might arise, since it was hard to change the design later. This is known as Big Design Up Front (BDUF). It's like one of those enormous Swiss Army Knives with 50 functions. Most of the time all you need is a knife with something to open your beer with and then maybe pick your teeth afterwards but you never know. So you over-engineer because it's hard to improve it later. If it never gets used then a lot of effort has been wasted.

Refactoring turns design on its ear. Now you can continually evolve your design as needed. There's no longer a need to write for every possible situation up front so you can focus on just what you need right now. If you need more flexibility later, you can add that flexibility through refactoring. It's like having a Swiss Army knife that you can add tools to as you need them.

Further Reference

Red Flags Return

Astute readers had a number of comments about last week's Program Repair Shop and Red Flags article.

Control Flow Puzzle

In the article, I had a section of code that looked like this:

       $_ = <INFO> until !defined($_) || /^(\* Menu:|\037)/;
       return @header if !defined($_) || /^\037/;

I disliked the structure and especially the repeated tests. I played with it, changing it to

Table of Contents

Control Flow Puzzle
Pattern Matching
Synthetic Variables
Send More Code

        while (<INFO>) {
          last if /^\* Menu:/;
          return @header if /^\037/;
        }
        return @header unless defined $_;

and then used Simon Cozens' suggestion of

        do { 
          $_ = <INFO>; 
          return @header if /^\037/ || ! defined $_ 
        } until /^\* Menu:/ ;

This still bothered me, because do...until is unusual. But I was out of time, so that's what I used.

Readers came up with two interesting alternatives. Jeff Pinyan suggested:

        while (<INFO>) {
          last if /^\* Menu:/;
          return %header if /^\037/ or eof(INFO);
        }

This is perfectly straightforward, and the only reason I didn't think of it was because of my prejudice against eof(). In the article, I recommended avoiding eof(), and that's a good rule of thumb. But in this case, I think it was probably the wrong way to go.

After I saw Jeff's solution, I thought more about eof() and tried to remember what its real problems are. The conclusion I came to is that the big problem with eof() occurs when you use it on a filehandle that is involved in an interactive dialogue, such as a terminal.

Consider code like this:

        my ($name, $fav_color);
        print "Enter your name: ";
        chomp($name = <STDIN>);
        unless (eof(STDIN)) {
          print "Enter your favorite color: ";
          chomp($fav_color = <STDIN>);
        }

This seems straightforward, but it doesn't work. (Try it!) After user enters their name, we ask for eof(). This tries to read another character from STDIN, which means that the program is waiting for user input before printing the second prompt! The program hangs forever at the eof test, and the only way it can continue is if the user clairvoyantly guesses that they are supposed to enter their favorite color. If they do that, then the program will print the prompt and immediately continue. Not very useful behavior! And under some circumstances, this can cause deadlock.

However, in the example program I was discussing, no deadlock is possible because the information flows in only one direction - from a file into the program. So the use of eof() would have been safe.

Ilya Zakharevich suggested a solution that I like even better:

      while (<INFO>) {
          return do_menu() if /^\* Menu:/;
          last if /^\037/;
      }
      return %header;

Here, instead of requiring the loop to fall through to process the menu, we simply put the menu-processing code into a subroutine and process it inside the loop.

Ilya also pointed out that the order of the tests in the original code is backward:

	return @header if /^\037/ || ! defined $_

It should have looked like this:

	return @header if ! defined $_  || /^\037/;

Otherwise, we're trying to do a pattern-match operation on a possibly undefined value.

Ilya also suggested another alternative:

    READ_A_LINE: {
      return %header if not defined ($_ = <INFO>) or /^\037/;
      redo READ_A_LINE unless /^\* Menu:/;
    }

Randal Schwartz suggested something similar. This points out a possible rule of thumb: When Perl's control-flow constructions don't seem to be what you want, try decorating a bare block.

Oops!

I said:

now invoke the function like this:
	$object = Info_File->new('camel.info');

Unfortunately, the function in question was named open_info_file, not new. The call should have been

	$object = Info_File->open_info_file('camel.info');

I got the call right in my test program (of course I had a test program!) but then mixed up the name when I wrote the article. Thanks to Adam Turoff for spotting this.

Pattern Matching

In the article, I replaced this:

	($info_file) = /File:\s*([^,]*)/;
        ($info_node) = /Node:\s*([^,]*)/;
        ($info_prev) = /Prev:\s*([^,]*)/;
        ($info_next) = /Next:\s*([^,]*)/;
        ($info_up)   = /Up:\s*([^,]*)/;

With this:

	for my $label (qw(File Node Prev Next Up)) {
          ($header{$label}) = /$label:\s*([^,]*)/;
        }

Then I complained that Perl must recompile the regex each time through the loop, five times per node. Ilya pointed out the obvious solution:

	 $header{$1} = $2 
	     while /(File|Node|Prev|Next|Up):\s*([^,]*)/g;

I wish I had thought of this, because you can produce it almost mechanically. In fact, I think my original code betrays a red flag itself. Whenever you have something like this:

	for $item (LIST) {
          something involving m/$item/;
        }

this is a red flag, and you should consider trying to replace it with this:

	my $pat = join '|', LIST;
        Something involving m/$pat/o;

As a simple example, consider this common construction:

	@states = ('Alabama', 'Alaska', ..., 
	           'West Virginia', 'Wyoming');
        $matched = 0;
        for $state (@states) {
          if ($input =~ /$state/) { 
            $matched = 1; last;
          }
        }

It's more efficient to use this instead:

	my $pat = join '|', @states;
        $matched = ($input =~ /$pat/o);

Applying this same transformation to the code in my original program yields Ilya's suggestion.

Synthetic Variables

My code looked like this:

	while (<INFO>) {
          return 1 if /^\037/;    # end of node, success.
          next unless /^\* \S/;   # skip non-menu-items
          if (/^\* ([^:]*)::/) {  # menu item ends with ::
              $key = $ref = $1;
          } elsif (/^\* ([^:]*):\s*([^.]*)[.]/) {
              ($key, $ref) = ($1, $2);
          } else {
              print STDERR "Couldn't parse menu item\n\t$_";
              next;
          }
          $info_menu{$key} = $ref;
        }

Ilya pointed out that in this code, $key and $ref may be synthetic variables. A synthetic variable isn't intrinsic to the problem you're trying to solve; rather, they're an artifact of the way the problem is expressed in a programming language. I think $key and $ref are at least somewhat natural, because the problem statement does include menu items with names that refer to nodes, and $key is the name of a menu item and $ref is the node it refers to. But some people might prefer Ilya's version:

       while (<INFO>) {
           return 1 if /^\037/;        # end of node, success.
           next unless s/^\* (?=\S)//; # skip non-menu-items
           $info_menu{$1} = $1, next if /^([^:]*)::/; 
           $info_menu{$1} = $2, next if /^([^:]*):\s*(.*?)\./;
           print STDERR "Couldn't parse menu item\n\t* $_";
       }

Whatever else you say about it, this reduces the code from eleven lines to six, which is good.

Old News

Finally, a belated correction. In the second Repair Shop and Red Flags Article way back in June, I got the notion that you shouldn't use string operations on numbers. While I still think this is good advice, I then tried to apply it outside of the domain in which it made sense.

I was trying to transform a number like 12345678 into an array like ('12', ',', '345', ',', '678'). After discussing several strategies, all of which worked, I ended with the following nonworking code:

	sub convert {
          my ($number) = shift;
          my @result;
          while ($number) {
            push @result, ($number % 1000) , ',';
            $number = int($number/1000);
          }
          pop @result;      # Remove trailing comma
          return reverse @result;
        }

If you ask this subroutine to convert the number 1009, you get ('1', ',', '9'), which is wrong; it should have been (1, ',', '009'). Many people wrote to point this out; I think Mark Lybrand was the first. Oops! Of course, you can fix this with sprintf, but really the solutions I showed earlier in the article are better.

The problem here is that I became too excited about my new idea. I still think it's usually a red flag to treat a number like a string. But there's an exception: When you are formatting a number for output, you have to treat it like a string, because output is always a string. I think Charles Knell hit the nail on the head here:

By inserting commas into the returned value, you ultimately treat the number as a string. Why not just give in and admit you're working with a string.

Thanks, Charles.

People also complained that the subroutine returns a rather peculiar list instead of a single scalar, but that was the original author's decision and I didn't want to tamper with it without being sure why he had done it that way. People also took advantage of the opportunity to send in every bizarre, convoluted way they would think of to accomplish the same thing (or even a similar thing), often saying something like this:

You are doing way too much work! Why don't you simply use this, like everyone else does?
	sub commify {
          $_ = shift . '*';
          "nosehair" while s/(.{1,3})\*/*,$1/;
          substr($_,2);
        }

I think this just shows that all code is really simple if you already happen to understand it.

Send More Code

Finally, thanks to everyone who wrote in, especially the people I didn't mention. These articles have been quite popular, and I'd like to continue them. But that can't happen unless I have code to discuss. So if you'd like to see another ``Red Flags'' article, please consider sending me a 20- to 50-line section of your own code. If you do, I won't publish the article without showing it to you beforehand.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02