A Chromosome at a Time with Perl, Part 2
by James D. TisdallOctober 15, 2003
James D. Tisdall is the author of the recently released Mastering Perl for Bioinformatics.
In my previous article, A Chromosome at a Time with Perl, Part I, I showed you some programming "tricks" that help you avoid the trap of using up all your main memory when coding for very long strings, such as chromosomes and entire genomes.
The basic approach involved improving your code's running time by limiting the amount of memory space the program uses. The tricks discussed were calling subroutines with references as arguments, and searching for a pattern in a very large file by keeping only a small "window" of the file in memory at any one time, in a buffer.
This article will continue that discussion. I'll show you more about how references can greatly speed up a subroutine call by avoiding making copies of very large strings. I'll show you how you can bypass the overhead of subroutine calls entirely. I'll extend the previous example of a buffer window into a large file, making it suited to any situation where you know the minimum and maximum length of a pattern for which you're searching. And I'll show you how to quantify the behavior of your code by measuring its speed and space usage.
|
Related Reading
Mastering Perl for Bioinformatics |
Why Space Puts a Lower Bound on Time
In Perl, as in any programming system, the size of the data that the program uses is an absolute lower bound on how fast the program can perform.
In fact, algorithms are typically classified by how fast they perform on inputs of varying sizes, by giving their speed as a function of the size of the input. So a program that has to do 2n computations on an input of size n is a hell of a lot slower than a program that has to do n2 computations on an input of size n. The first is called intractable and exponential, or "bad"; the second is called tractable and polynomial, or "good." For instance, if n, the size of the input, is 100, then n2 is 10,000, while 2n is bigger than the number of atoms in the universe. But who's counting? And is the universe really finite? Oh well ... back to your regularly scheduled program.
This way of measuring an algorithm is called time complexity. It's usually written in a shorthand called big Oh notation. (See the Suggested Reading at the end of this article, if you get that far.)
In particular, if an algorithm gets an input of size n, and then just to write the answer it must write an output of size 2n, then the algorithm is taking 2n time, at least. So the space that an algorithm uses is intimately connected to the time it uses. Of course, a program could use just a small, constant amount of space and still use 2n time, for instance if it added and subtracted the number one over and over, 2n times, for some perverse reason. Still, the amount of space that an algorithm uses establishes a lower bound for how much time the algorithm takes to complete.
What's all this got to do with Perl programming in bioinformatics? Quite a lot, if you're writing code that manipulates, say, the 3 gigabases of human DNA.
If you're new to the field, a base is one of the letters A, C, G, or T that represents one of the four molecules that are the principal building blocks of DNA. Each base is typically represented in a computer language as one ASCII character taking one 8-bit byte, so 3 gigabases equals 3 gigabytes. Of course, you could represent each of the four bases using only 2 bits, so considerable compression is possible; but such space efficiency is not commonly employed. Hmmm ... makes an interesting idea for another article, however! "Perl and a Chromosome, Two Bits." Watch this space.
Just to read in the 3 gigabytes of DNA is going to take you some time. If you're also making copies of the 3 gigabytes in your variables, you're going to need more main memory than most computers have available. So the crafty Perl programmer needs to think of programming solutions that minimize the amount of space used when computing with such large input. If she does, not only will she have a program that fits into her computer's memory (always a wise move); she may also have a program that runs pretty fast, if she does say so herself, with all due humility.
Subroutines Without Extra Space
In Part I, I briefly discussed how passing references to subroutines can save you considerable space. I'll just expand a little on that discussion in this section. There are three main ways that references can be used to save space and therefore time in the subroutines of your Perl program.
One: Collect Data in a Subroutine Argument
First, let's say you call a subroutine to get some data. Typically this takes a form such as this:
my $chromosome1 = get_chromosome( 1 );
Assuming that the data is about 100 megabases long, the Perl programmer can see that the subroutine "get_chromosome" is collecting 100 megabases of DNA and then "returning" it, which means that copies of those 100 megabases are being made. And that's a Bad Thing.
Instead, the wily hacker could pass a reference to the $chromosome1 string into the
subroutine, which could be written to "fill" the string with the 100 megabases without
the need for further copying. Then after the subroutine call, say like so:
get_chromosome(1, \$chromosome1);
the variable $chromosome1 would contain the same data as in the previous version, but
it would have gotten there without being copied by means of being "returned" from
a subroutine. And that's a Good Thing. The only gotcha here is that the subroutine
call is definitely changing what's in the $chromosome1 variable. No problem as long
as you remember that's what's happening.
Two: Pass the Subroutine a Reference to the Data
Here's a second way to use references as subroutine arguments to good advantage.
Let's say you have a chromosome's worth of DNA in a variable $chromosome1, and
you want to pass it to a subroutine that counts how many As, Cs, Gs, and Ts there
are in it. (This might be important if you were looking for genes in the chromosome,
for instance -- in humans, genes tend to be GC rich.)
If you write your code like this:
my($a, $c, $g, $t) = countacgt( $chromosome1 );
then the "countacgt" subroutine is going to make a copy of the data in the argument
$chromosome1. That's a Regrettable Occurence.
On the other hand, if you pass the subroutine a reference to the data in
$chromosome1, the data will not be copied, and that's a Fortunate Happenstance:
my($a, $c, $g, $t) = countacgt( \$chromosome1 );
However, once again you'll have to be aware that the subroutine has the power to
change what's in the $chromosome1 variable, and either avoid changing it or take note of the change.
As another alternative, you could use
my($a, $c, $g, $t) = countacgt( $chromosome1 );
but then don't assign a new variable to the argument within the countacgt subroutine, like so:
my($chrom) = @_;
Instead, just use $_[0] to access the chromosome data without copying it. And that's a Perl Idiom. (For readability, you may want to add a comment explaining what kind of data $_[0] is, since you won't have an informative
variable name.)
Three: Return a Reference to the Data
Now a third and final way to use references instead of space: if you have a subroutine that collects a large amount of data, you can have it return a reference to the data instead of the data itself; this will also avoid the large string copies, which Ain't Bad:
my $chromosome1ref = get_chromosome( 1 );
Eliminating Subroutines Altogether
Organizing the code for a computation into a logical set of subroutines can make for clean, easy-to-understand, and elegant programming.
Unfortunately, it can also make your program perform much slower. Take this small example. (An exon is a stretch of a chromosome's DNA, transcribed into RNA, that contains part of the code for a gene. In organisms such as humans, most genes are composed of multiple exons separated by non-coding introns; the exons are spliced together to get the actual code for the gene):
while ((my $begin, my $end) = each %exon_endpoints) {
print get_exon($chromosome, $begin, $end), "\n\n";
}
sub get_exon {
my($chromosome, $begin, $end) = @_;
# The arguments to substr are: string, beginning, length
return substr($chromosome, $begin - 1, $end - $begin + 1);
}
This code takes the information about exon endpoints stored in the hash %exon_endpoints
(key = begin, value = end) to extract the exons from a chromosome and print them.
(You may remember from Part I that I translated between the Perl idea of location, where
the first location of a string is position 0, and the biologist's idea of location, where the
first location is position 1.) The code is short, to the point, and gets the job done.
Unfortunately, it also makes as many copies of the entire chromosome as there are
exons to print out. Ouch.
In such circumstances, you can save a boatload of pain by eliminating the subroutine entirely, like so:
while ((my $begin, my $end) = each %exon_endpoints) {
print substr($chromosome, $begin - 1, $end - $begin + 1), "\n\n";
}
The bad news: now the details of how to extract the desired exon from the chromosome
are right in the loop, instead of being nicely tucked away in the
subroutine get_exon. The good news: the program will finish running before the weekend.
Pages: 1, 2 |

