A Chromosome at a Time with Perl, Part 2
by James D. Tisdall
|
Pages: 1, 2
Sequence Motifs with Bounded Lengths
In Part I, I showed you how to search for a small pattern in a very large file of DNA data (in FASTA format) by only keeping at most two lines of the file in the program's memory at any one time.
Here is some code that generalizes that approach. It is more general because it allows you to declare the maximum and minimum pattern sizes. It uses no more than a certain maximum amount of main memory for the data at any one time. For instance, if you're looking for a pattern and you know that any match must be between 300 and 2,000 bases long, you can use this subroutine to search any amount of DNA while keeping the amount of main memory used for the DNA to within about 4,000 bytes, twice the maximum pattern size. Only matching patterns between 300 and 2,000 bases long will be reported.
#!/usr/bin/perl
#
# find_size_bounded_pattern : find a pattern known to be between a min and max length
# Keep small size of memory but handle arbitrarily large input files
#
# Copyright (c) 2003 James Tisdall
#
use warnings;
use strict;
use Carp;
my $pattern = "GGTGGAC[GT].{50,1500}[AC][GT][CG]ATAT";
my $filename = "Fly.dna";
my $min = 65;
my $max = 1515;
my @locations = find_size_bounded_pattern($pattern, $filename, $min, $max);
print "@locations\n";
exit;
### End of main program
##################################################
##################################################
### Subroutines:
sub find_size_bounded_pattern {
################# Arguments:
# $pattern - the pattern to search for, as a regular _expression
# $filename - the name of the DNA fasta file (may have multiple records)
# $min - the shortest length of a usable match to the pattern
# $max - the longest length of a usable match to the pattern
#################
my($pattern, $filename, $min, $max) = @_;
################# Other variables:
# $buffer - a buffer to store the DNA text, usually (2 * $max) in length
# $position - the position of the beginning of the buffer in the DNA
# @locations - the locations where the pattern was found, to be returned
# @locations also includes headers for each fasta record
# $header - the one-line fasta header for each record in a fasta file
my $buffer = '';
my $position = 0;
my @locations = ();
my $header = '';
#################
# Open the DNA file
open(DNA,"<$filename") or croak("Cannot open $filename:$!\n");
# Get the input lines and compute
while(my $newline = <DNA> ) {
# If new fasta header, reinitialize buffer and location counter
if($newline =~ /^>/) {
# Complete previous search in buffer which contains end of fasta record
while($buffer =~ /$pattern/gi) {
if($-[0] <= length($buffer) - $min) {
unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
push(@locations, $position + $-[0] + 1);
}
}
}
# Reset $header, $buffer, $position for new fasta record
$header = $newline;
push(@locations, "\n$header");
buffer = '';
$position = 0;
# Get new line from file
next;
}
# If new line of DNA data
# Add the new line to the buffer
chomp $newline;
$buffer .= $newline;
if(length($buffer) < (2 * $max) ) {
next;
}
# Search for the DNA pattern
# (Report the character at position 0 as at position 1, as usual in biology)
while($buffer =~ /$pattern/gi) {
if($-[0] < $max) {
unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
push(@locations, $position + $-[0] + 1);
}
}else{
last;
}
}
# Reset the position counter
# (will be accurate after you reset the buffer, next)
$position = $position + $max;
# Reset the buffer
# Discard the first $max worth of data in the buffer
$buffer = substr($buffer, $max);
}
# Complete search in final buffer
while($buffer =~ /$pattern/gi) {
if($-[0] <= length($buffer) - $min) {
unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
push(@locations, $position + $-[0] + 1);
}
}
}
# Computation complete
return @locations;
}
How the Code Works
This code gets its DNA from a FASTA-formatted file (FASTA is the most common format for a file of DNA sequence data). It would be fairly easy to rewrite this so that you could give multiple FASTA filenames on a command line and all the files would be processed by this code. As it is, it can handle a single FASTA file that contains multiple FASTA records.
The subroutine find_size_bounded_pattern returns an array of all the
locations in the DNA that
contain the pattern.
Since the input may have several FASTA records, the one-line header of
each record is also returned,
to
help identify the start of each new record. For instance, I tested this
program on a file, Fly.dna, that contains all the
chromosomes of the fruit fly, Drosophila melanogaster. In this file,
each new chromosome begins with
a new FASTA
header, which is added to the returned array. The locations reported
start afresh from 1 for each
chromosome.
The pattern to be searched for is only reported if it's between a
certain minimum and maximum
length.
Twice the maximum desired pattern length (plus the length of an input
line)
is the limit of the amount of DNA data that is read into the program's
buffer.
That way you can search a $max worth of DNA for the beginning locations
of patterns that may be up
to $max long.
The overall structure of this code is pretty simple, and the comments in the code should do most of the explaining. There are two situations dealt with in the loop as it reads input lines. First is when there is a new FASTA header. Then you have to finish searching in the buffer, and reset the variables to begin a search in a new sequence of DNA from a new FASTA record. Second is when there is a new line of DNA. And finally, after all the lines have been read and you exit the loop, there may still be some unsearched DNA in the buffer, so the subroutine ends by searching the DNA remaining in the last buffer.
In this code, the devil is in the details of how the specific locations
and sizes are set. The
intermediate level Perl
programmer should be able to puzzle this out given the comments in the
code. Note that after a
successful pattern match
the builtin variable $-[0] has the offset of the beginning of the
match, and $+[0] has the offset of
the end of the match. This
avoids the use of the special variable $&, the use of which causes all
manner of space to be used to
hold this and several
other special variables. But if your regular expression has any
parentheses, that's enough to make
the special variables and
their considerable space get used too.
Of course, regular expressions have their own rules of behavior, such
as greedy matching and
so forth, that are not addressed by this code. (Could you modify this
program to find
patterns that overlap each other? What happens if $max is less than
the input line size?
What other assumptions are made by this code?)
Profiling the Speed of your Perl Program
You can profile the speed of your Perl program fairly easily. Let's say I put the program in a file called sizebound.pl. Then I can get a report on the time the various parts of the program require by running the program like this:
[tisdall@coltrane]$ perl -d:DProf sizebound.pl
And then getting a summary of the report (from the file tmon.out that DProf creates) like so:
[tisdall@coltrane]$ dprofpp
Total Elapsed Time = 95.1899 Seconds
User+System Time = 94.9099 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
99.9 94.87 94.870 1 94.870 94.870 main::find_size_bounded_pattern
0.02 0.020 0.020 3 0.0067 0.0067 main::BEGIN
0.00 0.000 -0.000 1 0.0000 - warnings::BEGIN
0.00 0.000 -0.000 2 0.0000 - Exporter::import
0.00 0.000 -0.000 1 0.0000 - warnings::import
0.00 0.000 -0.000 1 0.0000 - strict::import
0.00 0.000 -0.000 1 0.0000 - strict::bits
When you have lots of subroutines, this can really help you see where the most time is being spent. Here, I'm really just getting the information that the program took about a minute and a half to look for the pattern in the DNA of the fruit fly.
It's also possible to get information about the space usage of a program, but you have to use a version of Perl that was compiled with -DDEBUG, which is not usually the case. If you have such a version, then the following will give you some information:
[tisdall@coltrane]$ perl -DL sizebound.pl
But that's enough for here and now; take a look at the Perl documentation section called perldebguts. And drive safely.
Suggested Reading
Here are some of the many books that you might find useful. I cut my teeth on the Bentley books, but the older ones are hard to find.
- Introduction to Algorithms, Second Edition, by Cormen et al, MIT Press, 2001.
- Writing Efficient Programs, by Jon Bentley, Prentice Hall 1982.
- Refactoring: Improving the Design of Existing Code, by Fowler et al, Addison-Wesley, 1999.
- Programming Pearls, Second Edition, by Jon Bentley, Prentice Hall 1999.
- More Programming Pearls: Confessions of a Coder, by Jon Bentley, Pearson Education, 1988.
O'Reilly & Associates recently released (September 2003) Mastering Perl for Bioinformatics.
-
Sample Chapter 9, Introduction to Bioperl, is available free online.
-
You can also look at the Table of Contents, the Index, and the Full Description of the book.
-
For more information, or to order the book, click here.

