Sign In/My Account | View Cart  
advertisement


Listen Print

A Chromosome at a Time with Perl, Part 2
by James D. Tisdall | Pages: 1, 2

Sequence Motifs with Bounded Lengths

In Part I, I showed you how to search for a small pattern in a very large file of DNA data (in FASTA format) by only keeping at most two lines of the file in the program's memory at any one time.

Here is some code that generalizes that approach. It is more general because it allows you to declare the maximum and minimum pattern sizes. It uses no more than a certain maximum amount of main memory for the data at any one time. For instance, if you're looking for a pattern and you know that any match must be between 300 and 2,000 bases long, you can use this subroutine to search any amount of DNA while keeping the amount of main memory used for the DNA to within about 4,000 bytes, twice the maximum pattern size. Only matching patterns between 300 and 2,000 bases long will be reported.


#!/usr/bin/perl
#
# find_size_bounded_pattern : find a pattern known to be between a min and max length
#   Keep small size of memory but handle arbitrarily large input files
#
#  Copyright (c) 2003 James Tisdall
#

use warnings;
use strict;
use Carp;

my $pattern  = "GGTGGAC[GT].{50,1500}[AC][GT][CG]ATAT";
my $filename = "Fly.dna";
my $min      = 65;
my $max      = 1515;

my @locations = find_size_bounded_pattern($pattern, $filename, $min, $max);

print "@locations\n";

exit;

### End of main program
##################################################

##################################################
### Subroutines:

sub find_size_bounded_pattern {

    ################# Arguments:
    # $pattern   - the pattern to search for, as a regular _expression
    # $filename  - the name of the DNA fasta file (may have multiple records)
    # $min       - the shortest length of a usable match to the pattern    
    # $max       - the longest length of a usable match to the pattern    
    #################
    my($pattern, $filename, $min, $max) = @_;

    ################# Other variables:
    # $buffer    - a buffer to store the DNA text, usually (2 * $max) in length
    # $position  - the position of the beginning of the buffer in the DNA
    # @locations - the locations where the pattern was found, to be returned
    #              @locations also includes headers for each fasta record
    # $header    - the one-line fasta header for each record in a fasta file
    my $buffer = '';
    my $position = 0;
    my @locations = ();
    my $header = '';
    #################
    
    # Open the DNA file
    open(DNA,"<$filename") or croak("Cannot open $filename:$!\n");

    # Get the input lines and compute
    while(my $newline = <DNA> ) {

        # If new fasta header, reinitialize buffer and location counter
        if($newline =~ /^>/) {
            # Complete previous search in buffer which contains end of fasta record
            while($buffer =~ /$pattern/gi) {
                if($-[0] <= length($buffer) - $min) {
                    unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
                        push(@locations, $position + $-[0] + 1);
                    }
                }
            }
            # Reset $header, $buffer, $position for new fasta record
            $header = $newline;
            push(@locations, "\n$header");
            buffer = '';
	    $position = 0;

            # Get new line from file
            next;
        }

        # If new line of DNA data

        # Add the new line to the buffer
        chomp $newline;
        $buffer .= $newline;

        if(length($buffer) < (2 * $max) ) {
            next;
        }
    
        # Search for the DNA pattern
        # (Report the character at position 0 as at position 1, as usual in biology)
        while($buffer =~ /$pattern/gi) {
            if($-[0] < $max) {
                unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
                    push(@locations, $position + $-[0] + 1);
                }
            }else{
                last;
            }
        }
    
        # Reset the position counter
        # (will be accurate after you reset the buffer, next)
        $position = $position + $max;
    
        # Reset the buffer
        # Discard the first $max worth of data in the buffer
        $buffer = substr($buffer, $max);
    }

    # Complete search in final buffer
    while($buffer =~ /$pattern/gi) {
        if($-[0] <= length($buffer) - $min) {
            unless(($+[0] - $-[0] < $min) or ($+[0] - $-[0] > $max)) {
                push(@locations, $position + $-[0] + 1);
            }
        }
    }

    # Computation complete
    return @locations;
}

How the Code Works

This code gets its DNA from a FASTA-formatted file (FASTA is the most common format for a file of DNA sequence data). It would be fairly easy to rewrite this so that you could give multiple FASTA filenames on a command line and all the files would be processed by this code. As it is, it can handle a single FASTA file that contains multiple FASTA records.

The subroutine find_size_bounded_pattern returns an array of all the locations in the DNA that contain the pattern. Since the input may have several FASTA records, the one-line header of each record is also returned, to help identify the start of each new record. For instance, I tested this program on a file, Fly.dna, that contains all the chromosomes of the fruit fly, Drosophila melanogaster. In this file, each new chromosome begins with a new FASTA header, which is added to the returned array. The locations reported start afresh from 1 for each chromosome.

The pattern to be searched for is only reported if it's between a certain minimum and maximum length. Twice the maximum desired pattern length (plus the length of an input line) is the limit of the amount of DNA data that is read into the program's buffer. That way you can search a $max worth of DNA for the beginning locations of patterns that may be up to $max long.

The overall structure of this code is pretty simple, and the comments in the code should do most of the explaining. There are two situations dealt with in the loop as it reads input lines. First is when there is a new FASTA header. Then you have to finish searching in the buffer, and reset the variables to begin a search in a new sequence of DNA from a new FASTA record. Second is when there is a new line of DNA. And finally, after all the lines have been read and you exit the loop, there may still be some unsearched DNA in the buffer, so the subroutine ends by searching the DNA remaining in the last buffer.

In this code, the devil is in the details of how the specific locations and sizes are set. The intermediate level Perl programmer should be able to puzzle this out given the comments in the code. Note that after a successful pattern match the builtin variable $-[0] has the offset of the beginning of the match, and $+[0] has the offset of the end of the match. This avoids the use of the special variable $&, the use of which causes all manner of space to be used to hold this and several other special variables. But if your regular expression has any parentheses, that's enough to make the special variables and their considerable space get used too. Of course, regular expressions have their own rules of behavior, such as greedy matching and so forth, that are not addressed by this code. (Could you modify this program to find patterns that overlap each other? What happens if $max is less than the input line size? What other assumptions are made by this code?)

Profiling the Speed of your Perl Program

You can profile the speed of your Perl program fairly easily. Let's say I put the program in a file called sizebound.pl. Then I can get a report on the time the various parts of the program require by running the program like this:


[tisdall@coltrane]$ perl -d:DProf sizebound.pl 

And then getting a summary of the report (from the file tmon.out that DProf creates) like so:


[tisdall@coltrane]$ dprofpp
Total Elapsed Time =  95.1899 Seconds
  User+System Time =  94.9099 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
 99.9   94.87 94.870      1   94.870 94.870  main::find_size_bounded_pattern
 0.02   0.020  0.020      3   0.0067 0.0067  main::BEGIN
 0.00   0.000 -0.000      1   0.0000      -  warnings::BEGIN
 0.00   0.000 -0.000      2   0.0000      -  Exporter::import
 0.00   0.000 -0.000      1   0.0000      -  warnings::import
 0.00   0.000 -0.000      1   0.0000      -  strict::import
 0.00   0.000 -0.000      1   0.0000      -  strict::bits

When you have lots of subroutines, this can really help you see where the most time is being spent. Here, I'm really just getting the information that the program took about a minute and a half to look for the pattern in the DNA of the fruit fly.

It's also possible to get information about the space usage of a program, but you have to use a version of Perl that was compiled with -DDEBUG, which is not usually the case. If you have such a version, then the following will give you some information:


[tisdall@coltrane]$ perl -DL sizebound.pl 

But that's enough for here and now; take a look at the Perl documentation section called perldebguts. And drive safely.

Suggested Reading

Here are some of the many books that you might find useful. I cut my teeth on the Bentley books, but the older ones are hard to find.

  • Introduction to Algorithms, Second Edition, by Cormen et al, MIT Press, 2001.
  • Writing Efficient Programs, by Jon Bentley, Prentice Hall 1982.
  • Refactoring: Improving the Design of Existing Code, by Fowler et al, Addison-Wesley, 1999.
  • Programming Pearls, Second Edition, by Jon Bentley, Prentice Hall 1999.
  • More Programming Pearls: Confessions of a Coder, by Jon Bentley, Pearson Education, 1988.

O'Reilly & Associates recently released (September 2003) Mastering Perl for Bioinformatics.