Sign In/My Account | View Cart  
advertisement


Listen Print

Parsing Protein Domains with Perl
by James D. Tisdall | Pages: 1, 2

Perl Subroutine to Parse Prosite Records into Their Line Types

The other task we need to accomplish is to parse the various types of lines, so that, for instance, we can get the ID and the PA pattern lines easily. The next subroutine accomplishes this task: given a Prosite record, it returns a hash with the lines of each type indexed by a key that is the two-character "line type". The keys we'll be interested in are the ID key for the line that has the identification information; and the PA key for the line(s) that have the pattern information.

This "get_line_types" subroutine does more than we need. It makes a hash index on all the line types, not just the ID and PA lines that we'll actually use here. But that's OK. The subroutine is short and simple enough, and we may want to use it later to do things with some of the other types of lines in a Prosite record.

By building our hash to store the lines of a record, we can extract any of the data lines from the record that we like, just by giving the line type code (such as ID for identification number). We can use this hash to extract two line types that will interest us here, the ID identifier line and the PA pattern line. Then, by translating the Prosite pattern into a Perl regular expression (using our first subroutine), we will be in a position to actually look for all the patterns in a protein sequence. In other words, we will have extracted the pattern information and made it available for use in our Perl program, so we can search for the patterns in the protein sequence.


If you're interested in learning Perl, don't miss O'Reilly's best-selling Learning Perl, 3rd Edition, which has been updated to cover Perl version 5.6 and rewritten to reflect the needs of programmers learning Perl today. For a complete list of O'Reilly's books on Perl, go to perl.oreilly.com.


Here, then, is our second subroutine, which accepts a Prosite record, and returns a hash which has the lines of the record indexed by their line types:


#
# Parse a PROSITE record into "line types" hash
# 
sub get_line_types {

  #
  # Collect the PROSITE record
  #
  my($record) = @_;

  #
  # Initialize the hash
  #   key   = line type
  #   value = lines
  #
  my %line_types_hash = ();

  #
  # Split the PROSITE record to an array of lines
  #
  my @records = split(/\n/,$record);

  #
  # Loop through the lines of the PROSITE record
  #
  foreach my $line (@records) {

    #
    # Extract the 2-character name
    # of the line type
    #
    my $line_type = substr($line,0,2);

    #
    # Append the line to the hash
    # indexed by this line type
    #
    (defined $line_types_hash{$line_type})
    ?  ($line_types_hash{$line_type} .= $line)
    :  ($line_types_hash{$line_type} = $line);
  }

  #
  # Return the hash 
  #
  return %line_types_hash;
}

Main Program

Now let's see the code at work. The following program uses the subroutines we've just defined to read in the Prosite records one at a time from the database in the flat file prosmall.txt. It then separates the different kinds of lines (such as "PA" for pattern), and translates the patterns into regular expressions, using the subroutine PROSITE_2_regexp we already wrote. Finally, it searches for the regular expressions in the protein sequence, and reports the position of the matched pattern in the sequence.


#!/usr/bin/perl
#
# Parse patterns from the PROSITE database, and
# search for them in a protein sequence
#

#
# Turn on useful warnings and constraints
#
use strict;
use warnings;

#
# Declare variables
#

#
# The PROSITE database
#
my $prosite_file = 'prosmall.dat';

#
# A "handle" for the opened PROSITE file
#
my $prosite_filehandle; 

#
# Store each PROSITE record that is read in
#
my $record = '';

#
# The protein sequence to search
# (use "join" and "qw" to keep line length short)
#
my $protein = join '', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSG
QSTVSGELQDSVLQDRSMPHQEILAADEVLQESE
MRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
);

#
# open the PROSITE database or exit the program
#
open($prosite_filehandle, $prosite_file)
 or die "Cannot open PROSITE file $prosite_file";

#
# set input separator to termination line //
#
$/ = "//\n";

#
# Loop through the PROSITE records
#
while($record = <$prosite_filehandle>) {

  #
  # Parse the PROSITE record into its "line types"
  #
  my %line_types = get_line_types($record);

  #
  # Skip records without an ID (the first record)
  #
  defined $line_types{'ID'} or next;

  #
  # Skip records that are not PATTERN
  # (such as MATRIX or RULE)
  #
  $line_types{'ID'} =~ /PATTERN/ or next;

  #
  # Get the ID of this record
  #
  my $id = $line_types{'ID'};
  $id =~ s/^ID   //;
  $id =~ s/; .*//;

  #
  # Get the PROSITE pattern from the PA line(s)
  #
  my $pattern = $line_types{'PA'};
  # Remove the PA line type tag(s)
  $pattern =~ s/PA   //g;

  #
  # Calculate the Perl regular expression
  # from the PROSITE pattern
  #
  my $regexp =  PROSITE_2_regexp($pattern);

  #
  # Find the PROSITE regular expression patterns
  # in the protein sequence, and report
  #
  while ($protein =~ /$regexp/g) {
    my $position = (pos $protein) - length($&) +1;
    print "Found $id at position $position\n";
    print "   match:   $&\n";
    print "   pattern: $pattern\n";
    print "   regexp:  $regexp\n\n";
  }
}

#
# Exit the program
#
exit;

This program is available online as the file parse_prosite. The tiny example Prosite database is available as the file prosmall.dat. If you save these files on your (Unix, Linux, Macintosh, or Windows) computer, you can enter the following command at your command-line prompt (in the same folder in which you saved the two files):


% perl parse_prosite

and it will produce the following output:


Found PKC_PHOSPHO_SITE at position 22
   match:   SSR
   pattern: [ST]-x-[RK].
   regexp:  [ST].[RK]

Found PKC_PHOSPHO_SITE at position 86
   match:   TVK
   pattern: [ST]-x-[RK].
   regexp:  [ST].[RK]

Found CK2_PHOSPHO_SITE at position 76
   match:   SHDE
   pattern: [ST]-x(2)-[DE].
   regexp:  [ST].{2}[DE]

Found MYRISTYL at position 30
   match:   GGVSGQ
   pattern: G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}.
   regexp:  G[^EDRKHPFYW].{2}[STAGCN][^P]

As you see, our short program goes through the Prosite database one record at a time, parsing each record according to the types of lines within it. If the record has an ID and a pattern, it then extracts them, creates a Perl regular expression from the pattern, and finally searches in a protein sequence for the regular expression, reporting on the patterns found.

The Next Step

This article has shown you how to take biological data from the Prosite database and use it in your own programs. With this ability, you can write programs specific to your particular research needs.

Many kinds of data discovery are possible: you could combine searches for Prosite patterns with some other computation. For instance, you may want to also search the associated genomic DNA or cDNA for restriction sites surrounding a particular Prosite pattern in the translated protein, in preparation for cloning.


James Tisdall has also written Why Biologists Want to Program Computers for oreilly.com.


While such programs are interesting in their own right, their importance in laboratory research really lies in the fact that their use can save enormous amounts of time; time which can then be used for other, less routine, tasks on which biological research critically depends.

This article gives an example of using Perl to extract and use data from a flat file database, of which there are many in biological research. In fact, some of the most important biological databases are in flat file format, including GenBank and PDB, the primary databases for DNA sequence information and for protein structures.

With the ability to write your own programs, the true power of bioinformatics can be applied in your lab. Learning the Perl programming language can give you a direct entry into this valuable new laboratory technique.


O'Reilly & Associates recently released (October 2001) Beginning Perl for Bioinformatics.