Recently in Biology Category

Data Munging for Non-Programming Biologists

Have you ever renamed 768 files? Merged the content from 96 files into a spreadsheet? Filtered 100 lines out of a 20,000-line file?

Have you ever done these things by hand?

Disciples of laziness--one of the three Perl programmer's virtues--know that you should never repeat anything five times, let alone 768. It dismayed me to learn that biologists do this kind of thing all the time.

On the Origin of Scripts: The Problem

Experimental biologists increasingly face large sets of large files in often-incompatible formats, which they need to filter, reformat, merge, and otherwise munge (definition 3). Biologists who can't write Perl (most of them) often end up editing large files by hand. When they have the same problem a week later, they do the same thing again--or they just give up.

My job description includes helping biologists to use computers. I could just write tailored, one-off scripts for them, right? As an answer, let me tell you about Neeraj. Neeraj is a typical NPB (non-programming biologist) who works down the hall. He came into my office, saying, "I have 12,000 sequences that I need to make primers for." I said, "Make what?" (I haven't been doing biology for very long.) Luckily, we figured out pretty quickly that all he wants to do is get characters 201-400 from each DNA sequence in a file. Those of you who have been Perling for a while can do this with your eyes closed (if you can touch type):

perl -ne 'print substr($_, 200, 200), "\n"' sequences.in >
    primers.out

Voilá! I gave Neeraj his output file and he went away, happy, to finish building his clone army to take over the world. (Or was he going to genetically modify rice to solve world hunger? I keep forgetting.)

Unfortunately, that wasn't the end. The next day, Neeraj came back, because he also wanted primers from the back end of the sequences (substr($_, -400, 200)). Because he's doing cutting-edge research, he may have totally different requirements next month, when he finishes his experiments. With just a few people in our group supporting hundreds or even thousands of biologists, writing tailored scripts, even quick one-liners, doesn't scale. Other common solutions, such as teaching biologists Perl or creating graphical workflow managers, didn't seem to fully address the data manipulation problem especially for occasional users, who won't be munging every day.

We need some tool that allows Neeraj, or any NPB, to munge his own data, rather than relying on (and explaining biology to) a programmer. Keeping the biologist in the loop this way gives him the best chance of applying the relevant data and algorithms to answer the right questions. The tool must be easy for a non-programmer to learn and to remember after a month harvesting fish eyes in Africa. It should also be TMTOWTDI-compliant, allowing him to play with data until he can sculpt it in the most meaningful way. While we're at it, the tool will need to evolve rapidly as biologists ask new questions and create new kinds of data at an ever-increasing rate.

When I told Neeraj's story to others in our group, they said that they have struggled with this problem for years. During one of our brainstorming sessions, my not-so-pointy-haired boss, Eitan Rubin, said, "Wouldn't it be nice if we could just give them a book of magical data-munging scripts that Just Work?" "Hm--a sort of Script Tome?" And thus the Scriptome was born. (The joke here is that every self-respecting area of study in biology these days needs to have "ome" in its name: the genome, proteome, metabolome. There's even a journal called OMICS now.)

Harnessing the Power of the Atom

The Scriptome is a cookbook for munging biological data. The cookbook model nicely fits the UNIX paradigm of small tools that do simple operations. Instead of UNIX pipes, though, we encourage the use of intermediate files to avoid errors.

We use a couple of tricks in order to make this cookbook accessible to NPBs. We use the familiar web browser as our GUI and harness the power of hyperlinking to develop a highly granular, hierarchical table of contents for the tools. This means we can include dozens to hundreds of tools, without requiring users to remember command names. Another trick is syntax highlighting. We gray out most of the Perl, to signify that reading it is optional. Parameters--such as filenames, or maximum values to filter a certain column by--we highlight in attention-getting red. Finally, we make a conscious effort to avoid computer science or invented terminology. Instead, we use language biologists find familiar. For example, tools are "atoms," rather than "snippets."

Each Scriptome tool consists of a Perl one-liner in a colored box, along with a couple of sentences of documentation (any more than that and no one will read it), and sample inputs and outputs. In order to use a tool, you:

  • Pick a tool type, perhaps "Choose" to choose certain lines or columns from a file.
  • Browse a hierarchical table of contents.
  • Cut and paste the code from the colored box onto a Unix, Mac OS X, or Windows command line. (Friendlier interfaces are in alpha testing--a later section explains more.)
  • Change red text as desired, using arrow keys or a text editor.
  • Hit Enter.
  • That's it!

Figure 1
Figure 1. A Scriptome tool for finding unique lines in a file--click image for full-size screen shot.

The tool in Figure 1 reads the input line by line, using Perl's -n option, and prints each line only when it sees the value in a given, user-editable column for the first time. The substitution removes a newline, even if it's a Windows file being read on a UNIX machine. Then the line is split on tabs. A hash keeps track of unique values in the given column, deciding which lines to print. Finally, the script prints to the screen a very quick diagnostic, specifically how many lines it chose out of how many total lines it read. (Choosing all lines or zero lines may mean you're not filtering correctly.)

By cutting and pasting tools like this, a biologist can perform basic data munging operations, without any programming knowledge or program installation (except for ActivePerl on Windows). Unfortunately, that's still not really enough to solve real-world problems.

Splicing the Scriptome

As it happens, the story I told you about Neeraj earlier wasn't entirely accurate. He actually wanted to print both the beginning and ending substrings from his sequences. Also, his input was in the common FASTA format, where each sequence has an ID line like >A2352334 followed by a variable number of lines with DNA letters. We don't have one tool that parses FASTA and takes out two different substrings; writing every possible combination of tools would take even longer than writing Perl 6 (ahem). Instead, again following UNIX, we leave it up to the biologist to combine the tools into problem-specific solutions. In this case, that solution would involve using the FASTA-to-table converter, followed by a tool to pull out the sequence column, and then two copies of the substring tool.

We're asking biologists to break a problem down into pieces--each of which is solvable using some set of tools--and then to string those tools together in the right order with the right parameters. That sounds an awful lot like programming, doesn't it? Although you may not think about it anymore, some of the fundamental concepts of programming are new and difficult. Luckily, it turns out that biologists learned more in grad school than how to extract things out of (reluctant) other things. In fact, they already know how to break down problems, loop, branch, and debug; instead of programming, though, they call it developing protocols. They also already have cookbooks for experimental molecular biology. Such a protocol might include lines like:

  1. Add 1 ml of such-and-such enzyme to the DNA.
  2. Incubate test tube at 90 degrees C for an hour.
  3. If the mixture turns clear, goto step 5.
  4. Repeat steps 2-3 three times.
  5. Pour liquid into a sterile bottle very carefully.

We borrowed the term "protocol" to describe an ordered set of parameterized Scriptome tools that solves a larger problem. (The right word for this is a script, but don't tell our users--they might realize they're learning how to program.) We feature some pre-written protocols on the website. Note that because each tool is a command-line command, a set of them together is really just an executable shell script.

The Scriptome may be even more than a high-level, mostly syntax-free, non-toy language for NPBs. Because it exposes the Perl directly on the website--giving new meaning to the term "open source"--some curious biologists may even start reading the short, simple, relevant examples of Perl code. (Unfortunately, putting the command into one line makes it harder to read. One of our TODOs is an Explain button next to each tool, which would show you a commented, multi-line version of each script.) From there, it's a short hop to tweaking the tools, and before you know it, we'll have more annoying newbie posts on comp.lang.perl.misc!

Intelligent Design: The Geeky Details

If you've read this far, you may have realized by now that the Scriptome is not a programming project at heart. Design, interface, documentation, and examples are as important as the programming itself, which is pretty easy. This being an article on Perl.com, though, I want to discuss the use of Perl throughout the project.

Why Perl?

Several people asked me why we didn't write the Scriptome in Python, or R, or just use UNIX sh for everything. Well, other than the obvious ("It's the One True Language!"), Perl data munges by design, it's great for fast tool development, it's portable to many platforms, and it's already installed on every Unix and Mac OS X box. Moreover, the Bioperl modules offer me a huge number of tools to steal, um, reuse. Finally, Perl is the preferred language of the entire Scriptome development team (me).

What kind of Perl?

Perl allows you to write pretty impressive tools in only a couple of hundred characters, with Perl Golf tricks such as the -n option, autovivification, and the implicit $_ variable. On the other hand, we want the code to be readable, especially if we want newbies to learn from it, so we can't use too many Golf shortcuts. (For example, here's the winning solution in the Perl Golf contest for a script to find the last non-zero digit of N factorial by Juho Snellman:

 #!perl -l $_*=$`%9e9,??for+1=~?0*$?..pop;print$`%10

Some might consider this difficult for newbies to read.)

The Scriptome Build

Even though we're trying to keep the tools pretty generic and simple, we know we'll need several dozen at least, to be at all useful. In addition, data formats and biologists' interests will change over time. We knew we had to make the process of creating a new tool fast and automatic.

I write the tool pages in POD, which lets me use Vim rather than a fancy web-page editor. My Makefile runs pod2html to create a nice, if simple, web page that includes the table of contents for free. A Perl filter then adds a navigation bar and some simple interface-enhancing JavaScript, and makes the parameters red. I may give in and switch to a templating system, database back end, or XML eventually, and automated testing would be great. For now, keeping it simple means I can create, test, document, and publish a new tool in under an hour. (Okay, I didn't include debugging in that time.)

Perl Culture

There's lots of Perl code in the project, but I'm trying to incorporate some Perl attitude as well. The "Aha!" moment of the Scriptome came when we realized we could just post a bunch of hacked one-liners on the Web to help biologists now, rather than spend six or 12 months crafting the perfect solution. While many computational biologists focus on writing O(N) programs for sophisticated sequence analysis or gene expression studies, we're not ashamed to write glue instead; we solve the unglamorous problem of taking the output from their fancy programs and throwing it into tabular format, so that a biologist can study the results in Excel. After all, if data munging is even one step in Neeraj's pipeline, then he still can't get his paper published without these tools. Finally, we're listening aggressively to our users, because only they can tell us which easy things to make easy and which hard things to make possible.

Filling the Niche: The Scriptome and Other Solutions

One of my greatest concerns in talking to people about biologists' data munging is that people don't even realize that there's a problem, or they think it's already been solved. Biologists--who happily pipette things over and over and over again--don't realize that computers could save them lots of time. Too many programmers figure that anyone who needs to can just read Learning Perl. I'm all for that, of course, but experimental biologists need to spend much more of their time getting data (dissecting bee brains, say) than analyzing it, so they can't afford the time it takes to become programmers. They shouldn't have to. Does the average biologist need multiple inheritance, getprotobyname(), and negative look-behind regexes? There's a large body of problems out there that are too diverse for simple, inflexible tools to handle, but are too simple to need full-fledged programming.

How about teaching a three-hour course with just enough Perl to munge simple data? At minimum, it should teach variables, arrays, hashes, regular expressions, and control structures--and then there's syntax. "Wait, what's the difference between @{$a[$a]} and @a{$a[$a]} again?" "Oh, my, look at the time." As Damian Conway writes in "Seven Deadly Sins of Introductory Programming Language Design" (PDF link), syntax oddities often distract newbies from learning basic programming concepts. How much can you teach in three hours, and how much will they remember after a month without practicing?

Another route would be building a graphical program that can do everything a biologist would want, where pipelines are developed by dragging and dropping icons and connectors. Unfortunately, a comprehensive graphical environment requires a major programming effort to build, and to keep current. Not only that, but the interface for such a full-featured, graphical program will necessarily be complex, raising the learning barrier.

In building the Scriptome, we purposely narrowed our scope, to maximize learnability and memorability for occasional users. While teaching programming and graphical tools are effective solutions for some, I believe the Scriptome fills an empty niche in the data munging ecosphere (the greposphere?).

Creation Is Not Easy

How much progress have we made in addressing the problem space between tool use and programming? Our early reviews have been mostly positive, or at least constructive. Suzy, our first power user, started out skeptical, saying she'd probably have to learn Perl because any tools we gave her wouldn't be flexible enough. I encouraged her to use the Scriptome in parallel with learning Perl. She ended up a self-described "Scriptome monster," tweaking tool code and creating a 16-step protocol that did real bioinformatics. Still, one good review won't get you any Webby awards. Our first priority at this point is to build a user base and to get feedback on the learnability, memorability, and effectiveness of the website, with its 50 or so tools.

It will take more than just feedback to implement the myriad ideas we have for improving the Scriptome, which is why I'm here to make a bald-faced plea for your help. The project needs lots of new tools, new protocols, and possibly new interfaces. You, the Perl.com reader, can certainly write code for new tools; the real question is whether you (unlike certain, unnamed CPAN contributors) can also write good documentation and examples, or find bugs in early versions of tools. We would also love to get relevant protocol ideas. Check out the Scriptome project page and send something to me or the scriptome-users mailing list.

Here's a little challenge. I really did have a client who renamed 768 files by hand before I could Perl it for him. Can you write a generic renaming atom that a NPB could use? (Hint: "Tell the user to learn regular expressions" is not a valid solution.) The winner will receive a commemorative plaque (<bgcolor="gold">) on the Scriptome website.

Speaking of new interfaces, one common concern we hear from programmers is that NPBs won't be able or willing to handle the command-line paradigm shift and the few commands needed (cd, more, dir/ls) to use the Scriptome. In case our users do tell us it's a problem, we're exploring a few different ways to wrap the Scriptome tools, such as:

  • A Firefox plugin that gives you a command line in a toolbar and displays your output file in the browser. (Currently being developed by Rob Miller and his group at MIT.)
  • An Excel VBA that lets you put command lines into a column, and creates a shell script out of it.
  • Wrapping the command-line tools in Pise (web forms around shell commands) or GenePattern (a more general GUI bio tool).

We'll probably try several of these avenues, because they allow us to keep using the command-line interface if desired.

As for the future, well, who says that only biologists are interested in munging tabular data? Certainly, chemists and astronomers could get into this. I set my sights even higher. How about a Scriptome for a business manager wanting to munge reports? An Apache Scriptome to munge your website's access logs? An iTunes Scriptome to manage your music? Let's give users the power to do what they want with their data.

Sorry, GUI Neanderthalis, but you can't adapt to today's data munging needs. Make room for Homo Scriptiens!

A Chromosome at a Time with Perl, Part 1

James D. Tisdall is the author of the soon-to-be-released Mastering Perl for Bioinformatics.

For some time now, the use of Perl in biology has been standard practice. Perl remains the most popular language among biologists for a multitude of programming tasks. The same reasons why Perl has been a success story among system administrators, as well as one of the big success stories in the early days of the Web and CGI programming, have also made it the lingua franca of programming in biology, known as bioinformatics.

One of the reasons why Perl has been equally well suited to dealing with things like DNA and protein sequence data is that it's so easy to declare and use a string. You just use it, without worrying about allocating memory, or managing memory as the string shrinks or grows. DNA and proteins and other standard biological data are almost always represented in Perl as strings, so this facility with strings translates directly into a facility with DNA and proteins.

For example, say you have a subroutine get_chromosome that returns a string of all the DNA in a human chromosome. In humans, this might be a string about 100Mb in length. This snippet of code calls get_chromosome to initialize a scalar variable, $chromosome1, with the string of DNA sequence data that summarizes human chromosome 1:

$chromosome1 = get_chromosome( 1 );

This programming is as easy as cake. I mean, simple as pie. Well, you know what I mean.

But beneath this wonderful ease of programming lurks a problem. It's a problem that can make your wonderful, intuitive code for tossing around chromosomes and genomes--which looks so elegant in your printout, and which appears so neatly divided into intuitively satisfying, interacting subroutines--an inefficient mess that barely runs at all, when it's not completely clogging up your computer.

So, in this short article I'll show you a handful of tricks that enable you to write code for dealing with large amounts of biological sequence data--in this case, very long strings--while still getting satisfactory speed from the program.

Memory is the Bottleneck

What is the problem, exactly? It usually comes down to this: by dealing with very large strings, each one of which uses a significant portion of the main memory that your computer uses to hold a running program, you can easily overtax the amount of main memory available.

When a program on your computer (a process on your Linux, Unix, or Mac OS X computer) runs out of main memory, its performance starts to seriously degrade. It may try to overcome the lack of fast and efficient main memory by enlisting a portion of disk space to hold the part of the running program that it can no longer fit.

But when a program starts writing and reading to and from hard disk memory it can get awfully slow awfully fast. And depending on the nature of the computation, the program may start "thrashing," that is, repeatedly writing and reading large amounts of data between main memory and hard disk. Your elegant program has turned into a greedy, lazy misanthrope that grabs up all the resources available and then seems to just sit there. You've created a monster!

Take the snippet of code above that calls get_chromosome. Without knowing anything more about the subroutine, it's a pretty good bet that it is fetching the 100Mb of data from somewhere, perhaps a disk file, or a relational database, or a web site. To do so, it must be using at least 100Mb of memory. Then, when it returns the data to be stored in $chromosome1, the program uses another 100Mb of memory. Now, perhaps you want to do a regular expression search on the chromosome, saving the desired expression with parentheses that set the special variables $1, $&, and so on. These special variables can be quite large, and that means use of even more memory by your program.

And since this is elegant, simple code you've written, you may well make other copies of the chromosome data or portions of it, in your tenacious hunt for the elusive cure for a deadly disease. The resulting code may be clear, straightforward to understand, and correct--all good and proper things for code to be--but the amount of string copies will land you in the soup. Not only does copying a large string take up memory, but the actual copying can itself be slow, especially if there's a lot of it.

Space Efficiency

You may need to add a new constraint to your program design when you've got a large amount of data in a running program. The constraint is "Use minimal memory." Often, a program that barely runs at all and takes many hours of clogging up the computer, can be rewritten to run in a few minutes by reworking the algorithm so that it uses only a small fraction of the memory.

It's a case of decreasing time by first decreasing space. (Astrophysicists, take note.)

References

There's one easy way to cut down on the number of big strings in a program.

If you need a subroutine to return a large string, as in the get_chromosome subroutine I've used as an example, you can use references to eliminate some of this memory usage.

The practice of passing references to a subroutine is familiar to experienced Perl programmers. In our example, we can rewrite the subroutine so that the return value is placed into a string that is passed in as an argument. But we don't pass a copy of the string--we pass a reference to the string, which takes almost no additional space, and which still enables the subroutine to provide the entire chromosome 1 DNA to the calling program. Here's an example:

load_chromosome( 1, \$chromosome1 );

This new subroutine has two arguments. The 1 presumably will tell the subroutine which human chromosome we want (we want the biggest human chromosome, chromosome 1).

The second argument is a reference to a scalar variable. Inside the subroutine, the reference is most likely used to initialize an argument like this:

my($chromnumber, $chromref) = @_;

And then the DNA data is put into the string by calling it $$chromref, for instance like so:

$$chromref = 'ACGTGTGAACGGA';

No return value is needed. After the subroutine call, the main program will find that the contents of $chromosome1 have changed, and now consist of "ACGTGTGAACGGA." (Of course, a chromosome is much longer than this little fragment.)

Using references is also a great way to pass a large amount of data into a subroutine without making copies of it. In this case, however, the fact that the subroutine can change the contents of the referenced data is something to watch out for. Sometimes you just want a subroutine to get to use the data, but you expect the variable containing the data to still have the same data after the subroutine gets a look at it. So you have to watch what you do when you're passing references to data into a subroutine, and make sure you know what you want.

Managing Memory with Buffers

One of the most efficient ways to deal with very large strings is to deal with them a little at a time.

Here's an example of a program for searching an entire chromosome for a particular 12-base pattern, using very little memory. (A base is one of the four molecules that are the principal building blocks of DNA. The four bases are represented in Perl strings as the characters A, C, G, and T. You'll often hear biologists talking about "megabases" instead of "megabytes" in a string. If you hear that, you're probably talking to a bioinformatician.)

When writing a program that will search for any regular expression in a chromosome, it's hard to see how you could avoid putting the whole chromosome in a string. But very often there's a limit to the size of what you're searching for. In this program, I'm looking for the 12-base pattern "ACGTACGTACGT." And I'm going to get the chromosome data from a disk file.

My trick is going to be to just read in the chromosome data a line or two at a time, search for the pattern, and then reuse the memory to read in the next line or two of data.

The extra work I have to do in programming is, first, I need to keep track myself of how much of the data has been read in, so I can report the locations in the chromosome of successful searches. Second, I need to keep aware that my pattern might start at the end of one line and complete at the beginning of the next line, so I need to make sure I search across line breaks as well as within lines of data from the input file.

Here's a small program that reads in a FASTA file given as an argument on the command line and searches for my pattern in any amount of DNA--a whole chromosome, a whole genome, even all known genetic data, just assuming that the data is in a FASTA file named in the command line. I'll call my program find_fragment, and assuming the DNA is in a FASTA file called human.dna, I'll call it like so:

[tisdall@coltrane]$ perl find_fragment human.dna

For testing purposes I made a very short FASTA DNA file, human.dna, which contains:

> human dna--Not!  The fragment ACGTACGTACGT appears at positions 10, 40, and 98
AAAAAAAAAACGTACGTACGTCCGCGCGCGCGCGCGCGCACGTACGTACG
TGGGGGGGGGGGGGGGCCCCCCCCCCGGGGGGGGGGGGAAAAAAAAAACG
TACGTACGTTTTTTTTTTTTTTTTTTTTTTTTTTT

Here's the code for the program find_fragment:

#!/usr/bin/perl

#
# find_fragment : find 'ACGTACGTACGT' in a very large DNA FASTA file 
# using minimal memory
#
#  N.B. This example program does no checking of the input to ensure 
#       that it is DNA data in FASTA format; it just assumes that 
#       it is. This program also assumes there is just one FASTA
#       record in the input file.
#
#  Copyright (c) 2003 James Tisdall
#

use warnings;
use strict;
use Carp;

# Make sure the program is called with one argument, presumably a 
# FASTA file of DNA
my $USAGE = "perl find_fragment file.FASTA";
unless(@ARGV == 1) { croak "$USAGE:$!\n" }

# $fragment: the pattern to search for
# $fraglen:  the length of $fragment
# $buffer:   a buffer to hold the DNA from the input file
# $position: the position of the buffer in the total DNA
my($fragment, $fraglen, $buffer, $position) = ('ACGTACGTACGT', 12, '', 0);

# The first line of a FASTA file is a header and begins with '>'
my $header = <>;

# Get the first line of DNA data, to start the ball rolling
$buffer = <>;
chomp $buffer;

# The remaining lines are DNA data ending with newlines
while(my $newline = <>) {

    # Add the new line to the buffer
    chomp $newline;
    $buffer .= $newline;

    # Search for the DNA fragment, which has a length of 12
    # (Report the character at string position 0 as being at position 1, 
    # as usual in biology)
    while($buffer =~ /$fragment/gi) {
        print "Found $fragment at position ", $position + $-[0] + 1, "\n";
    }

    # Reset the position counter (will be true after you reset the buffer, next)
    $position = $position + length($buffer) - $fraglen + 1;

    # Discard the data in the buffer, except for a portion at the end
    # so patterns that appear across line breaks are not missed
    $buffer = substr($buffer, length($buffer) - $fraglen + 1, $fraglen - 1);
}

Here's the output of running the command perl find_fragment human.dna:

Found ACGTACGTACGT at position 10
Found ACGTACGTACGT at position 40
Found ACGTACGTACGT at position 98

How the Code Works

After the highly recommended use strict and use warnings are turned on, and the Carp module is loaded so the program can "croak" when needed, the program variables are declared and initialized.

The first line of the FASTA file is a header and is not needed here, so it's read and not used. Then the first line of DNA data is read into the buffer and its newline character is removed. I start with this because I want to search for the fragment even if it is broken by new lines, so I'll have to look at least at the first two lines; here I get the first line, and in the while loop that follows I'll start by adding the second line to the buffer.

Then the while loop, which does the main work of the program, starts reading in the next line of the FASTA file named on the command line, in this case the FASTA file human.dna. The newline is removed with "chomp," and the new line is added to the buffer.

Then comes the short while loop that does the regular expression pattern match of the $fragment in the $buffer. It has modifiers "g" for global search (the fragment may appear more than once in the buffer); and "i" for case insensitive search, that is, either uppercase or lowercase DNA data (e.g. ACGT or acgt).

When the fragment is found the program simply prints out the position. $position holds the position of the beginning of the buffer in the total DNA, and is something I have to keep track of. $-[0] is a special variable that gives the offset of the last successful pattern match in the string. I also add 1, because biologists always say that the first base in a sequence of DNA is at position 1, whereas Perl says that the first character in a string is at position 0. So I add 1 to the Perl position to get the biologist's position.

The last two lines of code reset the buffer by eliminating the beginning part of it, and then adjust the position counter accordingly. The buffer is shortened so that it just keeps the part at the very end that might be part of a pattern match that crosses over the lines of the input file. This would be the tail part of the buffer that is just one base shorter than the length of the fragment.

In this way, the program keeps at most two lines' worth of DNA in $buffer, but still manages to search the entire genome (or chromosome or whatever is in the FASTA file) for the fragment. It performs very quickly, compared to a program that reads in a whole genome and blows out the memory in the process.

When You Should Bother

A space-inefficient program might well work fine on your better computers, but it won't work well at all when you need to run it on another computer with less main memory installed. Or, it might work fine on the fly genome, but slow to a crawl on the human genome.

The rule of thumb is that if you know you'll be dealing with large data sets, consider the amount of space your program uses as an important constraint when designing and coding. Then you won't have to go back and redo the entire program when a large amount of DNA gets thrown at you.

Editor's note: Stay tuned for part two in this two-part series later this month. In it, James will take a more in-depth look at space efficiency, and include a more general version of a program that uses a buffer. In particular, part two will cover running subroutines with minimal space, eliminating subroutines altogether, and sequence motifs with bounded lengths.


O'Reilly & Associates will soon release (September 2003) Mastering Perl for Bioinformatics.

Beginning Bioinformatics

Bioinformatics, the use of computers in biology research, has been increasing in importance during the past decade as the Human Genome Project went from its beginning to the announcement last year of a "draft" of the complete sequence of human DNA.

The importance of programming in biology stretches back before the previous decade. And it certainly has a significant future now that it is a recognized part of research into many areas of medicine and basic biological research. This may not be news to biologists. But Perl programmers may be surprised to find that their handsome language has become one of the most - if not the most popular - of computer languages used in bioinformatics.

My new book Beginning Perl for Bioinformatics from O'Reilly & Associates addresses the needs of biologists who want to learn Perl programming. In this article, I'm going to approach the subject from another, almost opposite, angle. I want to address the needs of Perl programmers who want to learn biology and bioinformatics.

First, let me talk about ways to go from Perl programmer to "bioinformatician". I'll describe my experience, and give some ideas for making the jump. Then, I'll try to give you a taste of modern biology by talking about some of the technology used in the sequencing of genomes.

My Experience

Bioinformaticians generally have either a biology or programming background, and then receive additional training in the other field. The common wisdom is that it's easier for biologists to pick up programming than the other way around; but, of course, it depends on the individual. How does one take the skills learned while programming in, say, the telecommunications industry, and bring them to a job programming for biology?

I used to work at Bell Labs in Murray Hill, N.J., in the Speech Research Department. It was my first computer programming job; I got to do stuff with computer sound, and learn about speech science and linguistics as well. I also got to do some computer music on the side, which was fantastic for me. I became interested in the theory of computer science, and entered academia full time for a few years.

When it became time for me to get back to a regular salary, the Human Genome Project had just started a bioinformatics lab at the university where I was studying. I had a year of molecular biology some years before as an undergraduate, but that was before the PCR technique revolutionized the field. At that time, I read Watson's classic "The Molecular Biology of the Gene" and so I had an inkling about DNA, which probably helped, and I knew I liked the subject. I went over to meet the directors and leveraged my Unix and C and Bell Labs background to get a job as the systems manager. (PCR, the polymerase chain reaction, is the way we make enough copies ("clones") of a stretch of DNA to be able to do experiments on it. After learning the basics of DNA -- keep reading! -- PCR would be a great topic to start learning about molecular biology techniques. I'll explain how in just a bit.)

In my new job I started working with bioinformatics software, both supporting and writing it. In previous years, I'd done practically no programming, having concentrated on complexity theory and parallel algorithms. Now I was plunged into a boatload of programming -- C, Prolog, Unix shell and FORTRAN were the principal languages we used. At that time, just as I was starting the job, a friend at the university pressed his copy of Programming Perl into my hands. It made a strong impression on me, and in short order I was turning to Perl for most of the programming jobs I did.

O'Reilly Bioinformatics Technology Conference

Don't miss the Beginning Perl for Bioinformatics session, Monday, January 28, 2002, at the O'Reilly Bioinformatics Technology Conference.

I also started hanging out with the genome project people. I took some graduate courses in human genetics and molecular biology, which helped me a lot in understanding what everyone around me was doing.

After a few years, when the genome project closed down at my university, I went to other organizations to do bioinformatics, first at a biotech startup, then at a national comprehensive cancer center, and now consulting for biology researchers. So that's my story in a nutshell, which I offer as one man's path from programming to bioinformatics.

Bringing Programming to Biology

Especially now that bioinformatics is seen as an important field, many biology researchers are adding bioinformatics to their grant proposals and research programs. I believe the kind of path that I took is even more possible now than then, simply due to the amount of bioinformatics funding and jobs that are now staffed. Find biology research organizations that are advertising for programmers, and let them know you have the programming skills and the interest in biology that would make you an asset to their work.

But what about formal training? It's true that the ideal bioinformatician has graduate degrees in both computer science and biology. But such people are extremely rare. Most workers in the field have a good mix of computer and biology skills, but their degrees tend to come from one or the other. Still, formal training in biology is a good way for a computer programmer to learn about bioinformatics, either preceding or concurrently with a job in the field.

I can understand the reluctance to face another degree. (I earned my degrees with a job and a family to support, and it was stressful at times.) Yes, it is best to get a degree if you're going to be working in biology. A masters degree is OK, but most of the best jobs go to those who have their doctrate degree. They are, however, in ample supply and often get relatively low pay, as in postdoc positions that are frequently inhabited for many years. So the economic benefit of formal training in biology is not great, compared to what you may be earning as a computer expert. But at present bioinformatics pays OK.

On the other hand, to really work in biology, training is a good thing. It's a deep subject, and in many ways quite dissimilar to computer science or electrical engineering or similar fields. It has many surprises, and the whole "wet lab" experimental approach is hard to get out of books.

For self-study, there's one book that I think is a real gem for Perl programmers who want to learn about modern biology research. The book is called "Recombinant DNA," by the co-discoverer of the structure of DNA, James Watson, and his co-authors Gilman, Witkowski, Zoller, and Witkowski. The book was deliberately written for a wide audience, so you can start at the beginning with an explanation of what, exactly, are DNA and proteins, the two most important types of molecules in biology. But it goes on to introduce a wide range of fundamental topics in biology research, including explanations of the molecular biology laboratory techniques that form the basis of the revolution and the golden age in biology that we're now experiencing. I particularly like the use of illustrations to explain the techniques and the biology -- they're outstanding. In my jobs as manager of bioinformatics, I've always strongly urged the programmers to keep the book around and to dip into it often.

The book does have one drawback, however. It was published in 1992. Ten years is as long in biology as it is in computer technology; so "Recombinant DNA" will not go into newer stuff such as microarrays or SNPs. (And don't get the even earlier "Recombinant DNA: A Short Course" -- the 1992 edition is the one to get for now.) But what it does give you is a chance to really understand the fundamental techniques of modern molecular biology; and if you want to bring your Perl programming expertise to a biology research setting, then this is a great way to get a good start getting the general idea.

There are a few other good books out, and several more coming during the next year, in the bioinformatics field. Waterman; Mount; Grant and Ewens; Baxevanis et al, and Pevzner are a few of the most popular books (some more theoretical than others). My book, although for beginning programmers, may be helpful in the later chapters to get an idea of basic biological data and programs. Gibas and Jambeck's book Developing Bioinformatics Computer Skills gives a good overview of much of the software and the general computational approach that's used in bioinformatics, although it also includes some beginning topics unsuitable for the experienced programmer.

Of all the bioinformatics programs that one might want to learn about, the Perl programmer will naturally gravitate toward the Bioperl project. This is an open-source, international collaborative effort to write useful Perl bioinformatics modules, and it has reached a point during the past few years where it is quite useful stuff. The 1.0 release may be available by the time you read this. Exploring this software, available at http://www.bioperl.org, is highly recommended, with one caveat: It does not include much tutorial material, certainly not for teaching basic biology concepts. Still, you'll find lots of great stuff to explore and use in Bioperl. It's a must for the Perl bioinformatician.

Apart from self-study, you may also want to try to get into some seminars or reading groups at the local university or biotech firm, or generally meet people. If you're job hunting, then you may want to go introduce yourself to the head of the biology department at the U, and let her (yes, there are a lot of women working in biology research, a much better situation than in programming) -- know that you want a bioinformatics job and that you are a wizard at 1) programming in general, 2) Web programming, and 3) getting a lot out of computers for minimal money. But be prepared for them to have sticker shock when it comes to salaries. Maybe it's getting a little better now, but I've often found that biologists want to pay you about half of what you're worth on the market. Their pay level is just lower than that in computer programming. When you get to that point, you might have to be a bit hardnosed during salary negotiations to maintain your children's nutritional requirements.

I don't know of a book or training program that's specifically targeted at programmers interested in learning about biology. However, many universities have started offering bioinformatics courses, training programs, and even degrees, and some of their course offerings are designed for the experienced programmer. You might consider attending one of the major bioinformatics conferences. However, there will be a tutorial aimed at you in the upcoming O'Reilly bioinformatics conference -- indeed, the main focus of that conference is from the programming side more than the biology side.

Apart from the upcoming O'Reilly conference already mentioned, there is the ISMB conference, the largest in the bioinformatics field, which is in Calgary this coming summer; a good place to meet people and learn. It will also play host to the Bioperl yearly meeting, which is directly on target. Actually, if you check out the presenters at the ISMB, RECOMB or O'Reilly conferences, then you will find computer science people who are specializing in biology-related problems, as well as biologists specializing in infomatics, and many of these will be many of these will be lab heads or managers who maintain staffs of programmers.

The thing about biology is that it's a very large area. Most researchers stake out a claim to some particular system -- say, the regulation of nervous system development in the fly -- and work there. So it's hard to really prepare yourself for the particular biology you might find on the job. The "Recombinant DNA" book will give you an overview of some of the more important techniques that are commonly used in most labs.

A Taste of Molecular Biology

Now that I've given you my general take on how a Perl programmer could move into biology research, I'll turn my attention to two basic molecular biology techniques that are fundamental in biology research, as for instance in the Human Genome Project: restriction enzymes and cloning with PCR.

First, we have to point out that the two most important biological molecules, DNA and proteins, are both polymers, which are chains of smaller building block molecules. DNA is made out of four building blocks, the nucleotides or "bases"; proteins are made from 20 amino acids. DNA has a regular structure, usually the famous double helix of two complementary strands intertwined; whereas proteins fold up in a wide variety of ways that have an important effect on what the proteins are able to do inside the cell. DNA is the primary genetic material that transmits traits to succeeding generations. Finally, DNA contains the coded templates from which proteins are made; and proteins accomplish much of the work of the cell.

One important class of proteins are enzymes, which promote certain specific chemical reactions in the cell. In 1978 the Nobel Prize was awarded to Werner Arber, Daniel Nathans, and Hamilton Smith for their discovery and work on restriction enzymes in the 1960s and early 1970s. Restriction enzymes are a large group of enzymes that have the useful property of cutting DNA at specific locations called restriction sites. This has been exploited in several important ways. It has been an important technique in fingerprinting DNA, as is used in forensic science to identify individuals. It has been instrumental in developing physical maps, which are known positions along DNA and are used to zero in on the location of genes, and also serve as a reference point for the lengthy process of the determination of the entire sequence of bases in DNA.

Restriction enzymes are fundamental to modern biological research. To learn more about them, you could go to the REBASE restriction enzyme database where detailed information about all known restriction enzymes is collected. Many of them are easily ordered from supply houses for use in the lab.

One of the most common restriction enzymes is called EcoRI. When it finds the six bases GAATTC along a strand of DNA, it cleaves the DNA.

The other main technique I want to introduce is one already mentioned: PCR, or the polymerase chain reaction. This is the most important way that DNA samples are cloned, that is, have copies made. PCR is very powerful at this; in a short time many millions of copies of a stretch of DNA can be created, at which point there is enough of a "sample" of the DNA to perform other molecular biology experiments, such as determining what exactly is the sequence of bases in the DNA (as has been accomplished for humans in the Human Genome Project.)

PCR also won a Nobel prize for its invention, by Kary Mullis in 1983. The basic idea is quite simple. We've mentioned that the two intertwined strands of the double helix of DNA are complementary. They are different, but given one strand we know what the other strand is, as they always pair in a specific way. PCR exploits this property.

Motivation

It's clear that a short article is not going to get very far in introducing a major science such as biology. But I hope I've given you enough pointers to enable you to make a good start at learning about this explosive science, and about how a Perl programmer might be able to bring needed skills to the great challenge of understanding life and curing disease.

In the 10 years I've been working in biology, I've found it to be a really exciting field, very stimulating intellectually; and I've found that going to work to try to help to cure cancer, Alzheimer's disease, and others, has been very satisfying emotionally.

I wish you the very best of luck. If you make it to the O'Reilly conference, please look me up!

Visit the home of the Perl programming language: Perl.org

Sponsored by

Powered by Movable Type 5.02