Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Data Munging for Non-Programming Biologists
by Amir Karger, Eitan Rubin | Pages: 1, 2, 3

Splicing the Scriptome

As it happens, the story I told you about Neeraj earlier wasn't entirely accurate. He actually wanted to print both the beginning and ending substrings from his sequences. Also, his input was in the common FASTA format, where each sequence has an ID line like >A2352334 followed by a variable number of lines with DNA letters. We don't have one tool that parses FASTA and takes out two different substrings; writing every possible combination of tools would take even longer than writing Perl 6 (ahem). Instead, again following UNIX, we leave it up to the biologist to combine the tools into problem-specific solutions. In this case, that solution would involve using the FASTA-to-table converter, followed by a tool to pull out the sequence column, and then two copies of the substring tool.

We're asking biologists to break a problem down into pieces--each of which is solvable using some set of tools--and then to string those tools together in the right order with the right parameters. That sounds an awful lot like programming, doesn't it? Although you may not think about it anymore, some of the fundamental concepts of programming are new and difficult. Luckily, it turns out that biologists learned more in grad school than how to extract things out of (reluctant) other things. In fact, they already know how to break down problems, loop, branch, and debug; instead of programming, though, they call it developing protocols. They also already have cookbooks for experimental molecular biology. Such a protocol might include lines like:

  1. Add 1 ml of such-and-such enzyme to the DNA.
  2. Incubate test tube at 90 degrees C for an hour.
  3. If the mixture turns clear, goto step 5.
  4. Repeat steps 2-3 three times.
  5. Pour liquid into a sterile bottle very carefully.

We borrowed the term "protocol" to describe an ordered set of parameterized Scriptome tools that solves a larger problem. (The right word for this is a script, but don't tell our users--they might realize they're learning how to program.) We feature some pre-written protocols on the website. Note that because each tool is a command-line command, a set of them together is really just an executable shell script.

The Scriptome may be even more than a high-level, mostly syntax-free, non-toy language for NPBs. Because it exposes the Perl directly on the website--giving new meaning to the term "open source"--some curious biologists may even start reading the short, simple, relevant examples of Perl code. (Unfortunately, putting the command into one line makes it harder to read. One of our TODOs is an Explain button next to each tool, which would show you a commented, multi-line version of each script.) From there, it's a short hop to tweaking the tools, and before you know it, we'll have more annoying newbie posts on comp.lang.perl.misc!

Intelligent Design: The Geeky Details

If you've read this far, you may have realized by now that the Scriptome is not a programming project at heart. Design, interface, documentation, and examples are as important as the programming itself, which is pretty easy. This being an article on Perl.com, though, I want to discuss the use of Perl throughout the project.

Why Perl?

Several people asked me why we didn't write the Scriptome in Python, or R, or just use UNIX sh for everything. Well, other than the obvious ("It's the One True Language!"), Perl data munges by design, it's great for fast tool development, it's portable to many platforms, and it's already installed on every Unix and Mac OS X box. Moreover, the Bioperl modules offer me a huge number of tools to steal, um, reuse. Finally, Perl is the preferred language of the entire Scriptome development team (me).

What kind of Perl?

Perl allows you to write pretty impressive tools in only a couple of hundred characters, with Perl Golf tricks such as the -n option, autovivification, and the implicit $_ variable. On the other hand, we want the code to be readable, especially if we want newbies to learn from it, so we can't use too many Golf shortcuts. (For example, here's the winning solution in the Perl Golf contest for a script to find the last non-zero digit of N factorial by Juho Snellman:

 #!perl -l $_*=$`%9e9,??for+1=~?0*$?..pop;print$`%10

Some might consider this difficult for newbies to read.)

The Scriptome Build

Even though we're trying to keep the tools pretty generic and simple, we know we'll need several dozen at least, to be at all useful. In addition, data formats and biologists' interests will change over time. We knew we had to make the process of creating a new tool fast and automatic.

I write the tool pages in POD, which lets me use Vim rather than a fancy web-page editor. My Makefile runs pod2html to create a nice, if simple, web page that includes the table of contents for free. A Perl filter then adds a navigation bar and some simple interface-enhancing JavaScript, and makes the parameters red. I may give in and switch to a templating system, database back end, or XML eventually, and automated testing would be great. For now, keeping it simple means I can create, test, document, and publish a new tool in under an hour. (Okay, I didn't include debugging in that time.)

Perl Culture

There's lots of Perl code in the project, but I'm trying to incorporate some Perl attitude as well. The "Aha!" moment of the Scriptome came when we realized we could just post a bunch of hacked one-liners on the Web to help biologists now, rather than spend six or 12 months crafting the perfect solution. While many computational biologists focus on writing O(N) programs for sophisticated sequence analysis or gene expression studies, we're not ashamed to write glue instead; we solve the unglamorous problem of taking the output from their fancy programs and throwing it into tabular format, so that a biologist can study the results in Excel. After all, if data munging is even one step in Neeraj's pipeline, then he still can't get his paper published without these tools. Finally, we're listening aggressively to our users, because only they can tell us which easy things to make easy and which hard things to make possible.

Pages: 1, 2, 3

Next Pagearrow