Sign In/My Account | View Cart  
advertisement


Listen Print Discuss

Data Munging for Non-Programming Biologists
by Amir Karger, Eitan Rubin | Pages: 1, 2, 3

Filling the Niche: The Scriptome and Other Solutions

One of my greatest concerns in talking to people about biologists' data munging is that people don't even realize that there's a problem, or they think it's already been solved. Biologists--who happily pipette things over and over and over again--don't realize that computers could save them lots of time. Too many programmers figure that anyone who needs to can just read Learning Perl. I'm all for that, of course, but experimental biologists need to spend much more of their time getting data (dissecting bee brains, say) than analyzing it, so they can't afford the time it takes to become programmers. They shouldn't have to. Does the average biologist need multiple inheritance, getprotobyname(), and negative look-behind regexes? There's a large body of problems out there that are too diverse for simple, inflexible tools to handle, but are too simple to need full-fledged programming.

How about teaching a three-hour course with just enough Perl to munge simple data? At minimum, it should teach variables, arrays, hashes, regular expressions, and control structures--and then there's syntax. "Wait, what's the difference between @{$a[$a]} and @a{$a[$a]} again?" "Oh, my, look at the time." As Damian Conway writes in "Seven Deadly Sins of Introductory Programming Language Design" (PDF link), syntax oddities often distract newbies from learning basic programming concepts. How much can you teach in three hours, and how much will they remember after a month without practicing?

Another route would be building a graphical program that can do everything a biologist would want, where pipelines are developed by dragging and dropping icons and connectors. Unfortunately, a comprehensive graphical environment requires a major programming effort to build, and to keep current. Not only that, but the interface for such a full-featured, graphical program will necessarily be complex, raising the learning barrier.

In building the Scriptome, we purposely narrowed our scope, to maximize learnability and memorability for occasional users. While teaching programming and graphical tools are effective solutions for some, I believe the Scriptome fills an empty niche in the data munging ecosphere (the greposphere?).

Creation Is Not Easy

How much progress have we made in addressing the problem space between tool use and programming? Our early reviews have been mostly positive, or at least constructive. Suzy, our first power user, started out skeptical, saying she'd probably have to learn Perl because any tools we gave her wouldn't be flexible enough. I encouraged her to use the Scriptome in parallel with learning Perl. She ended up a self-described "Scriptome monster," tweaking tool code and creating a 16-step protocol that did real bioinformatics. Still, one good review won't get you any Webby awards. Our first priority at this point is to build a user base and to get feedback on the learnability, memorability, and effectiveness of the website, with its 50 or so tools.

It will take more than just feedback to implement the myriad ideas we have for improving the Scriptome, which is why I'm here to make a bald-faced plea for your help. The project needs lots of new tools, new protocols, and possibly new interfaces. You, the Perl.com reader, can certainly write code for new tools; the real question is whether you (unlike certain, unnamed CPAN contributors) can also write good documentation and examples, or find bugs in early versions of tools. We would also love to get relevant protocol ideas. Check out the Scriptome project page and send something to me or the scriptome-users mailing list.

Here's a little challenge. I really did have a client who renamed 768 files by hand before I could Perl it for him. Can you write a generic renaming atom that a NPB could use? (Hint: "Tell the user to learn regular expressions" is not a valid solution.) The winner will receive a commemorative plaque (<bgcolor="gold">) on the Scriptome website.

Speaking of new interfaces, one common concern we hear from programmers is that NPBs won't be able or willing to handle the command-line paradigm shift and the few commands needed (cd, more, dir/ls) to use the Scriptome. In case our users do tell us it's a problem, we're exploring a few different ways to wrap the Scriptome tools, such as:

  • A Firefox plugin that gives you a command line in a toolbar and displays your output file in the browser. (Currently being developed by Rob Miller and his group at MIT.)
  • An Excel VBA that lets you put command lines into a column, and creates a shell script out of it.
  • Wrapping the command-line tools in Pise (web forms around shell commands) or GenePattern (a more general GUI bio tool).

We'll probably try several of these avenues, because they allow us to keep using the command-line interface if desired.

As for the future, well, who says that only biologists are interested in munging tabular data? Certainly, chemists and astronomers could get into this. I set my sights even higher. How about a Scriptome for a business manager wanting to munge reports? An Apache Scriptome to munge your website's access logs? An iTunes Scriptome to manage your music? Let's give users the power to do what they want with their data.

Sorry, GUI Neanderthalis, but you can't adapt to today's data munging needs. Make room for Homo Scriptiens!