November 2001 Archives

Request Tracker

If you've ever had to do user support, then you'll be familiar with the following scenario: A question comes in from a user, and you write an e-mail reply. However, by the time you've hit "send," another member of your team has already dealt with the problem. Or maybe you're fortunate enough not to have to support end-users, but you have a terrible time managing the various projects that you're working on. Or maybe you just can't remember what you have to do today. How can Perl help here?

Enter Request Tracker. Request Tracker is a free trouble ticketing system written in Perl. It is used widely for everything from bug tracking and customer support to personal project management and to-do lists.

Some of you may be familiar with the first edition of Request Tracker, RT1. RT1 was successful, but internally, it was a bit crufty. RT2 is a complete rewrite designed to be much more capable, flexible and extensible.

Overview

So what is it?

On the simplest level, RT tracks requests - tickets - and correspondence about them. If you aren't already familiar with such systems, consider sending a support request to your ISP. It might go something like this:

        You: My e-mail isn't working.

        ISP: Have you looked at the FAQ list?  
		Can you be more specific about what isn't working.

        You: When I hit send in mumble mailer, it says 
		"Cannot connect to server"

        ISP: Please make sure you have your SMTP server configured 
		as documented ...

        You: Thanks!  That did it!

This exchange most likely won't happen all at once, and could happen over the span of several days involving several technicians from your ISP. Each technician will want to know the specifics of what the previous technician told you, and your response. The ISP will also want to know who is handling a particular ticket, how long it's been open, how many total tickets are open, what category your ticket falls into ("e-mail problems"), etc. This is the problem that RT was built to solve. Some companies use tools such as "Remedy" or "Clarify" for similar purposes, but RT is a free and open-source solution.

RT isn't only for help-desk situations. The same system can be used to track bugs in software, outstanding action items or any other issue. In this domain, RT overlaps with programs such as Bugzilla and Gnats. In fact, RT's being used by the Perl 6 developers to track what they need to work on.

Presentation and Interface

RT provides the most natural way of displaying exchanges such as the one in the example above. Every request is displayed as a sequence of transactions, starting with the initial message that created it. A transaction can also represent a change in metadata - for instance, the person who is working on a ticket may change as another member of the team takes over, or the ticket may be re-prioritized; it can change its status to reflect whether it's new, open, resolved or waiting for more input from the requestor.

Requests live in queues, where a queue provides a loose grouping of both what the ticket is about and who is likely to respond to it. For instance, queries to the Web master of a site could end up in the "web" queue, whereas each development team would have a queue per project they were working on.

There are a bunch of ways to get data into RT. The two most popular are the Web interface and the e-mail interface. You can also have RT automatically insert data from CVS commit logs, via command-line interface or through more specialist tools.

The e-mail interface is easy to use and uses a simple tag in the subject line to determine what to do with a message. This tag will generally look like [site #40] where "site" is a special tag identifying your RT setup and the number (40) is the ticket number. All e-mails sent to RT with that information in the subject will be appended to the appropriate request. Most RT systems are configured so that an e-mail without a recognized tag in the subject line automatically creates a new request.


    From: John Doe <john@doe.com>
    To: rt@myisp.com
    Subject: Re: [myisp #3120] Email doesn't work
    Date: Tue Nov 27 21:57:58 PST 2001

    I tried to configure Netscape Communicator as your instructions
    said, but I'm still getting the same error as before.  What else
    would you suggest?

The Web interface allows for more flexible display of the information about a ticket. You can enter new information about a ticket or change its metadata.

Meta-data

RT stores lots of information about each ticket. By default, it maintains information about the requestor, owner, (the person who's currently working on the ticket) status, subject, creation date, due date, priority, queue, links to other tickets, as well as the people who are interested in that particular ticket.

This information is important for sorting, categorization and reporting. RT also allows for custom meta-data to be added to a ticket in the form of keywords. Returning to our ISP example, they might configure their RT setup to record the platform and software being used. One reason to do this is so that specialists can be assigned to focus on issues in their areas of specialization. Similarly, you can gather statistics on, say, the number of users reporting problems using Outlook Express.

  • As a side note, it's important to note that it's impossible to present all the possible configurations of RT in this article. As we'll continue to see later, it's extremely flexible and each organization will need to configure it for their particular needs.

Scrips and Templates

As configured "out of the box," RT does not send any e-mail by default. The end user needs to use the Web interface to configure RT's "scrips." Scrips are a way of telling RT to trigger certain actions based on when something happens.

For example, one important scrip would be:

   When a new ticket is opened, send an e-mail to the person who created it.

This way, your requestor would get an autoresponse telling them that their query was being looked at. Another useful scrip would be:

   When someone e-mails new information into a ticket, send that information
   to everyone who is interested in the ticket.

Correspondence and Comments

One important aspect of RT is how it differentiates between correspondence and comments. Correspondence is something that is sent by one of the RT users to the requestor, to solicit further information or inform them of developments; comments are normally set up to be internal to RT users, and never sent to the end-user. Think of it as being the support team's chance to be rude to the end-user behind their back.

RT is smart. When correspondence is e-mailed out to the user, it appears as if the author had written it, but has a tweaked From: line so that replies are also sent back into RT and added to the ticket.

The ISP, using RT

Here is the above ISP example, as it might look as an RT ticket:

 Ticket #3120
 Opened: Nov 17, 2001.
 Subject: Email problems
 Requestor: John Doe <john@doe.com>
 Owner: Stef
 Current Status: Resolved

 Correspondence from John Doe <john@doe.com> on Nov 17, 2001 3pm
 > My Email Isn't Working

 Taken by Stef at Nov 17, 2001, 3:48pm

 Status Changed from New to Open by Stef at Nov 17, 2001, 3:49pm
 
 Correspondence from Stef Murky <stef@userfriendly.comic> on Nov 17, 2001 4:00pm
 > Have you looked at the FAQ list?  Can you be more specific
   about what isn't working.

 Correspondence from John Doe <john@doe.com> on Nov 18, 2001 1:00pm
 > When I hit send in mumble mailer, it says "Cannot connect to
   server"

 Comment from Stef Murky <stef@userfriendly.comic> on Nov 18, 2001 3:15pm
 > Looked at our records, it appears that John uses Netscape under windows.

 Correspondence from Stef Murky <stef@userfriendly.comic> on Nov 18, 2001 3:19pm
 > Please make sure you have your SMTP server configured as
   documented at ...

 Correspondence from John Doe <john@doe.com> on Nov 19, 2001 9:05am
 > Thanks!  That did it!

 Status Changed from Open to Resolved by Stef

The Guts

You may be thinking that RT is a horribly complicated piece of software, impossible to understand, use and extend. If you are, you're completely wrong. Through extensive modularization and a well thought out architecture and design, RT is quite easy to understand, use and extend.

RT2 is built upon standard modules that many people use. There isn't anything esoteric used; in fact, you may already be using many of the modules on which it depends.

But you might think that installing all the dependent modules might be a chore. Again, RT is smart. It comes with a script that uses the CPAN.pm module to retrieve and install the correct versions of everything it needs.

This is probably the hardest part of installing RT. (Assuming you have CPAN.pm configured properly.) You run "make fixdeps," and it all happens for you. It may be necessary to run it multiple times or install one or two modules by hand, but that's easy compared to manually installing almost 30 modules.

 DBI DBIx::DataSource DBIx::SearchBuilder HTML::Entities MLDBM
 Net::Domain Net::SMTP Params::Validate HTML::Mason CGI::Cookie
 Apache::Cookie Apache::Session Date::Parse Date::Format MIME::Entity
 Mail::Mailer Getopt::Long Tie::IxHash Text::Wrapper Text::Template
 File::Spec Errno FreezeThaw File::Temp Log::Dispatch DBD::mysql or DBD::Pg

The most important module is probably DBIx::SearchBuilder. It provides a standard mechanism for persistent object-oriented storage. By using it, RT doesn't need to worry about the details of the SQL queries to access its database. The details of DBIx::SearchBuilder are beyond the scope of this article - in fact, they'll be covered in a future perl.com article - but in a nutshell, your classes will subclass the SearchBuilder class, and the module will take care of the persistence for you. SearchBuilder also makes it easy to port RT to your SQL backend of choice; since it's all done through the DBI, the architecture is completely database-independent.

Configurability & Extensibility

Ninety-five percent of RT is configured via the Web interface. Once the system is up and running, most changes can be made through the Configuration menu. Everything from adding users to configuring automatic e-mail responses can be done there.

Because the Web interface is created using HTML::Mason, it's easy to extend RT using the internal API. Whole pages or just individual Mason elements can be easily overridden or updated. Common customizations include specialized reports, user interface tweaks or new authentication systems. Any trouble ticketing system will need adapting to the needs of the local users, and RT makes it easy to do this. The API by which the Mason site accesses the ticketing database is the same thing that all of the interfaces use natively, so it's possible to completely re-implement any interface from scratch - in fact, you can write your own tools to access the system quickly and easily by using the RT::* modules in your own code.

More Configuration Examples

rt.cpan.org

Recently, to demonstrate the strength of RT as well as to provide a needed service to the community, Jesse Vincent, the author of RT, set up an RT instance for all perl modules on CPAN.

rt.cpan.org showcases many features of RT.

  • Scalability
    There are thousands of modules on CPAN and thousands of users
  • Access Control
    Module Authors can only manage requests for their own projects
  • External Authentication
    rt.cpan.org authenticates authors from PAUSE (Perl Authors Upload Service)
  • Regular Data Import
    As items are added to CPAN, rt.cpan.org must stay up to date. Jesse has written scripts to take the CPAN data and keep the RT synced.
  • Custom User Interface
    rt.cpan.org shares the search.cpan.org design motif.

For more information on rt.cpan.org, visit it, or see this article on use.perl

Currently, I'm using RT for several different projects. Here are a few more details about each RT setup to provide you with more ideas and examples.

  • Bug Tracking
    As mentioned above, the perl 6 project is using RT to track bugs and todo items. Right now the Parrot project has one queue set up; this is a little slow at the moment, but will be ramping up as Parrot stabilizes. A custom report was written to show only items marked with the "Todo" keyword, so that we can update a Web page listing the things that need doing. The scrips are configured to keep the requestor in the loop with the progress of their issue.
  • Help-Desk
    perl.org maintains many mailing lists, which means dealing with lots of users who have trouble subscribing, or more likely unsubscribing from lists. The list-owner@perl.org e-mail address is filtered into RT, where a ticket is created automatically. Thus, each user's case can be tracked. For this, we wrote a special template system that allows us to easily insert common answers into correspondence.
  • Project Management
    This is similar to bug tracking, but doesn't require the same kind of notification or categorization. At perl.org we use our RT setup to track the status of a variety of internal projects, such as our CVS server and Web site development.
  • Personal Todo and Information Store
    At home, I use RT as "yet another TODO list" and information store. Instead of cluttering my e-mail inbox, or stuffing things into folders where I might forget about them, I open tickets in my "personal RT." I can then categorize them and add comments to them. There are a range of different things I stick into RT, ranging from "Remember to look at this Web site" to "Get new cellphone." For the latter, I add comments that include the details of my research. I can easily access this from work or anywhere there is Web access.

More!

There's a lot more to RT than I've covered in this article. Some things I've glossed over include:

  • Access control
    RT has a sophisticated access control system that supports different levels of access to tickets based on a user's identity, group memberships or role. You can grant permissions globally or per queue. It's possible to configure read-only access or only allow someone to see tickets they requested.
  • Command Line Interface
    There is a full-featured command line interface that allows you to do almost anything. This provides another way to script and customize things.
  • Scalability
    Request Tracker is very scalable and is being used in production environments with several tens of thousands of tickets in the database. (Not on a 486, of course.)

Future Directions

The current version of RT2 is 2.0.9. The 2.1 development series will be starting soon leading toward 2.2. While nothing has been finalized, items on the table for 2.2 include better ACL support, more flexible keywords, asset trackin, and several other cool things.

Best Practical

Commercial Support for Request Tracker is available from Best Practical Solutions, LLC. Best Practical Solutions was formed by RT's author, Jesse Vincent, to sell support and custom development for RT. They do all sorts of customization, interfaces, and custom import tools.

Other URLS

http://bestpractical.com/rt/ - RT Site
mailto:sales@bestpractical.com - Support and customization inquiries
http://www.masonhq.com/ - HTML::Mason Site
http://rt.cpan.org/ - RT for every module in CPAN

Lightweight Languages

What happens if you get a bunch of academic computer scientists and implementors of languages such as Perl, Python, Smalltalk and Curl, and lock them into a room for a day? Bringing together the academic and commercial sides of language design and implementation was the interesting premise behind last weekend's Lightweight Languages Workshop, LL1, at the AI Lab at MIT, and I'm happy to say that it wasn't the great flame-fest you might imagine.

While there were the occasional jibes from all sides, (and mainly in our direction) it was a good-natured and enjoyable event. The one-day workshop began with a keynote by Olin Shivers, associate professor of computing at Georgia Tech, on how "Lambda" was the ultimate little language. He was, of course, teaching us about the Joys of Scheme. Olin showed us how we could use Scheme as a basis for implementing domain-specific languages, such as awk, and by embedding these languages in Scheme, one would have not just the power of the original language, but also the implementation language available as well. The alert reader will remember that this is something Larry spoke about in the State of the Onion this year with regard to Perl 6, and something that Damian Conway and Leon Brocard have been working on in Perl 5.

Olin demonstrated the power of his "embedding" technique by calling from Scheme to Awk and from Awk back into Scheme in the same routine. He's even written a Scheme shell along the same lines, but there was some disagreement as to whether or not being able to do shell-like things with Scheme syntax means you have a Scheme shell. Olin also brushed aside suggestions that writing Awk, Scheme and Shell in the same syntax in the same function might be potentially confusing, saying that it had never tripped him up.

Next came an enlightening panel discussion on the merits of the Worse-Is-Better philosophy; that is, the idea, that doing it right is not as important as doing it right now. I spoke first, explaining how Parrot development has to strike a balance between being correct, maintainable and fast, and ended by asking why doing the right thing is almost always the opposite of doing the fast thing; Jeremy Hylton, one of the Python developers and a really nice guy, replied that Python did not have this problem - for them, correctness and maintainability were more important than performance, giving them more flexibility in terms of doing it Right.

Dan Weinreb piped up with the idea that programming implementation tends to be driven by managerial paranoia; thankfully, this is something that both the open-source language implementation community and the academic community are pretty much immune from. Nevertheless, the pressure to release can have adverse affects on future performance, as illustrated by Guy Steele, one of the designers on Java, who told us a sad story about the dangers of Worse Is Better - by putting some vital component of Java (I think it was tail recursion, but it might have been something else) off in order to get release 1.0 out of the door, it made it much harder to implement later on. This resonated with the experience that Python implementors had adding lexical scoping. This rapidly caused a degeneration into whether a language without lexical scoping or the lambda operator can be said to be properly thought out.

Joe Marshall gave a short talk on one element of the initial implementation of the Rebol programming language. Rebol 1.0 was slightly different from the "network programming language" that Rebol is today - it was essentially Scheme with the brackets filed off. He explained that in his original Scheme implementation, he used a terribly complicated trick to avoid outgrowing the stack; something that was completely rewritten in the next implementation. Again we found that maintainability was more important in the long run.

Jeremy Hylton then gave his presentation, on the design and implementation of Python. It was amusing but, at least to me, not entirely surprising, that Jeremy was the one developer at the conference that Dan Sugalski and I had most in common with. The implementation teams of Perl and Python are facing very much the same problems and approaching them in only very slightly different ways. That is, compared with many of the other approaches we saw presented, at least.

Then it was our turn. We were running very late by this point, so I skipped over most of my talk; Dan explained what a lightweight language usually needed from its interpreter, and how Parrot was planning to provide that. David Simmons, one of the developers of SmallScript, a "scripting" Smalltalk implementation, challenged us on the viability of JIT compiling Parrot code, something we were interested in after looking at the Mono project's JIT compiler.

We had been intrigued all day by the title of Shriram Krishnamurthi's talk, "The Swine Before Perl." We didn't know whether that referred to us or to Jeremy, who was talking before us. In the event, Shriram delivered a wonderfully interesting presentation about Scheme, and how he could apply "laziness, impatience and hubris" to Scheme programming. He mentioned a Scheme Web server that served dynamic pages many times faster than Apache/mod_perl, and a way of manipulating Scheme macros to provide a pre-compiled state-machine generator. He countered criticisms that Scheme is just lots of insane silly parentheses by demonstrating how XML was just lots of insane silly angle brackets. A fair point well made.

Waldemar Horwat lead us through some of the more important design decisions involved in JavaScript version 2.0. No, don't laugh, it's a much more fully featured language than you'd think. His main point was, unsurpisingly, something else that Larry had picked up on in his State of the Onion talk; that modules need to provide ways of encoding the version of their API so as to protect their namespace from subclasses. That's to say, if there's a Perl module Foo and you subclass it to Foo::Advanced by adding a frobnitz method, then what happens when the original author of Foo produces the next version of Foo that already has a frobnitz method? Waldemar's solution was to use version-specific namespaces that only expose certain subsets of the whole API if a particular version number is asked for. This allows dependent modules to be future-proofed.

Christopher Barber gave a talk on Curl, a bizarre little language that fills the same sort of niche as Flash. David Simmons spoke about SmallScript and his work porting Smalltalk to the Microsoft CLR; he had high hopes for the way Microsoft .NET was panning out. Jonathan Bachrach, one of the AI Labs researchers gave a clever talk, which I missed due to being out in the corridors debating the finer points of threading and continuations with a Scheme hacker. I returned to see him explain a efficient but highly difficult to follow method of determining whether two classes are the same; he also said that his language compiles down to C and the C compiler is executed on the fly, since it's faster to call GCC than to start up your own virtual machine. Paul Graham rounded off the talks by talking about his new dialect of Lisp, which he called Arc. Arc is designed to be a language for "good programmers" only, and gets away without taking the shortcuts and safeguards that you need if you're trying to be accessible. His presentation was very entertaining, and I have high hopes for his little language.

As I've indicated, the interest of the workshop was as much what was going on outside the talks as well; Dan and I got to meet a load of interesting and clever people, and it was challenging for us to discuss our ideas with them - especially since we didn't always see eye to eye with our academic counterparts. Sadly, few people seemed to have heard much about Ruby, something they will probably come to regret in time. Dan seemed to have picked up a few more interesting technical tips, such as a way to collect reference count loops without walking all of the objects in a heap. Oh, and we found that you should pour liquid nitrogen into containers first rather than trying to make ice cream by directly pouring it into a mix of milk and butter. And that the ice-cream so produced is exceptionally tasty.

But seriously, what did we learn? I think we learned that many problems that we're facing in terms of Perl implementation right now have already been thoroughly researched and dealt with as many as 30 years ago; but we also learned that if we want to get at this research, then we need to do a lot of digging. The academic community is good at solving tricky problems like threading, continuations, despatch and the like, but not very interested in working out all the implications. To bring an academic success to commercial fruition requires one, as Olin Shivers puts it, "to become Larry Wall for a year" - to take care of all the gritty implementation details, and that's not the sort of thing that gets a PhD.

So the impetus is on us as serious language implementors to take the time to look into and understand the current state of the art in VM research to avoid re-inventing the wheel. Conferences such as LL1, and the mailing list that has been established as a result of it, are a useful way for us to find out what's going on and exchange experience with the academic community, and I look forward intently to the next one!

O'Reilly & Associates was proud to be a co-sponsor of the Lightweight Languages Workshop.

Parsing Protein Domains with Perl

The Perl programming language is popular with biologists because of its practicality. In my book, Beginning Perl for Bioinformatics, I demonstrate how many of the things biologists want to write programs for are readily--even enjoyably--accomplished with Perl.

My book teaches biologists how to program in Perl, even if they have never programmed before. This article will use Perl at the level found in the middle-to-late chapters in my book, after some of the basics have been learned. However, this article can be read by biologists who do not (yet) know any programming. They should be able to skim the program code in this article, only reading the comments, to get a general feel for how Perl is used in practical applications, using real biological data.

Biological data on computers tends to be either in structured ASCII flat files--that is to say, in plain-text files--or in relational databases. Both of these data sources are easy to handle with Perl programs. For this article, I will discuss one of the flat-file data sources, the Prosite database, which contains valuable biological information about protein domains. I will demonstrate how to use Perl to extract and use the protein domain information. In Beginning Perl for Bioinformatics I also show how to work with several other similar data sources, including GenBank (Genetic Data Bank), PDB (Protein DataBank), BLAST (Basic Local Alignment Search Tool) output files, and REBASE (Restriction Enzyme Database).

What is Prosite?

Prosite stands for "A Dictionary of Protein Sites and Patterns." To learn more about the fascinating biology behind Prosite, visit the Prosite User Manual. Here's an introductory description of Prosite from the user manual:

"Prosite is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences. It consists of a database of biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which known family of protein (if any) the new sequence belongs."

In some cases, the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, it can be identified by the occurrence in its sequence of a particular cluster of residue types, variously known as a pattern, a motif, a signature, or a fingerprint. These motifs arise because of particular requirements on the structure of specific regions of a protein, which may be important, for example, for their binding properties, or for their enzymatic activity.

Prosite is available as a set of plain-text files that provide the data, plus documentation. The Prosite home page provides a user interface that allows you to query the database and examine the documentation. The database can also be obtained for local installation from the Prosite ftp site. Its use is free of charge for noncommercial users.

There is some fascinating and important biology involved here; and in the programs that follow there are interesting and useful Perl programming techniques. See the Prosite User Manual for the biology background, and Beginning Perl for Bioinformatics for the programming background. Or just keep reading to get a taste for what is possible when you combine programming skills with biological data.

Prosite Data

The Prosite data can be downloaded to your computer. It is in the ASCII flat file called prosite.dat and is more than 4MB in size. A small version of this file created for this article, called prosmall.dat, is available here. This version of the data has just the first few records from the complete file, making it easier for you to download and test, and it's the file that we'll use in the code discussed later in this article.

Prosite also provides an accompanying data file, prosite.doc, which contains documentation for all the records in prosite.dat. Though we will not use it for this article, I do recommend you look at it and think about how to use the information along with the code presented here if you plan on doing more with Prosite.


O'Reilly Bioinformatics Technology Conference James Tisdall will be speaking at O'Reilly's first Bioinformatics Technology Conference, January 28-31, 2002, in Tuscon, Arizona. For more information visit Bioinformatics Conference Web site.


The Prosite data in prosite.dat (or our much smaller test file prosmall.dat) is organized in "records," each of which consists of several lines, and which always include an ID line and a termination line containing "//". The Prosite lines all begin with a two-character code that specifies the kind of data that appears on that line. Here's a breakdown of all the possible line types that a record may contain from the Prosite User Manual:

ID
Identification (Begins each entry; one per entry)
AC
Accession number (one per entry)
DT
Date (one per entry)
DE
Short description (one per entry)
PA
Pattern (>=0 per entry)
MA
Matrix/profile (>=0 per entry)
RU
Rule (>=0 per entry)
NR
Numerical results (>=0 per entry)
CC
Comments (>=0 per entry)
DR
Cross references to SWISS-PROT (>=0 per entry)
3D
Cross references to PDB (>=0 per entry)
DO
Pointer to the documentation file (one per entry)
//
Termination line (Ends each entry; one per entry)

Each of these line types has certain kinds of information that are formatted in a specific manner, as is detailed in the Prosite documentation.

Prosite Patterns

Let's look specifically at the Prosite patterns. These are presented in a kind of mini-language that describes a set of short stretches of protein that may be a region of known biological activity. Here's the description of the pattern "language" from the Prosite User Manual:

The PA (PAttern) lines contains the definition of a Prosite pattern. The patterns are described using the following conventions:

  • The standard IUPAC one-letter codes for the amino acids are used.
  • The symbol `x' is used for a position where any amino acid is accepted.
  • Ambiguities are indicated by listing the acceptable amino acids for a given position, between square parentheses `[ ]'. For example: [ALT] stands for Ala or Leu or Thr.
  • Ambiguities are also indicated by listing between a pair of curly brackets `{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.
  • Each element in a pattern is separated from its neighbor by a `-'.
  • Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.
  • When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a `<' symbol or respectively ends with a `>' symbol.
  • A period ends the pattern.

Perl Subroutine to Translate Prosite Patterns into Perl Regular Expressions

In order to use this pattern data in our Perl program, we need to translate the Prosite patterns into Perl regular expressions, which are the main way that you search for patterns in data in Perl. For the sake of this article I will assume that you know the basic regular expression syntax. (If not, just read the program comments, and skip the Perl regular expressions.) As an example of what the following subroutine does, it will translate the Prosite pattern [AC]-x-V-x(4)-{ED}. into the equivalent Perl regular expression [AC].V.{4}[^ED]

Here, then, is our first Perl code, the subroutine PROSITE_2_regexp, to translate the Prosite patterns to Perl regular expressions:


#
# Calculate a Perl regular expression
#  from a PROSITE pattern
#
sub PROSITE_2_regexp {

  #
  # Collect the PROSITE pattern
  #
  my($pattern) = @_;

  #
  # Copy the pattern to a regular expression
  #
  my $regexp = $pattern;

  #
  # Now start translating the pattern to an
  #  equivalent regular expression
  #

  #
  # Remove the period at the end of the pattern
  #
  $regexp =~ s/.$//;

  #
  # Replace 'x' with a dot '.'
  #
  $regexp =~ s/x/./g;

  #
  # Leave an ambiguity such as '[ALT]' as is.
  #   However, there are two patterns [G>] that need
  #   special treatment (and the PROSITE documentation
  #   is a bit vague, perhaps).
  #
  $regexp =~ s/\[G\>\]/(G|\$)/;
  
  #
  # Ambiguities such as {AM} translate to [^AM].
  #
  $regexp =~ s/{([A-Z]+)}/[^$1]/g;

  #
  # Remove the '-' between elements in a pattern
  #
  $regexp =~ s/-//g;

  #
  # Repetitions such as x(3) translate as x{3}
  #
  $regexp =~ s/\((\d+)\)/{$1}/g;

  #
  # Repetitions such as x(2,4) translate as x{2,4}
  #
  $regexp =~ s/\((\d+,\d+)\)/{$1}/g;

  #
  # '<' becomes '^' for "beginning of sequence"
  #
  $regexp =~ s/\</^/;

  #
  # '>' becomes '$' for "end of sequence"
  #
  $regexp =~ s/\>/\$/;

  #
  # Return the regular expression
  #
  return $regexp;
}

Subroutine PROSITE_2_regexp takes the Prosite pattern and translates its parts step by step into the equivalent Perl regular expression, as explained in the comments for the subroutine. If you do not know Perl regular expression syntax at this point, just read the comments--that is, the lines that start with the # character. That will give you the general idea of the subroutine, even if you don't know any Perl at all.


Learn more about the power of regular expressions from O'Reilly's Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools.


Perl Subroutine to Parse Prosite Records into Their Line Types

The other task we need to accomplish is to parse the various types of lines, so that, for instance, we can get the ID and the PA pattern lines easily. The next subroutine accomplishes this task: given a Prosite record, it returns a hash with the lines of each type indexed by a key that is the two-character "line type". The keys we'll be interested in are the ID key for the line that has the identification information; and the PA key for the line(s) that have the pattern information.

This "get_line_types" subroutine does more than we need. It makes a hash index on all the line types, not just the ID and PA lines that we'll actually use here. But that's OK. The subroutine is short and simple enough, and we may want to use it later to do things with some of the other types of lines in a Prosite record.

By building our hash to store the lines of a record, we can extract any of the data lines from the record that we like, just by giving the line type code (such as ID for identification number). We can use this hash to extract two line types that will interest us here, the ID identifier line and the PA pattern line. Then, by translating the Prosite pattern into a Perl regular expression (using our first subroutine), we will be in a position to actually look for all the patterns in a protein sequence. In other words, we will have extracted the pattern information and made it available for use in our Perl program, so we can search for the patterns in the protein sequence.


If you're interested in learning Perl, don't miss O'Reilly's best-selling Learning Perl, 3rd Edition, which has been updated to cover Perl version 5.6 and rewritten to reflect the needs of programmers learning Perl today. For a complete list of O'Reilly's books on Perl, go to perl.oreilly.com.


Here, then, is our second subroutine, which accepts a Prosite record, and returns a hash which has the lines of the record indexed by their line types:


#
# Parse a PROSITE record into "line types" hash
# 
sub get_line_types {

  #
  # Collect the PROSITE record
  #
  my($record) = @_;

  #
  # Initialize the hash
  #   key   = line type
  #   value = lines
  #
  my %line_types_hash = ();

  #
  # Split the PROSITE record to an array of lines
  #
  my @records = split(/\n/,$record);

  #
  # Loop through the lines of the PROSITE record
  #
  foreach my $line (@records) {

    #
    # Extract the 2-character name
    # of the line type
    #
    my $line_type = substr($line,0,2);

    #
    # Append the line to the hash
    # indexed by this line type
    #
    (defined $line_types_hash{$line_type})
    ?  ($line_types_hash{$line_type} .= $line)
    :  ($line_types_hash{$line_type} = $line);
  }

  #
  # Return the hash 
  #
  return %line_types_hash;
}

Main Program

Now let's see the code at work. The following program uses the subroutines we've just defined to read in the Prosite records one at a time from the database in the flat file prosmall.txt. It then separates the different kinds of lines (such as "PA" for pattern), and translates the patterns into regular expressions, using the subroutine PROSITE_2_regexp we already wrote. Finally, it searches for the regular expressions in the protein sequence, and reports the position of the matched pattern in the sequence.


#!/usr/bin/perl
#
# Parse patterns from the PROSITE database, and
# search for them in a protein sequence
#

#
# Turn on useful warnings and constraints
#
use strict;
use warnings;

#
# Declare variables
#

#
# The PROSITE database
#
my $prosite_file = 'prosmall.dat';

#
# A "handle" for the opened PROSITE file
#
my $prosite_filehandle; 

#
# Store each PROSITE record that is read in
#
my $record = '';

#
# The protein sequence to search
# (use "join" and "qw" to keep line length short)
#
my $protein = join '', qw(
MNIDDKLEGLFLKCGGIDEMQSSRTMVVMGGVSG
QSTVSGELQDSVLQDRSMPHQEILAADEVLQESE
MRQQDMISHDELMVHEETVKNDEEQMETHERLPQ
);

#
# open the PROSITE database or exit the program
#
open($prosite_filehandle, $prosite_file)
 or die "Cannot open PROSITE file $prosite_file";

#
# set input separator to termination line //
#
$/ = "//\n";

#
# Loop through the PROSITE records
#
while($record = <$prosite_filehandle>) {

  #
  # Parse the PROSITE record into its "line types"
  #
  my %line_types = get_line_types($record);

  #
  # Skip records without an ID (the first record)
  #
  defined $line_types{'ID'} or next;

  #
  # Skip records that are not PATTERN
  # (such as MATRIX or RULE)
  #
  $line_types{'ID'} =~ /PATTERN/ or next;

  #
  # Get the ID of this record
  #
  my $id = $line_types{'ID'};
  $id =~ s/^ID   //;
  $id =~ s/; .*//;

  #
  # Get the PROSITE pattern from the PA line(s)
  #
  my $pattern = $line_types{'PA'};
  # Remove the PA line type tag(s)
  $pattern =~ s/PA   //g;

  #
  # Calculate the Perl regular expression
  # from the PROSITE pattern
  #
  my $regexp =  PROSITE_2_regexp($pattern);

  #
  # Find the PROSITE regular expression patterns
  # in the protein sequence, and report
  #
  while ($protein =~ /$regexp/g) {
    my $position = (pos $protein) - length($&) +1;
    print "Found $id at position $position\n";
    print "   match:   $&\n";
    print "   pattern: $pattern\n";
    print "   regexp:  $regexp\n\n";
  }
}

#
# Exit the program
#
exit;

This program is available online as the file parse_prosite. The tiny example Prosite database is available as the file prosmall.dat. If you save these files on your (Unix, Linux, Macintosh, or Windows) computer, you can enter the following command at your command-line prompt (in the same folder in which you saved the two files):


% perl parse_prosite

and it will produce the following output:


Found PKC_PHOSPHO_SITE at position 22
   match:   SSR
   pattern: [ST]-x-[RK].
   regexp:  [ST].[RK]

Found PKC_PHOSPHO_SITE at position 86
   match:   TVK
   pattern: [ST]-x-[RK].
   regexp:  [ST].[RK]

Found CK2_PHOSPHO_SITE at position 76
   match:   SHDE
   pattern: [ST]-x(2)-[DE].
   regexp:  [ST].{2}[DE]

Found MYRISTYL at position 30
   match:   GGVSGQ
   pattern: G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}.
   regexp:  G[^EDRKHPFYW].{2}[STAGCN][^P]

As you see, our short program goes through the Prosite database one record at a time, parsing each record according to the types of lines within it. If the record has an ID and a pattern, it then extracts them, creates a Perl regular expression from the pattern, and finally searches in a protein sequence for the regular expression, reporting on the patterns found.

The Next Step

This article has shown you how to take biological data from the Prosite database and use it in your own programs. With this ability, you can write programs specific to your particular research needs.

Many kinds of data discovery are possible: you could combine searches for Prosite patterns with some other computation. For instance, you may want to also search the associated genomic DNA or cDNA for restriction sites surrounding a particular Prosite pattern in the translated protein, in preparation for cloning.


James Tisdall has also written Why Biologists Want to Program Computers for oreilly.com.


While such programs are interesting in their own right, their importance in laboratory research really lies in the fact that their use can save enormous amounts of time; time which can then be used for other, less routine, tasks on which biological research critically depends.

This article gives an example of using Perl to extract and use data from a flat file database, of which there are many in biological research. In fact, some of the most important biological databases are in flat file format, including GenBank and PDB, the primary databases for DNA sequence information and for protein structures.

With the ability to write your own programs, the true power of bioinformatics can be applied in your lab. Learning the Perl programming language can give you a direct entry into this valuable new laboratory technique.


O'Reilly & Associates recently released (October 2001) Beginning Perl for Bioinformatics.

Create RSS channels from HTML news sites

Even if you haven't heard of the RSS acronym before, you're likely to have used RSS in the past. Whether through the slashboxes at slashdot or our own news summary at use.perl.org, the premise remains the same - RSS, or "Remote Site Summary" is a method used for providing an overview of the latest news to appear on a site.

RSS is an XML-based format, where the the site's information is described in a format that simplifies the news down to a few key elements. In the example we're going to run through, we'll concentrate particularly on the title and link tags. If you're interested in specifics of RSS, you can read more about it and see the full specification at netscape.com; for the purposes of this tutorial, though, I'm going to concentrate on how we can manipulate RSS with Perl and leave RSS internals alone.

So, you have a news site requiring an RSS feed; or let's say there's a news site that you want to get into RSS format. O'Reilly's own Meerkat web site creates RSS descriptions of the news on other sites, and presents the latest news from all around the web as a news ticket. We're not going to use exactly the same method as Meerkat in this tutorial, but we'll use similar techniques to provide us with an RSS feed of the BBC News web site.

So, what does our perl script need to do in order to turn the headlines on the site into an RSS channel? There are three main tasks that we'll be handling:

  • Downloading the page for our script to work with,
  • Parsing the HTML on the page to give us meaningful summary information, and
  • Encoding the summary information in RSS.

For anyone who's used the huge CPAN network of modules before, it's not going to be a huge surprise to hear that there are modules to help us accomplish each of our tasks here. We're going to be using LWP::Simple to handle downloading our page, HTML::TokeParser to parse our downloaded HTML into some meaningful English, and XML::RSS to create an RSS channel from our headlines and links.

Let's jump in to the code. After declaring our use of the modules we're going to be using...


use strict;

use LWP::Simple;

use HTML::TokeParser;

use XML::RSS;

.. we're ready to create each of the objects we'll be using with each module. Note that while LWP::Simple uses a procedural interface, both HTML::TokeParser and XML::RSS have Object-Oriented interfaces. If you aren't used to OO in Perl, Simon Cozens' recent article on Object-Oriented Perl might be a great help.


# First - LWP::Simple.  Download the page using get();.

my $content = get( "http://news.bbc.co.uk/" ) or die $!;



# Second - Create a TokeParser object, using our downloaded HTML.

my $stream = HTML::TokeParser->new( \$content ) or die $!;



# Finally - create the RSS object. 

my $rss = XML::RSS->new( version => '0.9' );

Our next step in laying the foundations for our script is going to be to declare some variables and set the channel information on our RSS object. Every RSS channel carries some metadata; that's to say, data which provides information about the data that it encodes. For instance, it usually carries at least the channel's name, description, and an URL link. We're going to set up the metadata by calling the channel method on our object, like this:


# Prep the RSS.

$rss->channel(

	title        => "news.bbc.co.uk",

	link         => "http://news.bbc.co.uk/",

	description  => "news.bbc.co.uk - World News from the BBC.");



# Declare variables.

my ($tag, headline, $url);

We now have all three of our modules primed and ready for use, and can move on to step two - obtaining plaintext information from our HTML page. This is where we need to apply our analytical skills to determine the layout of the web site we're looking at, and to find out how to find the headlines we wish to extract. The BBC's HTML layout is complex but predictable, and follows this routine:

  • A <div class="bodytext"> tag is present.
  • An <a> tag is opened, and then closed.
  • An <a> tag is opened, containing the URL we wish to grab, linking to the full text of the news article.
  • The <a> tag is closed, and a <b class="h1"> or <b class="h2"> tag is opened.
  • Our headline lies between this <b> tag and a </b> tag.
At first glance, it seems like acquiring the url and headline in this situation is going to be awkward, but HTML::TokeParser makes light work of the page. The two important methods we are given by HTML::TokeParser are get_tag and get_trimmed_text.

get_tag skips forward in the HTML from our current position to the tag specified, and get_trimmed_text will grab plaintext from the current position to the end position specified. And so we can now translate our description of the BBC's layout into methods upon our HTML::TokeParser object, $stream.


# First indication of a headline - A <div> tag is present.

while ( $tag = $stream->get_tag("div") ) {



	# Inside this loop, $tag is at a <div> tag.

        # But do we have a "class="bodytext">" token, too? 

	if ($tag->[1]{class} and $tag->[1]{class} eq 'bodytext') {



		# We do! 

                # The next step is an <a></a> set, which we aren't interested in.  

                # Let's go past it to the next <a>.

		$tag = $stream->get_tag('a'); $tag = $stream->get_tag('a');

		

		# Now, we're at the <a> with the headline in.

                # We need to put the contents of the 'href' token in $url.

		$url = $tag->[1]{href} || "--";

		

		# That's $url done.  We can move on to $headline, and <b>

		$tag = $stream->get_tag('b');



		# Now we can grab $headline, by using get_trimmed_text 

                # up to the close of the <b> tag.

		# We want <b class="h1"> or <b class="h2">.  

                # A regular expression will come in useful here. 

		$headline = $stream->get_trimmed_text('/b') \ 
                 if ($tag->[1]{class} =~ /^h[12]$/); 

We're getting there. We have the page downloaded, and we're inside a while loop that's going to grab every set of url and headline pairs on the page. All that's left to do is add $url and $headline to our RSS channel; but first, some tidying up..


		# We need to escape ampersands, as they start entity references in XML.

		$url =~ s/&/&amp;/g;

	

		# The <a> tags contain relative URLs - we need to qualify these.

		$url = 'http://news.bbc.co.uk'.$url;

		

		# And that's it.  We can add our pair to the RSS channel. 

		$rss->add_item( title => $headline, link => $url);

	}

}

By the end of each iteration through the <div> tags, our $rss object is going to contain every title and link from the page. Now all we need to do is save it somewhere, and that's done with the $rss->save method.


$rss->save("bbcnews.rss");

After executing our script, we get a 'bbcnews.rss' file dumped in the current directory, and this can be processed by any RSS parser - for example, here's my current mail client, Evolution, adding our data to the 'Executive Summary' feature.

If we make this accessible on the web, as all good RSS feeds are, we can even get at this from our Slashboxes or use.perl news boxes. If you do want to make it web accessible, however, it's probably better to periodically create the RSS file from a cron job or similar, rather than using CGI, especially if you think the RSS is going to be accessed more often than the target site will be updated.

Of course, if you're creating an RSS feed for your own site, you have much greater control over the way you get your data. For instance, if you're using XML to produce your web site, you can use XML transformation techniques to produce RSS - but that's another tutorial for another time. For now, have fun, and happy spidering!

Object-Oriented Perl

I've recently started learning to play the game of Go. Go and Perl have many things in common -- the basic stuff of which they are made, the rules of the game, are relatively simple, and hide an amazing complexity of possibilities beneath the surface. But I think the most interesting thing I've found that Go and Perl have in common is that there are various different stages in your development as you learn either one. It's almost as if there are several different plateaus of experience, and you have to climb up a huge hill before getting onto the next plateau.

For instance, a Go player can play very simply and acquit himself quite decently, but to stop being a beginner and really get into the game, he has to learn how to attack and defend economically. Then, to move on to the next stage, he has to master fighting a repetitive sequence called a "ko." As I progress, I expect there to be other difficult strategies I need to master before I can become a better player.

Perl, too, is not without its plateaus of knowledge, and in my experience, the one that really separates the beginner from the intermediate programmer is an understanding of object-oriented (OO) programming. Once you've understood how to use OO Perl, the door is opened to a huge range of interesting and useful CPAN modules, new programming techniques, and mastery of the upper plateaus of Perl programming.

So what is it?

Object-oriented programming is one of those buzzwordy manager-speak phrases, but unlike most of them, it actually means something. Let's take a look at some perfectly ordinary procedural Perl code, bread and butter programming to most beginning programmers:


my $request = accept_request($client);
my $answer = process_request($request);
answer_request($client, $answer);
$new_request = redirect_request($client, $request, $new_url);

The example here is of something like a Web server: we receive a request from a client, process it in some way to obtain an answer, and send the answer to the client. Additionally, we can also redirect the request to a different URL.

The same code, written in an object-oriented style, would look a little different:


my $request = $client->accept();
$request->process();
$client->answer($request);
$new_request = $request->redirect($new_url);

What's going on here? What are these funny arrows? The thing to remember about object-oriented programming is that we're no longer passing the data around to subroutines, to have subroutines do things for us -- now, we're telling the data to do things for itself. You can think of the arrows, (->, formally the "method call operator") as instructions to the data. In the first line, we're telling the data that represents the client to accept a request and pass us something back.

What is this "data that represents the client," and what does it pass back? Well, if this is object-oriented programming, we can probably guess the answer: they're both objects. They look like ordinary Perl scalars, right? Well, that's just because objects really are like ordinary Perl scalars.

The only difference between $client and $request in each example is that in the object-oriented version, the scalars happen to know where to find some subroutines that they can call. (In OO speak, we call them "methods" instead of "subroutines.")

This is why we don't have to say process_request in the OO case: if we're calling the process method on something that knows it's a request, it knows that it's processing a request. Simple, eh? In OO speak, we say that the $request object is in the Request "class" -- a class is the "type of thing" that the object is, and classes are how objects locate their methods. Hence, $request->redirect and $mail->redirect will call completely different methods if $request and $mail are in different classes; what it means to redirect a Request object is very different to redirecting a Mail object.

You might wonder what's actually going on when we call a method. Since we know that methods are just the OO form of subroutines, you shouldn't be surprised to find that methods in Perl really are just subroutines. What about classes? Well, the purpose of a class is to distinguish one set of methods from another. And what's a natural way to distinguish one set of subroutines from another in Perl? You guessed it -- in Perl, classes are just packages. So if we've got an object called $request in the Request class and we call the redirect method, this is what actually happens:


# $request->redirect($new_url)

Request::redirect($request, $new_url)
    

That's right -- we just call the redirect subroutine in the appropriate package, and pass in the object along with any other parameters. Why do we pass in the object? So that redirect knows what object it's working on.

At a very basic level, this is all OO Perl is -- it's another syntax for writing subroutine calls so that it looks like you're performing actions on some data. At that, for most users of OO Perl modules, is as much as you need to know.

Why is it a win?

So if that's all it is, why does everyone think that OO Perl is the best thing since sliced bread? You'll certainly find that a whole host of interesting and useful modules out there depend on OO techniques. To understand what everyone sees in it, let's go back to procedural code for a moment. Here's something that extracts the sender and subject of a mail message:


sub mail_subject {
    my $mail = shift;
    my @lines = split /\n/, $mail;
    for (@lines) {
        return $1 if /^Subject: (.*)/;
        return if /^$/; # Blank line ends headers
    }
}
sub mail_sender {
    my $mail = shift;
    my @lines = split /\n/, $mail;
    for (@lines) {
        return $1 if /^From: (.*)/;
        return if /^$/;
    }
}

my $subject = mail_subject($mail);
my $from    = mail_sender($mail);

All well and good, but notice that we have to run through the whole mail each time we want to get new information about it. Now, it's true we could replace the body of these two subroutines with quite a complicated regular expression, but that's not the point: we're still doing more work than we ought to.

For our equivalent OO example, let's use the CPAN module Mail::Header. This takes a reference to an array of lines, and spits out a mail header object to which we can then do things.


my @lines = split /\n/, $mail;
my $header = Mail::Header->new(\@lines);

my $subject = $header->get("subject");
my $from    = $header->get("from");

Not only are we now looking at the problem from a perspective of "doing things to the header", we're also giving the module an opportunity to make this more efficient. How come?

One of the main benefits of CPAN modules is that they give us a set of functions we can call, and we don't have to care how they're implemented. OO programming calls this "abstraction" - the implementation is abstracted from the user's perspective. Similarly, we don't have to care what $mail_obj really is. It could just be our reference to an array of lines, but on the other hand, Mail::Header can do clever things with it.

In reality, $header is a hash reference under the hood. Again, we don't need to care whether or not it's a hash reference or an array reference or something altogether different, but as it's a hash reference, this allows the constructor, new (a constructor is just a method that creates a new object) to do all the pre-processing on our array of lines once and for all, and then store the subject, sender, and all sorts of other fields into some hash keys. All that get does, essentially, is retrieve the appropriate value from the hash. This is obviously vastly more efficient than running through the whole message each time.

That's what an object really is: it's something that the module can rearrange and use any representation of your data that it likes so that it's most efficient to operate on in the future. You, as an end user, get the benefits of a smart implementation (assuming, of course, that the person who wrote the module is smart...) and you don't need to care about, or even actually see, see what's going on underneath.

Using it

We've seen a simple use of OO techniques by using Mail::Header. Let's now look at a slightly more involved program, to solidify our knowledge. This is a very simple system information server for a Unix machine. (Don't be put off -- these programs will work on non-Unix systems as well.) Unix has a client/server protocol called "finger," by which you can contact a server and ask for information about its users. I run "finger" on my username at a local machine, and get:


% finger simon
Login name: simon       (messages off)  In real life: Simon Cozens
Office: Computing S
Directory: /v0/xzdg/simon               Shell: /usr/local/bin/bash
On since Nov  6 10:03:46                5 minutes 38 seconds Idle Time
   on pts/166 from riot-act.somewhere
On since Nov  6 12:28:08
   on pts/197 from riot-act.somewhere
Project: Hacking Perl for Sugalski
Plan:

Insert amusing anecdote here.

What we're going to do is write our own finger client and a server which dishes out information about the current system, and we're going to do this using the object-oriented IO::Socket module. Of course, we could do this completely procedurally, using Socket.pm, but it's actually comparatively much easier to do it this way.

First, the client. The finger protocol, as much as we need to care about it, is really simple. The client connects and sends a line of text -- generally, a username. The server sends back some text, and then closes the connection.

By using IO::Socket to manage the connection, we can come up with something like this:


#!/usr/bin/perl
use IO::Socket::INET;

my ($host, $username) = @ARGV;

my $socket = IO::Socket::INET->new(
                        PeerAddr => $host,
                        PeerPort => "finger"
                      ) or die $!;

$socket->print($username."\n");

while ($_ = $socket->getline) {
    print;
}

This is pretty straightforward: the IO::Socket::INET constructor new gives us an object representing the connection to peer address $host on port finger. We can then call the print and getline methods to send and receive data from the connection. Compare this with the non-OO equivalent, and you may realize why people prefer dealing with objects:


#!/usr/bin/perl -w
use strict;
use Socket;
my ($remote,$port, $iaddr, $paddr, $proto, $user);

($remote, $user) = @ARGV; 

$port    = getservbyname('finger', 'tcp')   || die "no port";
$iaddr   = inet_aton($remote)               || die "no host: $remote";
$paddr   = sockaddr_in($port, $iaddr);

$proto   = getprotobyname('tcp');
socket(SOCK, PF_INET, SOCK_STREAM, $proto)  || die "socket: $!";
connect(SOCK, $paddr)                       || die "connect: $!";
print SOCK "$user\n";
while (<SOCK>)) {
   print;
}

close (SOCK)            || die "close: $!";

Now, to turn to the server. We'll also use another OO module, Net::hostent, which allows us to treat the result of gethostbyaddr as an object, rather than as a list of values. This means we don't have to worry about remembering which element of the list means what we want.


#!/usr/bin/perl -w
use IO::Socket;
use Net::hostent;

my $server = IO::Socket::INET->new( Proto     => 'tcp',
                                    LocalPort => 'finger',
                                    Listen    => SOMAXCONN,
                                    Reuse     => 1);
die "can't setup server" unless $server;

while ($client = $server->accept()) {
  $client->autoflush(1);
  $hostinfo = gethostbyaddr($client->peeraddr);
  printf "[Connect from %s]\n", $hostinfo->name || $client->peerhost;
  my $command = client->getline();
  if    ($command =~ /^uptime/) { $client->print(`uptime`); }
  elsif ($command =~ /^date/)   { $client->print(scalar localtime, "\n"); }
  else  { $client->print("Unknown command\n");
  $client->close;
}

This is chock-full of OO Perl goodness -- a method call on nearly every line. We start in a very similar way to how we wrote the client: using IO::Socket::INET->new as a constructor. Did you notice anything strange about this? IO::Socket::INET is a package name, which means it must be a class, rather than an object. But we can still call methods on classes (they're generally called "class methods," for obvious reasons) and this is how most objects actually get instantiated: the class provides a method called new that produces an object for us to manipulate.

The big while loop calls the accept method that waits until a client connects and, when one does, returns another IO::Socket::INET object representing the connection to the client. We can call the client's autoflush method, which is the equivalent to setting $| for its handle; the peeraddr method returns the address of the client, which we can feed to gethostbyaddr.

As we mentioned earlier, this isn't the usual Perl gethostbyadd, but one provided by Net::Hostent, and it returns yet another object! We use the name method from this object, which represents information about a given host, to find the host's name.

The rest isn't anything new. If you think back to our client, it sent a line and awaited a response -- so our server has to read a line, and send a response. You get bonus points for adding more possible responses to our server.

Conclusion

So there we are. We've seen a couple of examples of using object-oriented modules. It wasn't that bad, was it? Hopefully now you'll be well-enough equipped to be able to start using some of the many OO modules on CPAN for yourself.

If, on the other hand, you feel you need a little more in-depth coverage of OO Perl, you could take a look at the "perlboot," "perltoot," and "perltootc" pages in the Perl documentation. The Perl Cookbook, an invaluable book for any serious Perl programmer, has a very comprehensive and easy to follow treatment of OO techniques. Finally, the most in-depth treatment of all can be found in Damian Conway's "Object-Oriented Perl", which will see you through from a complete beginner way up to Perl 4 or 5 dan...

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en