January 2000 Archives

RSS and You

RSS is Born

RSS is Born
Future of my_portal

Netscape had one of the first "portals" on the Web, a place where users could go to get most of their information needs fulfilled: search engines, news, email, and more. But Netscape soon wanted a portal that was more customizable by the user and contained content from any site that wanted to contribute. Hence My Netscape Network (MNN) was born.

At MNN, the user can choose what content to put on their own page—the latest headlines on Slashdot, recently uploaded files at freshmeat, or the most recent posts to a bulletin board at Network54. Each channel, as Netscape calls them, can also include information to display an image for that channel and a text input box for searching the channel's site.

The channels are described with formatted text files that are updated either at regular intervals, or whenever the site's content changes. The Netscape servers periodically download the updated channel files from the various sites providing them, and that information is then made immediately available to the users.

In addition to the list of items and associated hyperlinks, the channel files can also contain information to display a text box for a form, and an image link, as well as metadata about the site.

In order to do all this, a universal way for developers to describe their sites was needed. Netscape developed the RDF Site Summary (RSS) format, which uses XML and the Resource Description Framework (RDF), a hierarchical data model used primarily for describing web-based metadata. RSS 0.90 was the first version, released in March 1999. RSS is very simple to work with, and because it is XML, it is both human-readable and easily parsed by many different languages and programs.

The fundamental container for the RSS data is the channel. Properties within the channel include title, link, and description, most of which are optional. Optional containers inside the channel are image and textinput, each with their own properties. At least one (and up to 15) item is included in the channel. The Perl News RSS file is a typical example:

Perl News


<title>Perl News</title>
<description>News for the Perl Community</description>
<copyright>Copyright 1999, Chris Nandor</copyright>
<pubDate>Sun, 02 Jan 2000 10:58:39 EST</pubDate>
<lastBuildDate>Sun, 02 Jan 2000 10:58:39 EST</lastBuildDate>


<title>Perl Conference Papers Deadline Extended Two Days</title>

<title>Robert Writes in Defense of Coding Standards</title>


<title>Search Perl News</title>
<description>Search the Perl News database</description>

 * Perl Conference Papers Deadline Extended Two Days
 * Robert Writes in Defense of Coding Standards
 * Netizen Releases Training Materials
 * New Modules 10-14 January 2000
 * PerlMonth Issue 8 Now Available
 * New Modules 5-9 January 2000
 * Linux Magazine Publishes Wall's Uncultured Perl
 * New Modules 4 January 2000

Search Perl News

January 15, 2000, 17:45 EST

RSS files like this are now used by thousands of sites on the Web. And Netscape is not the only one providing all of this content. UserLand, which had been using its own channel description format (called scriptingNews) since late 1997, was one of the early adopters of RSS for My UserLand. The important features of its scriptingNews format were integrated into RSS 0.91.

The distribution of content via channels resembles the distribution of content via cable TV in that not all channels are carried by all providers. A site that carries RSS channels related to freedom of speech might not carry the Perl News channel. Netscape carries any type of channel. UserLand carries news channels only. Slashdot (which has its own RSS channel) also allows users to customize their personal Slashdot page with other channels, which they call Slashboxes. But Slashdot has a much narrower focus than My UserLand and MNN, carrying only channels that relate to free software and hackers. xmlTree attempts to categorize as much of the XML content available on the Web as possible, much of which is RSS content.


Jonathan Eisenzopf, who has worked on a lot of XML projects for Perl and who runs the perlxml.com web site, wrote the XML::RSS module. It is based on XML::Parser, as most XML modules are, and uses an object-oriented syntax. It makes creation and parsing of RSS files easy.

To create an RSS channel, you first create an XML::RSS object:

    use XML::RSS;
    my $rss = new XML::RSS;

Then call any of the four primary methods: channel(), image(), textinput(), or add_item(). The channel() method sets up information about the channel (not all the options are included here):

        title           => 'Perl News',
        'link'          => 'http://www.news.perl.org/',
        description     => 'News for the Perl Community',

Then you can call the optional textinput() and image() methods. The image data will tell the site that uses the channel where to get the image and what to link it to. The textinput data describes the action, the submit button value, and the name of the text input box so that a form can be presented to the user.

        title   => 'Perl News',
        url     => 'http://www.news.perl.org/perl-news-small.gif',

        title       => 'Search Perl News',
        description => 'Search the Perl News database',
        name        => 'text',
        'link'      => 'http://www.news.perl.org/perl-news.cgi',

And then, for each item to add to the channel, call the add_item() method:

    for my $i (keys %items) {
            title   => $items{$i},
            'link'  => "http://www.news.perl.org/perl-news.cgi?item=$i",

The final step is to save it. You can either get the data with the as_string() method and then save it, or just save it directly with the save() method.

    my $data = $rss->as_string;             # or ...

For an example of using XML::RSS to create RSS, see the source for the program that generates Perl News. The program that generates the HTML for the Perl News main page generates the RSS file at the same time.

It seems that, as with HTML, many (if not most) RSS files are created by hand, or at least using a template from a program. That is fine to do, but using XML::RSS has the advantage of creating valid, well-formed data (which is exceedingly important with XML); and, of course, as the RSS format evolves,the module can evolve with it. It's simply easier, in many cases.


While it was pretty cool that RSS channels were being created for most of the sites I frequented, I was getting frustrated that the content providers out there did not provide all of the channels I wanted, or they weren't in the layout I wanted, or they just did something I didn't like. I wanted to control the content myself. That's the whole point, right? So I finally got around to doing something about it, and wrote a program for a new site I call my_portal.

The purpose of the project was to let me view the content I wanted from the sites I wanted in the format I wanted. And, of course, I wanted others to be able to do the same thing, because I was pretty sure that I was not the only one with this dilemma (and the feedback I've received confirms this). So I made a basic, lightweight program using Eisenzopf's XML::RSS. The program only does a few things: it fetches RSS channels, and it displays them on the Web.

To display the RSS channels, I need to parse them. XML::RSS and XML::Parser come to the rescue again, with the parse() and parse_file() methods:

    $rss->parse($data);                     # or ...

$rss is a hashref, and I extract the elements as with any complex data structure:

    print qq{<A HREF="$rss->{channel}{'link'}">$rss->{channel}{title}</A>};

    for my $i (@{$rss->{items}}) {
        print qq{<A HREF="$i->{'link'}">$i->{title}</A>};

    print qq{<FORM METHOD="GET" ACTION="$rss->{textinput}{'link'}">
             <INPUT TYPE="TEXT" NAME="$rss->{textinput}{name}">

When the data structure gets confusing, I'll often print it out with Data::Dumper to get a good look at it:

    use Data::Dumper;
    print Dumper $rss;

The program's interface is fairly simple. From the command line, I can add a channel by typing its name and the URL of the RSS file. The channel is added to a DBM file, and then its RSS file is downloaded (and so is the image from the RSS file, if available). I can update each channel individually, or all at once. I set a cron job to download new data every time the big hand is on the twelve or the six, so the content is updated all day long. Using the LWP::Simple module's mirror() function, a new RSS file and image are only downloaded if the remote server says they have been modified.

The fact that my_portal was designed primarily for my own use has not dissuaded people from requesting features for it, or me from trying to provide some of them. First, there was user configurability. A configure screen presents a form where a user can choose which channels to view, and in what order. Just for kicks, the user can also select the colors of the page. All of these things have my personal choices as the default, but the user can change them at will. The data for these preferences is stored in another DBM, and remembered through cookies. Because some users wanted to use more than one computer or browser, I went ahead and added usernames and passwords so that they could log in and use the same configuration from anywhere.

I also worked to make sure that my_portal constructs valid HTML. I am really annoyed by sites that don't work in certain browsers, or look bad in some, so I made sure to use valid HTML 4.0 for the entire site. It looks just fine in all browsers from Lynx to Mozilla. However, because the program pulls in content from other sources, it is possible that a page will be produced containing invalid HTML. Someone may try to embed incorrect HTML tags or entities in their RSS, or neglect to encode entities that need to be encoded.

Normally, XML::Parser will croak on many of these common problems, so they will never get to your HTML page anyway. If an item's title is Amazon.com Sues Barnes & Noble, the ampersand will cause an exception, because it needs to be turned into an encoded entity (such as &amp;). Another problem I've run into is RSS files in Mac OS text format (using CRs instead of LFs or CRLFs), which for some odd reason were making the parser choke. So I wrote some filters that process the RSS files after LWP::Simple::mirror() downloads them, before they are passed to XML::RSS. A simple regex converts CRs and CRLFs to LFs, and then any ampersand that is not followed by [a-zA-Z0-9]+; or #\d+; is converted into &amp;.

        my @time = (stat $file)[8,9];
        local $^I = '.bak';
        local @ARGV = $file;
        while (<>) {
        # continued below 

I need to do one more thing before I am done with the filtered file. An optional property in the channel data notes the time when the channel was published (lastBuildDate), and my_portal prints it on the web page, so users have some idea of how recently updated the channel is. If that optional data is not in the RSS file, the program uses the modification time of the RSS file on disk, which LWP::Simple::mirror() sets to whatever the remote server says the modification time of the remote file is. So before touching the original file, we use stat() to get the access and modification time. After saving the new file, we use utime() to set those values to the newly saved file, so the modification time of the file is preserved.

        # continued from above
        utime @time, $file;
        unlink "$file.bak";

You probably won't be able to construct filters to fix all potential problems with the RSS files—if they are totally broken, and your computer doesn't happen to have artificial intelligence, then there is nothing you can do—so make sure you catch exceptions when parsing XML:

    for my $channel (@channels) {
        eval { $rss->parse("$dir/$channel.rss") }
        warn "XML for channel $channel not well formed:\n$@"
            and last if $@;

        # do something with $rss data ...

Future of my_portal

There are some features that my_portal, because of its limited design scope, won't accommodate in its current form. One suggestion was that users be able to add arbitrary RSS channels to their personal page through an external link ("click here to add Foo News to Your Portal!"). This is a fine idea, but it doesn't fit the scheme and scope of the project; the interface and backend would need to be rethunk to do something like this. It may happen in the future, though.

Also, my_portal currently does not support locking for the user's or channel DBMs. For the channel DBM, this is not a serious problem, since only one person would likely be changing that DBM anyway.

For the user's DBM, this could be a problem. Moving the data to MySQL first would solve the problem. Otherwise, locking may be added eventually. My bet is on moving to MySQL first.

The my_portal program is also a little bit slow; each time it displays the channel, it executes, reads in the user database to find the user, and then reads in all of the appropriate channels. If my_portal were to be extended for widespread use, the best thing to do would probably be to make it into a mod_perl process that uses MySQL. That would solve or significantly alleviate the speed problems and the data problems, since it would always be loaded in, have persistence to the database connection, handle simultaneous accesses, and so on.

Since it suits my personal purposes fine as it is, and since I don't get paid to work on it, I may or may not get to these and other changes in the near future. But the program is, after all, open source and available at http://www.news.perl.org/my_portal/my_portal.plx. Patches welcome!

In Defense of Coding Standards

How to Create Coding Standards that Work

One of the things that we love most about Perl is its flexibility, itssimilarity to natural language, and the fact that There's More Than One Way To Do It. Of course, when I say ``we'' I mean Perl hackers; the implicit ``them'' in this case is management, people who prefer other languages, or people who have to maintain someone else's line noise.

"Just because there are bad coding standards out there, doesn't mean that all coding standards are bad."

Perl programmers tend to rebel at the idea of coding standards, or at having their creativity limited by arbitrary rules -- otherwise, they'd be coding in Python :). But I think that sometimes a little bit of consistency can be a good thing.

As Larry himself said in one of his State of the Onion talks, three virtues of coding are Diligence, Patience and Humility. Diligence (the opposite of Laziness, if you're paying attention) is necessary when you're working with other programmers. You can't afford to name your variables $foo, $bar, and $stimps_is_a_sex_goddess if someone has to come along after you and figure out what the hell you meant. This is where coding standards come in handy.

Let me tell you about my recent experiences writing coding standards.

I work in a small company with about half a dozen coders on staff. We code in languages such as Perl, Python, and C, with occasional excursions into things like SQL and non-programming-languages like HTML.

We'd been working together a few months when it was decided that some development standards (slightly broader than coding standards, but mostly related to coding) would be a good idea. The difficulties we wanted to address were:

  • program design,
  • naming conventions,
  • formatting conventions,
  • documentation, and
  • licensing.

All these issues had popped up already in our few short months of working with each other, especially when one person handed a project over to another. We needed to create some standards to ensure that all our work was consistent enough for other people to follow, but we didn't want to do this at the expense of individuality or creativity. And we didn't want to insult our coders' intelligence by dictating every little thing to them.

Being the person who tends to write things in our company, I took it upon myself to put together some standards with the help of the developers. From the beginning, my plan was to set some general ground rules, then to expand on them language by language where necessary. I wanted the standards to be as brief as possible, while still conveying enough information for a hypothetical new hire to read and understand without having to guess at anything.

Here's what we came up with as our general rules:

  1. The verbosity of all names should be proportional to the scope of their use
  2. The plurality of a variable name should reflect the plurality of the data it contains. In Perl, $name is a single name, while @names is an array of names
  3. In general, follow the language's conventions in variable naming and other things. If the language uses variable_names_like_this, you should too. If it uses ThisKindOfName, follow that.
  4. Failing that, use UPPER_CASE for globals, StudlyCaps for classes, and lower_case for most other things. Note the distinction between words by using either underscores or StudlyCaps.
  5. Function or subroutine names should be verbs or verb clauses. It is unnecessary to start a function name with do_.
  6. Filenames should contain underscores between words, except where they are executables in $PATH. Filenames should be all lower case, except for class files which maybe in StudlyCaps if the language's common usage dictates it.

That's it. That's the core of our coding standards.

Those rules were developed during a one-hour meeting with all development staff. There's nothing there that anyone disagrees on at all, and I think that's because the rules are basically common sense.

Our standards then go on to give a few extra guidelines for each language. For Perl, we have the following standards:

1. Read perldoc perlstyle and follow all suggestions contained therein, except where they disagree with the general coding standards, which take precedence.

2. Use the -w command line flag and the strict pragma at all times, and -T (taint checking) where appropriate.

3. Name Perl scripts with a .pl extension, and CGI scripts with a .cgi extension. One exception: Perl scripts in $PATH may omit the .pl.

... and a few more, about one printed page in total. For instance, we have a couple of regexp-related guidelines, a couple of points about references and complex data structures (including when not to use them), and a list of our favourite modules that we recommend developers use (CGI, DBI, Text::Template, etc.).

Our documentation standards say to include at least a README, INSTALL and LICENSE file with each piece of software; that each source code file should include the name, author, description, version and copyright information; that any function that needs more than two lines of comments to explain what it does needs to be written more clearly; and that any more detailed documentation should be handed to professional technical writers.

Coding standards needn't be onerous. Just because there are bad coding standards out there, doesn't mean that all coding standards are bad.

I think the way to a good coding standard is to be as minimalist as possible. Anything more than a couple of pages long, or which deviates too far from common practice, will frustrate developers and won't be followed. And standards that are too detailed may obscure the fact that the code has deeper problems.

Here's a second rule: standardise early! Don't try to impose complex standards on a project or team that's been going for a long time -- the effort to bring existing code up to standard will be too great. If your standards are minimal and based on common sense, there's no reason to wait for the project to take shape or the team's preferences to become known.

If you do set standards late, don't set out on a crusade to bring existing code up to scratch. Either fix things as you come to them, or (better) rewrite from scratch. Chances are that what you had was pretty messy anyway, and could do with reworking.

Third rule? I suppose three rules is a good number. The third rule is to encourage a culture in which standards are followed, not because Standards Must Be Obeyed, but because everyone realises that things work better that way. Imagine what would happen if, for instance, mail transport agents didn't follow RFC822. MTAs don't follow RFC822 because they're forced to, but because Internet email just wouldn't work without it. The thought of writing an MTA which was non-compliant is perverse (or Microsoft policy, one or the other).

If your development team understands that standards do make things easier and result in higher quality, more maintainable code, then the effort of enforcement will be small.

Damn, I seem to have found a fourth rule. Oh well.

Fourth rule: don't expect coders to document. Don't expect coders to do architecture or high-level design. Don't expect coders to have an eye for user interface. If they do, that's great, but no matter how many standards or methodologies you lay down, there's no way to change the fact that coding skill is not necessarily related to, and in fact may be inversely proportional to, those other necessary skills. Don't let a set of standards be your crutch when you really need to hire designers or documentors.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en