RSS and You

Jan 25, 2000 by Chris Nandor

RSS is Born

Contents:
• RSS is Born
• XML::RSS
• my_portal
• Future of my_portal

Netscape had one of the first “portals” on the Web, a place where users could go to get most of their information needs fulfilled: search engines, news, email, and more. But Netscape soon wanted a portal that was more customizable by the user and contained content from any site that wanted to contribute. Hence My Netscape Network (MNN) was born. At MNN, the user can choose what content to put on their own page—the latest headlines on Slashdot, recently uploaded files at freshmeat, or the most recent posts to a bulletin board at Network54. Each channel, as Netscape calls them, can also include information to display an image for that channel and a text input box for searching the channel’s site.

The channels are described with formatted text files that are updated either at regular intervals, or whenever the site’s content changes. The Netscape servers periodically download the updated channel files from the various sites providing them, and that information is then made immediately available to the users.

In addition to the list of items and associated hyperlinks, the channel files can also contain information to display a text box for a form, and an image link, as well as metadata about the site.

In order to do all this, a universal way for developers to describe their sites was needed. Netscape developed the RDF Site Summary (RSS) format, which uses XML and the Resource Description Framework (RDF), a hierarchical data model used primarily for describing web-based metadata. RSS 0.90 was the first version, released in March 1999. RSS is very simple to work with, and because it is XML, it is both human-readable and easily parsed by many different languages and programs.

The fundamental container for the RSS data is the channel. Properties within the channel include title, link, and description, most of which are optional. Optional containers inside the channel are image and textinput, each with their own properties. At least one (and up to 15) item is included in the channel. The Perl News RSS file is a typical example:

...

<channel>
<title>Perl News</title>
<link>http://www.news.perl.org/</link>
<description>News for the Perl Community</description>
<language>en</language>
<copyright>Copyright 1999, Chris Nandor</copyright>
<pubDate>Sun, 02 Jan 2000 10:58:39 EST</pubDate>
<lastBuildDate>Sun, 02 Jan 2000 10:58:39 EST</lastBuildDate>
<managingEditor>news@perl.org</managingEditor>
<webMaster>news@perl.org</webMaster>

...

<item>
<title>Perl Conference Papers Deadline Extended Two Days</title>
<link>http://www.news.perl.org/perl-news.cgi?item=947976240|9809</link>
</item>

<item>
<title>Robert Writes in Defense of Coding Standards</title>
<link>http://www.news.perl.org/perl-news.cgi?item=947976265|9810</link>
</item>

...

<textinput>
<title>Search Perl News</title>
<description>Search the Perl News database</description>
<name>text</name>
<link>http://www.news.perl.org/perl-news.cgi</link>
</textinput>

</channel>
</rss>

Perl Conference Papers Deadline Extended Two Days Robert Writes in Defense of Coding Standards Netizen Releases Training Materials New Modules 10-14 January 2000 PerlMonth Issue 8 Now Available New Modules 5-9 January 2000 Linux Magazine Publishes Wall’s Uncultured Perl New Modules 4 January 2000 Search Perl News

January 15, 2000, 17:45 EST

RSS files like this are now used by thousands of sites on the Web. And Netscape is not the only one providing all of this content. UserLand, which had been using its own channel description format (called scriptingNews) since late 1997, was one of the early adopters of RSS for My UserLand. The important features of its scriptingNews format were integrated into RSS 0.91.

The distribution of content via channels resembles the distribution of content via cable TV in that not all channels are carried by all providers. A site that carries RSS channels related to freedom of speech might not carry the Perl News channel. Netscape carries any type of channel. UserLand carries news channels only. Slashdot (which has its own RSS channel) also allows users to customize their personal Slashdot page with other channels, which they call Slashboxes. But Slashdot has a much narrower focus than My UserLand and MNN, carrying only channels that relate to free software and hackers. xmlTree attempts to categorize as much of the XML content available on the Web as possible, much of which is RSS content.

XML::RSS

Jonathan Eisenzopf, who has worked on a lot of XML projects for Perl and who runs the perlxml.com web site, wrote the XML::RSS module. It is based on XML::Parser, as most XML modules are, and uses an object-oriented syntax. It makes creation and parsing of RSS files easy.

To create an RSS channel, you first create an XML::RSS object:

    use XML::RSS;
    my $rss = new XML::RSS;

Then call any of the four primary methods: channel(), image(), textinput(), or add_item(). The channel() method sets up information about the channel (not all the options are included here):

    $rss->channel(
        title           => 'Perl News',
        'link'          => 'http://www.news.perl.org/',
        description     => 'News for the Perl Community',
    );

Then you can call the optional textinput() and image() methods. The image data will tell the site that uses the channel where to get the image and what to link it to. The textinput data describes the action, the submit button value, and the name of the text input box so that a form can be presented to the user.

    $rss->image(
        title   => 'Perl News',
        url     => 'http://www.news.perl.org/perl-news-small.gif',
    );

    $rdf->textinput(
        title       => 'Search Perl News',
        description => 'Search the Perl News database',
        name        => 'text',
        'link'      => 'http://www.news.perl.org/perl-news.cgi',
    );

And then, for each item to add to the channel, call the add_item() method:

    for my $i (keys %items) {
        $rss->add_item(
            title   => $items{$i},
            'link'  => "http://www.news.perl.org/perl-news.cgi?item=$i",
        );
    }

The final step is to save it. You can either get the data with the as_string() method and then save it, or just save it directly with the save() method.

    my $data = $rss->as_string;             # or ...
    $rss->save("$dir/$channel.rss");

For an example of using XML::RSS to create RSS, see the source for the program that generates Perl News. The program that generates the HTML for the Perl News main page generates the RSS file at the same time.

It seems that, as with HTML, many (if not most) RSS files are created by hand, or at least using a template from a program. That is fine to do, but using XML::RSS has the advantage of creating valid, well-formed data (which is exceedingly important with XML); and, of course, as the RSS format evolves,the module can evolve with it. It’s simply easier, in many cases.

my_portal

While it was pretty cool that RSS channels were being created for most of the sites I frequented, I was getting frustrated that the content providers out there did not provide all of the channels I wanted, or they weren’t in the layout I wanted, or they just did something I didn’t like. I wanted to control the content myself. That’s the whole point, right? So I finally got around to doing something about it, and wrote a program for a new site I call my_portal.

The purpose of the project was to let me view the content I wanted from the sites I wanted in the format I wanted. And, of course, I wanted others to be able to do the same thing, because I was pretty sure that I was not the only one with this dilemma (and the feedback I’ve received confirms this). So I made a basic, lightweight program using Eisenzopf’s XML::RSS. The program only does a few things: it fetches RSS channels, and it displays them on the Web.

To display the RSS channels, I need to parse them. XML::RSS and XML::Parser come to the rescue again, with the parse() and parse_file() methods:

    $rss->parse($data);                     # or ...
    $rss->parsefile("$dir/$channel.rss");

$rss is a hashref, and I extract the elements as with any complex data structure:

    print qq{<A HREF="$rss->{channel}{'link'}">$rss->{channel}{title}</A>};

    for my $i (@{$rss->{items}}) {
        print qq{<A HREF="$i->{'link'}">$i->{title}</A>};
    }

    print qq{<FORM METHOD="GET" ACTION="$rss->{textinput}{'link'}">
             <INPUT TYPE="TEXT" NAME="$rss->{textinput}{name}">
             </FORM>};

When the data structure gets confusing, I’ll often print it out with Data::Dumper to get a good look at it:

    use Data::Dumper;
    print Dumper $rss;

The program’s interface is fairly simple. From the command line, I can add a channel by typing its name and the URL of the RSS file. The channel is added to a DBM file, and then its RSS file is downloaded (and so is the image from the RSS file, if available). I can update each channel individually, or all at once. I set a cron job to download new data every time the big hand is on the twelve or the six, so the content is updated all day long. Using the LWP::Simple module’s mirror() function, a new RSS file and image are only downloaded if the remote server says they have been modified.

The fact that my_portal was designed primarily for my own use has not dissuaded people from requesting features for it, or me from trying to provide some of them. First, there was user configurability. A configure screen presents a form where a user can choose which channels to view, and in what order. Just for kicks, the user can also select the colors of the page. All of these things have my personal choices as the default, but the user can change them at will. The data for these preferences is stored in another DBM, and remembered through cookies. Because some users wanted to use more than one computer or browser, I went ahead and added usernames and passwords so that they could log in and use the same configuration from anywhere.

I also worked to make sure that my_portal constructs valid HTML. I am really annoyed by sites that don’t work in certain browsers, or look bad in some, so I made sure to use valid HTML 4.0 for the entire site. It looks just fine in all browsers from Lynx to Mozilla. However, because the program pulls in content from other sources, it is possible that a page will be produced containing invalid HTML. Someone may try to embed incorrect HTML tags or entities in their RSS, or neglect to encode entities that need to be encoded.

Normally, XML::Parser will croak on many of these common problems, so they will never get to your HTML page anyway. If an item’s title is Amazon.com Sues Barnes & Noble, the ampersand will cause an exception, because it needs to be turned into an encoded entity (such as &). Another problem I’ve run into is RSS files in Mac OS text format (using CRs instead of LFs or CRLFs), which for some odd reason were making the parser choke. So I wrote some filters that process the RSS files after LWP::Simple::mirror() downloads them, before they are passed to XML::RSS. A simple regex converts CRs and CRLFs to LFs, and then any ampersand that is not followed by [a-zA-Z0-9]+; or #\d+; is converted into &.

    {
        my @time = (stat $file)[8,9];
        local $^I = '.bak';
        local @ARGV = $file;
        while (<>) {
            s/\015\012?/\012/g;
            s/&(?!(?:[a-zA-Z0-9]+|#\d+);))/&amp;/g;
            print;
        }
        # continued below
        ...
    }

I need to do one more thing before I am done with the filtered file. An optional property in the channel data notes the time when the channel was published (lastBuildDate), and my_portal prints it on the web page, so users have some idea of how recently updated the channel is. If that optional data is not in the RSS file, the program uses the modification time of the RSS file on disk, which LWP::Simple::mirror() sets to whatever the remote server says the modification time of the remote file is. So before touching the original file, we use stat() to get the access and modification time. After saving the new file, we use utime() to set those values to the newly saved file, so the modification time of the file is preserved.

    {
        ...
        # continued from above
        utime @time, $file;
        unlink "$file.bak";
    }

You probably won’t be able to construct filters to fix all potential problems with the RSS files—if they are totally broken, and your computer doesn’t happen to have artificial intelligence, then there is nothing you can do—so make sure you catch exceptions when parsing XML:

    for my $channel (@channels) {
        eval { $rss->parse("$dir/$channel.rss") }
        warn "XML for channel $channel not well formed:\n$@"
            and last if $@;

        # do something with $rss data ...
    }

Future of `my_portal`

There are some features that my_portal, because of its limited design scope, won’t accommodate in its current form. One suggestion was that users be able to add arbitrary RSS channels to their personal page through an external link (“click here to add Foo News to Your Portal!”). This is a fine idea, but it doesn’t fit the scheme and scope of the project; the interface and backend would need to be rethunk to do something like this. It may happen in the future, though.

Also, my_portal currently does not support locking for the user’s or channel DBMs. For the channel DBM, this is not a serious problem, since only one person would likely be changing that DBM anyway.

For the user’s DBM, this could be a problem. Moving the data to MySQL first would solve the problem. Otherwise, locking may be added eventually. My bet is on moving to MySQL first.

The my_portal program is also a little bit slow; each time it displays the channel, it executes, reads in the user database to find the user, and then reads in all of the appropriate channels. If my_portal were to be extended for widespread use, the best thing to do would probably be to make it into a mod_perl process that uses MySQL. That would solve or significantly alleviate the speed problems and the data problems, since it would always be loaded in, have persistence to the database connection, handle simultaneous accesses, and so on.

Since it suits my personal purposes fine as it is, and since I don’t get paid to work on it, I may or may not get to these and other changes in the near future. But the program is, after all, open source and available at http://www.news.perl.org/my_portal/my_portal.plx. Patches welcome!

Tags

web