April 2008 Archives

A Beginner's Introduction to Perl 5.10

First, a Little Sales Pitch

Editor's note: this series is based on Doug Sheppard's Beginner's Introduction to Perl. A Beginner's Introduction to Files and Strings with Perl 5.10 explains how to use files and strings, and A Beginner's Introduction to Regular Expressions with Perl 5.10 explores regular expressions, matching, and substitutions. A Beginner's Introduction to Perl Web Programming demonstrates how to write web programs.

Welcome to Perl.

Perl is the Swiss Army chainsaw of programming languages: powerful and adaptable. It was first developed by Larry Wall, a linguist working as a systems administrator for NASA in the late 1980s, as a way to make report processing easier. Since then, it has moved into a several other areas: automating system administration, acting as glue between different computer systems, web programming, bioinformatics, data munging, and even application development.

Why did Perl become so popular when the Web came along? Two reasons: First, most of what is being done on the Web happens with text, and is best done with a language that's designed for text processing. More importantly, Perl was appreciably better than the alternatives at the time when people needed something to use. C is complex and can produce security problems (especially with untrusted data), Tcl can be awkward, and Python didn't really have a foothold.

It also didn't hurt that Perl is a friendly language. It plays well with your personal programming style. The Perl slogan is "There's more than one way to do it," and that lends itself well to large and small problems alike. Even more so, Perl is very portable and widespread -- it's available pre-installed almost everywhere -- and of course there are thousands of freely-distributable libraries available from the CPAN.

In this first part of our series, you'll learn a few basics about Perl and see a small sample program.

A Word About Operating Systems

This series assumes that you're using a Unix or Unix-like operating system (Mac OS X and Cygwin qualify) and that you have the perl binary available at /usr/bin/perl. It's OK if you're running Windows through ActivePerl or Strawberry Perl; most Perl code is platform-independent.

Your First Perl Program

Save this program as a file called first.pl:

use feature ':5.10';
say "Hi there!";

(The traditional first program says Hello world!, but I'm an iconoclast.)

Run the program. From a command line, go to the directory with this file and type perl first.pl. You should see:

Hi there!

Friendly, isn't it?

I'm sure you can guess what say does. What about the use feature ':5.10'; line? For now, all you need to know is that it allows you to use nice new features found in Perl 5.10. This is a very good thing.

Functions and Statements

Perl has a rich library of built-in functions. They're the verbs of Perl, the commands that the interpreter runs. You can see a list of all the built-in functions in the perlfunc man page (perldoc perlfunc, from the command line). Almost all functions can take a list of commma-separated parameters.

The print function is one of the most frequently used parts of Perl. You use it to display things on the screen or to send information to a file. It takes a list of things to output as its parameters.

print "This is a single statement.";
print "Look, ", "a ", "list!";

A Perl program consists of statements, each of which ends with a semicolon. Statements don't need to be on separate lines; there may be multiple statements on one line. You can also split a single statement across multiple lines.

print "This is "; print "two statements.\n";
print "But this ", "is only one statement.\n";

Wait a minute though. What's the difference between say and print? What's this \n in the print statements?

The say function behaves just like the print function, except that it appends a newline at the end of its arguments. It prints all of its arguments, and then a newline character. Always. No exceptions. print, on the other hand, only prints what you see explicitly in these examples. If you want a newline, you have to add it yourself with the special character escape sequence \n.

use feature ':5.10';

say "This is a single statement.";
say "Look, ", "a ", "list!";

Why do both exist? Why would you use one over the other? Usually, most "display something" statements need the newline. It's common enough that say is a good default choice. Occasionally you need a little bit more control over your output, so print is the option.

Note that say is two characters shorter than print. This is an important design principle for Perl -- common things should be easy and simple.

Numbers, Strings, and Quotes

There are two basic data types in Perl: numbers and strings.

Numbers are easy; we've all dealt with them. The only thing you need to know is that you never insert commas or spaces into numbers in Perl. Always write 10000, not 10,000 or 10 000.

Strings are a bit more complex. A string is a collection of characters in either single or double quotes:

'This is a test.'
"Hi there!\n"

The difference between single quotes and double quotes is that single quotes mean that their contents should be taken literally, while double quotes mean that their contents should be interpreted. Remember the character sequence \n? It represents a newline character when it appears in a string with double quotes, but is literally the two characters backslash and n when it appears in single quotes.

use feature ':5.10';
say "This string\nshows up on two lines.";
say 'This string \n shows up on only one.';

(Two other useful backslash sequences are \t to insert a tab character, and \\ to insert a backslash into a double-quoted string.)

Variables

If functions are Perl's verbs, then variables are its nouns. Perl has three types of variables: scalars, arrays, and hashes. Think of them as things, lists, and dictionaries respectively. In Perl, all variable names consist of a punctuation character, a letter or underscore, and one or more alphanumeric characters or underscores.

Scalars are single things. This might be a number or a string. The name of a scalar begins with a dollar sign, such as $i or $abacus. Assign a value to a scalar by telling Perl what it equals:

my $i                = 5;
my $pie_flavor       = 'apple';
my $constitution1776 = "We the People, etc.";

You don't need to specify whether a scalar is a number or a string. It doesn't matter, because when Perl needs to treat a scalar as a string, it does; when it needs to treat it as a number, it does. The conversion happens automatically. (This is different from many other languages, where strings and numbers are two separate data types.)

If you use a double-quoted string, Perl will insert the value of any scalar variables you name in the string. This is often useful to fill in strings on the fly:

use feature ':5.10';
my $apple_count  = 5;
my $count_report = "There are $apple_count apples.";
say "The report is: $count_report";

The final output from this code is The report is: There are 5 apples..

You can manipulate numbers in Perl with the usual mathematical operations: addition, multiplication, division, and subtraction. (The multiplication and division operators in Perl use the * and / symbols, by the way.)

my $a = 5;
my $b = $a + 10;       # $b is now equal to 15.
my $c = $b * 10;       # $c is now equal to 150.
$a    = $a - 1;        # $a is now 4, and algebra teachers are cringing.

That's all well and good, but what's this strange my, and why does it appear with some assignments and not others? The my operator tells Perl that you're declaring a new variable. That is, you promise Perl that you deliberately want to use a scalar, array, or hash of a specific name in your program. This is important for two reasons. First, it helps Perl help you protect against typos; it's embarrassing to discover that you've accidentally mistyped a variable name and spent an hour looking for a bug. Second, it helps you write larger programs, where variables used in one part of the code don't accidentally affect variables used elsewhere.

You can also use special operators like ++, --, +=, -=, /= and *=. These manipulate a scalar's value without needing two elements in an equation. Some people like them, some don't. I like the fact that they can make code clearer.

my $a = 5;
$a++;        # $a is now 6; we added 1 to it.
$a += 10;    # Now it's 16; we added 10.
$a /= 2;     # And divided it by 2, so it's 8.

Strings in Perl don't have quite as much flexibility. About the only basic operator that you can use on strings is concatenation, which is a ten dollar way of saying "put together." The concatenation operator is the period. Concatenation and addition are two different things:

my $a = "8";    # Note the quotes.  $a is a string.
my $b = $a + "1";   # "1" is a string too.
my $c = $a . "1";   # But $b and $c have different values!

Remember that Perl converts strings to numbers transparently whenever necessary, so to get the value of $b, the Perl interpreter converted the two strings "8" and "1" to numbers, then added them. The value of $b is the number 9. However, $c used concatenation, so its value is the string "81".

Remember, the plus sign adds numbers and the period puts strings together. If you add things that aren't numbers, Perl will try its best to do what you've told it to do, and will convert those non-numbers to numbers with the best of its ability.

Arrays are lists of scalars. Array names begin with @. You define arrays by listing their contents in parentheses, separated by commas:

my @lotto_numbers = (1, 2, 3, 4, 5, 6);  # Hey, it could happen.
my @months        = ("July", "August", "September");

You retrieve the contents of an array by an index, sort of like "Hey, give me the first month of the year." Indexes in Perl start from zero. (Why not 1? Because. It's a computer thing.) To retrieve the elements of an array, you replace the @ sign with a $ sign, and follow that with the index position of the element you want. (It begins with a dollar sign because you're getting a scalar value.) You can also modify it in place, just like any other scalar.

use feature ':5.10';

my @months = ("July", "August", "September");
say $months[0];         # This prints "July".
$months[2] = "Smarch";  # We just renamed September!

If an array value doesn't exist, Perl will create it for you when you assign to it.

my @winter_months = ("December", "January");
$winter_months[2] = "February";

Arrays always return their contents in the same order; if you go through @months from beginning to end, no matter how many times you do it, you'll get back July, August, and September in that order. If you want to find the number of elements of an array, assign the array to a scalar.

use feature ':5.10';
my @months      = ("July", "August", "September");
my $month_count = @months;
say $month_count;  # This prints 3.

my @autumn_months; # no elements
my $autumn_count = @autumn_months;
say $autumn_count; # this prints 0

Some programming languages call hashes "dictionaries". That's what they are: a term and a definition. More precisely, they contain keys and values. Each key in a hash has one and only one corresponding value. The name of a hash begins with a percentage sign, like %parents. You define hashes by comma-separated pairs of key and value, like so:

my %days_in_month = ( "July" => 31, "August" => 31, "September" => 30 );

You can fetch any value from a hash by referring to $hashname{key}, or modify it in place just like any other scalar.

say $days_in_month{September}; # 30, of course.
$days_in_month{February} = 29; # It's a leap year.

To see what keys are in a hash, use the keys function with the name of the hash. This returns a list containing all of the keys in the hash. The list isn't always in the same order, though; while you can count on @months always to return July, August, September in that order, keys %days_in_month might return them in any order whatsoever.

my @month_list = keys %days_in_month;
# @month_list is now ('July', 'September', 'August', 'February')!

The three types of variables have three separate namespaces. That means that $abacus and @abacus are two different variables, and $abacus[0] (the first element of @abacus) is not the same as $abacus{0} (the value in %abacus that has the key 0).

Comments

Some of the code samples from the previous section contained code comments. These are useful for explaining what a particular piece of code does, and vital for any piece of code you plan to modify, enhance, fix, or just look at again. (That is to say, comments are important.)

Anything in a line of Perl code that follows a # sign is a comment, unless that # sign appears in a string.)

use feature ':5.10';
say "Hello world!";  # That's more like it.
# This entire line is a comment.

Loops

Almost every program ever written uses a loop of some kind. Loops allow you run a particular piece of code over and over again. This is part of a general concept in programming called flow control.

Perl has several different functions that are useful for flow control, the most basic of which is for. When you use the for function, you specify a variable to use as the loop index, and a list of values to loop over. Inside a pair of curly brackets, you put any code you want to run during the loop:

use feature ':5.10';

for my $i (1, 2, 3, 4, 5) {
     say $i;
}

This loop prints the numbers 1 through 5, each on a separate line. (It's not very useful; you're might think "Why not just write say 1, 2, 3, 4, 5;?". This is because say adds only one newline, at the end of its list of arguments.)

A handy shortcut for defining loop values is the range operator .., which specifies a range of numbers. You can write (1, 2, 3, 4, 5) as (1 .. 5) instead. You can also use arrays and scalars in your loop list. Try this code and see what happens:

use feature ':5.10';

my @one_to_ten = (1 .. 10);
my $top_limit  = 25;

for my $i (@one_to_ten, 15, 20 .. $top_limit) {
    say $i;
}

Of course, again you could write say @one_to_ten, 15, 20 .. $top_limit;

The items in your loop list don't have to be numbers; you can use strings just as easily. If the hash %month_has contains names of months and the number of days in each month, you can use the keys function to step through them.

use feature ':5.10';

for my $i (keys %month_has) {
    say "$i has $month_has{$i} days.";
}

for my $marx ('Groucho', 'Harpo', 'Zeppo', 'Karl') {
    say "$marx is my favorite Marx brother.";
}

The Miracle of Compound Interest

You now know enough about Perl -- variables, print/say, and for() -- to write a small, useful program. Everyone loves money, so the first sample program is a compound-interest calculator. It will print a (somewhat) nicely formatted table showing the value of an investment over a number of years. (You can see the program at compound_interest.pl)

The single most complex line in the program is:

my $interest = int( ( $apr / 100 ) * $nest_egg * 100 ) / 100;

$apr / 100 is the interest rate, and ($apr / 100) * $nest_egg is the amount of interest earned in one year. This line uses the int() function, which returns the integer value of a scalar (its value after any stripping off any fractional part). We use int() here because when you multiply, for example, 10925 by 9.25%, the result is 1010.5625, which we must round off to 1010.56. To do this, we multiply by 100, yielding 101056.25, use int() to throw away the leftover fraction, yielding 101056, and then divide by 100 again, so that the final result is 1010.56. Try stepping through this statement yourself to see just how we end up with the correct result, rounded to cents.

Play Around!

At this point you have some basic knowledge of Perl syntax and a few simple toys to play with. Try writing some simple programs with them. Here are two suggestions, one simple and the other a little more complex:

  • A word frequency counter. How often does each word show up in an array of words? Print out a report. (Hint: Use a hash to count of the number of appearances of each word.)
  • Given a month and the day of the week that's the first of that month, print a calendar for the month.

Using Amazon S3 from Perl

Data management is a critical and challenging aspect for any online resource. With exponentially growing data sizes and popularity of rich media, even small online resources must effectively manage and distribute a significant amount of data. Moreover, the peace of mind associated with an additional offsite data storage resource is invaluable to everyone involved.

At SundayMorningRides.com, we manage a growing inventory of GPS and general GIS (Geography Information Systems) data and web content (text, images, videos, etc.) for the end users. In addition, we must also effectively manage daily snapshots, backups, as well as multiple development versions of our web site and supporting software. For any small organization, this can add up to significant costs -- not only as an initial monetary investment but also in terms of ongoing labor costs for maintenance and administration.

Amazon Simple Storage Service (S3) was released specifically to address the problem of data management for online resources -- with the aim to provide "reliable, fast, inexpensive data storage infrastructure that Amazon uses to run its own global network of web sites." Amazon S3 provides a web service interface that allows developers to store and retrieve any amount of data. S3 is attractive to companies like SundayMorningRides.com as it frees us from upfront costs and the ongoing costs of purchasing, administration, maintenance, and scaling our own storage servers.

This article covers the Perl, REST, and the Amazon S3 REST module, walking through the development of a collection of Perl-based tools for UNIX command-line based interaction to Amazon S3. I'll also show how to set access permissions so that you can serve images or other data directly to your site from Amazon S3.

A Bit on Web Services

Web services have become the de-facto method of exposing information and, well, services via the Web. Intrinsically, web services provide a means of interaction between two networked resources. Amazon S3 is accessible via both Simple Object Access Protocol (SOAP) or representational state transfer (REST).

The SOAP interface organizes features into custom-built operations, similar to remote objects when using Java Remote Method Invocation (RMI) or Common Object Resource Broker Architecture (CORBA). Unlike RMI or CORBA, SOAP uses XML embedded in the body of HTTP requests as the application protocol.

Like SOAP, REST also uses HTTP for transport. Unlike SOAP, REST operations are the standard HTTP operations -- GET, POST, PUT, and DELETE. I think of REST operations in terms of the CRUD semantics associated with relational databases: POST is Create, GET is Retrieve, PUT is Update, and DELETE is Delete.

"Storage for the Internet"

Amazon S3 represents the data space in three core concepts: objects, buckets, and keys.

  • Objects are the base level entities within Amazon S3. They consist of both object data and metadata. This metadata is a set of name-attribute pairs defined in the HTTP header.
  • Buckets are collections of objects. There is no limit to the number of objects in a bucket, but each developer is limited to 100 buckets.
  • Keys are unique identifiers for objects.

Without wading through the details, I tend think of buckets as folders, objects as files, and keys as filenames. The purpose of this abstraction is to create a unique HTTP namespace for every object.

I'll assume that you have already signed up for Amazon S3 and received your Access Key ID and Secret Access Key. If not, please do so.

Please note that the S3::* modules aren't the only Perl modules available for connecting to Amazon S3. In particular, Net::Amazon::S3 hides a lot of the details of the S3 service for you. For now, I'm going to use a simpler module to explain how the service works internally.

Connecting, Creating, and Listing Buckets

Connecting to Amazon S3 is as simple as supplying your Access Key ID and your Secret Access Key to create a connection, called here $conn. Here's how to create and list the contents of a bucket as well as list all buckets.

#!/usr/bin/perl

use S3::AWSAuthConnection;
use S3::QueryStringAuthGenerator;

use Data::Dumper;

my $AWS_ACCESS_KEY_ID     = 'YOUR ACCESS KEY';
my $AWS_SECRET_ACCESS_KEY = 'YOUR SECRET KEY';

my $conn = S3::AWSAuthConnection->new($AWS_ACCESS_KEY_ID,
                                      $AWS_SECRET_ACCESS_KEY);

my $BUCKET = "foo";

print "creating bucket $BUCKET \n";
print $conn->create_bucket($BUCKET)->message, "\n";

print "listing bucket $BUCKET \n";
print Dumper @{$conn->list_bucket($BUCKET)->entries}, "\n";

print "listing all my buckets \n";
print Dumper @{$conn->list_all_my_buckets()->entries}, "\n";

Because every S3 action takes place over HTTP, it is good practice to check for a 200 response.

my $response = $conn->create_bucket($BUCKET);
if ($response->http_response->code == 200) {
    # Good
} else {
    # Not Good
}

As you can see from the output, the results come back in a hash. I've used Data::Dumper as a convenient way to view the contents. If you are running this for the first time, you will obviously not see anything listed in the bucket.

listing bucket foo
$VAR1 = {
          'Owner' => {
                     'ID' => 'xxxxx',
                     'DisplayName' => 'xxxxx'
                   },
          'Size' => '66810',
          'ETag' => '"xxxxx"',
          'StorageClass' => 'STANDARD',
          'Key' => 'key',
          'LastModified' => '2007-12-18T22:08:09.000Z'
        };
$VAR4 = '
';
listing all my buckets
$VAR1 = {
          'CreationDate' => '2007-11-28T17:31:48.000Z',
          'Name' => 'foo'
        };
';

Writing an Object

Writing an object is simply a matter of using the HTTP PUT method. Be aware that there is nothing to prevent you from overwriting an existing object; Amazon S3 will automatically update the object with the more recent write request. Also, it's currently not possible to append to or otherwise modify an object in place without replacing it.

my %headers = (
    'Content-Type' => 'text/plain'
);
$response = $conn->put( $BUCKET, $KEY, S3Object->new("this is a test"),
                        \%headers);

Likewise, you can read a file from STDIN:

my %headers;

FILE: while(1) {
    my $n = sysread(STDIN, $data, 1024 * 1024, length($data));
    if ($n < 0) {
        print STDERR "Error reading input: $!\n";
        exit 1;
    }
    last FILE if $n == 0;
}
$response = $conn->put("$BUCKET", "$KEY", $data, \%headers);

To add custom metadata, simply add to the S3Object:

S3Object->new("this is a test", { name => "attribute" })

By default, every object has private access control when written. This allows only the user that stored the object to read it back. You can change these settings. Also, note that each object can hold a maximum of 5 GB of data.

You are probably wondering if it is also possible to upload via a standard HTTP POST. The folks at Amazon are working on it as we speak -- see HTTP POST beta discussion for more information. Until that's finished, you'll have to perform web-based uploads via an intermediate server.

Reading an Object

Like writing objects, there are several ways to read data from Amazon S3. One way is to generate a temporary URL to use with your favorite client (for example, wget or Curl) or even a browser to view or retrieve the object. All you have to do is generate the URL used to make the REST call.

my $generator = S3::QueryStringAuthGenerator->new($AWS_ACCESS_KEY_ID,
    $AWS_SECRET_ACCESS_KEY);

...and then perform a simple HTTP GET request. This is a great trick if all you want to do is temporarily view or verify the data.

$generator->expires_in(60);
my $url = $generator->get($BUCKET, "$KEY");
print "$url \n";

You can also programmatically read the data directly from the initial connection. This is handy if you have to perform additional processing of the data.

my $response = $conn->get("$BUCKET", "$KEY");
my $data     = $response->object->data;

Another cool feature is the ability to use BitTorrent to download files from Amazon S3 . You can access any object that has anonymous access privileges via BitTorrent.

Delete an Object

By now you probably have the hang of the process. If you're going to create objects, you're probably going to have to delete them at some point.

$conn->delete("$BUCKET", "$KEY");

Set Access Permissions and Publish to a Website

As you may have noticed from the previous examples, all Amazon S3 objects access goes through HTTP. This makes Amazon S3 particularly useful as a online repository. In particular, it's useful to manage and serve website media. You could almost imagine Amazon S3 serving as mini Content Delivery Network for media on your website. This example will demonstrate how to build a very simple online page where the images are served dynamically via Amazon S3.

The first thing to do us to upload some images and set the ACL permissions to public. I've modified the previous example with one difference. To make objects publicly readable, include the header x-amz-acl: public-read with the HTTP PUT request.

my %headers = (
    'x-amz-acl' => 'public-read',
);

Additional ACL permissions include:

  • private (default setting if left blank)
  • public-read
  • public-read-write
  • authenticated-read

Now you know enough to put together a small script that will automatically display all images in the bucket to a web page (you'll probably want to spruce up the formatting).

...
my $BUCKET   = "foobar";
my $response = $conn->list_bucket("$BUCKET");

for my $entry (@{$response->entries}) {
    my $public_url   = $generator->get($BUCKET, $entry->{Key});
    my ($url, undef) = split (/\?/, $public_url);
    $images         .= "<img src=\"$url\"><br />";
}
($webpage =  <<"WEBPAGE");
<html><body>$images</body></html>
WEBPAGE
print $q->header();
print $webpage;

To add images to this web page, upload more files into the bucket and they will automatically appear the next time you load the page.

It's also simple to link to media one at a time for a webpage. If you examine the HTML generated by this example, you'll see that all Amazon S3 URLs have the basic form http://bucketname.s3.amazon.com/objectname. Also note that the namespace for buckets is shared with all Amazon S3 users. You may have already picked up on this.

Conclusion

Amazon S3 is a great tool that can help with the data management needs of all sized organizations by offering cheap and unlimited storage. For personal use, it's a great tool for backups (also good for organizations) and general file storage. It's also a great tool for collaboration. Instead of emailing files around, just upload a file and set the proper access controls -- no more dealing with 10 MB attachment restrictions!

At SundayMorningRides.com we use S3 as part of our web serving infrastructure to reduce the load on our hardware when serving media content.

When combined with other Amazon Web Services such as SimpleDB (for structured data queries) and Elastic Compute Cloud (for data processing) it's easy to envision a low cost solution for web-scale computing and data management.

More Resources and References

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en