February 2000 Archives

Ten Perl Myths


Introduction

Table of Contents

Perl is hard
Perl looks like line noise
Perl is hard because it has regexps
Perl is hard because it has references
Perl is just for Unix
Perl is just for one-liners - can't build `real' programs with it.
Perl is just for CGI
Perl is too slow
Perl is insecure
Perl is not commercial, so it can't be any good
Conclusion

One of the things you might not realize when you're thinking about Perl and hearing about Perl is that there is an awful lot of disinformation out there, and it's really hard for someone who's not very familiar with Perl to separate the wheat from the chaff, and it's very easy to accept some of these things as gospel truth - sometimes without even realising it.

What I'm going to do, then, is to pick out the top ten myths that you'll hear bandied around, and give a response to them. I'm not going to try to persuade you to use Perl - the only way for you to know if it's for you is to get on and try it - but hopefully I can let you see that not all of what you hear is true.

First, though, let's make sure we understand what Perl is, and what it's for.

Perl is a general-purpose programming language. The answer to the question `Can I do this in Perl?' is very probably `yes'. It's often used for little system administration tasks and for CGI and other web stuff, but that's not the whole story. We'll see soon that you can use Perl for full-sized projects as well.

Perl is sometimes called a `scripting' language, but only by people who don't like it or don't understand it. Firstly, there's no real difference between programming and scripting - it's all just telling the computer what you want it to do. Second, even if there was, Perl's as much of a scripting language as Java or C. I'm going to talk about Perl programs here, but you might hear some people call them Perl scripts. The people who call them `programs' on the whole write better ones.

Perl is hard

The first thing people will tell you is that Perl is hard: hard to learn, hard to use, hard to understand. Since Perl is so powerful, the logic goes, it must be difficult, right?

Wrong. For a start, Perl is built on a number of languages that will be familiar to almost every programmer these days. Know any C? Then you've got a head start with Perl. Know how to program the Unix shell? Or write an awk or sed program? If so, you'll immediately feel at home with some elements of Perl syntax.

And even if you don't know any of these languages, even if you're totally new to programming, I'd still say Perl was one of the easiest languages to begin programming with, for two good reasons.

Perl works the way you do. One of the Perl mottos is `There's more than one way to do it'. People approach tasks in very different ways, and sometimes come out with very different solutions to the same problem. Perl is accomodating - it doesn't force any particular style on you, (unless you ask it to) and it allows you to express your programming intentions in a way that reflects how you as a person think about programming. Here's an example: suppose we've got a file which consists of two columns of data, separated by a colon. What we have to do is to swap around the two. This came up in a discussion the other day, and here's how I thought about doing it: Read a line, swap what's on either side of the colon, then print the line.

        while (<>) {
                s/(.*):(.*)/$2:$1/;
                print;
        }

It's not a hard problem to understand, so it shouldn't be hard to solve. It only needs a few lines - in fact, if you use some command line options to perl, you can dispense with everything apart from the second line. But let's not get too technical when we can get by without it.

Now, for those of you who don't know that much Perl, that diamond on the first line means `read in a line', and the s on the second means `substitute'. The brackets mean `remember' and .* means `any amount of anything'

So, while we can read a line in, we do some kind of substitution, and then print it out again. What are we substituting? We take something which we remember, followed by a colon, then something else we remember. Then we replace all that with the second thing, a colon, and the first thing. That's one, fairly natural way to think about it.

Someone else, however, chose to do it another way:

        while (<>) {
                chomp;
                ($first, $second) = split /:/;
                print $second, ":", $first, "\n";
        }

Slightly longer, maybe a little easier to understand, (maybe not, I don't know) but the point is, that's how he thought about approaching the problem. It's how he visualised it, and it's how his mind works. Perl hasn't imposed any particular way of thinking about programming on us.

To go through it, quickly: chomp takes off the new-line. Then we split (using the reasonably obviously named split function) the incoming text into two variables, around a colon. Finally, we put it back together in reverse order, the second bit, the colon, the first bit, and last of all putting the new-line back on the end where it belongs.

The second thing which makes Perl easy is that you don't need to understand all of it to get the job done. Sure, we could have written the above program on the command line, like this:

        % perl -p -e 's/(.*):(.*)/$2:$1/'

(-p makes Perl loop over the incoming data and print it out once you've finished fiddling with it.)

But we didn't need to know that. You can do a lot with a little knowledge of Perl. Obviously, it's easier if you know more, but it's also easy to get started, and to use what you know to get the job done.

Let's take another example. We want to examine an /etc/passwd file and show some details about the users. Perl has some built-in functions to read entries from the password file, so we could use them. However, even if we didn't know about them, we could do the job with what we do know already: we know how to read in and split up a colon-delimited input file, which is all we need. There's more than one way to do it.

So, thanks to its similarity to other languages, the fact that it doesn't force you to think in any particular way, and the fact that a little Perl knowledge goes a long way, we can happily consign the idea that `Perl is hard' to mythology.

Perl looks like line noise

The next most common thing you'll hear is that Perl is ugly, or untidy, or is a write-only language. Unfortunately for me, there's a large number of Perl programs out there that appear to back this up. But just because you can write ugly programs, this doesn't mean it's an ugly language - there was an Obfuscated C Competition long before there was an Obfuscated Perl one.

Each time I look at a piece of Perl that seems to have been uploaded in EBCDIC over a noisy serial line, I stop and wonder `what possesses someone to write something so ugly?' Over time, I've come to realise that a consequence of Perl being easy to use is that it's easy to abuse as well.

What I think happens goes something like this: you're faced with a data file which you need converted into another format, by yesterday. Perl to the rescue! In five minutes, you've come up with something that makes sense to you at the time and does the job - it might not look pretty, but it works. You save it away somewhere, just in case the same problem comes up again. Sooner or later, it does - but this time the input format's just a tiny bit different, so you make a quick modification to deal with the change. Eventually, you've got quite a sophisticated program. You never meant to write a huge translator, but it was just so easy to modify what you already had. Code reuse and rapid development have teamed up to create a monster.

The problem is that people then distribute this, because it works and because it's useful. And other people take one look at it and say, `Man, how could you write something so ugly?' And Perl gets a bad reputation.

But you don't have to do it like that. You could realise what's going to happen and spend time re-writing your program, probably from scratch, to make it readable and maintainable. You can sacrifice development speed for readability just as well as the other way around.

You see, it's perfectly possible to write programs in Perl that are absolutely crystal clear, shining examples of the art of programming and show off your clever algorithms in all their beauty. But Perl isn't going to make you. It's up to you.

In short, Perl doesn't write illegible Perl, people do. If you can stop yourself being one of them, we can agree that Perl's reputation for looking like line noise is no more than a myth.

Perl is hard because it has regexps

One of the parts of Perl that have contributed to the myth that Perl is an illegible language is the way matching is specified - regular expressions. As with the whole of Perl, these things are very powerful, and we all know that power corrupts. The basic idea is very simple: what we are doing is looking for certain things in a string. You want to look for the three characters `abc', you write /abc/. So far, so good.

The next thing that comes along is the ability to match not just exact characters, but certain types of characters: a digit can be matched with \d, and so to match a digit then a colon, you say /\d:/. We've already seen that . matches any character. There's also ^ and $ to specify the beginning and the end of the line respectively. It's still pretty easy to get the hang of, yes? To match two digits at the beginning of the line, followed by any character and then a comma, you specify each of those things in turn: /^\d\d.,/

And so it goes on, getting more and more complex as you can express more complicated ideas about what you want to match. The important thing to remember is that the regular expression syntax is just like any other language: to express yourself in it, you need to get into the habit of being able to translate between it and your native language until you can think in the target language. So, even if I don't understand it by sight, I can work out what /^.,\d.$/ does because I can read it out. At the beginning of the line, we want to find each of the following items: any character, then a comma, then a digit, then any character, which brings us to the end of the line.

Once we get into not just matching but also substitution, we can produce some really nasty looking stuff. Here's my current favourite example, which is relatively simple as far as regular expressions go. It corrects mispellings of `Teh' and `teh' to `The' and `the' respectively:

        s/\b([tT])eh\b/$1he/g

You could sit down and read it out yourself, but Perl allows us to break up the regular expression, and whitespace and comments, so there's no reason for having incomprehensible regular expressions lying around. Here's a fully documented version. Once you have practise reading and writing regular expressions, you'll be able to do this kind of expansion in your head: (and you'll find it less distracting, too.)

       s/\b   # Start with a word boundary, and
         (    # save away
         [tT] # either a `t' or a `T',
         )    # stop saving,
         eh   # and then the rest of the word
         \b   # ending at a word boundary
       /      # and replace it with
         $1   # the original character we saved, whether `t' or `T'
         he   # and the correct spelling
       /gx;   # globally throughout the string.

Regular expressions can look difficult at first sight, but once you know the code and you can break them down and build them up as above, you'll soon find that they're as natural a way of expressing finding text as your own language. Saying that they make Perl difficult, then, would be a myth.

Perl is hard because it has references

The next one is actually two myths in one: first, there's the myth that Perl can't deal with complicated data structures. You've only got three data types available in Perl: a scalar, which holds numbers and text and look like this: $foo; an array, which holds a list of scalars, and is represented like this: @bar; and a hash, which holds one-to-one correspondances between scalars, which we write like this: %baz. Hashes are usually used for storing `key-value' type records: we'll see an example later on.

Not enough, you will be told, to build up the complicated structures you need in `real programming'. Well, this one isn't even half true. Since Perl 5 came out, and that's five years ago now, Perl has been able to make complex structures out of references, and we'll see how it's done in a second.

So once you've got around that one, you'll hear the opposite: references are too complicated, and you end up with a mess of punctuation symbols. Interestingly, the people who find references most complicated are people used to C - references are sort of like pointers, but not quite, leaving C people getting hung up about memory and addresses and all sorts of irrelevant things. You don't have to worry about how memory is laid out in Perl; you've got more important things to do with your time. As usual, C programmers are confusing themselves by making things more complicated than they need to be.

References are, at their simplest, flat-pack storage for data. They turn any data - hashes, arrays, scalars, even subroutines - into a scalar that represents it. So, let's say we've got some hashes as follows:

        %billc = (
                name => "Bill Clinton",
                job  => "US President"
        );
        
        %billg = (
                name => "Bill Gates",
                job  => "Microsoft President"
        );

Of course, it's a hassle having an individual variable for each person, and there are a lot of Bills in the world - I get about four a month, and that's enough for me. Ideally, we want to put them all together, and we'll store that in an array. Problem! Arrays can only hold scalars, not hashes. No problem - use a reference to flat-pack each hash into a scalar. We do this simply by adding a backslash before the name:

        $billc_r = \%billc;
        $billg_r = \%billg;

Now we've got two scalars, which contain all the data in our arrays. To unpack the data back into a hash, you dereference it: just tell Perl you want to make it into a hash. We know hashes start with %, so we prefix the name of our reference with that:

        %billc can now be accessed via %$billc_r
        %billg can now be accessed via %$billg_r

And now we can store these references into an array:

        @bills = ( $billc_r, $billg_r );

or

        @bills = ( \%billc, \%billg );

Hey presto! An array which contains two hashes - a complex data structure.

Of course, there are a couple more tricks: ways of creating references to arrays and hashes directly instead of taking a reference to an already existing variable, and ways of getting to the data in a reference directly instead of dereferencing to a new variable. (See the symmetry?)

But as before, you don't need to know the whole of the language to get things done. Yes, it makes your code clearer and shorter if you don't use temporary variables unnecessarily, but if you don't know how do to that, it doesn't stop you using references.

Granted, you may not understand references in their entirety now. You may not even see why they're useful; in fact, if you're just writing simple programs that throw text around you probably will never need to use them.

But hopefully now you can sail between the Scylla that says you can't handle complicated data, and the Charybdis that says you can but it's hopelessly confusing, and see that, like the rest of the Odyssey, it's just a myth.

Perl is just for Unix

Isn't Perl just for Unix? I hear this, and I'm finding it harder to answer it with a straight face. The standard Perl distribution contains support for over 70 operating systems. Separate porting projects exist for Windows and for the Macintosh, and many other systems besides. It's hard to find a computer that Perl doesn't run on these days: Perl now even runs on the Psion organiser and is close to being built on the Palm Pilot.

This means that Perl is one of the most - if not the most portable language around. A properly written program will need absolutely no changes to move from Unix to Windows NT and on to the Macintosh, and an improperly written one will probably need three or four lines of change.

Perl is, most definitely, not just for Unix. That one is, purely and simply, a myth.

Perl is just for one-liners - can't build `real' programs with it.

The same sort of people who say that Perl is `just a scripting language' will probably try and tell you that Perl isn't suitable for `serious programming'. You wouldn't write an operating system in Perl, so it can't be any good.

Well, maybe you wouldn't write an operating system in it. I know one or two people who are trying, but they're freaks. However, this doesn't mean you can't build large, sophisticated and important programs in Perl. It's just a programming language.

People have written some pretty big stuff in Perl - it manages Slashdot, which shows it can stand up to a fair amount of load, the data from the Human Genome Project, and innumerable network servers and clients. Programs in the hundreds of thousands of lines of Perl are not uncommon.

Furthermore, you can extend Perl to get at any C libraries you have around - anything you can do in C, you can do in Perl, and more besides. Yes, it's a good language for one-liners and short data mangling, but that's not all Perl's about.

To say that Perl isn't suited for `serious programming' shows either a misunderstanding of what Perl is, or what `serious programming is' and is, at any rate, a myth.

Perl is just for CGI

Ah, the great CGI/Perl confusion. Since Perl is the best language for doing dynamic web content in, people have managed to get a little muddled on the differences between Perl and CGI.

CGI is just a protocol - an agreement that a web server and a program are going to talk the same language. You can get Perl to speak that protocol, and that's what a lot of people do. You can get C to speak CGI, if you must. I've written programs that talk CGI in INTERCAL, but then, I'm like that. There's nothing Perl specific about CGI.

There's also nothing CGI specific about Perl. Yes, that might be what most people out there are using Perl for, and yes, that's because it's a task that Perl is particularly well suited to. But as we've seen, people can do, and are doing, far more with Perl than just messing about on the Web. Heck, Perl was around back when the Web was in nappies.

CGI is Perl? Perl is CGI? It's all a load of myths.

Perl is too slow

Maybe you'll hear people say that Perl is too slow to be any use.

In some cases, it might be pretty slow relative to something like C: C can be anything up to 50 times faster than an equivalent Perl program. But it depends on a lot of things.

It depends on how you write your program; if you write in a C-like style, you'll find it runs considerably slower than an equivalent program written with Perl idioms. For instance, you could look at a string character by character, like you would in C. You'd be doing a lot of work yourself, though, that you could probably do with a simple regular expression. The less you ask Perl to do, the faster it runs.

Sometimes, though, there are things that C finds hard and Perl breezes; string manipulation is one thing, because Perl allows you to think of things at the string level, instead of forcing you to see them a character at a time.

There are, however, occasions when C is going to win hands down in terms of running time. But if you're writing software yourself, you have to consider another area: development time. The amount of time and emotional energy you have to have to exert in programming is important to you, and, since programmers cost a lot of money these days, whoever pays for you, too.

Let's take a really simple example to show how it works: we've got a document with a series of numbered points. We've added another point at line 50, so after that, every number at the beginning of a line after line 50 should be increased by one. I'd rather spend a few seconds to cook up a bit of Perl like this:

        % perl -p -e 's/^(\d+)/1+$1/e if $. > 50'

than a good half hour trying to hack it up in C, with the associated medication fees resulting from me having had to bang my head against a brick wall for that length of time.

Granted, maybe we don't need the speed of C for a simple example like that, but the principle extends to big programs too: You might be able to write faster programs in C, but you can always write programs faster in Perl.

Perl is insecure

What about security? There's got to be some chinks in the armour there. People can read the source code to your programs, so you're vunerable to all sorts of problems!

While it's true that the source must be readable in order for the Perl interpreter to run it, this doesn't mean that Perl is insecure. It might mean that what you've written is insecure, and you think it would be better hiding away the deficiencies, but these days, very few people and Microsoft actually believe that. Security by obscurity isn't very much security at all.

Just like the readability of your code and the wonderful Y2K bug, you can't blame Perl for what you choose to write with it. Perl isn't going to magically make you write secure programs - sure, if you use the tainting mechanism, it'll try its hardest to stop you writing insecure code, but that's no substitute for knowing how to write properly yourself.

If you really want, you can try and hide the source code; you can use source filters, you can try compiling it with the Perl compiler, but none of these things guarantee that it can't be unencrypted or decompiled - and none of them will fix any problems. Far better just to write secure code.

So, what you write might be insecure, but Perl itself insecure? No, that's another myth.

Perl is not commercial, so it can't be any good

Finally, you'll get those who claim that, since Perl isn't commercial software, it can't be any good. There's no support for it, the documentation is provided by volunteers, and so on.

It amazes me that in a world of Linux and Apache and Samba and many other winning technologies, people can still think like this. But then, it shouldn't amaze me, because commercial vendors want them to think like this and spend a lot of money trying to frighten them into doing so.

I could spend my time saying that because it's supported by volunteers, people are doing it for love instead of for money, and this leads to a better product, but let's not bother fighting on philosophical grounds about the nature of volunteer development. Let's get down to facts.

The standard Perl distribution contains over 70,000 lines of documentation, which should really be enough for anyone. If not, there are innumerable tutorials available on the web. Add to that all the lines of documentation on CPAN modules, and we've got a pretty substantial base of prose. And that's just the freely available stuff. At last count, there were over 250 books dedicated to Perl, and probably as many again that include it.

Documentation is not something we have a problem with.

What about support? Well, if you've read through all that documentation and you still have a problem, there are at least five Usenet newsgroups dedicated to Perl, and at least one IRC channel. These are all again staffed by volunteers, so they don't have to be nice to you if you obviously haven't read the FAQs. But that doesn't mean they're not useful, and some of the big names in the Perl world hang around there. You can probably find answers to your questions, if you show enough common sense.

Of course, you may need more than that - thousands of firms offer Perl training, and you can buy real support contracts, shrink-wrapped Perl packages and everything that would make even the most pointy-haired of bosses feel comfortable with it. Just because it's free, doesn't mean it isn't commercial, and the idea that making it free doesn't make it worthwhile is nothing more than a myth.

Conclusion

That's not all the myths you'll hear about Perl; I haven't time to list them all, but there's a lot of disinformation out there. If you've heard any of those things I've mentioned about, I'd ask you to take another look at Perl; it's easier than you think, it's faster than you think, and it's better than you think. Don't listen to the myths - don't even take my word for it. Get out there and try it for yourself.

My Life With Spam

or

I wrote Part 1 of this series back in October 1999 for the LinuxPlanet web site, but the editors decided not to publish the rest of the series. Since then, many people have asked for the continuation. This article is Part 2.

Part 1 of the series discussed my early experiences with the spam problem, first on Usenet and then in my e-mail box. I talked about how to set up a mail filter program and how to have it parse incoming messages. I discussed splitting up the header into logical lines and why this is necessary. For the details and the Perl code, you can read the original article.

I also talked about my filtering philosophy, which is to blacklist the domains that send me a lot of spam, and reject mail from those domains.

Domain Pattern Matching

One way to handle domains might have been to take the To: address in the message and strip off the host name. However, this is impossible because a host name is a domain. perl.com is both a host name and a domain; www.perl.com is both a host name and a domain, and so is chocolaty-goodness.www.perl.com. In practice, though, it's easy to use a simple heuristic:

  1. Split the host name into components.
  2. If the last component is com, edu, gov, net, or org, then the domain name is the last two components.
  3. Otherwise, the domain name is the last three components

The theory is that if the final component is not a generic top-level domain like com, it is probably a two-letter country code. Most countries imitate the generic space at the second level of their domain. For example, the United Kingdom has ac.uk, co.uk, and org.uk corresponding to edu, com, and org, so when I get mail from someone@thelonious.new.ox.ac.uk, I want to recognize the domain as ox.ac.uk (Oxford University), not ac.uk.

Of course, this is a heuristic, which is a fancy way of saying that it doesn't work. Many top-level domains aren't divided up at the third level the way I assumed. For example, the to domain has no organization at all, the same as the com domain. If I get mail from hot.spama.to, my program will blacklist that domain only, not realizing that it's part of the larger spama.to domain owned by the same people. So far, however, this has never come up.

And I didn't include mil in my list of exceptions. However, I've never gotten spam from a mil.

Eventually the domain registration folks will introduce a batch of new generic top-level domains, such as .firm and .web. But they've been getting ready for it since 1996; they're still working up to it; and I might grow old and die waiting for it to happen. (See http://www.gtld-mou.org/ for more information.)

For all its problems, this method has worked just fine for a long time because hardly any of the problems ever actually come up. There's a moral here: The world is full of terrifically over-engineered software. Sometimes you can waste a lot of time trying to find the perfect solution to a problem that only needs to be partially solved. Or as my friends at MIT used to say, ``Good enough for government work!''

Here's the code for extracting the domain:

 1    my ($user, $site) = $s =~ /(.*)@(.*)/;
 2    next unless $site;
 3    my @components =  split(/\./, $site);
 4    my $n_comp = ($components[-1] =~ /^edu|com|net|org|gov$/) ? 2 : 3;
 5    my $domain = lc(join '.', @components[-$n_comp .. -1]);
 6    $domain =~ s/^\.//;  # Remove leading . if there is one.

The sender's address is in $s. I extract the site name from the address with a simple pattern match, which is also a wonderful example of the "good enough" principle. Messages appear in the comp.lang.perl.misc newsgroup every week asking for a pattern that matches an e-mail address. The senders get a lot of very long complicated answers, or are told that it can't be done. And yet there it is. Sure, it doesn't work. Of course, if you get mail addressed to ``@''@plover.com, it's going to barf. Of course, if you get source-routed mail with an address like @send.here.first:joe@send.here.afterwards, it isn't going to work.

But guess what? Those things never happen. A production mail server has to deal with these sorts of fussy details, but if my program fails to reject some message as spam because it made an overly simple assumption about the format of the mail addresses, there's no harm done.

On line 2, we skip immediately to the next address if there's no site name, since it now appears that this wasn't an address at all. Line 3 breaks the site name up into components; thelonious.new.ox.ac.uk is broken into thelonious, new, ox, ac, and uk.

Line 4 is the nasty heuristic: It peeks at the last component, in this case uk, and if it's one of the magic five (edu, com, net, org, or gov), it sets $n_comp to 2; otherwise to 3. $n_comp is going to be the number of components that are part of the domain, so the domain of saul.cis.upenn.edu is upenn.edu, and the domain of thelonious.new.ox.ac.uk is ox.ac.uk.

To get the last component, we subscript the component array @components with -1. -1 as an array subscript means to get the last element of the array. Similarly, -2 means to get the next-to-last element. In this case, the last component is uk, which doesn't match the pattern, so $n_comp is 3.

On line 5, -$n_comp .. -1 is really -3 .. -1, which is the list -3, -2, -1. We use a Perl feature called a "list slice" to extract just the elements -3, -2, and -1 from the @components array. The syntax

        @components[(some list)]

invokes this feature. The list is taken to be a list of subscripts, and the elements of @components with those subscripts are extracted, in order. This is why you can write

        ($year, $month, $day) = (localtime)[5,4,3];

to extract the year, month, and day from localtime, in that order--it's almost the same feature. Here we're extracting elements -3 (the third-to-last), -2 (the second-to-last), and -1 (the last) and joining them together again. If $n_comp had been 2, we would have gotten elements -2 and -1 only.

Finally, line 6 takes care of a common case in which the heuristic doesn't work. If we get mail from alcatel.at, and try to select the last three components, we'll get one undefined component--because there were really only two there--and the result of the join will be .alcatel.at, with a bogus null component on the front. Line 6 looks to see if there's an extra period on the front of the domain, and if so, it chops it off.


Now That You Have it, What Do You Do With it?

I've extracted the domain name. The obvious thing to do is to have a big hash with every bad domain in it, and to look this domain up in the hash to see if it's there. Looking things up in a hash is very fast.

However, that's not what I decided to do. Instead, I have a huge file with a regex in it for every bad domain, and I take the domain in question and match it against all the patterns in the file. That's a lot slower. A lot slower. Instead of looking up the domain instantaneously, it takes 0.24 seconds to match the patterns.

Some people might see that and complain that it was taking a thousand times as long as it should. And maybe that's true. But the patterns are more flexible, and what's a quarter of a second more or less? The mail filter handled 2,211 messages in the month of January. At 0.24 seconds each, the pattern matching is costing me less than 9 minutes per month.

So much for the downside. What's the upside? I get to use patterns. That's a big upside.

I have a pattern in my pattern file that rejects any mail from anyone who claims that their domain is all digits, such as 12345.com. That would have been impossible with a hash. I have a pattern that rejects mail from anyone with "casino" in their domain. That took care of spam from Planetrockcasino.com and Romancasino.com before I had ever heard of those places. Remember that I only do the pattern matching on the domain, so if someone sent me mail from casino.ox.ac.uk, it would get through.

The regexes actually do have a potential problem: The patterns are in the file, one pattern per line. Suppose I'm adding patterns to the file and I leave a blank line by mistake. Then some mail arrives. The filter extracts the domain name of the sender and starts working through the pattern file. 0.24 seconds later, it gets to the blank line.

What happens when you match a string against the empty pattern? It matches, that's what. Every string matches the empty pattern. Since the patterns are assumed to describe mail that I don't want to receive, the letter is rejected. So is the next letter. So is every letter. Whoops.

It's tempting to say that I should just check for blank patterns and skip them if they're there, but that won't protect me against a line that has only a period and nothing else--that will also match any string.

Instead, here's what I've done:

        $MATCHLESS = "qjdhqhd1!&@^#^*&!@#";

        if ($MATCHLESS =~ /$badsite_pat/i) {
          &defer("The bad site pattern matched `$MATCHLESS', 
                  so I assume it would match anything.  Deferring...\n");
        }

Since the patterns are designed to identify bad domain names, none of them should match qjdhqhd1!&@^#^*&!@#. If a pattern does match that string, it probably also matches a whole lot of other stuff that it shouldn't. In that case, the program assumes that the pattern file is corrupt, and defers the delivery. This means that it tells qmail that it isn't prepared to deliver the mail at the present time, and that qmail should try again later on. qmail will keep trying until it gets through or until five days have elapsed, at which point it gives up and bounces the message back to the sender. Chances are that I'll notice that I'm not getting any mail sometime before five days have elapsed, look in the filter log file, and fix the pattern file. As qmail retries delivery, the deferred messages will eventually arrive.

Deferring a message is easy when your mailer is qmail. Here's the defer subroutine in its entirety:

        sub defer {
          my $msg = shift;
          carp $msg;
          exit 111;
        }

When qmail sees the 111 exit status from the filter program, it interprets it as a request to defer delivery. (Similarly, 100 tells qmail that there was a permanent failure and it should bounce the message back to the sender immediately. The normal status of 0 means that delivery was successful.)

I would still be in trouble if I installed com as a pattern in the pattern file, because it matches more domains than it should, but the MATCHLESS test doesn't catch it. But unlike the blank line problem, it's never come up, so I've decided to deal with it when it arises.


'Received:' Lines

In addition to filtering the From:, Reply-To:, and envelope sender addresses, I also look through the list of forwarding hosts for bad domains. The From: and Reply-To: headers are easy to forge: The sender can put whatever they want in those fields and spammers usually do. But the Received: fields are a little different. When computer A sends a message to computer B, the receiving computer B adds a Received: header to the message, recording who it is, when it received the message, and from whom. If the message travels through several computers, there will be several received lines, with the earliest one at the bottom of the header and the later ones added above it. Here's a typical set of Received: lines:

1   Received: (qmail 7131 invoked by uid 119); 22 Feb 1999 22:01:59 -0000
2 Received: (qmail 7124 invoked by uid 119); 22 Feb 1999 22:01:58 -0000
3 Received: (qmail 7119 invoked from network); 22 Feb 1999 22:01:53 -0000
4 Received: from renoir.op.net (root@209.152.193.4)
5 by plover.com with SMTP; 22 Feb 1999 22:01:53 -0000
6 Received: from pisarro.op.net (root@pisarro.op.net [209.152.193.22]) by renoir.op.net (o1/$Revision:1.18 $) with ESMTP id RAA24909 for <mjd@mail.op.net>; Mon, 22 Feb 1999 17:01:48 -0500 (EST)
7 Received: from linc.cis.upenn.edu (LINC.CIS.UPENN.EDU [158.130.12.3]) by pisarro.op.net (o2/$Revision: 1.1 $) with ESMTP id RAA12310 for <mjd@op.net>; Mon, 22 Feb 1999 17:01:45 -0500(EST)
8 Received: from saul.cis.upenn.edu (SAUL.CIS.UPENN.EDU [158.130.12.4])
9 by linc.cis.upenn.edu (8.8.5/8.8.5) with ESMTP id QAA15020
10 for <mjd@op.net>; Mon, 22 Feb 1999 16:56:20 -0500 (EST)
11 Received: from mail.cucs.org (root@cucs-a252.cucs.org [207.25.43.252])
12 by saul.cis.upenn.edu (8.8.5/8.8.5) with ESMTP id QAA09203 13 for <mjd@saul.cis.upenn.edu>; Mon, 22 Feb 1999 16:56:19 -0500 (EST)
14 Received: from localhost.cucs.org ([192.168.1.223])
15 by mail.cucs.org (8.8.5/8.8.5) with SMTP id QAA06332
16 for <mjd@saul.cis.upenn.edu>; Mon, 22 Feb 1999 16:54:11 -0500

This is from a message that someone sent to mjd@saul.cis.upenn.edu, an old address of mine. Apparently the sender's mail client, in localhost.cucs.org, initially passed the message to the organization's mail server, mail.cucs.org. The mail server then added the lines 14-16 to the message header.

The mail server then delivered the message to saul.cis.upenn.edu over the Internet. saul added lines 11-13. Notice that the time on line 13 is 128 seconds after the time on line 13. This might mean that the message sat on mail.cucs.org for 128 seconds before it was delivered to saul, or it might mean that the two computers' clocks are not properly synchronized.

When the mail arrived on saul, the mailer there discovered that I have a .forward file there directing delivery to mjd@op.net. saul needed to forward the message to mjd@op.net. However, most machines in the University of Pennsylvania CIS department do not deliver Internet mail themselves. Instead, they forward all mail to a departmental mail hub, linc, which takes care of delivering all the mail outside the organization. Lines 8-10 were added by linc when the mail was delivered to it by saul.

linc looked up op.net in the domain name service and discovered that the machine pisarro.op.net was receiving mail for the op.net domain. Line 7 was added by pisarro when it received the mail from linc.

I don't know why pisarro then delivered the message to renoir, but we know that it did, because line 6 says so.

qmail on plover.com added lines 4-5 when the mail was delivered from renoir. Then the final three lines, 1-3, were added by qmail for various local deliveries to mjd, then mjd-filter (which runs my spam filter), and finally, mjd-filter-deliver, which is the address that actually leads to my mailbox.

What can we learn from all this? The Received: lines have a record of every computer that the message passed through on its way to being delivered. And unlike the From: and Reply-To: lines, it really does record where the message has been.

Suppose the original sender, at localhost.cucs.org had wanted to disguise the message's origin. Let's call him Bob. Bob cannot prevent cucs.org from being mentioned in the message header. Why? Because there it is in line 11. Line 11 was put there by saul.cis.upenn.edu, not by Bob, who has no control over computers in the upenn.edu domain.

Bob can try to confuse the issue by adding spurious Received: lines, but he can't prevent the other computers from adding the correct ones.

Now, when spammers send spam, they often forge the From: and the Reply-To: lines so that people don't know who they are and can't come and kill them. But they can't forge the Received: lines because it's another computer that puts those in. So when we're searching for domains to check against the list of bad domain patterns, we should look through the Received: lines too.

The difficulty with that is that there's no standard for what a Received: line should look like or what should be in it, and every different mailer does its own thing. You can see this in the example above. We need a way to go over the Received: lines and look for things that might be domains. This is just the sort of thing that Perl regexes were designed for.

1 sub forwarders {
2 return @forwarders if @forwarders;
3
4 @forwarders =
5 grep { /[A-Za-z]/ } ($H{'Received'} =~ m/(?:[\w-]+\.)+[\w-]+/g);
6
7 @forwarders = grep { !/(\bplover\.com|\bcis\.upenn\.edu|\bpobox\.com|\bop\.net)$/i } @forwarders;
8
9 foreach $r (@forwarders) {
10 $r{lc $r} = 1;
11 }
12
13 @forwarders = keys %r;
14
15 return @forwarders;
16 }

The message header has already been parsed and placed into the %H hash. $H{Received} contains the concatenation of all the Received lines in the whole message. The purpose of the forwarders() function is to examine $H{Received}, extract all the domain names it can find, and place them in the array @forwarders.

Lines 4-5 are the heart of this process. Let's look a little more closely.

        $H{'Received'} =~ m/(?:[\w-]+\.)+[\w-]+/g

This does a pattern match in the Received: lines. [\w-] looks for a single letter, digit, underscore, or hyphen, while [\w-]+ looks for a sequence of such characters, such as saul or apple-gunkies. This is a domain component. [\w-]+\. looks for a domain component followed by a period, like saul. or apple-gunkies..

Ignore the ?: for the time being. Without it, the pattern is ([\w-]+\.)+[\w-]+, which means a domain component followed by a period, then another domain component followed by another period, and so on, and ending with a domain component and no period. So this is a pattern that will match something that looks like a domain.

The /g modifier on the match instructs Perl to find all matching substrings and to return a list of them. Perl will look through the Received: headers, pulling out all the things that look like domains, making them into a list, and returning the list of domains.

Another example of this feature:

$s = "Timmy is 7 years old and he lives at 350 Beacon St. 
Boston, MA 02134" @numbers = ($s =~ m/\d+/g);

Now @numbers contains (7, 350, 02134).

I still haven't explained that ?:. I have to confess to a lie. Perl only returns the list of matching substrings if the pattern contains no parentheses. If the pattern contains parentheses, the parentheses cause part of the string to be captured into the special $1 variable, and the match returns a list of the $1s instead of a list of the entire matching substrings. If I'd done

"saul.cis.upenn.edu plover.com" =~ m/([\w-]+\.)+[\w-]+/g

instead, I would have gotten the list ("saul.cis.upenn.", "plover."), which are the $1's, because the com parts match the final [\w-]+, which is not in parentheses. The ?: in the real pattern is nothing more than a switch to tell Perl not to use $1; . Since $1 isn't being used, we get the default behavior, and the match returns a list of everything that matched.

 
  4   @forwarders = 
  5   grep { /[A-Za-z]/ } ($H{'Received'} =~ m/(?:[\w-]+\.)+[\w-]+/g);

The pattern match generates a list of things that might be domains. The list initially looks like:

  renoir.op.net 209.152.193.4
  plover.com
  pisarro.op.net pisarro.op.net 209.152.193.22 renoir.op.net 1.18 mail.op.net
  linc.cis.upenn.edu LINC.CIS.UPENN.EDU 158.130.12.3 pisarro.op.net 1.1 op.net
  saul.cis.upenn.edu SAUL.CIS.UPENN.EDU 158.130.12.4
  linc.cis.upenn.edu 8.8.5 8.8.5
  op.net
  mail.cucs.org cucs-a252.cucs.org 207.25.43.252
  saul.cis.upenn.edu 8.8.5 8.8.5
  saul.cis.upenn.edu
  localhost.cucs.org 192.168.1.223
  mail.cucs.org 8.8.5 8.8.5
  saul.cis.upenn.edu

As you can see, it contains a lot of junk. Most notably, it contains several occurrences of 8.8.5, because the upenn.edu mailer was Sendmail version 8.8.5. There are also some IP addresses that we won't be able to filter, and some other things that look like version numbers. The grep filters this list of items and passes through only those that contain at least one letter, discarding the entirely numeric ones.

@forwarders is now

       
  renoir.op.net
  plover.com
  pisarro.op.net pisarro.op.net renoir.op.net mail.op.net
  linc.cis.upenn.edu LINC.CIS.UPENN.EDU pisarro.op.net op.net
  saul.cis.upenn.edu SAUL.CIS.UPENN.EDU
  linc.cis.upenn.edu
  op.net
  mail.cucs.org cucs-a252.cucs.org
  saul.cis.upenn.edu
  saul.cis.upenn.edu
  localhost.cucs.org
  mail.cucs.org
  saul.cis.upenn.edu

The rest of the function is just a little bit of cleanup. Line 7 discards several domain names that aren't worth looking at because they appear so often:

7  @forwarders = grep { !/(\bplover\.com|\bcis\.upenn\.edu|\bpobox\.com|\bop\.net)$/i }
   @forwarders;

Plover.com is my domain, and it's going to appear in all my mail, so there's no point in checking it. I worked at the University of Pennsylvania for four and a half years, and I get a lot of mail forwarded from there, so there's no point in checking cis.upenn.edu domains either. Similarly, I subscribe to the Pobox.com lifetime e-mail forwarding service, and I get a lot of mail forwarded through there. op.net is my ISP domain name, which handles mail for me when Plover is down. Line 7 discards all these domains from the @forwarders list, leaving only the following:

 
       mail.cucs.org cucs-a252.cucs.org
       localhost.cucs.org
       mail.cucs.org

Lines 9-13 now discard duplicate items, using a common Perl idiom:

  
 9     foreach $r (@forwarders) {
 10       $r{lc $r} = 1;
 11     }
 12
 13     @forwarders = keys %r;

We use the remaining items as keys in a hash. Since a hash can't have the same key twice, the duplicate mail.cucs.org has no effect on the hash, which ends up with the keys, mail.cucs.org, cucs-a252.cucs.org and localhost.cucs.org. The values associated with these keys are each "1," which doesn't matter. When we ask Perl for a list of keys on line 13, we get each key exactly once.

Finally, line 15 returns the list of forwarders to whomever needed it.

There's one little thing I didn't discuss:

  2       return @forwarders if @forwarders;

The first thing the function does is check to see if it's already processed the Received: lines and the computer @forwarders. If so, it returns the list without computing it over again. That way I can just call forwarders() anywhere in my program that I need a list of forwarders, without worrying that I might be doing the same work more than once; the forwarders() function guarantees to return immediately after the first time I call it.


More to Come

Because of the long delay, I'll repeat the quiz from the first article: What's wrong with this header line?

Received: from login_2961.sayme2.net (mail.sayme2.net[103.12.210.92])
by sayme2.net (8.8.5/8.7.3) with SMTP id XAA02040
for creditc@aoI.net;  Thu, 28 August 1997 15:51:23 -0700 (EDT)

The story's not over. In the next article, I'll talk about some other rules I used to filter the spam; one of them would have thrown out messages when it saw a line like the one above. Another one throws out mail when there's no To: line in the message--a likely sign of bulk mail.

I'll also tell a cautionary tale of how I might have lost a lot of money because my system worked too well, and how I found out that sometimes, you want to get unsolicited bulk mail.

Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en