My Life With Spam: Part 3

Mar 15, 2000 by Mark-Jason Dominus

How I Caught the Spam and What I Did With it When I Caught it

This article is the third in a series; you may want to read Part 1 and Part 2 if you haven’t yet. Here’s a short summary to refresh your memory:

I get a lot of spam mail, and I don’t like that. I used to get upset every time I read my mail and had to throw out a bunch of spam, so I decided to put some effort into writing a program that would filter my email and discard spam automatically. Then my blood pressure would stay low.

I decided that it was most important that nothing be thrown away for good, because I knew that no filter was perfect, and I was afraid that important business mail from a client might be silently discarded and I would lose my job. I also decided I wasn’t smart enough to filter on the actual content of messages—the touchstone example is that if someone sends me a letter complaining about spam, and they say ``Gosh, these Green Card Lottery messages are driving me crazy!’’ I don’t want to reject the message just because it mentions green card lotteries. I couldn’t figure out how to avoid that, so I resolved that I would filter on header lines only.

My main strategy is to maintain a blacklist of `bad domains’ that send a lot of spam, such as eastmail.com. Mail from these domains is presumed to be spam. Mail from anyone I don’t know in one of these domains is returned to the sender, with an apology, requesting them to re-send the message to a different address. Mail to that address is forwarded directly into my mailbox, and also into a program that adds their `From’ address to the list of `people I know’, so that next time they send me mail, it gets through the filter. The `people I know’ list is called a `whitelist’, and it’s just a DBM file. The code is very simple. After reading the message and parsing the header, as I described in Part 1, I look to see if the subject contains a secret password. If so, I add the message senders to my whitelist. Senders can also get on the whitelist by sending a message to a special address; this address invokes the program with a command-line argument that automatically enables whitelisting. The code looks something like this:

        if ($ARGV{Mode} eq 'Whitelist'
         || $H{Subject} =~ /$PASSWORD/) {
          foreach $s (senders()) {
            $db{lc $s} = time;
          }
        }

Drawbacks of This Approach

People occasionally ask what would happen if the spammers added the special whitelisting address to their mailing lists. This has never happened, but if they did, I would deal with it easily enough; I’d just change the address. Nobody ever needs to send more than one message to the whitelisting address, because once you’re on the whitelist, you’re on forever. So I wouldn’t be inconveniencing anyone by changing this address.

Of course, in the meantime, the spammer is still on the whitelist, and even if they don’t add the whitelist address to their address list, they might get the apology back in the mail and just add themselves to the whitelist. This has only happened a few times since I first started in late 1996—most spammers don’t provide valid return addresses, and those that do don’t read the replies to their mail.

For the couple of dedicated spammers who did try to add themselves to my whitelist, I have a special super-blacklist called the `losers list’. The filter rejects mail from losers no matter what address it comes to. It returns the message to the sender, with a note that says

        Hello.  I am an automatic reply.

        Your message is not welcome at this address.
        The recipient is not interested in hearing what
        you have to say.  Your mail being returned unread.

        Please do not send mail to this address in the future.

                Sincerely,

                The answering machine.

This has a side benefit: If I get really sick of someone I can add them to my loser list even if they’re not a spammer. It would probably be more mature to just have the filter throw away the message, rather than returning a note that says `I am ingoring you’, but I’m a spiteful guy.

So the logic in the filter program looks something like this:

if the message is from a loser
send it back with an `I am ignoring you’ note

else if the message is from someone on the whitelist
accept it

else if the message is from someone in a blacklisted domain
send it back with instructions about how to get whitelisted

else
accept it

The first two tests, for loserhood or whitelisting, are fast. The filter program keeps one big DBM file for both lists. The keys are `From:’ addresses, and the value is loser if the user is a loser, and the date of whitelisting if they user is on the whitelist. I keep the dates as an ASCII-formatted number of seconds since 1970, so the actual code looks something like this:

        if ($DB{$from} eq 'loser') {
          # they are a loser; reject mail
        } elsif ($DB{$from} > 0) {
          # they are on the whitelist; accept
        } elsif (domain_is_bad()) {
          # reject
        } else { 
          # accept
        }

domain_is_bad() takes longer, because it does many pattern matches. I discussed it in Part 2.

A Few More Heuristics

Checking the From:, Reply-To:, and Received: lines for the presence of domains that are `bad’ is the main part of the program. But there are a few extra miscellaneous rules I’ve put in that have worked really well. The function misc_bad_header checks these rules, returning true if it sees something bad, and false otherwise. For example:

  return 'Addressed To: "you" or "friend"' 
    if $H{To} =~ /(you|friend)/i;

This line looks at the To: line; if it’s addressed to you or to friend, it’s bad. Friend@public.com was particularly common in Spam for a while. The string returned by misc_bad_header is eventually entered in the log file as the reason for rejection.

At the end of the previous article, I asked what was wrong with the following Received: line:

        Received: from login_2961.sayme2.net (mail.sayme2.net[103.12.210.92])
          by sayme2.net (8.8.5/8.7.3) with SMTP id XAA02040
          for creditc@aoI.net;  Thu, 28 August 1997 15:51:23 -0700 (EDT)

People found all sorts of things wrong that hadn’t occurred to me! For example, underscores are not allowed in domain names, so the name login_2961.sayme2.net is illegal. However, many hosts do have illegal characters in their names, so this is probably not a good reason to reject the message.

Some people pointed out that the Received: line contains the dubious host name aoI.net, perhaps in hopes of fooling me into thinking that it actually said aol.net. Of course, they don’t look similar to the computer at all.

Jan-Pieter Cornet pointed out that the date is in the wrong format: It’s illegal to spell out August. It would be interesting to correlate the incidence of this sort of error with the incidence of Spam, but I haven’t done the analysis.

For me, the really interesting thing about this line is the time zone. EDT usually stands for `Eastern Daylight Time’, and the -0700 means that it’s the time zone seven hours behind Universal Coordinated Time. Guess what? EDT is four hours behind UTC, not seven. Similarly, EST is `Eastern Standard Time’, which is -0500. When messages like this started to show up, I added these lines to misc_bad_header():

  return 'Mangled time zone' if $H{Received} =~ /-0600 \(EST\)/
           || $H{Received} =~ /-0[57]00 \(EDT\)/;

I don’t know what caused the errors in the first place, but for a long time they were extremely reliable spam detectors. Neither one has come by in almost a year though, which is a shame; I miss them.

Yet More Heuristics

Here are some good ones:

          return 'X-PMFLAGS header' if exists $H{'X-PMFLAGS'};

I don’t know what an X-PMFLAGS line is, but it only seems to appear in spam.

          return 'Message was handled by bulk.mail' 
            if $H{Received} =~ /bulk.mail/i ;

There’s a bulk mailing program called bulk_mailer that identifies itself in the Received: line.

          return 'Subject contains "ad"/"adv"'
            if $H{Subject} =~ /\badv?\b/i;

Some spammers actually put a string like ADV: into the subject line of their message. Of course, the law of evolution by natural selection says that this trait will be selected against, and the next generation of spammers won’t do it as much. In the meantime, it works well.

The next one is obvious:

          return 'Subject: line contains $$$ '
            if $H{Subject} =~ /\${3}/;

The next one isn’t so obvious:

          return 'Username is all digits'
            if $H{From} =~ /^\d+\@/;

You’d be surprised at how often this comes up. Spammers seem to be moving away from it, though; now I’m more likely to get spam mail from an address like 23gfgd4@iperbole.bologna.it. I’d like to add a heuristic to throw away messages from unlikely-sounding usernames, but I don’t feel confident that I will be able to identify them reliably.

Finally, here’s a big heuristic that works well: Make a list of `bad’ words, and reject any message that includes any of the `bad’ words in any of the X-headers:

          my $pat = join '|', @bad_words;
          foreach $h (keys %H) {
            next unless $h =~ /^X-/;
            return "Header `$h' contains bad word $1"
              if $H{$h} =~ /($pat)/io;
          }

What words are `bad’? Oh, the usual naughtiness. `Cyberpromo’ heads the list. (Am I allowed to say that on a public web site?) `stealth’ is a good one.

There’s a trick here that you might not have seen before. You have a bunch of patterns, in this case the items in @bad_words, and a bunch of strings, in this case the values of %H, and you want to see if any of the strings match any of the patterns. The obvious technique is to use:

        foreach $string (values %H) { 
          foreach $pattern (@bad_words) {
            if ($string =~ /$pattern/) {
              return "String $string matches pattern $pattern"
            }
          }
        }

This is a mistake. Why? Because a regex ie like a little computer program: It must to be compiled, and then it can be executed and given an input, which is the string that you want to match against it. Compiling a regex is relatively slow, but you only need to do it once, and then you can give it many strings and see if any of them match, without recompiling.

Perl normally compiles your regexes at the same time as it compiles the rest of your program. But if the regexes change at run time, it can’t do that. Instead, it compiles each one just as it is about to be used. If there are 42 strings in the string list, each regex gets compiled 42 times.

Just reversing the order of the foreach loops speeds this up by a factor of 18, because each pattern is compiled only once, and then used for all the strings:

        foreach $pattern (@bad_words) {
          foreach $string (values %H) { 
            if ($string =~ /$pattern/) {
              return "String $string matches pattern $pattern"
            }
          }
        }

(I ran a simple test with 16 strings and nine patterns. If there were more patterns, the speedup would be greater.)

If misc_bad_header() is called only once, then compiling each pattern once is the best you can do. But if the loop just above is run twice, then all the patterns have to be compiled again the second time around; this is wasteful.

The solution is to construct one pattern that never varies at all, and compile it exactly once. We can construct one pattern by joining the real patterns together with |s in between. That’s what

          my $pat = join '|', @bad_words;

does; it takes a @bad_words list that looks like ('cyberpromo', 'stealth', ...) and turns it into a pattern that looks like cyberpromo|stealth|.... Then we match against this pattern:

         return "Header `$h' contains bad word $1"
           if $H{$h} =~ /($pat)/io;

The /o modifier tells Perl that even though $pat looks like a variable, we’re really using it as a constant. We promise perl that $pat will never change, and in return, Perl compiles the regex once, the first time it is needed, and never compiles it again.

In this program, this clever hack is probably not worth doing, because the filter program gets invoked once to handle one mail message, calls misc_bad_headers() exactly once, and then exits. If I were doing it over, I would have left out the trick—it comes under the heading of `premature optimization’, and premature optimization is the root of all evil. But the trick can be useful in other contexts.

Heuristics that Didn’t Work

I got tired of getting mail titled RE: BIOTECH BARGAIN! and **LET YOUR COMPUTER MAKE MONEY WHILE YOU SLEEP!**, so I thought I’d try putting in a filter to reject mail whose Subject: line was in all capitals:

             my $s = $H{Subject};
             # subject contains no lowercase letters
             return 'Subject: all capitals' if $s !~ /[a-z]/ ;

The very next email message I received was titled FYI. It was from my brother-in-law. The spam filter rejected it. I commented out the `all capitals’ test.

A few months later, I was ready for more abuse. I tried this::

             $s = tr/A-Za-z//cd;
             return 'Subject: all capitals' if $s !~ /[a-z]/ ;
               && length($s) > 6;  # and at least six capitals

This removes non-letters from the subject, then rejects the message if the subject is all capitals and is at least seven letters long. Soon after I tried this, I got mail from one of my clients titled URGENT FIX!!!. That was the end of that experiment.

The all-capitals subject line does not hold the record for the biggest spam filtering failure. That honor goes to this line:

        return 'no To: line' if $H{To} eq '';

This seemed fairly innoccuous, and still does—In fact, I’m still using it. Last November, when I was away in London, I received a bulk mail that fell afoul of this test and was sent to the spam file. All spam goes into a file so that I can glance over it later, just in case it turns out to be something important. The subject lines are logged separately. Usually I glance over the subject line log (which is short) looking for messages that might be interesting; if I see any, I go look in the spam file to see if it really is interesting.

When I got back from London, I glanced over the subject log, and saw that

        Subject: A message from VA Linux Systems (Mostly in English)

had been rejected because there was no To: line. I peeked at the message in the spam file, but I didn’t look at it closely, because I had a ton of unanswered mail to get to. I forgot about it until a couple of months later when I ran into Chris DiBona at a conference. ``Did you get the letter?’’ he asked.

``What letter?''

My spam filter had discarded VA Linux’s extremely lucrative IPO announcement.

Oops.

Why haven’t I taken out the To: line rule? Because it worked! It was supposed to detect bulk mail, and that’s just what it did do. The moral here is that there’s some bulk mail that you actually want to receive.

Notes From Last Time

I got a lot of interesting mail about the last article in this series. Several people wrote to me to suggest a different filtering strategy: Simply throw away mail unless one of your own addresses is in the To: line.

A lot of people wrote in about this, so it must work well for some people, but it won’t work for me. As I mentioned in Part 1, I had considered this back in 1996. I decided that it would probably result in a lot of false positives, where non-spam mail was discarded. (For example, the IPO mail would still have been rejected under this regime.) And then I went through my archive of spam and discovered that about 20% of it actually was addressed to me, so I would have a 20% false negative rate also, where spam would get through to me because it was addressed to me. With the system I do use, the false positive rate is well below 10%, and the false negative rate is much lower.

Someone also wrote to suggest that whenever I got spam, I could send an automatic complaint letter to the relevant ISP, demanding revocation of the sender’s account.

This is a perfectly terrible idea, for several reasons. Most obviously, most spam I get these days has a fake return address anyway, and complaining to the owners of hapless.com isn’t going to accomplish anything if the mail wasn’t actually sent from hapless.com, except to harass the hapless owners. So it’s at best useless and at worst a serious nuisance to people who are already trying to deal with a huge number of bogus bounced messages.

But the real reason I hate this idea so much is that it makes even a small false positive rate into a disaster. I can’t reliably identify all the non-spam messages I get; even with three layers of backup, I still lost out on the VA Linux IPO because I thought it was spam. But I can live with that; I knew when I set up the filter that I might one day miss something important because of it. I made an informed decision to take that risk.

But my correspondents, the people who write to me, did not make an informed decision. They are stuck with my policy whether they like it or not, and I owe it to them to inconvenience them as little as possible. A legitimate person who sends me mail and falls afoul of the spam filter has to send their message to me again, and even though they only have to do it once, it’s inconvenient and annoying. That inconvenience and annoyance to innocent people who have never done anything wrong other than to try to communicate with me is the biggest problem with the whole filtering strategy.

But if inconveniencing my correspondent by forcing them to re-send a message is a selfish thing to do, how much more selfish is it to inconvenience them by mailing their ISP to demand that their account be shut down? I can apologize to someone for making them send a second message. How could I apologize for demanding that their internet access be revoked? I just don’t have that much gall.

Notes From Next Time

Coming next month, the last article in this series: Why I got 85 spam messages last Wednesday, why I’m not happy with filtering, and what I’m doing about it. This involves, of all things, a hack to the Apache web server.

Tags