March 2000 Archives

My Life With Spam: Part 3

How I Caught the Spam and What I Did With it When I Caught it


This article is the third in a series; you may want to read Part 1 and Part 2 if you haven't yet. Here's a short summary to refresh your memory:

I get a lot of spam mail, and I don't like that. I used to get upset every time I read my mail and had to throw out a bunch of spam, so I decided to put some effort into writing a program that would filter my email and discard spam automatically. Then my blood pressure would stay low.

I decided that it was most important that nothing be thrown away for good, because I knew that no filter was perfect, and I was afraid that important business mail from a client might be silently discarded and I would lose my job. I also decided I wasn't smart enough to filter on the actual content of messages---the touchstone example is that if someone sends me a letter complaining about spam, and they say ``Gosh, these Green Card Lottery messages are driving me crazy!'' I don't want to reject the message just because it mentions green card lotteries. I couldn't figure out how to avoid that, so I resolved that I would filter on header lines only.

My main strategy is to maintain a blacklist of `bad domains' that send a lot of spam, such as eastmail.com. Mail from these domains is presumed to be spam. Mail from anyone I don't know in one of these domains is returned to the sender, with an apology, requesting them to re-send the message to a different address. Mail to that address is forwarded directly into my mailbox, and also into a program that adds their `From' address to the list of `people I know', so that next time they send me mail, it gets through the filter. The `people I know' list is called a `whitelist', and it's just a DBM file. The code is very simple. After reading the message and parsing the header, as I described in Part 1, I look to see if the subject contains a secret password. If so, I add the message senders to my whitelist. Senders can also get on the whitelist by sending a message to a special address; this address invokes the program with a command-line argument that automatically enables whitelisting. The code looks something like this:

        if ($ARGV{Mode} eq 'Whitelist'
         || $H{Subject} =~ /$PASSWORD/) {
          foreach $s (senders()) {
            $db{lc $s} = time;
          }
        }


Drawbacks of This Approach

People occasionally ask what would happen if the spammers added the special whitelisting address to their mailing lists. This has never happened, but if they did, I would deal with it easily enough; I'd just change the address. Nobody ever needs to send more than one message to the whitelisting address, because once you're on the whitelist, you're on forever. So I wouldn't be inconveniencing anyone by changing this address.

Of course, in the meantime, the spammer is still on the whitelist, and even if they don't add the whitelist address to their address list, they might get the apology back in the mail and just add themselves to the whitelist. This has only happened a few times since I first started in late 1996---most spammers don't provide valid return addresses, and those that do don't read the replies to their mail.

For the couple of dedicated spammers who did try to add themselves to my whitelist, I have a special super-blacklist called the `losers list'. The filter rejects mail from losers no matter what address it comes to. It returns the message to the sender, with a note that says

        Hello.  I am an automatic reply.

        Your message is not welcome at this address.
        The recipient is not interested in hearing what
        you have to say.  Your mail being returned unread.

        Please do not send mail to this address in the future.

                Sincerely,

                The answering machine.

This has a side benefit: If I get really sick of someone I can add them to my loser list even if they're not a spammer. It would probably be more mature to just have the filter throw away the message, rather than returning a note that says `I am ingoring you', but I'm a spiteful guy.

So the logic in the filter program looks something like this:

if the message is from a loser
send it back with an `I am ignoring you' note
else if the message is from someone on the whitelist
accept it
else if the message is from someone in a blacklisted domain
send it back with instructions about how to get whitelisted
else
accept it

The first two tests, for loserhood or whitelisting, are fast. The filter program keeps one big DBM file for both lists. The keys are `From:' addresses, and the value is loser if the user is a loser, and the date of whitelisting if they user is on the whitelist. I keep the dates as an ASCII-formatted number of seconds since 1970, so the actual code looks something like this:

        if ($DB{$from} eq 'loser') {
          # they are a loser; reject mail
        } elsif ($DB{$from} > 0) {
          # they are on the whitelist; accept
        } elsif (domain_is_bad()) {
          # reject
        } else { 
          # accept
        }

domain_is_bad() takes longer, because it does many pattern matches. I discussed it in Part 2.


A Few More Heuristics

Checking the From:, Reply-To:, and Received: lines for the presence of domains that are `bad' is the main part of the program. But there are a few extra miscellaneous rules I've put in that have worked really well. The function misc_bad_header checks these rules, returning true if it sees something bad, and false otherwise. For example:

  return 'Addressed To: "you" or "friend"' 
    if $H{To} =~ /(you|friend)/i;

This line looks at the To: line; if it's addressed to you or to friend, it's bad. Friend@public.com was particularly common in Spam for a while. The string returned by misc_bad_header is eventually entered in the log file as the reason for rejection.

At the end of the previous article, I asked what was wrong with the following Received: line:

        Received: from login_2961.sayme2.net (mail.sayme2.net[103.12.210.92])
          by sayme2.net (8.8.5/8.7.3) with SMTP id XAA02040
          for creditc@aoI.net;  Thu, 28 August 1997 15:51:23 -0700 (EDT)

People found all sorts of things wrong that hadn't occurred to me! For example, underscores are not allowed in domain names, so the name login_2961.sayme2.net is illegal. However, many hosts do have illegal characters in their names, so this is probably not a good reason to reject the message.

Some people pointed out that the Received: line contains the dubious host name aoI.net, perhaps in hopes of fooling me into thinking that it actually said aol.net. Of course, they don't look similar to the computer at all.

Jan-Pieter Cornet pointed out that the date is in the wrong format: It's illegal to spell out August. It would be interesting to correlate the incidence of this sort of error with the incidence of Spam, but I haven't done the analysis.

For me, the really interesting thing about this line is the time zone. EDT usually stands for `Eastern Daylight Time', and the -0700 means that it's the time zone seven hours behind Universal Coordinated Time. Guess what? EDT is four hours behind UTC, not seven. Similarly, EST is `Eastern Standard Time', which is -0500. When messages like this started to show up, I added these lines to misc_bad_header():

  return 'Mangled time zone' if $H{Received} =~ /-0600 \(EST\)/
           || $H{Received} =~ /-0[57]00 \(EDT\)/;

I don't know what caused the errors in the first place, but for a long time they were extremely reliable spam detectors. Neither one has come by in almost a year though, which is a shame; I miss them.


Yet More Heuristics

Here are some good ones:

          return 'X-PMFLAGS header' if exists $H{'X-PMFLAGS'};

I don't know what an X-PMFLAGS line is, but it only seems to appear in spam.

          return 'Message was handled by bulk.mail' 
            if $H{Received} =~ /bulk.mail/i ;

There's a bulk mailing program called bulk_mailer that identifies itself in the Received: line.

          return 'Subject contains "ad"/"adv"'
            if $H{Subject} =~ /\badv?\b/i;

Some spammers actually put a string like ADV: into the subject line of their message. Of course, the law of evolution by natural selection says that this trait will be selected against, and the next generation of spammers won't do it as much. In the meantime, it works well.

The next one is obvious:

          return 'Subject: line contains $$$ '
            if $H{Subject} =~ /\${3}/;

The next one isn't so obvious:

          return 'Username is all digits'
            if $H{From} =~ /^\d+\@/;

You'd be surprised at how often this comes up. Spammers seem to be moving away from it, though; now I'm more likely to get spam mail from an address like 23gfgd4@iperbole.bologna.it. I'd like to add a heuristic to throw away messages from unlikely-sounding usernames, but I don't feel confident that I will be able to identify them reliably.

Finally, here's a big heuristic that works well: Make a list of `bad' words, and reject any message that includes any of the `bad' words in any of the X-headers:

          my $pat = join '|', @bad_words;
          foreach $h (keys %H) {
            next unless $h =~ /^X-/;
            return "Header `$h' contains bad word $1"
              if $H{$h} =~ /($pat)/io;
          }

What words are `bad'? Oh, the usual naughtiness. `Cyberpromo' heads the list. (Am I allowed to say that on a public web site?) `stealth' is a good one.

There's a trick here that you might not have seen before. You have a bunch of patterns, in this case the items in @bad_words, and a bunch of strings, in this case the values of %H, and you want to see if any of the strings match any of the patterns. The obvious technique is to use:

        foreach $string (values %H) { 
          foreach $pattern (@bad_words) {
            if ($string =~ /$pattern/) {
              return "String $string matches pattern $pattern"
            }
          }
        }

This is a mistake. Why? Because a regex ie like a little computer program: It must to be compiled, and then it can be executed and given an input, which is the string that you want to match against it. Compiling a regex is relatively slow, but you only need to do it once, and then you can give it many strings and see if any of them match, without recompiling.

Perl normally compiles your regexes at the same time as it compiles the rest of your program. But if the regexes change at run time, it can't do that. Instead, it compiles each one just as it is about to be used. If there are 42 strings in the string list, each regex gets compiled 42 times.

Just reversing the order of the foreach loops speeds this up by a factor of 18, because each pattern is compiled only once, and then used for all the strings:

        foreach $pattern (@bad_words) {
          foreach $string (values %H) { 
            if ($string =~ /$pattern/) {
              return "String $string matches pattern $pattern"
            }
          }
        }

(I ran a simple test with 16 strings and nine patterns. If there were more patterns, the speedup would be greater.)

If misc_bad_header() is called only once, then compiling each pattern once is the best you can do. But if the loop just above is run twice, then all the patterns have to be compiled again the second time around; this is wasteful.

The solution is to construct one pattern that never varies at all, and compile it exactly once. We can construct one pattern by joining the real patterns together with |s in between. That's what

          my $pat = join '|', @bad_words;

does; it takes a @bad_words list that looks like ('cyberpromo', 'stealth', ...) and turns it into a pattern that looks like cyberpromo|stealth|.... Then we match against this pattern:

         return "Header `$h' contains bad word $1"
           if $H{$h} =~ /($pat)/io;

The /o modifier tells Perl that even though $pat looks like a variable, we're really using it as a constant. We promise perl that $pat will never change, and in return, Perl compiles the regex once, the first time it is needed, and never compiles it again.

In this program, this clever hack is probably not worth doing, because the filter program gets invoked once to handle one mail message, calls misc_bad_headers() exactly once, and then exits. If I were doing it over, I would have left out the trick---it comes under the heading of `premature optimization', and premature optimization is the root of all evil. But the trick can be useful in other contexts.


Heuristics that Didn't Work

I got tired of getting mail titled RE: BIOTECH BARGAIN! and **LET YOUR COMPUTER MAKE MONEY WHILE YOU SLEEP!**, so I thought I'd try putting in a filter to reject mail whose Subject: line was in all capitals:

             my $s = $H{Subject};
             # subject contains no lowercase letters
             return 'Subject: all capitals' if $s !~ /[a-z]/ ;

The very next email message I received was titled FYI. It was from my brother-in-law. The spam filter rejected it. I commented out the `all capitals' test.

A few months later, I was ready for more abuse. I tried this::

             $s = tr/A-Za-z//cd;
             return 'Subject: all capitals' if $s !~ /[a-z]/ ;
               && length($s) > 6;  # and at least six capitals

This removes non-letters from the subject, then rejects the message if the subject is all capitals and is at least seven letters long. Soon after I tried this, I got mail from one of my clients titled URGENT FIX!!!. That was the end of that experiment.

The all-capitals subject line does not hold the record for the biggest spam filtering failure. That honor goes to this line:

        return 'no To: line' if $H{To} eq '';

This seemed fairly innoccuous, and still does---In fact, I'm still using it. Last November, when I was away in London, I received a bulk mail that fell afoul of this test and was sent to the spam file. All spam goes into a file so that I can glance over it later, just in case it turns out to be something important. The subject lines are logged separately. Usually I glance over the subject line log (which is short) looking for messages that might be interesting; if I see any, I go look in the spam file to see if it really is interesting.

When I got back from London, I glanced over the subject log, and saw that

        Subject: A message from VA Linux Systems (Mostly in English)

had been rejected because there was no To: line. I peeked at the message in the spam file, but I didn't look at it closely, because I had a ton of unanswered mail to get to. I forgot about it until a couple of months later when I ran into Chris DiBona at a conference. ``Did you get the letter?'' he asked.

``What letter?''

My spam filter had discarded VA Linux's extremely lucrative IPO announcement.

Oops.

Why haven't I taken out the To: line rule? Because it worked! It was supposed to detect bulk mail, and that's just what it did do. The moral here is that there's some bulk mail that you actually want to receive.


Notes From Last Time

I got a lot of interesting mail about the last article in this series. Several people wrote to me to suggest a different filtering strategy: Simply throw away mail unless one of your own addresses is in the To: line.

A lot of people wrote in about this, so it must work well for some people, but it won't work for me. As I mentioned in Part 1, I had considered this back in 1996. I decided that it would probably result in a lot of false positives, where non-spam mail was discarded. (For example, the IPO mail would still have been rejected under this regime.) And then I went through my archive of spam and discovered that about 20% of it actually was addressed to me, so I would have a 20% false negative rate also, where spam would get through to me because it was addressed to me. With the system I do use, the false positive rate is well below 10%, and the false negative rate is much lower.

Someone also wrote to suggest that whenever I got spam, I could send an automatic complaint letter to the relevant ISP, demanding revocation of the sender's account.

This is a perfectly terrible idea, for several reasons. Most obviously, most spam I get these days has a fake return address anyway, and complaining to the owners of hapless.com isn't going to accomplish anything if the mail wasn't actually sent from hapless.com, except to harass the hapless owners. So it's at best useless and at worst a serious nuisance to people who are already trying to deal with a huge number of bogus bounced messages.

But the real reason I hate this idea so much is that it makes even a small false positive rate into a disaster. I can't reliably identify all the non-spam messages I get; even with three layers of backup, I still lost out on the VA Linux IPO because I thought it was spam. But I can live with that; I knew when I set up the filter that I might one day miss something important because of it. I made an informed decision to take that risk.

But my correspondents, the people who write to me, did not make an informed decision. They are stuck with my policy whether they like it or not, and I owe it to them to inconvenience them as little as possible. A legitimate person who sends me mail and falls afoul of the spam filter has to send their message to me again, and even though they only have to do it once, it's inconvenient and annoying. That inconvenience and annoyance to innocent people who have never done anything wrong other than to try to communicate with me is the biggest problem with the whole filtering strategy.

But if inconveniencing my correspondent by forcing them to re-send a message is a selfish thing to do, how much more selfish is it to inconvenience them by mailing their ISP to demand that their account be shut down? I can apologize to someone for making them send a second message. How could I apologize for demanding that their internet access be revoked? I just don't have that much gall.


Notes From Next Time

Coming next month, the last article in this series: Why I got 85 spam messages last Wednesday, why I'm not happy with filtering, and what I'm doing about it. This involves, of all things, a hack to the Apache web server.

This Week on p5p 2000/03/05



Notes

Well, my apologies to everyone for the long absence. I'm going to try to send out reports more frequently, and to let go some of my obsession about being completely complete, and see if I can stay on schedule.

This means that if you think I missed something of importance, I would be more grateful than ever if you would bring it to my attention.

Meta-Information

The most recent report will always be available at http://www.perl.com/p5pdigest.cgi .

You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com.

Please send corrections and additions to mjd-perl-thisweek-YYYYMM@plover.com where YYYYMM is the current year and month.

Brief Update

Since last time, Perl has had three beta releases and now stands at version 5.5.670.

Announcement for version 5.5.670

Announcement for version 5.5.660

In the wake of this, list traffic is very high with various small bug reports and configuration problems.

Module Warnings

Paul Marquess added functionality to the warnings.pm lexical warnings module so that a module could peek at the warnings that are enabled in its caller, and issue similar sorts of warnings. For example, your function might issue a warning about a newline at the end of its filename argument, but only if your function's caller has the newline category of warnings enabled.

Paul asked whether it would be a good idea to allow modules to be able to register their own warning categories, and presented a possible interface to such functionality. Then you might say something like

        use Fred;
        use warnings 'Fred';

to turn on the warnings that were special to the Fred module. Paul asked for comments from the list about this. Original message

There was a lot of interesting discussion of this. Paul said that he had thought about making it possible for a module to define its own hierarchy of warnings, but that that seemed like overkill. Ron Kimball suggested that use warnings 'Fred' or no warnings 'Fred' might enable or disable warnings from module Fred regardless of whether or not Fred had registed its own warning category. Paul said he had considered this, but that it would suggest to people that doing use warnings 'Some::Module' would necessarily do something useful, when in fact Some::Module might not issue any warnings at all.

Yitzchak Scott-Thoennes pointed out that if module names are warning categories, then you could put explanations of your warnings into your module's POD and then diagnostics would be able to find the explanations or your own module's warnings.

POD Changes

5.5.670 brought with it some changes to the structure of POD. In particular, the old problem of putting angle brackets into C<...> tags is probably solved. You can now write C<< $var->method >> and the rule is that the text to be output is delimited by the double-angle brackets and whitespace.

Changes went into Pod::Man, Pod::Parser and Pod::HTML to support this. Robin Barker pointed out that pod2latex still did not support this and other changes, and rather than dig into the horrible squiggly guts of pod2latex to fix it, he wrote a preprocessor (using Pod::Parser which removes the new escape codes from a pod document and yields an old-style POD. He bolted this to the front of pod2latex, which in my opinion was the right thing to do. Way to cut that gordian knot, dude.

Race conditions in statndard Perl utilities

Tom Christiansen posted a `Request for Hero' to fix the parts of the Perl distribution that write files to /tmp in an unsafe way, laying themselves open to race condition exploits and soforth. A typical problem is that /tmp is world-writable, so a malicious person could remove the temporary file and replace it with a new one before the program notices. Worse, if the superuser is running the program, a malicious user might remove the temporary file and replace it with a symbolic link to the password file; when the program updates what it thinks is the temporary file, it is really updating the password file.

Tom listed several utilities that may have these problems, including perlcc, perldoc, perlbug, and s2p although he also said that none of these bugs were actually in the core. Read about it.

Tom called for a core File::Temp module that would encapsulate safe temporary file creation functions.

Tim Jenness said that he had written such a thing about a year ago, and he volunteered to be the hero. Discussion of various attacks via /tmp ensued.

open Calls not checked in perldoc

Tom posted a related note, showing about half a dozen places in perldoc where a system call, usually open, was performed with no check of the return code.

Big Flame Wars

Somehow the ` open calls' discussion turned into a gigantic flame war.

If it's not too late, I advise skipping it. The subject is `security (and stupidity) bugs in perldoc'.

Chip and Jarkko Quit

Chip Salzenberg and Jarkko Hietaniemi quit p5p, citing big flame wars as the reasons. Chip created a new mailing list, which is supposed to be just like p5p, except that ``verbal abuse will not be tolerated.''

Leaving aside any questions of how Chip plans to enforce this, I believe that forking p5p is an incredibly bad idea. I can only hope that the new list will turn out to be irrelevant, but I'm really afraid that it will draw off just enough of the productive people from the main list to prevent both groups from being effective.

I'd love to revisit this in a year and find out that I'm wrong.

Templates in pack and unpack

Way back in October, Ilya had an idea for making pack and unpack capable of automatically unpacking data in any format, even one that was not known in advance. I wrote this up in October, but Ilya's new messages now suggest that I didn't understand the point. Ilya says:

Ilya: There were some initial objections due tomisunderstanding of the posted example: people thought that theexamples of serialization I initially posted for illustrationwere the only ways to serialize data.

Ilya's recent message

is probably worth reading, but skip the rest of the thread. It was a waste of time even before the tail end of this thread was sucked into the flame zone.

(Old news here)

use strict in the core modules

Peter Scott asked why 84/109 core modules did not use strict. That is a good question, and I don't think the answer is going to turn out to be `carelessness and hystorical accident'. At least, not in every case.

If someone did go around putting use strict into all the core modules, I think it would be a real good idea if they also kept track of how many actual bugs they found andcorrected as a result, and how many non-bugs.

Various

A large collection of bug reports, bug fixes, non-bug reports, questions, and answers. Also spam.

Until next time I remain, your humble and obedient servant,


Mark-Jason Dominus
Visit the home of the Perl programming language: Perl.org

Sponsored by

Monthly Archives

Powered by Movable Type 5.13-en