Andrew Pollack's Blog

Technology, Family, Entertainment, Politics, and Random Noise

Professional Services

Second Signal

Presentations

Andrew's Blog

Support

Working harder and harder on anti-spam

By Andrew Pollack on 04/09/2004 at 10:54 PM EDT

I've got really great results with Mark's SPAMJAM product but as is my nature I'm always trying to improve.

I've been analyzing the spam that does get through to try to figure out how best to eliminate it. Richard really believes in the idea of bayesian filtering -- he really like solutions that are as purely mathematical as possible. I've noticed that one thing the spammer-scum do to avoid these filters is full the mail with a large volume of hidden but otherwise perfectly good words. One of the most common I've been seeing is a multi-part mime message, with an included image file as one mime part, two plain text paragraphs randomly cut from some online magazine source but hidden through the use of html tags, and a bit of HTML who's only really job is to display the image.

On your screen, what you see is the ad image. That's it. You can't really filter very well on the text or even do much bayesian filtering on it because the overwhelming majority of text is going to be either entirely unique (id references for the mime parts) or extremely common text from the magazine article or html tags.

I spent pretty much the last two days working on this problem, and here's what I've come up with.... its design to work as an ENHANCEMENT to spamjam as a post-processor.

Each mime part is decoded (or ignored for file attachments and images) then passed through a filter which is able to do a few key things, first, it parses out the text into "chunks" that are still in order, but distinct. Each part has two pieces, the text leading up to an html tag, and the tag itself with its attributes if it has any. The next step is to pull out any attributes for size, color, bgcolor, href, and src. The rest of the tags are all dropped -- except comment tags, which are kept. These frequently have trackback data in them (e.g. <!-232983479212 --> ) which can be useful from a bayesian perspective.

With color and bgcolor and a ton of computation and testing, I've come up with an algorythm that assigns a "contrast value". Using a constant at the top of the code, I've decided that any text with a context value of less than 60% gets ignored. Also, text which is set to a "size" less than 2 (or 5px, or 6pt) is ignored. Other text gets added to the "keep" pile. In my testing, that gets rid of 100% of the hidden inclusions designed to kill bayesian filters. values for href and src are "trimmed" to their least significant realm but without file inclusion and also added to this pile. The pile is then trimmed of all whitespace other than a single space between strings of characters. Also, if more than 25% of the non-html tag content is in "non-visible" space, I deny the message outright based on it being subterfuge.

What I'm left with this, in most of the spam is a surprisingly short list of "words" which fall into four categories:

a) valid words from dictionary
b) urls to images and websites
c) spammer id tags (often stored as html comments)
d) non-dictionary words - often using character substitution to avoid detection (e.g. V1C0DlN)

Now, running this in my mail file, after filtering out dictionary words that are not in my block list, I'm ending up with an average of only 10 words per message to check.
I'm seeing a VERY high success rate at spotting the very spam that gets through. Dictionary words I did find that are in my blocklist of course trigger the deny right there.

With the non-dictionary words that are not in the blocklist, I'm planning the following steps:

1. Make a copy of the word that goes through a "cleansing" process of character swapping -- I have a hundred or so swaps I can perform and many steps to remove extra periods and space characters. This 'cleansed' word is added to the list.

2. Check the cleansed word against deny list & dictionary. If found in the deny list, of course its a denial. If found in the dictionary I deny as well -- because I assume that the cleansing process detected an attempt at subterfuge.

3. Now for the bayesian part: Classic bayesian says look for the "high value" words which are strongly identified as only being in spam or non spam and do a count. My twist on that is that I'm able to do it on only those non-dictionary (both cleansed and non-cleansed) words. That includes the URL's the spam points you to and it means only about 10 words per message to manage. I'll probably do that with a call out to a servlet and manage the wordlist in a relational database or else a simple binary tree stored on disk.

What's really cool, is that thanks to SPAMJAM, I have a HUGE pool of data that is KNOWN spam , caught via other methods -- spamcop checks, fraudulent use of AOL or other checkable IP sources, etc.. Most spam is like other spam in a few ways.

The goal, is that if just one spam message on the latest swiss watch deal is send by someone with an easy to trap thing (like using an AOL from address but not using an AOL server to route it since the aol server ip's are published) then I'll instantly have a high value set of URL's (the links to the watch deal) to trigger the bayesian catch of any other spam that comes through for the same thing. For me, the vast majority of spam is duplicated that way.

I'll talk to Mark if it works and see about releasing this pig to see how others fair.

There are - loading - comments....

By George Chiesa on 06/23/2004 at 11:21 AM EDT

I think the technique of learning "terms" from actual spam is an improvement. I'd rather have this and other known spammer techniques checked before the whitelist/blacklist current duo, since no legit mail should use the spammer techniques. My take: I'd rather live with false positives triggered by techniques I am aware of being caught first, even when a friend tries to tell me about it in an email. I do not care if it was an email from friend and not foe, if I already know a technique and can recognise it, it can go directly into spam. I can always retrieve it from the dump if my friend insists to discuss that mail ;-) PS I sent you one of those today...

I disagree with George about the whitelist/blacklistBy Ben Langhinrichs on 06/23/2004 at 11:21 AM EDT

The reason is efficency - it takes a lot more processing to do this whole thing if I can just say that the mail is from a friend. Spam filter has to be mighty efficient.

What I'm thinking...By Andrew Pollack on 06/23/2004 at 11:21 AM EDT

I think I'm going it optional -- before, after, or both (which could be used to report disposition). First I need to talk to Mark about how willing he is to see after market add-ons to his product.

Subject
Your Name
Homepage
*Your Email
	* Your email address is required, but not displayed.

Your thoughts....


	Remember Me

Andrew Pollack's Blog

Working harder and harder on anti-spam

Site Links

Useful Links

Other Recent Stories...