I've got really great results with Mark's SPAMJAM product but as is my nature I'm always trying to improve.
I've been analyzing the spam that does get through to try to figure out how best to eliminate it. Richard really believes in the idea of bayesian filtering -- he really like solutions that are as purely mathematical as possible. I've noticed that one thing the spammer-scum do to avoid these filters is full the mail with a large volume of hidden but otherwise perfectly good words. One of the most common I've been seeing is a multi-part mime message, with an included image file as one mime part, two plain text paragraphs randomly cut from some online magazine source but hidden through the use of html tags, and a bit of HTML who's only really job is to display the image.
On your screen, what you see is the ad image. That's it. You can't really filter very well on the text or even do much bayesian filtering on it because the overwhelming majority of text is going to be either entirely unique (id references for the mime parts) or extremely common text from the magazine article or html tags.
I spent pretty much the last two days working on this problem, and here's what I've come up with.... its design to work as an ENHANCEMENT to spamjam as a post-processor.
Each mime part is decoded (or ignored for file attachments and images) then passed through a filter which is able to do a few key things, first, it parses out the text into "chunks" that are still in order, but distinct. Each part has two pieces, the text leading up to an html tag, and the tag itself with its attributes if it has any. The next step is to pull out any attributes for size, color, bgcolor, href, and src. The rest of the tags are all dropped -- except comment tags, which are kept. These frequently have trackback data in them (e.g. <!-232983479212 --> ) which can be useful from a bayesian perspective.
With color and bgcolor and a ton of computation and testing, I've come up with an algorythm that assigns a "contrast value". Using a constant at the top of the code, I've decided that any text with a context value of less than 60% gets ignored. Also, text which is set to a "size" less than 2 (or 5px, or 6pt) is ignored. Other text gets added to the "keep" pile. In my testing, that gets rid of 100% of the hidden inclusions designed to kill bayesian filters. values for href and src are "trimmed" to their least significant realm but without file inclusion and also added to this pile. The pile is then trimmed of all whitespace other than a single space between strings of characters. Also, if more than 25% of the non-html tag content is in "non-visible" space, I deny the message outright based on it being subterfuge.
What I'm left with this, in most of the spam is a surprisingly short list of "words" which fall into four categories:
a) valid words from dictionary
b) urls to images and websites
c) spammer id tags (often stored as html comments)
d) non-dictionary words - often using character substitution to avoid detection (e.g. V1C0DlN)
Now, running this in my mail file, after filtering out dictionary words that are not in my block list, I'm ending up with an average of only 10 words per message to check.
I'm seeing a VERY high success rate at spotting the very spam that gets through. Dictionary words I did find that are in my blocklist of course trigger the deny right there.
With the non-dictionary words that are not in the blocklist, I'm planning the following steps:
1. Make a copy of the word that goes through a "cleansing" process of character swapping -- I have a hundred or so swaps I can perform and many steps to remove extra periods and space characters. This 'cleansed' word is added to the list.
2. Check the cleansed word against deny list & dictionary. If found in the deny list, of course its a denial. If found in the dictionary I deny as well -- because I assume that the cleansing process detected an attempt at subterfuge.
3. Now for the bayesian part: Classic bayesian says look for the "high value" words which are strongly identified as only being in spam or non spam and do a count. My twist on that is that I'm able to do it on only those non-dictionary (both cleansed and non-cleansed) words. That includes the URL's the spam points you to and it means only about 10 words per message to manage. I'll probably do that with a call out to a servlet and manage the wordlist in a relational database or else a simple binary tree stored on disk.
What's really cool, is that thanks to SPAMJAM, I have a HUGE pool of data that is KNOWN spam , caught via other methods -- spamcop checks, fraudulent use of AOL or other checkable IP sources, etc.. Most spam is like other spam in a few ways.
The goal, is that if just one spam message on the latest swiss watch deal is send by someone with an easy to trap thing (like using an AOL from address but not using an AOL server to route it since the aol server ip's are published) then I'll instantly have a high value set of URL's (the links to the watch deal) to trigger the bayesian catch of any other spam that comes through for the same thing. For me, the vast majority of spam is duplicated that way.
I'll talk to Mark if it works and see about releasing this pig to see how others fair.
Comment Entry |
Please wait while your document is saved.