Jump to content

Bayesian filtering


Recommended Posts

Hi, guys.

As some may have noticed, we've been experimenting with Bayesian filtering in SpamAssassin for the last five days or so. I approached this with skepticism, but after doing more research, I decided that it was possible to do a technically sound job. Unfortunately, after doing all the work, I'm not convinced that it's really doing anything at all for us.

Basically, we have a common Bayes database set up in the standard SpamAssassin way. Training is based on three components. All received mail with zero or lower scores are trained as non-spam. Mail with 8 or higher points are trained as spam. Also, for the last several days, all spam submissions using the Report as spam button in webmail have trained as spam. At the current time, we've had about 40,000 spams and 5,000 non-spams trained in the system. We're running a larger-than-default database of tokens and currently have 165,000 different tokens in the system.

There are several problems with the current system. It greatly increases the load on the mail filtering servers and the file server where the Bayes data is stored. It can cause mail to be delayed when mail backs up because of the load. These problems can be handled by throwing more servers at the problem. That's expensive, but possible if it contributes to greater accuracy.

Another problem is that it slows down held mail processing. Now, when you report a message as spam, the server needs to also train the Bayes system. This takes a non-negligible amount of time and when you add that up for 20 or 30 or 100 messages on a screenful of spam, it takes a lot longer to report a bunch of spam. Theoretically, we could make Bayes reporting be a user-selectable option. If you're unwilling to take the time, I guess you could turn off Bayes learning on your spam submissions.

The real problem is that I don't feel like it's stopping much, if any, spam. Now, one note. If you see anything going through the blade6 server, you can ignore that because it's not doing Bayes right now. The other servers are, though.

I've been looking at a lot of spam recently. A ton of spam. I'm seeing a lot of stuff like this:

X-spam-Status:    hits=20.0 tests=BAYES_90,FAKE_HELO_YAHOO_CA,HTML_70_80, HTML_MESSAGE,HTML_TITLE_UNTITLED,MIME_BASE64_TEXT, MIME_HTML_NO_CHARSET,MIME_HTML_ONLY,RCVD_IN_BL_SPAMCOP_NET,SUBJ_BUY version=2.63

This was identified as spam by Bayes, but it already had a ton of points in SpamAssassin and was blocked by SpamCop anyway. So the whole Bayes thing was pretty much a waste here.

I'm also seeing a lot of stuff like this:

X-spam-Status:    hits=0.6 tests=BAYES_44,BIZ_TLD,HTML_50_60,HTML_MESSAGE, MIME_HTML_ONLY version=2.63

In this case, the Bayes engine said it might be spam, it might not. Basically gives it 0 points for spam. The subject of this email was FWD: Here's All Pills. V1AgR[at] * :X:ANAx ; V+a+lium & Fi0.ric3t # S|o|ma * Pnter:m:in vHMPH. That's the kind of thing the Bayes was supposed to help with, right? I guarantee that I personally have trained a few hundred of these messages and the users in total have trained thousands. But Bayes has essentially no opinion whether this email was spam or not.

And then we have messages with scores like this:

X-spam-Status:    hits=0.0 tests=BAYES_20,HTML_60_70,HTML_IMAGE_ONLY_02, HTML_MESSAGE version=2.63

In this case, the Bayes system actually reduced the score of a spam. I get a lot of false negatives. Lots of spams that have BAYES_00 or BAYES_20 or BAYES_01 in them. Now, most don't have high enough scores before Bayes that they would have otherwise been caught. Still, it's annoying.

Basically, what I'm seeing falls into three categories. Spams marked as definitely spam by Bayes, but they would have been blocked anyway. Spams marked as not spam by Bayes (but in general they wouldn't have been blocked otherwise) and spams for which Bayes has no opinion.

What I haven't seen is spams that are held because the Bayes put them over the top. BAYES_99 is over 4 points but I'm not seeing spams in the 5 to 10 point range with BAYES_99.

Bayes is also supposed to be helping with false positives. So, a message that otherwise would have been held might get negative points due to Bayes and therefore it doesn't incorrectly get held. But FP's are few and far between with SpamAssassin.

So, now you know what's going on. People were throwing around 99% blocking levels with Bayes but I don't believe it. There's too much stuff like this:

arrive urging sumatra berkowitz multinomial floury brotherhood corvus quo amanita differentiable auntie bethel edematous abusable gordian tincture inveigle cryogenic nonsensical velours foxy balcony eerie voiceband stefan woe champagne opulent soya muriatic autocrat deject seattle interstitial etruscan gerry bulge
bivalve hop simultaneity checkpoint chump bloodshot aseptic nanking soapstone bagel solute bus
dauphine shelf shortish luxuriate basemen godsend beauregard homily cylinder device runyon evade speech incongruity partook waggle telecommunicate volcano kamikaze cousin boom sop wiretap dumbbell bufflehead eyelash

These lists of random words break Bayes. By switching them around for every spam and keeping the spam content short, they ensure that nearly all of the tokens found in a message are primarily found in non-spam. I'm also seeing spam that's just an image.

At this point, I'm inclined just to turn the Bayes stuff off. I'm willing to leave it in if it's actually blocking spam. If we're just wasting time, though, I'm going to shut it off to save the load on the system and make spam reporting fast again.

JT

Link to comment
Share on other sites

JT,

Thanks for trying the Bayes system.

Another problem is that it slows down held mail processing. Now, when you report a message as spam, the server needs to also train the Bayes system. This takes a non-negligible amount of time and when you add that up for 20 or 30 or 100 messages on a screenful of spam, it takes a lot longer to report a bunch of spam.
Couldn't this part be done in batch mode?

By "report a message as spam", do you just mean "Report as spam" in Webmail, or do you also mean "Quick - report immediately and trash", "Queue for reporting (and move to trash)", and/or web-based reporting?

Also, shouldn't the Bayes system get smarter with more training?

Thanks!

Link to comment
Share on other sites

shouldn't the Bayes system get smarter with more training

No doubt that this is the way it's supposed to work, but that the spammers are working just as hard to foil the process kind of points back to JT's results .. it might be a cool tool on "your" system, but i tlooks pretty bad from JT's list .. perhaps the SpamCop reports are too large of a sampling, thus the reason for the "not really a yes/no decision" that JT's describing ...???

I see this as pretty similar to HotMail's experience with BrightMail filtering .. spammers would just jack around until they got "the right mix", then start the spam run ... there's little doubt in my mind that the Ralsky's of the world don't run up all the anti stuff and whup the hell out of it to figure out how to get around the blocks, filters, and tricks.

"Nothing is sadder than the murder of a beautiful theory by a gang of ugly facts."

- Jeffrey Zeldman, in A List Apart

Link to comment
Share on other sites

JT,

Thanks for trying the Bayes system.

Couldn't this part be done in batch mode?

By "report a message as spam", do you just mean "Report as spam" in Webmail, or do you also mean "Quick - report immediately and trash", "Queue for reporting (and move to trash)", and/or web-based reporting?

Also, shouldn't the Bayes system get smarter with more training?

Thanks!

Well, there is no "batch mode" unless I write one. What I'm talking about is only in webmail with the Report as spam link. The way that link works now, since it's a web system running PHP, is that each spam is processed when you click the link, before the web page refreshes. You might be able to write a system that moves the spams to a different place and a daemon does learning on all of them, but that's a lot more complicated.

As far as the Bayes, I think it's as smart as it's going to get. The SpamAssassin guys say that there's really no advantage to going over 5000 messages. I think the problem is that the spammers are actively defeating Bayes by simply putting lots of random words into the spams. It's trivial for them to do but defeats token-based systems. Also, the spammers are resorting to lots more crazy spellings and such. Some of this mail is barely readable. In a way, I guess that's a victory that the spammers are forced to resort to such measures. However, if there really are 5000 ways to spell Cialis, then we'll never get enough training to accurately trap those emails.

JT

Link to comment
Share on other sites

I'm not trying to "pile on" here but I want to present facts. If others have a different experience, please let me know.

I realized that the VER system is an easy way to identify SpamAssassin-tagged emails that might have been caught by Bayes. I looked through 250 spams in my personal Held Mail from today. Of those 250, about 6 or 7 were caught only by SpamAssassin and had scores in the range of 5 to 10. This range of scores is the only place where a message might not have been trapped, but it was because of a high Bayes score.

When looking at these messages, the only one that was influenced one way or the other was a single message (spam) that scored BAYES_00 and had its SA score reduced by the Bayes filter. It scored an 8.0, but would have been over 12.0 if it had not been for Bayes.

Now, my sample of SpamAssassin messages is pretty small. But the sample of total spam was not. My feeling is that you'd have to look at hundreds of caught spam before you found a single message that was caught because of Bayes. If the system is only going to increase the spam blocking by 0.5% or less, it's not worth it.

I'm going to leave it running for a few more days, but unless something radically changes, this will be labeled a failed experiment.

JT

Link to comment
Share on other sites

Jefft, I think something else is failing and queering this experiment. See ne w thread I started as well as one other.

Sudden BARRAGE of 10+ SA scores (I am set to 5) that are passing through. See them in .spam.

VERY interested in your thoughts, as EVERYONE using SA in Bayes mode (Including our own Ellen) reports that trained Bayseian filtering on their own computers ROCKS.

I will recheck the Blade 6 issue, see if I see any correlation or non-correlation with all the "abberant" spam that is getting through...

David

Link to comment
Share on other sites

Thanks for trying out SA's bayes JT. Here are some thoughts:

1) I have to seriously wonder if our bayes db is accurate. It seems to me that a spam message should never trigger bayes_00, bayes_10, etc. I noticed one SA admin in the SA mailing list state that this would indicate improper training of the db. See a brief conversation of this here.

If I am interpreting my messages correctly, then it seems one very damaging piece of training on our bayes db results from the "SpamCop Quick reporting data" messages that people receive from quick reporting in the web interface. Here is a section of the headers from one of the quick reporting messages I recently received:

X-spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on blade1

X-spam-Level:

X-spam-Status: hits=-104.5 tests=BAD_CREDIT,BAYES_00,USER_IN_WHITELIST

version=2.63

X-SpamCop-Checked: 192.168.1.101 206.14.107.113 192.168.11.204

It is negative 104! I assume that this message was processed as ham and therefore is skewing our bayes db with improper weighting of tokens taken at least from the many spam "From:" and "Subject:" lines it contains and also possibly from domain names and IPs that the message contains. JT and fellow users, can you confirm that every "SpamCop Quick reporting data" email that we receive is processed as ham automatically, or did I just receive a fluke?

2) I also propose that our training is skewed towards spammy messages and not hammy ones. From JT's explanation above it appears that the extreme cases (on both sides) are automatically trained, however that is the only time that "ham" messages are getting trained. spam messages get trained quite a bit more because every time we submit spam these messages are getting trained as "spammy." Our current system is doing automatic training on clear cut cases for ham and spam, but additionally, users can only train on spam. No legitimate email (false positive or otherwise) can be trained by users as ham under the current system. I do not think this will reflect well on our bayes db and its effectiveness.

3) Maybe this is not a big deal, but further adding to a skewing of training on spam and not training on ham involves blade6. Right now blade6 is not performing bayes. This means that any message passing through blade6 does not get automatically classified as spam or ham, however users are reporting messages as spam via reporting that will then classify messages as spam. The end result (if I am thinking straight) would be that NO legitimate mail that passes through blade6 gets trained as ham, while most of the spam that passes through blade6 does get trained as spam.

I seriously think our bayes db is weak and we should consider starting over with a new db if "case 1" (above) is true and then the "SpamCop Quick reporting data" messages could be excluded from training.

Bayesian based filtering is defintely one of many tools that work well and combined with other methods can reach fantastic filtering rates. To be clear about spammers attempts at bayesian posioning by use of "word salad" in spam, they are simply not as effective as JT is painting them to be. There are many people who are routinely getting 99% filtering success using bayesian based methods today. The best success in using bayesian methods result from individual bayes databases, so I am not claiming that SpamCop's current implementation of bayesian filtering is going to get to 99%. Nor (for the record), am I asking that JT/SpamCop implement individual bayes databases either.

I do think that it is important to ensure that the current implementation of bayesian filtering is optimal before deciding it has no merit. I think I remember JT mentioning that SpamCop is filtering at 80% somewhere along the line. Lets give bayesian filtering an honest try and see if it can affect this percentage significantly.

JT, please consider speaking with some other SA admins regarding our current setup, perhaps Chris Santerre (not sure if he runs bayes or not.) At a minimum I think other admins would have very useful input with regards to training "ham" and their experiences with large user bases.

Thanks.

Link to comment
Share on other sites

I am not an expert at all and maybe I am missing something, but I wonder why Bayes is trained after SC BL results.

This way, it turns out to be kinda duplicate, doesn't it?

Wouldn't it be better if training would be independent from the BL and only based on users' choiches?

Just my 1 cent ;)

Link to comment
Share on other sites

independent from the BL and only based on users' choiches

I see that as a hard one to answer, as you'd be trying to decide on those "user choices" .. back to some people just do a one fell swoop on everything in their Held Folder (suggesting that the BL was already involved) ... others, due to time restraints, only go for "the ones that made it past the filters" - there'd be "new bad", but ... where are you going to get the "good" stuff to train the "ham" side of the Bayesian? I'm still of the belief that it's a reasonable good tool for the end user, but at ISP level, ????

Link to comment
Share on other sites

I know my SpamCop account gets far too much (percentage-wise) by way of non-spam emails with the subject "SpamCop Quick reporting data" that contain spammy subject lines for me to use my inbox as ham.

Link to comment
Share on other sites

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...