March 21st, 2004

Problems with Bayesian…

»

dspam-logo.gifI wrote MT-Bayesian as a quick hack last year because of my dislike of MT-Blacklist. I still don’t like Blacklist concept (been-there, done-that in Email) but I have to admit there are serious problem with MT-Bayesian.

1. Bayesian is extremely CPU intensive.

Bayesian algorithm isn’t complex but it suck up a lot CPU resources. This is okay in a mail environment but in MT where you need to rebuild often, this is enough to crash some slower machines.

And it doesn’t help I wrote the whole thing in Perl.2. Bayesian accuracy requires training

Bayesian needs a lot of training…Some thinks it is kind of cute, as it sound like a little tamagochi where you can play with it. But unfortunately, few have sufficient comments to train their bayesian properly. Moreover, you can’t just train them with spam comments; you also need equal number of ham (ie, non-spam) comments.

3. Token selection

The basic Bayesian algorithm say you choose about 15 most significant tokens (ie. words) to make a determination. Sound easy but the selection of the 15 tokens would seriously affect the result of the Bayesian. And in fact, it is quite a science with a few variation of algorithm out there.

I used the simplest algorithm but unfortunately, Occam’s Razor dont apply here.

Whats next? I think I should relook at the MT-Bayesian and to correct these deficiency, ie. write it in C, tweak the algorithm to use less training and better token selection. Luckily, there are already some work done like DSPAM which has shown even more accuracy then Bayesian with less training.

Ah, a yet-another-project to look into when I got the time…

Comments are closed.