Bayesian filter for MT – James Seng's Blog

Update: Michah Valine has taken over the maintenance of MT-Bayesian

Update: Please read the problems with MT-Bayesian.

Many people have complained about my “Solution for comments spams” is unfriendly to disabled or those who do not have a graphic browser.

Hearing your feedbacks, I spent the last 2 days working on a bayesian plugin. To cut the story short, the plugin will allow you to train your movabletype blog to automatically identify spam comments and pings.

Download it here: mt-bayesian-1.1.tar.gz (Documentation)

[I just installed SimpleComments plugin and modify the templates so it will also display spam but will blockquote them. Check it out. :-]Bayesian algorithm is a popular technique used in fighting Email spam. (See A Plan For Spam by Paul Graham).

You will train your MT blog to learn about spams. The system started off quite dumb but as you train it, it will learn and become better at identify spam. Once it is sufficient trained, it will require little or no further maintence.

‘Training’ is just a fanciful word. What you are actually doing is ‘blacklisting’ and ‘whitelisting’, except the system takes the whole content (not just IP or host) into consideration. Using these blacklist & whitelist which you tell it, it will attempt to guess (fuzzy logic) if other comments or pings are spams too.

I have customized the Bayesian algorithm to handle the blogs comments and pings, which is normally shorter then emails by via special handling of URLs, emails, hosts and IP addresses.

The advantage of this solution over my previous captcha solution is that it will work for both comments and pings. And it has no accessibility problem.

The advantage of this plugin over other solution (e.g. blacklist) is that after certain amount of training, it requires little and no maintence. Training is also similar then importing and exporting blacklists. In addition, it takes whitelist into consideration, not just blindly ban a host or subnet. (This is useful for those who has the misfortunate of been near a spammer). It also consider the whole content, including URLs, IPs, common words, etc into consideration. (See Bayesian vs Blacklist)

The disadvantage is this plugin is that the AI engine is only as good as you train it. If you don’t put in some initial effort to train, it don’t work well. Secondly, if you train it wrongly, you get wrong results.

Future plans: Allow user to export their trained bayesian database (or more specifically, hosts & urls) into RDF. This allows other users to import them into their own database. This will build a collective group bayesian blacklist and whitelist using datas of people whom you trust.

[Notice that there is a spam rating for each comments in this blog? ^_^ Feedbacks are welcome. Email me at jseng_at_pobox.org.sg]