October 18th, 2003

Problem with blacklist


Jay Allen, the author of MT-blacklist, commented in a discussion at Q Daily News:

While I will agree that many blacklist implementations and models are flawed, I still haven’t heard an valid criticism specifically of MT-Blacklist’s implementation (other than a couple of bugs which will be ironed out in the next version), but would be happy to hear some and adapt the program as necessary to best serve the needs of the community.

I take a quick look at MT-blacklist. The whole logic to determine if a comment or ping is spam or not comes down to this:

foreach $deny (@blacklisted_strings) {
  if ($str =~ m#$deny#i) 
    return $config->{logDenials} ? (1:$deny) : 1;

Translation: Do a case insensitive check on every blacklisted words and see if it appears anywhere in the comments or pings. If it is, then it is spam.

So, suppose I have “porn” as a blacklisted word (which I am almost certainly will), then a discussion of “Should we ban porn in Singapore?” would be impossible to be carried out.

The simple mindedness of blacklist logic is the problem, whether it is IP blacklist or content blacklist. Bayesian, on the other hand, analysis the whole content and give you a probability whether it is spam. Not perfect but at least it is not 1 or 0. Life isnt binary…

For further discussion, see Paul Graham note on Bayesian vs Blacklist.

ps: Please don’t get me wrong. I think Jay should be commented for his effort to fight comment spam. But that does not mean I agree with the notion of blacklist.

Comments are closed.