# Copyright (c) 2003, James Seng. (http://james.seng.cc/) # This code is released under the Artistic License. # # mt-bayesian-1.1.4 +++++++++++++++++++++++++++++++++++ INTRODUCTION +++++++++++++++++++++++++++++++++++ This plugin will allow you to train your MT to identify spam comments and trackback pings using Bayesian filtering technique which is hugely successful in fighting Email spams. The bayesian algorithm has been modified here to carter for the special characteristic of comments and pings. The system started off dumb but as you train the system, it will become better in identify spam. Once it is sufficient trained, it will require no or little maintenance. To see how this works, check out my blog at http://james.seng.cc/ You can contact me at jseng_at_pobox.org.sg if you have any questions. ['Training' is just a fanciful word. What you are actually doing is 'blacklisting' and 'whitelisting' comments and pings, except the system takes the whole content into consideration (not just IP or host). After you do enough blacklisting and/or whitelisting, the system will be able to identify spams comments or pings using fuzzy logic, or calculated probability] +++++++++++++++++++++++++++++++++++ CHANGES +++++++++++++++++++++++++++++++++++ 0.9 (1st Public Release on 15th Oct 2003) 0.9.3 - introduce a minor tweak to recalculate spam probablity once in a while 0.9.4 - fixed a minor bug introduced in 0.9.3 which stalled when you attempt to access the training menu 1.0 (Major Release on 17th Oct 2003) - improved performance. Spam probabilities are cached until you force recalculation in the training page. This allows faster generation of comments and pings pages, especially those dynamically generated one. - Add button to click all entries as spam or not spam in Training menu 1.0.1 - fixed a minor error in bayesian_list_comments.tmpl 1.0.2 - removed the confusing recalculate button. Instead, it will auto recalculate everytime you rebuild the site (from mt-bayesian.cgi) - better integrate mt-bayesian.cgi with MT so you can totally replace mt.cgi with mt-bayesian.cgi (if you choose to) 1.0.3 - if the user is using DBI (e.g. mysql), it won't work unlesss the neccessary tables are created. 1.0.4 - add a script to create the neccessary tables for DBI users 1.0.5 - fixed a typo in bayesian_mysql.dump (thanks to Patrick Berry) - convert bayesian-init-db into a CGI 1.1 (Major Release on 20th Oct 2003) - fixed all the related bugs with using mt-bayesian with DBI database (e.g. mysql, etc) - fixed a bug which will create duplicated bayesian rows and also slow down the performance 1.1.1 - fixed a few other bugs which will create duplicated bayesian rows 1.1.2 - fixed more duplicated bayesian rows bug 1.1.3 - add cleandb.cgi, a CGI script version of cleandb.pl 1.1.4 - add a button to delete only confirmed spam +++++++++++++++++++++++++++++++++++ INSTALLATION +++++++++++++++++++++++++++++++++++ 1. Install (copy or move) these files in their respective location. {MTDIR} refers to your base mt directory, e.g. /home/blog/public_html/mt {MTDIR} o mt-bayesian.cgi (make sure this is executable) {MTDIR}/plugins o mt-bayesian.pl In {MTDIR}/lib/MT o Bayesian.pm o BayesianBlog.pm o BayesianToken.pm {MTDIR}/lib/MT/App o BayesianTrain.pm {MTDIR}/tmpl/cms o bayesian_menu.tmpl <-- new addition in 1.0.2 o bayesian_list_blog.tmpl o bayesian_list_comments.tmpl o bayesian_list_pings.tmpl 2. (Optional) Edit {MTDIR}/lib/MT/Bayesian.pm, and modify the variable @ignore_tokens. Only add terms which appears frequently on both spams and non-spams on your site hence being statistically irrelevant. If you not sure, just leave the default alone. 3. (Optional - Only if you use mysql, postgres or other DBI engine) 3.1. move/copy the directory bdb/ to {MTDIR}/bdb/ 3.2. move/copy bayesian-init-db.cgi then point your browser to http://your.host/path_to_mt/bayesian-init-db.cgi (e.g. http://my.blog.com/mt/bayesian-init-db.cgi) 3.3. after database is initialized, delete bayesian-init-db.cgi Note: If you get an error trying to execute bayesian-init-db.cgi over your browser, the execute it manually. 4. That's it! Simple isn't it? +++++++++++++++++++++++++++++++++++ Modification to Templates +++++++++++++++++++++++++++++++++++ MT-Bayesian introduce 3 more tags which you can use in your templates. o MTSpamProb - return the spam probablity of the comment or trackback ping o MTIfSpam - return true if the comment or trackback is identify as spam o MTIfNotSpam - return true if the comment or trackback is NOT identify as spam Example, (extracted from my 'Comment Listing Template')
<$MTCommentBody$> Posted by <$MTCommentAuthorLink spam_protect="1"$> at <$MTCommentDate$> (Spam: <$MTSpamProb$>%)
Important Note: must be used ***INSIDE*** or container or else it won't work. In other words, the following will NOT work: ... Typically, you need to modify the following templates - Individual Entry Archive ... - Comment Listing Template ... - TrackBack Listing Template ... +++++++++++++++++++++++++++++++++++ Training your MT blog +++++++++++++++++++++++++++++++++++ To train your MT, point your browser to http://your.blog.com/path_to_mt/mt-bayesian.cgi e.g. http://my.blog.com/mt/mt-bayesian.cgi Select "Manage Comments" or "Manage Pings" and you see a listing of the comments or pings on your blog. On each comment or ping, you will find a checkbox which allow you to train MT to identify it as 'Spam' or 'Not Spam'. The reason for the later is because sometimes MT will identify comments or trackback as spam wrongly so you need to reverse it. (Think of it as blacklisting and whitelisting). Select some checkboxes, and click on "Train" button (on top or bottom) Immediately, you will see the probablity of the comments/pings changes after training. WARNING WARNING: Click on the "Delete Spam" button to delete ***ALL*** comments/pings which you identify or the system identify as spams. +++++++++++++++++++++++++++++++++++ FUTURE PLANS (if I have the time) +++++++++++++++++++++++++++++++++++ - Allow users to export their trained bayesian database. More specifically, export the whitelist and blacklist hosts and URLs into RDF. Other users can then import these database via RSS and use it for their filtering. - Add a button to click up the database and recalculate the spam probability after training in the Training interface. [done in 1.0] - Add button to "check all" in Training interface. (Idea via Trackback from absoblogginlutely.net) [done in 1.0] - Allow users to filter "display untrain comments/pings" in Training interface. (Idea from Michael Seneadza) - Add options to management interface to allow users to define spam threshold (default 0.9) and non-spam threshold (default 0.2). (Idea from Michael Seneadza) - (URGENT) write a script to create the tables for people using DBI [done in 1.0.4] - Add a button to delete only confirmed spam [done in 1.1.4] - Clean up database when the BayesianBlog is been removed. +++++++++++++++++++++++++++++++++++ FAQ +++++++++++++++++++++++++++++++++++ Failed to get it work? Check http://james.seng.cc/Bayesian-README.txt and check if there is a version change at the top. If so, the problem you encountered may already be solved there. Otherwise, feel free to drop me an email at jseng_at_pobox.org.sg. I will try my best to answer you. Q1. After I click 'Train', those I trained still on the list. Is it okay to 'retrain'? Yes, it is okay. The system knows you have train that entry and will not retrain, unless you switch from spam to not spam or vice versa. Q2. Comments/Pings listing generation become very slow! What's wrong? Upgrade to MT-Bayesian 1.0. It has improved performance. Q3. Most of the spam comments look fairly normal but contain links which is the problem unlike Emails spam. How can a Bayesian filter catch that? Actually yes, you can. And it is done in MT-Bayesian. Q4. How do you get the "Delete Spam" to work? If it hangs on you, then upgrade to 1.x Q5. I am using SimpleComments. Will this work? Yes. Q6. I trained my MT for my blog but my other blogs don't seem to get it. MT-Bayesian is learn on a per blog, not MT installation. In other words, you can train one blog to learn something but will not affect the other blogs. e.g. you may dislike me so you classified any posting from jseng as spam. That' fine because it is your blog afterall but your choice will not prevent me to post on other blogs on the same MT installation. Q7. Can I adapt this to ___fill in your blog engine___? First, if "I" means you hope that I will do it, then very sorry, I don't think I have the time. I wrote MT-Bayesian for fun and have no intention to turn this into my full time job. If "I" means you wish to adapt this to your own blog engine, by all means go ahead. It isn't too difficult. As a measure, it took me 2-3 hrs to build the Bayesian engine (based on the LISP code from Paul Graham) and another 8-10 hrs to figure how to stick it into MT. (Took a bit of time here since I never wrote a MT plugin before; scode too simple to be counted). So really, it isn't difficult. Q8. The training doesn't seem to save and my comments stay at 50% no matter what. You probably using a database engine like mysql or postgres to store your record. (Check your mt.cfg and look for a line called "DataSource".) Upgrade to 1.1. Anything before 1.1, the DBI engine is not working so you cant train the MT at all. During the upgrading process, please remember to execute Step 3 (see above). If you don't initalized and create the neccessary tables, it won't work. Q9. When I trained a comment/trackback, it is reflected in the training interface. But when I rebuild the pages, it is not reflected. There are some database row duplication bugs before 1.1. You need to upgrade 1.1.3 and also run the cleandb script to clean up your database. There are two cleandb script provided, one is a CGI script and the other is a standalone script. CGI method: Put cleandb.cgi into MT_DIR, make sure it is executable and then point your browser to http://your.host/path_to_mt/cleandb.cgi Standalone method: Put cleandb.pl into MT_DIR, make sure it is executable and then type ./cleandb.pl. If you are already running 1.1.2 (and above), then run cleandb.pl once in a while. The rows get duplicated when there are two rebuild of the same entry occurs at the same time. The solution is to lock the database before but it will slow down the performance drastically. Q10. Why all the comments are either 0%, 50% or 99%. Is there a bug? Short answer: Not enough training. Long answer: Bayesian takes about 15 significant token (words) from the content for its probability calculation. If you have not trained your bayesian sufficiently, the following case are likely to happen: 1. all 15 tokens are not trained before or not sufficient trained (i.e. > 3 times) => 50% 2. a few tokens are sufficient trained but low in numbers (i.e. 5 out of 6 spam or 4 out of 5 not spam) => 0% or 99% ... In practice, you need about 1,000 unique spam and 1,000 unique non spam to have sufficient training. Few sites reach that level. (I know my site dont!) (This is why perhaps a distributed database would help. It is in my TODO once I get some free time...) +++++++++++++++++++++++++++++++++++ CREDIT +++++++++++++++++++++++++++++++++++ Special thanks to Paul Graham who first proposed using AI Bayesian Alogrithm for spam fighting. See "A Plan for Spam" http://www.paulgrapham.com/spam.html Boris for fixing a bug in the README installation - (BayesianTrain.pm should go into {MTDIR}/lib/MT/App) Kianseon for pointing out the bug in bayesian_list_comments.tmpl Patrick Berry who pointed out that if the tables are not created before hand in DBI, the package will not work. He also helped in testing mysql. Thanks! Michael Seneadza and Michael William who pointed out the false positive bug. Michael Seneadza also helped in debugging the problem.