# Copyright (c) 2003, James Seng. (http://james.seng.cc/)
# This code is released under the Artistic License.
#
# mt-bayesian-1.0.6

+++++++++++++++++++++++++++++++++++
INTRODUCTION
+++++++++++++++++++++++++++++++++++

This plugin will allow you to train your MT to identify spam comments and 
trackback pings using Bayesian filtering technique which is hugely successful
in fighting Email spams. The bayesian algorithm has been modified here to 
carter for the special characteristic of comments and pings.

The system started off dumb but as you train the system, it will become better
in identify spam. Once it is sufficient trained, it will require no or little
maintenance.

To see how this works, check out my blog at http://james.seng.cc/

You can contact me at jseng_at_pobox.org.sg if you have any questions.

['Training' is just a fanciful word. What you are actually doing is 
'blacklisting' and 'whitelisting' comments and pings, except the system 
takes the whole content into consideration (not just IP or host).

After you do enough blacklisting and/or whitelisting, the system will be 
able to identify spams comments or pings using fuzzy logic, or
calculated probability]

+++++++++++++++++++++++++++++++++++
CHANGES
+++++++++++++++++++++++++++++++++++
0.9 (1st Public Release on 15th Oct 2003)
0.9.3 - introduce a minor tweak to recalculate spam probablity once in a while
0.9.4 - fixed a minor bug introduced in 0.9.3 which stalled when you attempt
        to access the training menu 

1.0 (Major Release on 17th Oct 2003)
      - improved performance. Spam probabilities are cached until you force
        recalculation in the training page. This allows faster generation of
        comments and pings pages, especially those dynamically generated one.
      - Add button to click all entries as spam or not spam in Training menu
1.0.1 - fixed a minor error in bayesian_list_comments.tmpl 
1.0.2 - removed the confusing recalculate button. Instead, it will auto
        recalculate everytime you rebuild the site (from mt-bayesian.cgi)
      - better integrate mt-bayesian.cgi with MT so you can totally replace
        mt.cgi with mt-bayesian.cgi (if you choose to)
1.0.3 - if the user is using DBI (e.g. mysql), it won't work unlesss the
        neccessary tables are created. 
1.0.4 - add a script to create the neccessary tables for DBI users
1.0.5 - fixed a typo in bayesian_mysql.dump (thanks to Patrick Berry)
      - convert bayesian-init-db into a CGI

+++++++++++++++++++++++++++++++++++
INSTALLATION
+++++++++++++++++++++++++++++++++++

1. Install (copy or move) these files in their respective location.
   {MTDIR} refers to your base mt directory, e.g. /home/blog/public_html/mt

   {MTDIR}
      o mt-bayesian.cgi             (make sure this is executable)

   {MTDIR}/plugins
      o mt-bayesian.pl

   In {MTDIR}/lib/MT
      o Bayesian.pm
      o BayesianBlog.pm
      o BayesianToken.pm

   {MTDIR}/lib/MT/App
      o BayesianTrain.pm

   {MTDIR}/tmpl/cms
      o bayesian_menu.tmpl           <-- new addition in 1.0.2
      o bayesian_list_blog.tmpl
      o bayesian_list_comments.tmpl
      o bayesian_list_pings.tmpl

2. (Optional) Edit {MTDIR}/lib/MT/Bayesian.pm, and modify the variable 
@ignore_tokens. Only add terms which appears frequently on both spams 
and non-spams on your site hence being statistically irrelevant. If
you not sure, just leave the default alone.

3. (Optional - Only if you use mysql, postgres or other DBI engine)
   3.1. move/copy the directory bdb/ to {MTDIR}/bdb/ 
   3.2. move/copy bayesian-init-db.cgi then point your browser
        to http://your.host/path_to_mt/bayesian-init-db.cgi
        (e.g. http://my.blog.com/mt/bayesian-init-db.cgi)
   3.3. after database is initialized, delete bayesian-init-db.cgi

4. That's it! Simple isn't it?

+++++++++++++++++++++++++++++++++++
Modification to Templates
+++++++++++++++++++++++++++++++++++

MT-Bayesian introduce 3 more tags which you can use in your templates.

o MTSpamProb  - return the spam probablity of the comment or trackback ping
o MTIfSpam - return true if the comment or trackback is identify as spam
o MTIfNotSpam - return true if the comment or trackback is NOT identify as spam

Example, (extracted from my 'Comment Listing Template')
   <MTComments>
   <MTIfNotSpam>
   <div class="comments-body">
   <$MTCommentBody$>
   <span class="comments-post">Posted by <$MTCommentAuthorLink spam_protect="1"$> at <$MTCommentDate$> (Spam: <$MTSpamProb$>%)</span>
   </div>
   </MTIfNotSpam>
   </MTComments>

Important Note: <MTIfNotSpam> must be used ***INSIDE*** <MTComments> or 
<MTPings> container or else it won't work. In other words, the following 
will NOT work:
        <MTIfNotSpam><MTPings> ... </MTPings></MTIfNotSpam>

Typically, you need to modify the following templates
   - Individual Entry Archive 
        <MTComments><MTIfNotSpam> ... </MTIfNotSpam></MTComments>
   - Comment Listing Template
        <MTComments><MTIfNotSpam> ... </MTIfNotSpam></MTComments>
   - TrackBack Listing Template
        <MTPings><MTIfNotSpam> ... </MTIfNotSpam></MTPings>

+++++++++++++++++++++++++++++++++++
Training your MT blog
+++++++++++++++++++++++++++++++++++

To train your MT, point your browser to  
http://your.blog.com/path_to_mt/mt-bayesian.cgi

e.g. http://my.blog.com/mt/mt-bayesian.cgi

Select "Manage Comments" or "Manage Pings" and you see a listing of the 
comments or pings on your blog.

On each comment or ping, you will find a checkbox which allow you to train MT 
to identify it as 'Spam' or 'Not Spam'. The reason for the later is because 
sometimes MT will identify comments or trackback as spam wrongly so you need 
to reverse it. (Think of it as blacklisting and whitelisting).

Select some checkboxes, and click on "Train" button (on top or bottom)

Immediately, you will see the probablity of the comments/pings changes after 
training.

WARNING WARNING: Click on the "Delete Spam" button to delete ***ALL*** 
comments/pings which you identify or the system identify as spams.

+++++++++++++++++++++++++++++++++++
FUTURE PLANS (if I have the time)
+++++++++++++++++++++++++++++++++++

- Allow users to export their trained bayesian database. More specifically, 
  export the whitelist and blacklist hosts and URLs into RDF. Other users
  can then import these database via RSS and use it for their filtering.

- Add a button to click up the database and recalculate the spam probability
  after training in the Training interface.

- Add button to "check all" in Training interface. (Idea via Trackback
  from absoblogginlutely.net) [done in 1.0]

- Allow users to filter "display untrain comments/pings" in Training
  interface. (Idea from Michael Seneadza)

- Add options to management interface to allow users to define spam
  threshold (default 0.9) and non-spam threshold (default 0.2).
  (Idea from Michael Seneadza)

- (URGENT) write a script to create the tables for people using DBI
  [done in 1.0.4]

+++++++++++++++++++++++++++++++++++
FAQ
+++++++++++++++++++++++++++++++++++

Failed to get it work? Check http://james.seng.cc/Bayesian-README.txt
and check if there is a version change at the top. If so, the problem
you encountered may already be solved there.

Otherwise, feel free to drop me an email at jseng_at_pobox.org.sg. I
will try my best to answer you.

Q1. After I click 'Train', those I trained still on the list. Is it okay
    to 'retrain'?

Yes, it is okay. The system knows you have train that entry and will not
retrain, unless you switch from spam to not spam or vice versa.

Q2. Comments/Pings listing generation become very slow! What's wrong?

Upgrade to MT-Bayesian 1.0. It has improved performance.

Q3. Most of the spam comments look fairly normal but contain links which
    is the problem unlike Emails spam. How can a Bayesian filter catch
    that?

Actually yes, you can. And it is done in MT-Bayesian.

Q4. How do you get the "Delete Spam" to work?

If it hangs on you, then upgrade to 1.x

Q5. I am using SimpleComments. Will this work?

Yes.

Q6. I trained my MT for my blog but my other blogs don't seem to get it.

MT-Bayesian is learn on a per blog, not MT installation. In other words,
you can train one blog to learn something but will not affect the other
blogs. e.g. you may dislike me so you classified any posting from 
jseng as spam. That' fine because it is your blog afterall but your choice
will not prevent me to post on other blogs on the same MT installation.

Q7. Can I adapt this to ___fill in your blog engine___?

First, if "I" means you hope that I will do it, then very sorry, I don't 
think I have the time. I wrote MT-Bayesian for fun and have no intention 
to turn this into my full time job. 

If "I" means you wish to adapt this to your own blog engine, by all
means go ahead. It isn't too difficult. As a measure, it took me 2-3 hrs
to build the Bayesian engine (based on the LISP code from Paul Graham)
and another 8-10 hrs to figure how to stick it into MT. (Took a bit of
time here since I never wrote a MT plugin before; scode too simple to 
be counted). So really, it isn't difficult.

Q8. The training doesn't seem to save and my comments stay at 50% no
    matter what.

Check your mt.cfg and look for a line called "DataSource".

If It is something like "./db/" (the default), then make sure your 
CGI has write access to the directory.

If it is in mysql or other database engine, then you need to create
the neccessary tables. See Step 3 in the installation procedure.

+++++++++++++++++++++++++++++++++++
CREDIT
+++++++++++++++++++++++++++++++++++

Special thanks to Paul Graham who first proposed using AI Bayesian Alogrithm 
for spam fighting. See "A Plan for Spam" http://www.paulgrapham.com/spam.html

Boris for fixing a bug in the README installation - (BayesianTrain.pm should
go into {MTDIR}/lib/MT/App)

Kianseon for pointing out the bug in bayesian_list_comments.tmpl

Patrick Berry who pointed out that if the tables are not created before
hand in DBI, the package will not work. He also helped in testing mysql.
Thanks!
