# Copyright (c) 2003, James Seng. (http://james.seng.cc/)
# This code is released under the Artistic License.
#
# mt-bayesian-1.1.4

+++++++++++++++++++++++++++++++++++
INTRODUCTION
+++++++++++++++++++++++++++++++++++

This plugin will allow you to train your MT to identify spam comments and 
trackback pings using Bayesian filtering technique which is hugely successful
in fighting Email spams. The bayesian algorithm has been modified here to 
carter for the special characteristic of comments and pings.

The system started off dumb but as you train the system, it will become better
in identify spam. Once it is sufficient trained, it will require no or little
maintenance.

To see how this works, check out my blog at http://james.seng.cc/

You can contact me at jseng_at_pobox.org.sg if you have any questions.

['Training' is just a fanciful word. What you are actually doing is 
'blacklisting' and 'whitelisting' comments and pings, except the system 
takes the whole content into consideration (not just IP or host).

After you do enough blacklisting and/or whitelisting, the system will be 
able to identify spams comments or pings using fuzzy logic, or
calculated probability]

+++++++++++++++++++++++++++++++++++
CHANGES
+++++++++++++++++++++++++++++++++++
0.9 (1st Public Release on 15th Oct 2003)
0.9.3 - introduce a minor tweak to recalculate spam probablity once in a while
0.9.4 - fixed a minor bug introduced in 0.9.3 which stalled when you attempt
        to access the training menu 

1.0 (Major Release on 17th Oct 2003)
      - improved performance. Spam probabilities are cached until you force
        recalculation in the training page. This allows faster generation of
        comments and pings pages, especially those dynamically generated one.
      - Add button to click all entries as spam or not spam in Training menu
1.0.1 - fixed a minor error in bayesian_list_comments.tmpl 
1.0.2 - removed the confusing recalculate button. Instead, it will auto
        recalculate everytime you rebuild the site (from mt-bayesian.cgi)
      - better integrate mt-bayesian.cgi with MT so you can totally replace
        mt.cgi with mt-bayesian.cgi (if you choose to)
1.0.3 - if the user is using DBI (e.g. mysql), it won't work unlesss the
        neccessary tables are created. 
1.0.4 - add a script to create the neccessary tables for DBI users
1.0.5 - fixed a typo in bayesian_mysql.dump (thanks to Patrick Berry)
      - convert bayesian-init-db into a CGI

1.1 (Major Release on 20th Oct 2003)
      - fixed all the related bugs with using mt-bayesian with DBI database
        (e.g. mysql, etc)
      - fixed a bug which will create duplicated bayesian rows and also
        slow down the performance

1.1.1 - fixed a few other bugs which will create duplicated bayesian rows
1.1.2 - fixed more duplicated bayesian rows bug
1.1.3 - add cleandb.cgi, a CGI script version of cleandb.pl
1.1.4 - add a button to delete only confirmed spam

+++++++++++++++++++++++++++++++++++
INSTALLATION
+++++++++++++++++++++++++++++++++++

1. Install (copy or move) these files in their respective location.
   {MTDIR} refers to your base mt directory, e.g. /home/blog/public_html/mt

   {MTDIR}
      o mt-bayesian.cgi             (make sure this is executable)

   {MTDIR}/plugins
      o mt-bayesian.pl

   In {MTDIR}/lib/MT
      o Bayesian.pm
      o BayesianBlog.pm
      o BayesianToken.pm

   {MTDIR}/lib/MT/App
      o BayesianTrain.pm

   {MTDIR}/tmpl/cms
      o bayesian_menu.tmpl           <-- new addition in 1.0.2
      o bayesian_list_blog.tmpl
      o bayesian_list_comments.tmpl
      o bayesian_list_pings.tmpl

2. (Optional) Edit {MTDIR}/lib/MT/Bayesian.pm, and modify the variable 
@ignore_tokens. Only add terms which appears frequently on both spams 
and non-spams on your site hence being statistically irrelevant. If
you not sure, just leave the default alone.

3. (Optional - Only if you use mysql, postgres or other DBI engine)
   3.1. move/copy the directory bdb/ to {MTDIR}/bdb/ 
   3.2. move/copy bayesian-init-db.cgi then point your browser
        to http://your.host/path_to_mt/bayesian-init-db.cgi
        (e.g. http://my.blog.com/mt/bayesian-init-db.cgi)
   3.3. after database is initialized, delete bayesian-init-db.cgi

   Note: If you get an error trying to execute bayesian-init-db.cgi
   over your browser, the execute it manually.

4. That's it! Simple isn't it?

+++++++++++++++++++++++++++++++++++
Modification to Templates
+++++++++++++++++++++++++++++++++++

MT-Bayesian introduce 3 more tags which you can use in your templates.

o MTSpamProb  - return the spam probablity of the comment or trackback ping
o MTIfSpam - return true if the comment or trackback is identify as spam
o MTIfNotSpam - return true if the comment or trackback is NOT identify as spam

Example, (extracted from my 'Comment Listing Template')
   <MTComments>
   <MTIfNotSpam>
   <div class="comments-body">
   <$MTCommentBody$>
   <span class="comments-post">Posted by <$MTCommentAuthorLink spam_protect="1"$> at <$MTCommentDate$> (Spam: <$MTSpamProb$>%)</span>
   </div>
   </MTIfNotSpam>
   </MTComments>

Important Note: <MTIfNotSpam> must be used ***INSIDE*** <MTComments> or 
<MTPings> container or else it won't work. In other words, the following 
will NOT work:
        <MTIfNotSpam><MTPings> ... </MTPings></MTIfNotSpam>

Typically, you need to modify the following templates
   - Individual Entry Archive 
        <MTComments><MTIfNotSpam> ... </MTIfNotSpam></MTComments>
   - Comment Listing Template
        <MTComments><MTIfNotSpam> ... </MTIfNotSpam></MTComments>
   - TrackBack Listing Template
        <MTPings><MTIfNotSpam> ... </MTIfNotSpam></MTPings>

+++++++++++++++++++++++++++++++++++
Training your MT blog
+++++++++++++++++++++++++++++++++++

To train your MT, point your browser to  
http://your.blog.com/path_to_mt/mt-bayesian.cgi

e.g. http://my.blog.com/mt/mt-bayesian.cgi

Select "Manage Comments" or "Manage Pings" and you see a listing of the 
comments or pings on your blog.

On each comment or ping, you will find a checkbox which allow you to train MT 
to identify it as 'Spam' or 'Not Spam'. The reason for the later is because 
sometimes MT will identify comments or trackback as spam wrongly so you need 
to reverse it. (Think of it as blacklisting and whitelisting).

Select some checkboxes, and click on "Train" button (on top or bottom)

Immediately, you will see the probablity of the comments/pings changes after 
training.

WARNING WARNING: Click on the "Delete Spam" button to delete ***ALL*** 
comments/pings which you identify or the system identify as spams.

+++++++++++++++++++++++++++++++++++
FUTURE PLANS (if I have the time)
+++++++++++++++++++++++++++++++++++

- Allow users to export their trained bayesian database. More specifically, 
  export the whitelist and blacklist hosts and URLs into RDF. Other users
  can then import these database via RSS and use it for their filtering.

- Add a button to click up the database and recalculate the spam probability
  after training in the Training interface. [done in 1.0]

- Add button to "check all" in Training interface. (Idea via Trackback
  from absoblogginlutely.net) [done in 1.0]

- Allow users to filter "display untrain comments/pings" in Training
  interface. (Idea from Michael Seneadza)

- Add options to management interface to allow users to define spam
  threshold (default 0.9) and non-spam threshold (default 0.2).
  (Idea from Michael Seneadza)

- (URGENT) write a script to create the tables for people using DBI
  [done in 1.0.4]

- Add a button to delete only confirmed spam [done in 1.1.4]

- Clean up database when the BayesianBlog is been removed.

+++++++++++++++++++++++++++++++++++
FAQ
+++++++++++++++++++++++++++++++++++

Failed to get it work? Check http://james.seng.cc/Bayesian-README.txt
and check if there is a version change at the top. If so, the problem
you encountered may already be solved there.

Otherwise, feel free to drop me an email at jseng_at_pobox.org.sg. I
will try my best to answer you.

Q1. After I click 'Train', those I trained still on the list. Is it okay
    to 'retrain'?

Yes, it is okay. The system knows you have train that entry and will not
retrain, unless you switch from spam to not spam or vice versa.

Q2. Comments/Pings listing generation become very slow! What's wrong?

Upgrade to MT-Bayesian 1.0. It has improved performance.

Q3. Most of the spam comments look fairly normal but contain links which
    is the problem unlike Emails spam. How can a Bayesian filter catch
    that?

Actually yes, you can. And it is done in MT-Bayesian.

Q4. How do you get the "Delete Spam" to work?

If it hangs on you, then upgrade to 1.x

Q5. I am using SimpleComments. Will this work?

Yes.

Q6. I trained my MT for my blog but my other blogs don't seem to get it.

MT-Bayesian is learn on a per blog, not MT installation. In other words,
you can train one blog to learn something but will not affect the other
blogs. e.g. you may dislike me so you classified any posting from 
jseng as spam. That' fine because it is your blog afterall but your choice
will not prevent me to post on other blogs on the same MT installation.

Q7. Can I adapt this to ___fill in your blog engine___?

First, if "I" means you hope that I will do it, then very sorry, I don't 
think I have the time. I wrote MT-Bayesian for fun and have no intention 
to turn this into my full time job. 

If "I" means you wish to adapt this to your own blog engine, by all
means go ahead. It isn't too difficult. As a measure, it took me 2-3 hrs
to build the Bayesian engine (based on the LISP code from Paul Graham)
and another 8-10 hrs to figure how to stick it into MT. (Took a bit of
time here since I never wrote a MT plugin before; scode too simple to 
be counted). So really, it isn't difficult.

Q8. The training doesn't seem to save and my comments stay at 50% no
    matter what.

You probably using a database engine like mysql or postgres to store 
your record. (Check your mt.cfg and look for a line called "DataSource".)

Upgrade to 1.1. Anything before 1.1, the DBI engine is not working so
you cant train the MT at all.

During the upgrading process, please remember to execute Step 3 
(see above). If you don't initalized and create the neccessary tables,
it won't work.

Q9. When I trained a comment/trackback, it is reflected in the training
    interface. But when I rebuild the pages, it is not reflected.

There are some database row duplication bugs before 1.1. You need to upgrade
1.1.3 and also run the cleandb script to clean up your database.

There are two cleandb script provided, one is a CGI script and the other
is a standalone script.

CGI method: Put cleandb.cgi into MT_DIR, make sure it is executable and 
then point your browser to http://your.host/path_to_mt/cleandb.cgi 

Standalone method: Put cleandb.pl into MT_DIR, make sure it is executable 
and then type ./cleandb.pl.

If you are already running 1.1.2 (and above), then run cleandb.pl once
in a while. The rows get duplicated when there are two rebuild of the same
entry occurs at the same time. The solution is to lock the database before 
but it will slow down the performance drastically.

Q10. Why all the comments are either 0%, 50% or 99%. Is there a bug?

Short answer: Not enough training.

Long answer: Bayesian takes about 15 significant token (words) from the
content for its probability calculation. If you have not trained your
bayesian sufficiently, the following case are likely to happen:

1. all 15 tokens are not trained before or not sufficient trained
   (i.e. > 3 times) => 50%

2. a few tokens are sufficient trained but low in numbers (i.e. 
   5 out of 6 spam or 4 out of 5 not spam) => 0% or 99% ...

In practice, you need about 1,000 unique spam and 1,000 unique non spam
to have sufficient training. Few sites reach that level. (I know my site
dont!)

(This is why perhaps a distributed database would help. It is in my TODO
once I get some free time...)

+++++++++++++++++++++++++++++++++++
CREDIT
+++++++++++++++++++++++++++++++++++

Special thanks to Paul Graham who first proposed using AI Bayesian Alogrithm 
for spam fighting. See "A Plan for Spam" http://www.paulgrapham.com/spam.html

Boris for fixing a bug in the README installation - (BayesianTrain.pm should
go into {MTDIR}/lib/MT/App)

Kianseon for pointing out the bug in bayesian_list_comments.tmpl

Patrick Berry who pointed out that if the tables are not created before
hand in DBI, the package will not work. He also helped in testing mysql.
Thanks!

Michael Seneadza and Michael William who pointed out the false positive bug.
Michael Seneadza also helped in debugging the problem.