TAR Evaluation Toolkit Release 1.0.0

Copyright 2013, Gordon V. Cormack

This software is made available under the Gnu Public License, version 3.

Any other use requires explicit permission from the author.

Includes "Sofia-ML" released under Apache license -- see COPYING in
the Sofia-ML folder.

-----

DOWNLOAD THE TOOLKIT (5.5 GB, tar.gz file)

tar-toolkit.tgz

DOWNLOAD THE TREC 2009 CORPUS (1.2 GB, not required if you use the features provided in toolkit)

treclegal09docs.tgz

-----

OVERVIEW

This toolkit simulates an incremental Technology Assisted Review process
and computes recall as a function of the number of documents reviewed.

Two protocols are simulated:

   Continuous Active Learning (CAL) -- after starting with a seed set of 
   1,000 docs, the system repeateadly selects batches of the most-likely
   relevant 1,000 docs to add to the training set, retrains, and repeats.

   Passive Supervised learning (PSL) -- after starting with a seed set of
   1,000 docs, the system repeatedly selects batchs of 1,000 random
   docs to add to the training set, retrains, and repeats.

Two classifiers are used:

    SVM (Pegasos Support Vector Machine, implemented by Sofia-ML)
    NB (Naive Bayes, custom implementation)

Two seed set protocols are implemented:

    Keyword-seed:  1,000 hits, taken at random, from a "seed query"
    Random-seed:   1,000 docs, selected at random.

The Collection is the Enron collection used for the TREC 2009 legal track,
minus about 70,000 vacuous documents.  A binary byte 4-gram hashed feature
representation is included in this toolkit.  The document identifiers are
also included, but the raw text of the Enron collection is not.

-----

INSTALLATION

Install on a Linux/Posix system, with gnuplot.

   tar zxf tar-toolkit.tgz
   cd tar-toolkit
   make

----

SIMULATION

   ./run {actkeysvm, actransvm, paskeysvm, pasransvm, actkeynb, pasrannb} {201, 202, 203, 207} [iterations]
      act = continuous active learning
      pas = passive supervised learning
      key = keyword-selected seed set
      ran = random seed set
      svm = support vector machine
      nb = naive bayes
      201, 202, 203, 207 = TREC topic number
      iterations (optional, between 1 and 100, default 100) = 
         the number of feedback/training batches

Note: it takes several hours to run one simulation protocol on one topic.

----

GRAPHING
  
   usage: ./graph topic run1 run2 run3 ...
      topic = one of "201", "202", "203", "204"
      run = runid or trainsize runid
      runid = one of actkeysvm actransvm paskeysvm pasransvm actkeynb pasrannb
      trainsize = a number indicating training set size in thousands of docs
         generally, graphs for PSL runs should specify trainsize, which
         shows the result when training is stopped after trainsize docs.
         graphs for CAL runs should not specify trainsize, as training
         is never stopped.
      example: ./graph 202 actkeysvm 2 pasransvm 5 pasransvm 8 pasransvm

Notes: 

    must be run after simulation
    creates output files "plot.eps" and "plot.pdf" in the current directory.
    for CAL, trainsize is normally *NOT* specified
    for PLS, trainsize is typically a number between 1 and 20, indicating the 
       number (in thousands) of training docs

----

INTERNALS

The learning stuff is in the folder "logistic" and the raw results are in
subfolders like "logistic/actransvm"   In the subfolders, "goldNN.TTT"
indicates recall achieved using NN-1 thousand training docs, and K
thousand reviewed docs, where K is the line number in the file.

logistic/selfNN.TT gives recall, based on the training standard relevance
assessments, as opposed to the gold standard.

For details of the operation of the scripts, see the source code.