TAR Evaluation Toolkit Release 1.0.0

This software is made available under the Gnu Public License, version 3.

Any other use requires explicit permission from the author.

Includes "Sofia-ML" released under Apache license -- see COPYING in
the Sofia-ML folder.

-----

DOWNLOAD THE TOOLKIT (5.5 GB, tar.gz file)

tar-toolkit.tgz

DOWNLOAD THE TREC 2009 CORPUS (1.2 GB, not required if you use the features provided in toolkit)

treclegal09docs.tgz

-----

OVERVIEW

This toolkit simulates an incremental Technology Assisted Review process
and computes recall as a function of the number of documents reviewed.

Two protocols are simulated:

Continuous Active Learning (CAL) -- after starting with a seed set of
1,000 docs, the system repeateadly selects batches of the most-likely
relevant 1,000 docs to add to the training set, retrains, and repeats.

Passive Supervised learning (PSL) -- after starting with a seed set of
1,000 docs, the system repeatedly selects batchs of 1,000 random
docs to add to the training set, retrains, and repeats.

Two classifiers are used:

SVM (Pegasos Support Vector Machine, implemented by Sofia-ML)
NB (Naive Bayes, custom implementation)

Two seed set protocols are implemented:

Keyword-seed: 1,000 hits, taken at random, from a "seed query"
Random-seed: 1,000 docs, selected at random.

The Collection is the Enron collection used for the TREC 2009 legal track,
minus about 70,000 vacuous documents. A binary byte 4-gram hashed feature
representation is included in this toolkit. The document identifiers are
also included, but the raw text of the Enron collection is not.

-----

INSTALLATION

Install on a Linux/Posix system, with gnuplot.

tar zxf tar-toolkit.tgz
cd tar-toolkit
make

----

SIMULATION

./run {actkeysvm, actransvm, paskeysvm, pasransvm, actkeynb, pasrannb} {201, 202, 203, 207} [iterations]
act = continuous active learning
pas = passive supervised learning
key = keyword-selected seed set
ran = random seed set
svm = support vector machine
nb = naive bayes
201, 202, 203, 207 = TREC topic number
iterations (optional, between 1 and 100, default 100) =
the number of feedback/training batches

Note: it takes several hours to run one simulation protocol on one topic.

----

GRAPHING

usage: ./graph topic run1 run2 run3 ...
topic = one of "201", "202", "203", "204"
run = runid or trainsize runid
runid = one of actkeysvm actransvm paskeysvm pasransvm actkeynb pasrannb
trainsize = a number indicating training set size in thousands of docs
generally, graphs for PSL runs should specify trainsize, which
shows the result when training is stopped after trainsize docs.
graphs for CAL runs should not specify trainsize, as training
is never stopped.
example: ./graph 202 actkeysvm 2 pasransvm 5 pasransvm 8 pasransvm

Notes:

must be run after simulation
creates output files "plot.eps" and "plot.pdf" in the current directory.
for CAL, trainsize is normally *NOT* specified
for PLS, trainsize is typically a number between 1 and 20, indicating the
number (in thousands) of training docs

----

INTERNALS

The learning stuff is in the folder "logistic" and the raw results are in
subfolders like "logistic/actransvm" In the subfolders, "goldNN.TTT"
indicates recall achieved using NN-1 thousand training docs, and K
thousand reviewed docs, where K is the line number in the file.

logistic/selfNN.TT gives recall, based on the training standard relevance
assessments, as opposed to the gold standard.

For details of the operation of the scripts, see the source code.