TAR Evaluation Toolkit Release 1.0.0 Copyright 2013, Gordon V. Cormack This software is made available under the Gnu Public License, version 3. Any other use requires explicit permission from the author. Includes "Sofia-ML" released under Apache license -- see COPYING in the Sofia-ML folder. ----- DOWNLOAD THE TOOLKIT (5.5 GB, tar.gz file) tar-toolkit.tgz DOWNLOAD THE TREC 2009 CORPUS (1.2 GB, not required if you use the features provided in toolkit) treclegal09docs.tgz ----- OVERVIEW This toolkit simulates an incremental Technology Assisted Review process and computes recall as a function of the number of documents reviewed. Two protocols are simulated: Continuous Active Learning (CAL) -- after starting with a seed set of 1,000 docs, the system repeateadly selects batches of the most-likely relevant 1,000 docs to add to the training set, retrains, and repeats. Passive Supervised learning (PSL) -- after starting with a seed set of 1,000 docs, the system repeatedly selects batchs of 1,000 random docs to add to the training set, retrains, and repeats. Two classifiers are used: SVM (Pegasos Support Vector Machine, implemented by Sofia-ML) NB (Naive Bayes, custom implementation) Two seed set protocols are implemented: Keyword-seed: 1,000 hits, taken at random, from a "seed query" Random-seed: 1,000 docs, selected at random. The Collection is the Enron collection used for the TREC 2009 legal track, minus about 70,000 vacuous documents. A binary byte 4-gram hashed feature representation is included in this toolkit. The document identifiers are also included, but the raw text of the Enron collection is not. ----- INSTALLATION Install on a Linux/Posix system, with gnuplot. tar zxf tar-toolkit.tgz cd tar-toolkit make ---- SIMULATION ./run {actkeysvm, actransvm, paskeysvm, pasransvm, actkeynb, pasrannb} {201, 202, 203, 207} [iterations] act = continuous active learning pas = passive supervised learning key = keyword-selected seed set ran = random seed set svm = support vector machine nb = naive bayes 201, 202, 203, 207 = TREC topic number iterations (optional, between 1 and 100, default 100) = the number of feedback/training batches Note: it takes several hours to run one simulation protocol on one topic. ---- GRAPHING usage: ./graph topic run1 run2 run3 ... topic = one of "201", "202", "203", "204" run = runid or trainsize runid runid = one of actkeysvm actransvm paskeysvm pasransvm actkeynb pasrannb trainsize = a number indicating training set size in thousands of docs generally, graphs for PSL runs should specify trainsize, which shows the result when training is stopped after trainsize docs. graphs for CAL runs should not specify trainsize, as training is never stopped. example: ./graph 202 actkeysvm 2 pasransvm 5 pasransvm 8 pasransvm Notes: must be run after simulation creates output files "plot.eps" and "plot.pdf" in the current directory. for CAL, trainsize is normally *NOT* specified for PLS, trainsize is typically a number between 1 and 20, indicating the number (in thousands) of training docs ---- INTERNALS The learning stuff is in the folder "logistic" and the raw results are in subfolders like "logistic/actransvm" In the subfolders, "goldNN.TTT" indicates recall achieved using NN-1 thousand training docs, and K thousand reviewed docs, where K is the line number in the file. logistic/selfNN.TT gives recall, based on the training standard relevance assessments, as opposed to the gold standard. For details of the operation of the scripts, see the source code.