Navigating Imprecision in Relevance Assessments on the Road to Total Recall: Roger and Me

Gordon V. Cormack & Maura R. Grossman

To appear in Proceedings of SIGIR 2017

Download authors' copy

Permanent DOI link

Abstract

Abstract Technology-assisted review ("TAR") systems seek to achieve "total recall"; that is, to approach, as nearly as possible, the ideal of 100% recall and 100% precision, with minimal human effort. The literature reports that TAR methods using relevance feedback can achieve considerably greater than the 65% recall and 65% precision reported by Voorhees in 2000 as the "practical upper bound on retrieval performance . . . since that is the level at which humans agree with one another?" (Variations in Relevance Judgments and the Measurement of Retrieval Effectiveness (36 Info. Proc. Mgmt. 697). This work argues that in order to build---as well as to evaluate---TAR systems that approach 100% recall and 100% precision, it is necessary to model human assessment, not as absolute ground truth, but as an indirect indicator of the amorphous property known as "relevance." The choice of model impacts both the evaluation of system effectiveness, as well as the in vitro simulation of relevance feedback. Models are presented that better fit available data than the infallible ground-truth model. These models suggest ways to improve TAR-system effectiveness so that hybrid human-computer systems can improve on both the accuracy and efficiency of human review alone. This hypothesis is tested by simulating TAR using two datasets: the TREC 4 AdHoc collection, and a dataset consisting of 400,000 email messages that were manually reviewed and classified by a single individual, Roger, in his official capacity as Senior State Records Archivist. The results using the TREC 4 data show that TAR achieves higher recall and higher precision than the assessments by either of two independent NIST assessors, and blind adjudication conducted by Roger, more than two years after his original email review, shows that he could have achieved the same recall and better precision, while reviewing substantially less than 400,000 emails, had he employed TAR in place of exhaustive manual review.