Annotator Rationales for Labeling Tasks in Crowdsourcing

Authors: Mucahid Kutlu, Tyler McDonnell, Tamer Elsayed, Matthew Lease

JAIR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate this, we perform A/B testing of our Standard Task (Section 4.2) vs. our Rationale Task (Section 4.3). We measure inter-annotator agreement (Section 6.2) to test whether the crowd is internally consistent, regardless of their agreement with our gold standard. Next, we measure the accuracy of crowd judgments (individually and aggregated) vs. the TREC gold standard (Section 6.3).
Researcher Affiliation Academia Mucahid Kutlu EMAIL TOBB University of Economics and Technology Ankara, Turkey Tyler Mc Donnell EMAIL University of Texas at Austin Austin, TX USA Tamer Elsayed EMAIL Qatar University Doha, Qatar Matthew Lease EMAIL University of Texas at Austin Austin, TX USA
Pseudocode Yes Algorithm 1 Top-N Filtering 1: procedure Filter-By-Top-N(Jd, N ) ... Algorithm 2 Threshold Filtering 1: procedure Filter-By-Threshold(Jd)
Open Source Code No For Dawid-Skene (DS), we adopt an existing open source package8. https://github.com/dallascard/dawid_skene
Open Datasets Yes We collect ad hoc Web search relevance judgments for the Clue Web09 Webcrawl (Callan et al., 2009) using the quaternary (4-point) scale described in Section 4.2. Search topics and judgments are drawn from the 2009 TREC Web Track (Clarke et al., 2010).
Dataset Splits Yes TREC gold judgments for our 700 documents are distributed as follows: 46% not relevant, 24% relevant, and 30% highly relevant. We evaluate collected crowd judgments against both this ternary gold standard (we collapse our probably/definitely not relevant distinctions) and a binarized version (we collapse TREC s relevant and highly relevant distinctions), yielding 46% not relevant and 54% relevant documents, and we collapse our own probably/definitely relevant distinctions. We collect 5 crowd responses per Webpage (700x5=3500 judgments) for each task design: Standard, Rationale, and Two-Stage.
Hardware Specification No No specific hardware details (GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running the experiments are mentioned in the paper.
Software Dependencies No For Dawid-Skene (DS), we adopt an existing open source package8. https://github.com/dallascard/dawid_skene
Experiment Setup Yes We set N = 3 for Top-N judgment filtering and rounding down to the nearest 10 for Threshold filtering based on pilot experiments (Section 4.1).