reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Annotator Rationales for Labeling Tasks in Crowdsourcing

Authors: Mucahid Kutlu, Tyler McDonnell, Tamer Elsayed, Matthew Lease

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate this, we perform A/B testing of our Standard Task (Section 4.2) vs. our Rationale Task (Section 4.3). We measure inter-annotator agreement (Section 6.2) to test whether the crowd is internally consistent, regardless of their agreement with our gold standard. Next, we measure the accuracy of crowd judgments (individually and aggregated) vs. the TREC gold standard (Section 6.3).
Researcher Affiliation	Academia	Mucahid Kutlu EMAIL TOBB University of Economics and Technology Ankara, Turkey Tyler Mc Donnell EMAIL University of Texas at Austin Austin, TX USA Tamer Elsayed EMAIL Qatar University Doha, Qatar Matthew Lease EMAIL University of Texas at Austin Austin, TX USA
Pseudocode	Yes	Algorithm 1 Top-N Filtering 1: procedure Filter-By-Top-N(Jd, N ) ... Algorithm 2 Threshold Filtering 1: procedure Filter-By-Threshold(Jd)
Open Source Code	No	For Dawid-Skene (DS), we adopt an existing open source package8. https://github.com/dallascard/dawid_skene
Open Datasets	Yes	We collect ad hoc Web search relevance judgments for the Clue Web09 Webcrawl (Callan et al., 2009) using the quaternary (4-point) scale described in Section 4.2. Search topics and judgments are drawn from the 2009 TREC Web Track (Clarke et al., 2010).
Dataset Splits	Yes	TREC gold judgments for our 700 documents are distributed as follows: 46% not relevant, 24% relevant, and 30% highly relevant. We evaluate collected crowd judgments against both this ternary gold standard (we collapse our probably/definitely not relevant distinctions) and a binarized version (we collapse TREC s relevant and highly relevant distinctions), yielding 46% not relevant and 54% relevant documents, and we collapse our own probably/definitely relevant distinctions. We collect 5 crowd responses per Webpage (700x5=3500 judgments) for each task design: Standard, Rationale, and Two-Stage.
Hardware Specification	No	No specific hardware details (GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running the experiments are mentioned in the paper.
Software Dependencies	No	For Dawid-Skene (DS), we adopt an existing open source package8. https://github.com/dallascard/dawid_skene
Experiment Setup	Yes	We set N = 3 for Top-N judgment ﬁltering and rounding down to the nearest 10 for Threshold ﬁltering based on pilot experiments (Section 4.1).