reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adversarial Attacks on Crowdsourcing Quality Control

Authors: Alessandro Checco, Jo Bates, Gianluca Demartini

JAIR 2020 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement and experimentally validate the gold question detection system, using realworld data from a popular crowdsourcing platform. Our experimental results show that crowdworkers using the proposed system spend more time on signalled gold questions but do not neglect the others thus achieving an increased overall work quality.
Researcher Affiliation	Academia	Alessandro Checco EMAIL Information School, The University of Sheﬃeld Regent Court 211 Portobello Sheﬃeld S1 4DP, United Kingdom Jo Bates EMAIL Information School, The University of Sheﬃeld Regent Court 211 Portobello Sheﬃeld S1 4DP, United Kingdom Gianluca Demartini EMAIL School of Information Technology and Electrical Engineering University of Queensland, GP South Building StaﬀHouse Road, St Lucia QLD 4072, Australia
Pseudocode	No	The paper describes the system architecture and workflows (e.g., Client Workflow Simhash, Server Workflow Clustering) using figures and detailed textual descriptions of steps, but it does not include a distinct, structured pseudocode or algorithm block.
Open Source Code	Yes	The core functionalities of the plugin to replicate the following experiments are available at https://github.com/Alessandro Checco/ all-that-glitters-is-gold.
Open Datasets	Yes	We use the csta datasets and task logs described in (Benoit, Conway, Lauderdale, Laver, & Mikhaylov, 2016), consisting of crowdsourced annotations of political data. ...available from https://github.com/ kbenoit/CSTA-APSR.
Dataset Splits	No	The paper mentions characteristics of the datasets such as the percentage of gold questions (e.g., "12.4 % of them are gold questions") and the number of judgements per non-gold question (e.g., "each non-gold question had been answered by 10 workers"). It also describes sub-sampling to vary these parameters. However, it does not specify explicit training/test/validation splits for a model in a typical machine learning context that would be needed for direct reproduction of data partitioning.
Hardware Specification	No	The paper describes a browser plug-in and an external server but does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments or the server infrastructure.
Software Dependencies	No	The paper mentions technologies like "browser plug-in", "Java Script bookmarklet", "Orbit DB3" (with a GitHub link), and statistical models like "Gaussian mixture model". However, it does not specify version numbers for any programming languages, libraries, or operating systems used in their implementation or experimental setup.
Experiment Setup	Yes	Regarding the parameters of the system, we consider a realistic scenario: a job of 2000 tasks with an additional 5 % (100 tasks) of gold questions. We consider the default automatic behaviour of Figure Eight: 10 gold questions are used at the beginning to train and test the ability of the worker (i. e. a quiz page). After that, pages of 10 tasks are shown to the worker, of which 9 are requested tasks and one is a gold question. To be considered trusted, workers are required, by default, to judge a minimum of four gold questions and to reach an accuracy threshold of 70 %. ... Conﬁdence: The worker will consider as gold all questions with signalled probability of being gold of at least 50 %.