reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ARTICLE: Annotator Reliability Through In-Context Learning

Authors: Sujan Dutta, Deepak Pandita, Tharindu Cyril Weerasooriya, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhsh

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate this framework on two offensive speech datasets using multiple LLMs and compare its performance with traditional methods. Our findings indicate that ARTICLE can be used as a robust method for identifying reliable annotators, hence improving data quality. Figures 2 and 3 illustrate the performance (F1-score on the test set) at various values of k for DTR and DVOICED, respectively.
Researcher Affiliation	Academia	1Rochester Institute of Technology 2George Mason University EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper includes schematic diagrams and descriptions of its method, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code	Yes	Code https://github.com/Suji04/ARTICLE
Open Datasets	Yes	We consider two datasets on web toxicity: DTR and DVOICED. DTR contains 107,620 comments from multiple social web platforms (Twitter, Reddit, and 4chan) collectively annotated by 17,280 annotators. We sample 20,000 comments from DTR for our experiments ensuring that each set of 20 comments is annotated by the same five annotators, thereby retaining the structure of the original dataset. DVOICED (Weerasooriya et al. 2023b) includes 2,338 YouTube comments on three major US cable news networks (Khuda Bukhsh et al. 2021) annotated by 726 annotators.
Dataset Splits	Yes	For each annotator, we randomly split their annotations into two sets the first set (training set) contains 10 data points, and the second (test set) contains the rest. ... For each group, we construct a training set using 70% of the data. The rest is used for testing.
Hardware Specification	Yes	We run all our experiments in a Google Colab (pro+) environment with a single A100 GPU (40 GB) and 52 GB RAM.
Software Dependencies	No	The paper mentions using specific LLMs (Mistral-7B-instruct, Llama3-8B-instruct, GPT-3.5-turbo), which are models, but does not list specific programming languages or libraries with their version numbers required to replicate the experiments.
Experiment Setup	Yes	For each annotator, we randomly split their annotations into two sets the first set (training set) contains 10 data points, and the second (test set) contains the rest. ... We define a hyperparameter (k) that acts as a threshold. If, for a given annotator, the F1-score is less than k, we mark them as inconsistent and remove them from the dataset. ... For each group, we construct a training set using 70% of the data. The rest is used for testing. For each test instance, we randomly sample 15 examples from the training set and use them as in-context examples.