ARTICLE: Annotator Reliability Through In-Context Learning

Authors: Sujan Dutta, Deepak Pandita, Tharindu Cyril Weerasooriya, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhsh

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate this framework on two offensive speech datasets using multiple LLMs and compare its performance with traditional methods. Our findings indicate that ARTICLE can be used as a robust method for identifying reliable annotators, hence improving data quality. Figures 2 and 3 illustrate the performance (F1-score on the test set) at various values of k for DTR and DVOICED, respectively.
Researcher Affiliation Academia 1Rochester Institute of Technology 2George Mason University EMAIL, EMAIL, EMAIL
Pseudocode No The paper includes schematic diagrams and descriptions of its method, but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes Code https://github.com/Suji04/ARTICLE
Open Datasets Yes We consider two datasets on web toxicity: DTR and DVOICED. DTR contains 107,620 comments from multiple social web platforms (Twitter, Reddit, and 4chan) collectively annotated by 17,280 annotators. We sample 20,000 comments from DTR for our experiments ensuring that each set of 20 comments is annotated by the same five annotators, thereby retaining the structure of the original dataset. DVOICED (Weerasooriya et al. 2023b) includes 2,338 YouTube comments on three major US cable news networks (Khuda Bukhsh et al. 2021) annotated by 726 annotators.
Dataset Splits Yes For each annotator, we randomly split their annotations into two sets the first set (training set) contains 10 data points, and the second (test set) contains the rest. ... For each group, we construct a training set using 70% of the data. The rest is used for testing.
Hardware Specification Yes We run all our experiments in a Google Colab (pro+) environment with a single A100 GPU (40 GB) and 52 GB RAM.
Software Dependencies No The paper mentions using specific LLMs (Mistral-7B-instruct, Llama3-8B-instruct, GPT-3.5-turbo), which are models, but does not list specific programming languages or libraries with their version numbers required to replicate the experiments.
Experiment Setup Yes For each annotator, we randomly split their annotations into two sets the first set (training set) contains 10 data points, and the second (test set) contains the rest. ... We define a hyperparameter (k) that acts as a threshold. If, for a given annotator, the F1-score is less than k, we mark them as inconsistent and remove them from the dataset. ... For each group, we construct a training set using 70% of the data. The rest is used for testing. For each test instance, we randomly sample 15 examples from the training set and use them as in-context examples.