ARTICLE: Annotator Reliability Through In-Context Learning
Authors: Sujan Dutta, Deepak Pandita, Tharindu Cyril Weerasooriya, Marcos Zampieri, Christopher M. Homan, Ashiqur R. KhudaBukhsh
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate this framework on two offensive speech datasets using multiple LLMs and compare its performance with traditional methods. Our findings indicate that ARTICLE can be used as a robust method for identifying reliable annotators, hence improving data quality. Figures 2 and 3 illustrate the performance (F1-score on the test set) at various values of k for DTR and DVOICED, respectively. |
| Researcher Affiliation | Academia | 1Rochester Institute of Technology 2George Mason University EMAIL, EMAIL, EMAIL |
| Pseudocode | No | The paper includes schematic diagrams and descriptions of its method, but no explicit pseudocode or algorithm blocks are provided. |
| Open Source Code | Yes | Code https://github.com/Suji04/ARTICLE |
| Open Datasets | Yes | We consider two datasets on web toxicity: DTR and DVOICED. DTR contains 107,620 comments from multiple social web platforms (Twitter, Reddit, and 4chan) collectively annotated by 17,280 annotators. We sample 20,000 comments from DTR for our experiments ensuring that each set of 20 comments is annotated by the same five annotators, thereby retaining the structure of the original dataset. DVOICED (Weerasooriya et al. 2023b) includes 2,338 YouTube comments on three major US cable news networks (Khuda Bukhsh et al. 2021) annotated by 726 annotators. |
| Dataset Splits | Yes | For each annotator, we randomly split their annotations into two sets the first set (training set) contains 10 data points, and the second (test set) contains the rest. ... For each group, we construct a training set using 70% of the data. The rest is used for testing. |
| Hardware Specification | Yes | We run all our experiments in a Google Colab (pro+) environment with a single A100 GPU (40 GB) and 52 GB RAM. |
| Software Dependencies | No | The paper mentions using specific LLMs (Mistral-7B-instruct, Llama3-8B-instruct, GPT-3.5-turbo), which are models, but does not list specific programming languages or libraries with their version numbers required to replicate the experiments. |
| Experiment Setup | Yes | For each annotator, we randomly split their annotations into two sets the first set (training set) contains 10 data points, and the second (test set) contains the rest. ... We define a hyperparameter (k) that acts as a threshold. If, for a given annotator, the F1-score is less than k, we mark them as inconsistent and remove them from the dataset. ... For each group, we construct a training set using 70% of the data. The rest is used for testing. For each test instance, we randomly sample 15 examples from the training set and use them as in-context examples. |