reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning from Disagreement: A Survey

Authors: Alexandra N. Uma, Tommaso Fornaciari, Dirk Hovy, Silviu Paun, Barbara Plank, Massimo Poesio

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this survey, we review the evidence for disagreements on nlp and cv tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these diﬀerent approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions.
Researcher Affiliation	Academia	Alexandra N. Uma EMAIL Queen Mary University of London Tommaso Fornaciari EMAIL Dirk Hovy EMAIL Università Bocconi, Milano Silviu Paun EMAIL Queen Mary University of London Barbara Plank EMAIL IT University of Copenhagen Massimo Poesio EMAIL Queen Mary University of London
Pseudocode	No	The paper describes algorithms and methods but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	All datasets and models employed in this paper are freely available as supplementary materials.
Open Datasets	Yes	The Phrase Detectives corpus can be found at http://www.phrasedetectives.org/; The Phrase Detectives 2 corpus is freely available from the ldc and from https://github.com/ dali-ambiguity.; The dataset by Dumitrache et al. (2018b) is available from https://github.com/Crowd Truth/ Medical-Relation-Extraction.; The dataset from Snow et al. (2008) is available from http://sites.google.com/site/nlpannotations; The Label Me dataset can be found at http://labelme.csail.mit.edu/; The CIFAR-10 dataset is available at https://www.cs.toronto.edu/~kriz/cifar.html; The dataset from Peterson et al. (2019) is available from https://github.com/jcpeterson/cifar-10h
Dataset Splits	Yes	The training, development, and test data respectively contain 97,040, 4,753, and 5,855 markables. (pdis dataset); In our experiments, we randomly split the 10k images with crowd annotations into training and test data (8882 and 1118 images respectively) to allow for ground truth and probabilistic evaluation. We used 500 gold-labeled images from the dataset as our development set. (ic-labelme dataset); In this paper, we used the CIFAR10H dataset, referred here as ic-cifar10h, for training and testing using a 70:30 random split while ensuring that the number of images per class remained balanced as in the original dataset. We also used a subset of the CIFAR-10 training dataset (3k images) as our development set.
Hardware Specification	No	The paper describes model architectures and training processes but does not specify any particular hardware like GPU models, CPU types, or memory amounts used for running the experiments. It mentions using 'pretrained cnn layers of the VGG-16 deep neural network' and 'Res Net34A model' and a 'publicly available Pytorch implementation of this Res Net model' but not the hardware they ran on.
Software Dependencies	No	The paper mentions several software components like 'bert sentence classifier', 'Adam optimizer', 'Pytorch implementation', 'pretrained Glove embeddings'. However, it does not provide specific version numbers for these components or the programming language used for their implementation, which is necessary for reproducibility.
Experiment Setup	Yes	The model was always trained for 20 epochs using the Adam optimizer (Kingma and Ba, 2015) at a learning rate of 0.001 with the the model with best development f1 saved at each epoch. (pos tagging); The is model was trained for 10 epochs with training parameters set according to Lee et al. (2018). (information status classification); The model was trained for 4 epochs using a 10-fold cross-validation at a learning rate of 2e-5. (relation extraction); The model was trained for 20 epochs using 10-fold cross-validation and the Adam optimizer (Kingma and Ba, 2015) at a learning rate of 0.0001. (recognizing textual entailment); Training was carried out for 50 epochs using the Adam optimizer (Kingma and Ba, 2015) at a learning rate of 0.001. (image classification: Label Me); We trained the model with for a total of 65 epochs divided into segments of of 50, 5, and 10, using a learning rate of 0.1 and decaying the learning rate by 0.0001 at the end of every segment. (image classification: CIFAR-10H)