reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interactive Label Cleaning with Example-based Explanations

Authors: Stefano Teso, Andrea Bontempelli, Fausto Giunchiglia, Andrea Passerini

NeurIPS 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical evaluation shows that clarifying the reasons behind the model s suspicions by cleaning the counter-examples helps in acquiring substantially better data and models, especially when paired with our FIM approximation. We empirically address the following research questions: Q1: Do counter-examples contribute to cleaning the data? Q2: Which inﬂuence-based selection strategy identiﬁes the most mislabeled counter-examples? Q3: What contributes to the effectiveness of the best counter-example selection strategy?
Researcher Affiliation	Academia	Stefano Teso University of Trento Trento, Italy EMAIL Andrea Bontempelli University of Trento Trento, Italy EMAIL Fausto Giunchiglia University of Trento Trento, Italy EMAIL Andrea Passerini University of Trento Trento, Italy EMAIL
Pseudocode	Yes	The pseudo-code of CINCER is listed in Algorithm 1.
Open Source Code	Yes	The code for all experiments is available at: https://github.com/abonte/cincer.
Open Datasets	Yes	Data sets. We used a diverse set of classiﬁcation data sets: Adult [27]: data set of 48,800 persons... Breast [27]: data set of 569 patients... 20NG [27]: data set of newsgroup posts... MNIST [29]: handwritten digit recognition data set... Fashion [30]: fashion article classiﬁcation dataset...
Dataset Splits	Yes	For adult and breast, a random 80 : 20 training-test split is used while for MNIST, fashion and 20 NG the split provided with the data set is used.
Hardware Specification	Yes	All experiments were run on a 12-core machine with 16 Gi B of RAM and no GPU.
Software Dependencies	No	We implemented CINCER using Python and Tensorﬂow [25] on top of three classiﬁers and compared different counter-example selection strategies on ﬁve data sets.
Experiment Setup	Yes	Upon receiving a new example, the classiﬁer is retrained from scratch for 100 epochs using Adam [31] with default parameters, with early stopping when the accuracy on the training set reaches 90% for FC and CNN, and 70% for LR. The margin threshold is set to τ = 0.2.