reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Label Error Detection and Elimination with Uncertainty Quantification

Authors: Johannes Jakubik, Michael Vössing, Manil Maskey, Christopher Wölfle, Gerhard Satzger

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We comprehensively evaluate our algorithms on four image classification benchmark datasets in two stages. In the first stage, we demonstrate that our UQ-LED algorithms outperform state-of-the-art confident learning in identifying label errors. In the second stage, we show that removing all identified errors from the training data based on our approach results in higher accuracies than training on all available labeled data.
Researcher Affiliation	Collaboration	JOHANNES JAKUBIK , Karlsruhe Institute of Technology, IBM Research Europe, Switzerland MICHAEL VOESSING, Karlsruhe Institute of Technology, IBM Germany, Germany MANIL MASKEY, NASA, Marshall Space Flight Center, US CHRISTOPHER WÖLFLE, Karlsruhe Institute of Technology, Germany GERHARD SATZGER, Karlsruhe Institute of Technology, Germany
Pseudocode	No	The paper describes algorithms such as CL-MCD, CL-MCD + Entropy, CL-MCD-Ensemble, and Algorithm Ensemble in Section 3 and the noise generation in Section 4, but does not present them in a structured pseudocode or algorithm block format. Figures 3 and 4 are 'Overviews' not pseudocode.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code, a direct link to a code repository, or mention of code in supplementary materials for the methodology described.
Open Datasets	Yes	We make use of the prominent datasets MNIST (Le Cun et al. 2010), CIFAR-10 (Krizhevsky 2009), CIFAR-100 (Krizhevsky 2009), and tiny-Image Net (Le and Yang 2015).
Dataset Splits	Yes	For all datasets, we split the non-test data into 80% training and 20% validation data. During the training of all models, we drop the last partial batch of each epoch. According to (Northcutt, Jiang, et al. 2021), this improves stability by avoiding weight updates from just a few noisy samples. Noise Generation Before obtaining out-of-sample predicted probabilities from the MNIST, CIFAR-10, CIFAR100, and Tiny-Image Net datasets, we add synthetic asymmetric label noise to each dataset based on the method proposed above. For the noise generation, we leverage the previously pretrained models. After pre-training, we let each model predict on its test set and use the softmax probabilities to calculate the class-specific similarity scores and flipping probabilities per dataset. Finally, using the calculated similarity scores, we generate the noise transition matrices for the three different noise rates 𝜏1 = 0.05, 𝜏2 = 0.1, and 𝜏3 = 0.2 for each dataset. Cross-Validation After the label noise has been added to each training set, we conduct a four-fold crossvalidation to obtain out-of-sample probabilities. For the cross-validation, we use the same model and training settings per dataset as described above. To obtain out-of-sample predicted MCD probabilities during the crossvalidation, we conduct 𝐹= 5 forward passes. After the out-of-sample predicted softmax and MCD probabilities are obtained via the cross-fold validation, we identify label errors in the training set using the different algorithms and measure their label error detection performance.
Hardware Specification	Yes	Every model is trained on an NVIDIA Tesla V100.
Software Dependencies	No	All code is developed in Py Torch (Paszke et al. 2019), and Py Torch Lightning (Falcon 2019), which is a high-level interface for Py Torch simplifying many repetitive tasks, like monitoring. The paper mentions Py Torch and Py Torch Lightning along with their publication years, but does not specify exact version numbers for these software packages, which is required for reproducibility.
Experiment Setup	Yes	Specifically, we use a VGG-11 model for the MNIST dataset and Res Net-50 models for CIFAR-10, CIFAR-100, and Tiny-Image Net. We adapt both model architectures towards uncertainty quantification by adding five dropout layers after the five most inner encoder layers, following (Kendall et al. 2015). The dropout probability of those layers is set to the standard value of 0.5 (Srivastava et al. 2014). For the pre-training on the MNIST training set, we train the modified VGG-11 model over 15 epochs with a batch size of 256 and a learning rate of 0.01. For the pre-training on the CIFAR-10, CIFAR-100, and Tiny-Image Net datasets, we run three modified Res Net-50 models over 350 epochs with a batch size of 256. The learning rate scheduler reduces the initial learning rate of 0.1 after 150 epochs to 0.01 and, finally, after 250 epochs to 0.001.