reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Confident Learning: Estimating Uncertainty in Dataset Labels

Authors: Curtis Northcutt, Lu Jiang, Isaac Chuang

JAIR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	First, a proof is presented giving realistic suﬃcient conditions under which CL exactly ﬁnds label errors and exactly estimates the joint distribution of noisy and true labels. Second, experimental data are shared, showing that this CL algorithm is empirically performant on three tasks (a) label noise estimation, (b) label error ﬁnding, and (c) learning with noisy labels, increasing Res Net accuracy on a cleaned-Image Net and outperforming seven recent highly competitive methods for learning with noisy labels on the CIFAR dataset. The results presented are reproducible with the implementation of CL algorithms, open-sourced as the cleanlab1 Python package. These contributions are presented beginning with the formal problem speciﬁcation and notation (Section 2), then deﬁning the algorithmic methods employed for CL (Section 3) and theoretically bounding expected behavior under ideal and noisy conditions (Section 4). Experimental benchmarks on the CIFAR, Image Net, Web Vision, and MNIST datasets, crosscomparing CL performance with that from a wide range of highly competitive approaches, including INCV (Chen et al., 2019), Mixup (Zhang et al., 2018), Mentor Net (Jiang et al., 2018), and Co-Teaching (Han et al., 2018), are then presented in Section 5.
Researcher Affiliation	Collaboration	Curtis G. Northcutt EMAIL Massachusetts Institute of Technology, Department of EECS, Cambridge, MA, USA Lu Jiang EMAIL Google Research, Mountain View, CA, USA Isaac L. Chuang EMAIL Massachusetts Institute of Technology, Department of EECS, Department of Physics, Cambridge, MA, USA
Pseudocode	Yes	Algorithm 1 (Conﬁdent Joint) for class-conditional label noise characterization. input ˆ P an n m matrix of out-of-sample predicted probabilities ˆ P [i][j] := ˆp( y = j; x, θ) input y N 0n, an n 1 array of noisy labels procedure Conﬁdent Joint( ˆ P , y): PART 1 (Compute thresholds) for j 1, m do for i 1, n do l new empty list [] if y[i] = j then append ˆ P [i][j] to l t[j] average(l) May use percentile instead of average for more conﬁdence PART 2 (Compute conﬁdent joint) C m m matrix of zeros for i 1, n do cnt 0 for j 1, m do if ˆ P [i][j] t[j] then cnt cnt + 1 y j guess of true label y y[i] if cnt > 1 then if label collision y arg max ˆ P [i] if cnt > 0 then C[ y][y ] C[ y][y ] + 1 output C, the m m unnormalized counts matrix
Open Source Code	Yes	The results presented are reproducible with the implementation of CL algorithms, open-sourced as the cleanlab1 Python package. 1. To foster future research in data cleaning and learning with noisy labels and to improve accessibility for newcomers, cleanlab is open-source and well-documented: https://github.com/cgnorthcutt/cleanlab/
Open Datasets	Yes	We present suﬃcient conditions where CL exactly ﬁnds label errors, and show CL performance exceeding seven recent competitive approaches for learning with noisy labels on the CIFAR dataset. Uniquely, the CL framework is not coupled to a speciﬁc data modality or model (e.g., we use CL to ﬁnd several label errors in the presumed error-free MNIST dataset and improve sentiment classiﬁcation on text data in Amazon Reviews). We also employ CL on Image Net to quantify ontological class overlap (e.g., estimating 645 missile images are mislabeled as their parent class projectile), and moderately increase model accuracy (e.g., for Res Net) by cleaning data prior to training.
Dataset Splits	Yes	For a fair comparison, all mean accuracies in Table 7 are reported on the same held-out test set, created by splitting the Amazon reviews dataset into a train set and test set such that every tenth example is placed in a test set and the remaining data is available for training (the Amazon Reviews 5-core dataset provides no explicit train set and test set). The Amazon Reviews dataset is naturally noisy, but the fraction of noise in the dataset is estimated to be less than 4% (Northcutt et al., 2021), which makes studying the beneﬁts of providing clean data for training challenging. To increase the percentage of noisy labels without adding synthetic noise, we subsample 1 million training examples from the train set by combining the label issues identiﬁed by all ﬁve CL methods from the original training data (244K examples) and a uniformly random subsample (766k examples) of the remaining cleaner training data.
Hardware Specification	Yes	We benchmarked INCV using the oﬃcial Github code2 on a machine with 128 GB of RAM and 4 RTX 2080 ti GPUs. Due to memory leak issues (as of the February 2020 open-source release, tested on a Mac OS laptop with 16GB RAM and Ubuntu 18.04 LTS Linux server 128GB RAM) in the implementation, training frequently stopped due to out-of-memory errors. For fair comparison, we restarted INCV training until all models completed at least 90 training epochs. For each experiment, Table S2 shows the total time required for training, epochs completed, and the associated accuracies.
Software Dependencies	No	The results presented are reproducible with the implementation of CL algorithms, open-sourced as the cleanlab1 Python package. These contributions are presented beginning with the formal problem speciﬁcation and notation (Section 2), then deﬁning the algorithmic methods employed for CL (Section 3) and theoretically bounding expected behavior under ideal and noisy conditions (Section 4). Experimental benchmarks on the CIFAR, Image Net, Web Vision, and MNIST datasets, crosscomparing CL performance with that from a wide range of highly competitive approaches, including INCV (Chen et al., 2019), Mixup (Zhang et al., 2018), Mentor Net (Jiang et al., 2018), and Co-Teaching (Han et al., 2018), are then presented in Section 5. Related work (Section 6) and concluding observations (Section 7) wrap up the presentation. Extended proofs of the main theorems, algorithm details, and comprehensive performance comparison data are presented in the appendices. The built-in SGD optimizer in the open-sourced fast Text library (Joulin et al., 2017) is used with settings: initial learning rate = 0.1, embedding dimension = 100, and n-gram = 3).
Experiment Setup	Yes	All models are trained using Res Net-50 with the common setting: learning rate 0.1 for epoch [0,150), 0.01 for epoch [150,250), 0.001 for epoch [250,350); momentum 0.9; and weight decay 0.0001, except INCV, SCE-loss, and Co-Teaching which are trained using their oﬃcial Git Hub code. Settings are copied from the kuangliu/pytorch-cifar Git Hub open-source code and were not tuned by hand. We report the highest score across hyper-parameters α {1, 2, 4, 8} for Mixup and p {0.7, 0.8, 0.9} for Mentor Net. For fair comparison with Co-Teaching, INCV, and Mentor Net, we also train using the co-teaching approach with forget rate = 0.5 [noise fraction], and report the max accuracy of the two trained models for each method. We observe that dropping the last partial batch of each epoch during training improves stability by avoiding weight updates from, in some cases, a single noisy example). Exactly the same noisy labels are used for training all models for each column of Table 2. For our method, we ﬁx its hyper-parameter, i.e. the number of folds in cross-validation across diﬀerent noise levels, and do not tune it on the validation set.