Confident Learning: Estimating Uncertainty in Dataset Labels

Authors: Curtis Northcutt, Lu Jiang, Isaac Chuang

JAIR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, a proof is presented giving realistic sufficient conditions under which CL exactly finds label errors and exactly estimates the joint distribution of noisy and true labels. Second, experimental data are shared, showing that this CL algorithm is empirically performant on three tasks (a) label noise estimation, (b) label error finding, and (c) learning with noisy labels, increasing Res Net accuracy on a cleaned-Image Net and outperforming seven recent highly competitive methods for learning with noisy labels on the CIFAR dataset. The results presented are reproducible with the implementation of CL algorithms, open-sourced as the cleanlab1 Python package. These contributions are presented beginning with the formal problem specification and notation (Section 2), then defining the algorithmic methods employed for CL (Section 3) and theoretically bounding expected behavior under ideal and noisy conditions (Section 4). Experimental benchmarks on the CIFAR, Image Net, Web Vision, and MNIST datasets, crosscomparing CL performance with that from a wide range of highly competitive approaches, including INCV (Chen et al., 2019), Mixup (Zhang et al., 2018), Mentor Net (Jiang et al., 2018), and Co-Teaching (Han et al., 2018), are then presented in Section 5.
Researcher Affiliation Collaboration Curtis G. Northcutt EMAIL Massachusetts Institute of Technology, Department of EECS, Cambridge, MA, USA Lu Jiang EMAIL Google Research, Mountain View, CA, USA Isaac L. Chuang EMAIL Massachusetts Institute of Technology, Department of EECS, Department of Physics, Cambridge, MA, USA
Pseudocode Yes Algorithm 1 (Confident Joint) for class-conditional label noise characterization. input ˆ P an n m matrix of out-of-sample predicted probabilities ˆ P [i][j] := ˆp( y = j; x, θ) input y N 0n, an n 1 array of noisy labels procedure Confident Joint( ˆ P , y): PART 1 (Compute thresholds) for j 1, m do for i 1, n do l new empty list [] if y[i] = j then append ˆ P [i][j] to l t[j] average(l) May use percentile instead of average for more confidence PART 2 (Compute confident joint) C m m matrix of zeros for i 1, n do cnt 0 for j 1, m do if ˆ P [i][j] t[j] then cnt cnt + 1 y j guess of true label y y[i] if cnt > 1 then if label collision y arg max ˆ P [i] if cnt > 0 then C[ y][y ] C[ y][y ] + 1 output C, the m m unnormalized counts matrix
Open Source Code Yes The results presented are reproducible with the implementation of CL algorithms, open-sourced as the cleanlab1 Python package. 1. To foster future research in data cleaning and learning with noisy labels and to improve accessibility for newcomers, cleanlab is open-source and well-documented: https://github.com/cgnorthcutt/cleanlab/
Open Datasets Yes We present sufficient conditions where CL exactly finds label errors, and show CL performance exceeding seven recent competitive approaches for learning with noisy labels on the CIFAR dataset. Uniquely, the CL framework is not coupled to a specific data modality or model (e.g., we use CL to find several label errors in the presumed error-free MNIST dataset and improve sentiment classification on text data in Amazon Reviews). We also employ CL on Image Net to quantify ontological class overlap (e.g., estimating 645 missile images are mislabeled as their parent class projectile), and moderately increase model accuracy (e.g., for Res Net) by cleaning data prior to training.
Dataset Splits Yes For a fair comparison, all mean accuracies in Table 7 are reported on the same held-out test set, created by splitting the Amazon reviews dataset into a train set and test set such that every tenth example is placed in a test set and the remaining data is available for training (the Amazon Reviews 5-core dataset provides no explicit train set and test set). The Amazon Reviews dataset is naturally noisy, but the fraction of noise in the dataset is estimated to be less than 4% (Northcutt et al., 2021), which makes studying the benefits of providing clean data for training challenging. To increase the percentage of noisy labels without adding synthetic noise, we subsample 1 million training examples from the train set by combining the label issues identified by all five CL methods from the original training data (244K examples) and a uniformly random subsample (766k examples) of the remaining cleaner training data.
Hardware Specification Yes We benchmarked INCV using the official Github code2 on a machine with 128 GB of RAM and 4 RTX 2080 ti GPUs. Due to memory leak issues (as of the February 2020 open-source release, tested on a Mac OS laptop with 16GB RAM and Ubuntu 18.04 LTS Linux server 128GB RAM) in the implementation, training frequently stopped due to out-of-memory errors. For fair comparison, we restarted INCV training until all models completed at least 90 training epochs. For each experiment, Table S2 shows the total time required for training, epochs completed, and the associated accuracies.
Software Dependencies No The results presented are reproducible with the implementation of CL algorithms, open-sourced as the cleanlab1 Python package. These contributions are presented beginning with the formal problem specification and notation (Section 2), then defining the algorithmic methods employed for CL (Section 3) and theoretically bounding expected behavior under ideal and noisy conditions (Section 4). Experimental benchmarks on the CIFAR, Image Net, Web Vision, and MNIST datasets, crosscomparing CL performance with that from a wide range of highly competitive approaches, including INCV (Chen et al., 2019), Mixup (Zhang et al., 2018), Mentor Net (Jiang et al., 2018), and Co-Teaching (Han et al., 2018), are then presented in Section 5. Related work (Section 6) and concluding observations (Section 7) wrap up the presentation. Extended proofs of the main theorems, algorithm details, and comprehensive performance comparison data are presented in the appendices. The built-in SGD optimizer in the open-sourced fast Text library (Joulin et al., 2017) is used with settings: initial learning rate = 0.1, embedding dimension = 100, and n-gram = 3).
Experiment Setup Yes All models are trained using Res Net-50 with the common setting: learning rate 0.1 for epoch [0,150), 0.01 for epoch [150,250), 0.001 for epoch [250,350); momentum 0.9; and weight decay 0.0001, except INCV, SCE-loss, and Co-Teaching which are trained using their official Git Hub code. Settings are copied from the kuangliu/pytorch-cifar Git Hub open-source code and were not tuned by hand. We report the highest score across hyper-parameters α {1, 2, 4, 8} for Mixup and p {0.7, 0.8, 0.9} for Mentor Net. For fair comparison with Co-Teaching, INCV, and Mentor Net, we also train using the co-teaching approach with forget rate = 0.5 [noise fraction], and report the max accuracy of the two trained models for each method. We observe that dropping the last partial batch of each epoch during training improves stability by avoiding weight updates from, in some cases, a single noisy example). Exactly the same noisy labels are used for training all models for each column of Table 2. For our method, we fix its hyper-parameter, i.e. the number of folds in cross-validation across different noise levels, and do not tune it on the validation set.