Label Noise Robustness of Conformal Prediction

Authors: Bat-Sheva Einbinder, Shai Feldman, Stephen Bates, Anastasios N. Angelopoulos, Asaf Gendler, Yaniv Romano

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theory and experiments suggest that conformal prediction and risk-controlling techniques with noisy labels attain conservative risk over the clean ground truth labels whenever the noise is dispersive and increases variability. In other adversarial cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure achieving the correct risk of the ground truth labels without score or data regularity. (...) In this paper, we present real-data experiments indicating that conformal prediction and risk controlling methods achieve valid risk/coverage even with access only to noisy labels. (...) Section 4. Experiments
Researcher Affiliation Academia Bat-Sheva Einbinder EMAIL Department of Electrical and Computer Engineering Technion Israel Institute of Technology Shai Feldman EMAIL Department of Computer Science Technion Israel Institute of Technology Stephen Bates EMAIL Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Anastasios N. Angelopoulos EMAIL Department of Electrical Engineering and Computer Science University of California, Berkeley Asaf Gendler EMAIL Department of Electrical and Computer Engineering Technion Israel Institute of Technology Yaniv Romano EMAIL Departments of Electrical and Computer Engineering and of Computer Science Technion Israel Institute of Technology
Pseudocode Yes Recipe 1 (Conformal prediction with noisy labels) 1. Consider i.i.d. data points (X1, Y1), . . . , (Xn, Yn), (Xtest, Ytest), a corruption model g : Y Y, and a score function s : X Y R. (...) Algorithm 1: Conformal risk control (...) Algorithm 2: Adaptive Conformal Inference (...) Algorithm 3: Rolling Risk Control
Open Source Code Yes Software is available online at https://github.com/bat-sheva/Conformal-Label-Noise, with all code needed to reproduce the numerical experiments.
Open Datasets Yes For this purpose, we use the CIFAR-10H data set, used by Peterson et al. (2019); Battleday et al. (2020); Singh et al. (2020) (...) using Aesthetic Visual Analysis (AVA) data set, first presented by Murray et al. (2012). (...) We use the MS COCO data set (Lin et al., 2014) (...) We use the CIFAR-100N data set (Wei et al., 2022) (...) We experiment on a polyp segmentation task, pooling data from several polyp data sets: Kvasir, CVC-Colon DB, CVC-Clinic DB, and ETIS-Larib. (...) We examine two real benchmarks: meps 19 and bio used in (Romano et al., 2019) (...) We experiment on a depth estimation task (Geiger et al., 2013)
Dataset Splits Yes We randomly select 2,000 observations from CIFAR-10H for calibration. The test set contains the remaining 8,000 samples (...) we generate a total of 60,000 data points, where 50,000 are used to fit a classifier, and the remaining ones are randomly split to form calibration and test sets, each of size 5,000. (...) Both models are trained on 34,000 noisy samples, calibrated on 7,778 noisy holdout points, and tested on 7,778 clean samples. (...) We fit a TRes Net (Ridnik et al., 2021) model on 100k clean samples and calibrate it using 105 noisy samples (...) We fit a TRes Net (Ridnik et al., 2021) model on 40k noisy samples and calibrate it using 2K noisy samples (...) We fit a TRes Net (Ridnik et al., 2021) model on 100k clean samples and calibrate it using 10K noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2). We control the false-negative rate (FNR), defined in (7), at different labels and measure the FNR obtained over clean and noisy versions of the test set which contains 30k samples. (...) We use Pra Net (Fan et al., 2020) as a base model and fit it over 1450 noisy samples. Then, we calibrate it using 500 noisy samples (...) For each data set and nominal risk level α, fit a quantile regression model on 12K samples and learn the α/2, 1 α/2 conditional quantiles of the noisy labels. Then, we calibrate its outputs using another 12K samples of the data with conformal risk control (...) Finally, we evaluate the performance on the test set which consists of 6K samples. (...) We use Le Re S (Yin et al., 2021) as a base model, which was pre-trained on a clean training set that corresponds to timestamps 1,...,6000. We continue training it and updating the calibration scheme in an online fashion on the following 2000 timestamps. We consider these samples, indexed by 6001 to 8000, as a validation set and use it to choose the calibration s hyperparameters (...) Finally, we continue the online procedure on the test samples whose indexes correspond to 8001 to 10000
Hardware Specification No The paper mentions models like "Res Net18 model", "VGG-16 model", "TRes Net model", and "Pra Net". It also mentions "high performance gpu-dedicated architecture" when referring to TResNet, but this describes the model's design for GPU efficiency, not the specific hardware used for experiments. There is no explicit mention of specific GPU/CPU models, memory amounts, or other detailed computer specifications used to run the experiments.
Software Dependencies No The paper describes optimization methods and hyperparameters such as "SGD optimizer", "Adam optimizer", "batch size of 128", "learning rate of 0.001", "dropout regularization", etc., but does not list specific software libraries (e.g., PyTorch, TensorFlow) with their version numbers.
Experiment Setup Yes Details regarding the training procedure can be found in Appendix B.1. The fraction of majority vote labels covered is demonstrated in Figure 1. This figure shows that when using the clean calibration set, the marginal coverage is 90%, as expected. (...) We set the function ˆu(x) of the residual magnitude score as ˆu(x) = 1. We follow Talebi and Milanfar (2018) and take a transfer learning approach to fit the predictive model using a VGG-16 model pretrained on Image Net data set. Details regarding the training strategy are in Appendix B.2. (...) We train the quantile regression model for 70 epochs using SGD optimizer with a batch size of 128 and an initial learning rate of 0.001 decayed every 20 epochs exponentially with a rate of 0.95 and a frequency of 10. We apply dropout regularization to avoid overfitting with a rate of 0.2. We train the classic regression model for 70 epochs using Adam optimizer with a batch size of 128 and an initial learning rate of 0.00005 decayed every 10 epochs exponentially with a rate of 0.95 and a frequency of 10. The dropout rate in this case is 0.5. (...) The training and calibration data are corrupted using the label noise models we defined earlier, with a fixed flipping probability of ϵ = 0.05. Of course, the test set is not corrupted and contains the ground truth labels. We apply conformal prediction using both the HPS and the APS score functions, with a target coverage level 1 α of 90%. (...) We use Pra Net (Fan et al., 2020) as a base model and fit it over 1450 noisy samples. Then, we calibrate it using 500 noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2) to control the false-negative rate (FNR) from (7) at different levels. (...) We consider the original depth values given in this data as ground truth and artificially corrupt them according to the additive noise model defined in (3) to produce noisy labels. Specifically, we add to each depth pixel an independent random noise drawn from a normal distribution with zero mean and 0.7 variance. (...) We employ the calibration scheme Rolling RC (Feldman et al., 2023b), which constructs uncertainty sets in an online setting with a valid risk guarantee in the sense of (10). We follow the experimental protocol outlined in (Feldman et al., 2023b, Section 4.2) and apply Rolling RC with an exponential stretching to control the image miscoverage loss at different levels on the observed, noisy, labels. We use Le Re S (Yin et al., 2021) as a base model, which was pre-trained on a clean training set that corresponds to timestamps 1,...,6000.