reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Label Noise Robustness of Conformal Prediction

Authors: Bat-Sheva Einbinder, Shai Feldman, Stephen Bates, Anastasios N. Angelopoulos, Asaf Gendler, Yaniv Romano

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theory and experiments suggest that conformal prediction and risk-controlling techniques with noisy labels attain conservative risk over the clean ground truth labels whenever the noise is dispersive and increases variability. In other adversarial cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure achieving the correct risk of the ground truth labels without score or data regularity. (...) In this paper, we present real-data experiments indicating that conformal prediction and risk controlling methods achieve valid risk/coverage even with access only to noisy labels. (...) Section 4. Experiments
Researcher Affiliation	Academia	Bat-Sheva Einbinder EMAIL Department of Electrical and Computer Engineering Technion Israel Institute of Technology Shai Feldman EMAIL Department of Computer Science Technion Israel Institute of Technology Stephen Bates EMAIL Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology Anastasios N. Angelopoulos EMAIL Department of Electrical Engineering and Computer Science University of California, Berkeley Asaf Gendler EMAIL Department of Electrical and Computer Engineering Technion Israel Institute of Technology Yaniv Romano EMAIL Departments of Electrical and Computer Engineering and of Computer Science Technion Israel Institute of Technology
Pseudocode	Yes	Recipe 1 (Conformal prediction with noisy labels) 1. Consider i.i.d. data points (X1, Y1), . . . , (Xn, Yn), (Xtest, Ytest), a corruption model g : Y Y, and a score function s : X Y R. (...) Algorithm 1: Conformal risk control (...) Algorithm 2: Adaptive Conformal Inference (...) Algorithm 3: Rolling Risk Control
Open Source Code	Yes	Software is available online at https://github.com/bat-sheva/Conformal-Label-Noise, with all code needed to reproduce the numerical experiments.
Open Datasets	Yes	For this purpose, we use the CIFAR-10H data set, used by Peterson et al. (2019); Battleday et al. (2020); Singh et al. (2020) (...) using Aesthetic Visual Analysis (AVA) data set, ﬁrst presented by Murray et al. (2012). (...) We use the MS COCO data set (Lin et al., 2014) (...) We use the CIFAR-100N data set (Wei et al., 2022) (...) We experiment on a polyp segmentation task, pooling data from several polyp data sets: Kvasir, CVC-Colon DB, CVC-Clinic DB, and ETIS-Larib. (...) We examine two real benchmarks: meps 19 and bio used in (Romano et al., 2019) (...) We experiment on a depth estimation task (Geiger et al., 2013)
Dataset Splits	Yes	We randomly select 2,000 observations from CIFAR-10H for calibration. The test set contains the remaining 8,000 samples (...) we generate a total of 60,000 data points, where 50,000 are used to ﬁt a classiﬁer, and the remaining ones are randomly split to form calibration and test sets, each of size 5,000. (...) Both models are trained on 34,000 noisy samples, calibrated on 7,778 noisy holdout points, and tested on 7,778 clean samples. (...) We ﬁt a TRes Net (Ridnik et al., 2021) model on 100k clean samples and calibrate it using 105 noisy samples (...) We ﬁt a TRes Net (Ridnik et al., 2021) model on 40k noisy samples and calibrate it using 2K noisy samples (...) We ﬁt a TRes Net (Ridnik et al., 2021) model on 100k clean samples and calibrate it using 10K noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2). We control the false-negative rate (FNR), deﬁned in (7), at diﬀerent labels and measure the FNR obtained over clean and noisy versions of the test set which contains 30k samples. (...) We use Pra Net (Fan et al., 2020) as a base model and ﬁt it over 1450 noisy samples. Then, we calibrate it using 500 noisy samples (...) For each data set and nominal risk level α, ﬁt a quantile regression model on 12K samples and learn the α/2, 1 α/2 conditional quantiles of the noisy labels. Then, we calibrate its outputs using another 12K samples of the data with conformal risk control (...) Finally, we evaluate the performance on the test set which consists of 6K samples. (...) We use Le Re S (Yin et al., 2021) as a base model, which was pre-trained on a clean training set that corresponds to timestamps 1,...,6000. We continue training it and updating the calibration scheme in an online fashion on the following 2000 timestamps. We consider these samples, indexed by 6001 to 8000, as a validation set and use it to choose the calibration s hyperparameters (...) Finally, we continue the online procedure on the test samples whose indexes correspond to 8001 to 10000
Hardware Specification	No	The paper mentions models like "Res Net18 model", "VGG-16 model", "TRes Net model", and "Pra Net". It also mentions "high performance gpu-dedicated architecture" when referring to TResNet, but this describes the model's design for GPU efficiency, not the specific hardware used for experiments. There is no explicit mention of specific GPU/CPU models, memory amounts, or other detailed computer specifications used to run the experiments.
Software Dependencies	No	The paper describes optimization methods and hyperparameters such as "SGD optimizer", "Adam optimizer", "batch size of 128", "learning rate of 0.001", "dropout regularization", etc., but does not list specific software libraries (e.g., PyTorch, TensorFlow) with their version numbers.
Experiment Setup	Yes	Details regarding the training procedure can be found in Appendix B.1. The fraction of majority vote labels covered is demonstrated in Figure 1. This figure shows that when using the clean calibration set, the marginal coverage is 90%, as expected. (...) We set the function ˆu(x) of the residual magnitude score as ˆu(x) = 1. We follow Talebi and Milanfar (2018) and take a transfer learning approach to fit the predictive model using a VGG-16 model pretrained on Image Net data set. Details regarding the training strategy are in Appendix B.2. (...) We train the quantile regression model for 70 epochs using SGD optimizer with a batch size of 128 and an initial learning rate of 0.001 decayed every 20 epochs exponentially with a rate of 0.95 and a frequency of 10. We apply dropout regularization to avoid overfitting with a rate of 0.2. We train the classic regression model for 70 epochs using Adam optimizer with a batch size of 128 and an initial learning rate of 0.00005 decayed every 10 epochs exponentially with a rate of 0.95 and a frequency of 10. The dropout rate in this case is 0.5. (...) The training and calibration data are corrupted using the label noise models we defined earlier, with a fixed flipping probability of ϵ = 0.05. Of course, the test set is not corrupted and contains the ground truth labels. We apply conformal prediction using both the HPS and the APS score functions, with a target coverage level 1 α of 90%. (...) We use Pra Net (Fan et al., 2020) as a base model and ﬁt it over 1450 noisy samples. Then, we calibrate it using 500 noisy samples with conformal risk control, as outlined in (Angelopoulos et al., 2024, Section 3.2) to control the false-negative rate (FNR) from (7) at diﬀerent levels. (...) We consider the original depth values given in this data as ground truth and artiﬁcially corrupt them according to the additive noise model defined in (3) to produce noisy labels. Specifically, we add to each depth pixel an independent random noise drawn from a normal distribution with zero mean and 0.7 variance. (...) We employ the calibration scheme Rolling RC (Feldman et al., 2023b), which constructs uncertainty sets in an online setting with a valid risk guarantee in the sense of (10). We follow the experimental protocol outlined in (Feldman et al., 2023b, Section 4.2) and apply Rolling RC with an exponential stretching to control the image miscoverage loss at different levels on the observed, noisy, labels. We use Le Re S (Yin et al., 2021) as a base model, which was pre-trained on a clean training set that corresponds to timestamps 1,...,6000.