Conformal prediction under ambiguous ground truth

Authors: David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, Arnaud Doucet

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. Pvote under-covers expert annotations: calibrated for 72% coverage, it falls short by on average 10%; our Monte Carlo CP closes this gap both empirically and theoretically. We present experiments on a skin condition classification following (Liu et al., 2020), multi-label classification and data augmentation before concluding in Section 5.
Researcher Affiliation Industry David Stutz1, Abhijit Guha Roy2, Tatiana Matejovicova1, Patricia Strachan2, Ali Taylan Cemgil1, Arnaud Doucet1 1Google Deep Mind, 2Google EMAIL
Pseudocode Yes Algorithm 1 Monte Carlo CP with 1 α coverage guarantee for m = 1 and 1 2α for m 2. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence level α; number of samples m Output: Prediction set C(X) for test example [...] Algorithm 2 ECDF Monte Carlo CP with (1 α)(1 δ) coverage guarantee. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence levels α, δ; data split 1 l n 1; number of samples m Output: Prediction set C(X) for test example X
Open Source Code No The paper does not contain any explicit statement about providing source code or a link to a code repository for the methodology described.
Open Datasets No In the main case study of this paper, we follow (Liu et al., 2020; Stutz et al., 2023) and consider a very ambiguous as well as safety-critical application in dermatology: skin condition classification from multiple images. We use the dataset of Liu et al. (2020) consisting of 1949 test examples and 419 classes [...]. The de-identified dermatology data used in this paper is not publicly available due to restrictions in the data-sharing agreements.
Dataset Splits Yes We randomly split the examples in two halves for calibration and testing. In Figure 3 (bottom), we plot the empirical coverage, i.e., the fraction of test examples for which (a) the true label (blue) or (b) the voted label (green) is included in the predicted prediction set. [... ] The true CDF is unknown but we can split the original calibration examples into X1, . . . , Xl and Xl+1, . . . , Xn and used the second split to obtained an empirical estimate F of F.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers.
Experiment Setup Yes We chose a coverage level of 1 α = 73% for our experiments (with results for α = 0.1 in the appendix) to stay comparable to the base model. [... ] We trained 10 multi-layer perceptrons with 100 hidden units for each digit to determine if the digit is present in the image. This simple classifier achieves 58.8% aggregated coverage when thresholding the 10 individual sigmoids at 0.5.