Conformal prediction under ambiguous ground truth
Authors: David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, Arnaud Doucet
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. Pvote under-covers expert annotations: calibrated for 72% coverage, it falls short by on average 10%; our Monte Carlo CP closes this gap both empirically and theoretically. We present experiments on a skin condition classification following (Liu et al., 2020), multi-label classification and data augmentation before concluding in Section 5. |
| Researcher Affiliation | Industry | David Stutz1, Abhijit Guha Roy2, Tatiana Matejovicova1, Patricia Strachan2, Ali Taylan Cemgil1, Arnaud Doucet1 1Google Deep Mind, 2Google EMAIL |
| Pseudocode | Yes | Algorithm 1 Monte Carlo CP with 1 α coverage guarantee for m = 1 and 1 2α for m 2. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence level α; number of samples m Output: Prediction set C(X) for test example [...] Algorithm 2 ECDF Monte Carlo CP with (1 α)(1 δ) coverage guarantee. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence levels α, δ; data split 1 l n 1; number of samples m Output: Prediction set C(X) for test example X |
| Open Source Code | No | The paper does not contain any explicit statement about providing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | In the main case study of this paper, we follow (Liu et al., 2020; Stutz et al., 2023) and consider a very ambiguous as well as safety-critical application in dermatology: skin condition classification from multiple images. We use the dataset of Liu et al. (2020) consisting of 1949 test examples and 419 classes [...]. The de-identified dermatology data used in this paper is not publicly available due to restrictions in the data-sharing agreements. |
| Dataset Splits | Yes | We randomly split the examples in two halves for calibration and testing. In Figure 3 (bottom), we plot the empirical coverage, i.e., the fraction of test examples for which (a) the true label (blue) or (b) the voted label (green) is included in the predicted prediction set. [... ] The true CDF is unknown but we can split the original calibration examples into X1, . . . , Xl and Xl+1, . . . , Xn and used the second split to obtained an empirical estimate F of F. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers. |
| Experiment Setup | Yes | We chose a coverage level of 1 α = 73% for our experiments (with results for α = 0.1 in the appendix) to stay comparable to the base model. [... ] We trained 10 multi-layer perceptrons with 100 hidden units for each digit to determine if the digit is present in the image. This simple classifier achieves 58.8% aggregated coverage when thresholding the 10 individual sigmoids at 0.5. |