reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Conformal prediction under ambiguous ground truth

Authors: David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, Arnaud Doucet

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In a case study of skin condition classification with significant disagreement among expert annotators, we show that applying CP w.r.t. Pvote under-covers expert annotations: calibrated for 72% coverage, it falls short by on average 10%; our Monte Carlo CP closes this gap both empirically and theoretically. We present experiments on a skin condition classification following (Liu et al., 2020), multi-label classification and data augmentation before concluding in Section 5.
Researcher Affiliation	Industry	David Stutz1, Abhijit Guha Roy2, Tatiana Matejovicova1, Patricia Strachan2, Ali Taylan Cemgil1, Arnaud Doucet1 1Google Deep Mind, 2Google EMAIL
Pseudocode	Yes	Algorithm 1 Monte Carlo CP with 1 α coverage guarantee for m = 1 and 1 2α for m 2. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence level α; number of samples m Output: Prediction set C(X) for test example [...] Algorithm 2 ECDF Monte Carlo CP with (1 α)(1 δ) coverage guarantee. Input: Calibration examples (Xi, λi)i [n]; test example X; confidence levels α, δ; data split 1 l n 1; number of samples m Output: Prediction set C(X) for test example X
Open Source Code	No	The paper does not contain any explicit statement about providing source code or a link to a code repository for the methodology described.
Open Datasets	No	In the main case study of this paper, we follow (Liu et al., 2020; Stutz et al., 2023) and consider a very ambiguous as well as safety-critical application in dermatology: skin condition classification from multiple images. We use the dataset of Liu et al. (2020) consisting of 1949 test examples and 419 classes [...]. The de-identified dermatology data used in this paper is not publicly available due to restrictions in the data-sharing agreements.
Dataset Splits	Yes	We randomly split the examples in two halves for calibration and testing. In Figure 3 (bottom), we plot the empirical coverage, i.e., the fraction of test examples for which (a) the true label (blue) or (b) the voted label (green) is included in the predicted prediction set. [... ] The true CDF is unknown but we can split the original calibration examples into X1, . . . , Xl and Xl+1, . . . , Xn and used the second split to obtained an empirical estimate F of F.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers.
Experiment Setup	Yes	We chose a coverage level of 1 α = 73% for our experiments (with results for α = 0.1 in the appendix) to stay comparable to the base model. [... ] We trained 10 multi-layer perceptrons with 100 hidden units for each digit to determine if the digit is present in the image. This simple classifier achieves 58.8% aggregated coverage when thresholding the 10 individual sigmoids at 0.5.