reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Generalized Venn and Venn-Abers Calibration with Applications in Conformal Prediction

Authors: Lars Van Der Laan, Ahmed Alaa

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5. Numerical experiments The utility of Venn and Venn-Abers calibration for classification and regression, as well as Venn multicalibration with the quantile loss in the context of conformal prediction (CP), has been demonstrated through synthetic and real data experiments in various works (...). In this section, we evaluate two novel instances of these methods: CP using Venn-Abers calibration with the quantile loss (Section 4.1) and Venn multicalibration for regression using the squared error loss. (...) Table 1. Metrics for each dataset: Marginal Coverage, Conditional Calibration Error (CCE), and Average Width.
Researcher Affiliation	Academia	1Department of Statistics, University of Washington 2Computational Precision Health, UC Berkeley and UCSF.
Pseudocode	Yes	Algorithm 1 Venn loss calibration; Algorithm 2 Venn-Abers loss calibration; Algorithm 3 Venn loss multicalibration
Open Source Code	Yes	Python code implementing Venn-Abers and Venn multicalibration methods for both squared error and quantile losses is available in the Venn Calibration package at the following Git Hub repository: https://github.com/Larsvanderlaan/Venn_Calibration
Open Datasets	Yes	We evaluate conformal prediction intervals constructed using Venn-Abers quantile calibration on real datasets, including the Medical Expenditure Panel Survey (MEPS) dataset (Cohen et al., 2009; MEPS, 2021), as well as the Concrete, Community, STAR, Bike, and Bio datasets from Romano et al. (2019), which are available in the cqr package.
Dataset Splits	Yes	Each dataset is split into a training set (50%), a calibration set (30%), and a test set (20%).
Hardware Specification	No	The paper mentions that
Software Dependencies	No	The paper mentions the use of 'xgboost (Chen and Guestrin, 2016)' and the 'cqr package', but specific version numbers for these software components are not provided.
Experiment Setup	No	The paper states, "We implement Venn-Abers quantile calibration (VA) using absolute residual error as the conformity score and train the 1 α quantile model f( ) of the conformity score using xgboost (Chen and Guestrin, 2016)." and "We train the model f using median regression with xgboost, such that the model is miscalibrated for the mean when the outcomes are skewed." While it describes the models and general approach, specific hyperparameters like learning rates, batch sizes, or number of epochs for xgboost are not provided.