reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Authors: Konstantina Bairaktari, Jiayun Wu, Steven Wu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically evaluate the conditional coverage of Kandinsky conformal prediction on real-world tasks with natural groups: income prediction across US states (Ding et al., 2021) and toxic comment detection across demographic groups (Borkan et al., 2019; Koh et al., 2021). The data is divided into a training set for learning the base predictor, a calibration set for learning the conformal predictor, and a test set for evaluation. We repeat all experiments 100 times with reshuffled calibration and test sets.
Researcher Affiliation	Academia	1Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA. 2School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Konstantina Bairaktari <EMAIL>, Jiayun Wu <EMAIL>, Zhiwei Steven Wu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Quantile Regression of Kandinsky CP. Algorithm 2 Prediction Set Function of Kandinsky CP.
Open Source Code	No	The paper does not explicitly provide a link to source code, nor does it state that code will be made publicly available.
Open Datasets	Yes	We empirically evaluate the conditional coverage of Kandinsky conformal prediction on real-world tasks with natural groups: income prediction across US states (Ding et al., 2021) and toxic comment detection across demographic groups (Borkan et al., 2019; Koh et al., 2021). C.1. ACSIncome: We preprocess the dataset following Liu et al. (2023). C.2. Civil Comments: Following Koh et al. (2021), we split the dataset into...
Dataset Splits	Yes	The data is divided into a training set for learning the base predictor, a calibration set for learning the conformal predictor, and a test set for evaluation. We train the base Gradient Boosting Tree regressor on 31,000 samples with 10,000 from each state. The calibration set contains 4,000 samples per state and the test set contains 2,000 samples per state. Following Koh et al. (2021), we split the dataset into 269,038 training samples and 178,962 samples for calibration and test.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running experiments.
Software Dependencies	No	We use Histogram-based Gradient Boosting Tree through the implementation of scikit-learn (Pedregosa et al., 2011). We finetune a Distil BERT-base-uncased model with a classification head on the training set, following the configurations of Koh et al. (2021). No version numbers are specified for these software packages.
Experiment Setup	Yes	We apply default hyperparameters suggested by scikit-learn except that we set max iter to 250. We finetune a Distil BERT-base-uncased model with a classification head on the training set, following the configurations of Koh et al. (2021).