Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Authors: Konstantina Bairaktari, Jiayun Wu, Steven Wu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically evaluate the conditional coverage of Kandinsky conformal prediction on real-world tasks with natural groups: income prediction across US states (Ding et al., 2021) and toxic comment detection across demographic groups (Borkan et al., 2019; Koh et al., 2021). The data is divided into a training set for learning the base predictor, a calibration set for learning the conformal predictor, and a test set for evaluation. We repeat all experiments 100 times with reshuffled calibration and test sets.
Researcher Affiliation Academia 1Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA. 2School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. Correspondence to: Konstantina Bairaktari <EMAIL>, Jiayun Wu <EMAIL>, Zhiwei Steven Wu <EMAIL>.
Pseudocode Yes Algorithm 1 Quantile Regression of Kandinsky CP. Algorithm 2 Prediction Set Function of Kandinsky CP.
Open Source Code No The paper does not explicitly provide a link to source code, nor does it state that code will be made publicly available.
Open Datasets Yes We empirically evaluate the conditional coverage of Kandinsky conformal prediction on real-world tasks with natural groups: income prediction across US states (Ding et al., 2021) and toxic comment detection across demographic groups (Borkan et al., 2019; Koh et al., 2021). C.1. ACSIncome: We preprocess the dataset following Liu et al. (2023). C.2. Civil Comments: Following Koh et al. (2021), we split the dataset into...
Dataset Splits Yes The data is divided into a training set for learning the base predictor, a calibration set for learning the conformal predictor, and a test set for evaluation. We train the base Gradient Boosting Tree regressor on 31,000 samples with 10,000 from each state. The calibration set contains 4,000 samples per state and the test set contains 2,000 samples per state. Following Koh et al. (2021), we split the dataset into 269,038 training samples and 178,962 samples for calibration and test.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for running experiments.
Software Dependencies No We use Histogram-based Gradient Boosting Tree through the implementation of scikit-learn (Pedregosa et al., 2011). We finetune a Distil BERT-base-uncased model with a classification head on the training set, following the configurations of Koh et al. (2021). No version numbers are specified for these software packages.
Experiment Setup Yes We apply default hyperparameters suggested by scikit-learn except that we set max iter to 250. We finetune a Distil BERT-base-uncased model with a classification head on the training set, following the configurations of Koh et al. (2021).