reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rectifying Conformity Scores for Better Conditional Coverage

Authors: Vincent Plassier, Alexander Fishkov, Victor Dheur, Mohsen Guizani, Souhaib Ben Taieb, Maxim Panov, Eric Moulines

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally show that our method is highly adaptive to the local data structure and outperforms existing methods in terms of conditional coverage, improving the reliability of statistical inference in various applications. We evaluate our method on several benchmark datasets and compare it against state-of-the-art alternatives (see Section 7). Our results demonstrate improved performance, particularly in terms of conditional coverage metrics such as worst slab coverage (Romano et al., 2020) and conditional coverage error (Dheur et al., 2024).
Researcher Affiliation	Academia	1Lagrange Mathematics and Computing Research Center 2Mohamed bin Zayed University of Artificial Intelligence 3Skolkovo Institute of Science and Technology 4University of Mons 5 Ecole Polytechnique. Correspondence to: Maxim Panov <EMAIL>.
Pseudocode	Yes	Algorithm 1 The RCP algorithm
Open Source Code	Yes	The code to reproduce main experiments is available at https://github.com/stat-ml/rcp
Open Datasets	Yes	We use publicly available regression datasets which are also considered in (Tsoumakas et al., 2011; Feldman et al., 2023; Wang et al., 2023) and only keep datasets with at least 2 outputs and 2000 total instances. The characteristics of the datasets are summarized in Appendix C. Table 6: List of datasets with their characteristics. Tsoumakas et al. (2011) scm20d ... rf1 ... scm1d ... Feldman et al. (2023) meps 21 ... meps 19 ... meps 20 ... house ... bio ... blog data ... Wang et al. (2023) taxi ...
Dataset Splits	Yes	We reserve 2048 points for calibration. The remaining data is split between 70% for training and 30% for testing. Each dataset is split randomly into train, calibration, and test parts. We reserve 2048 points for calibration and the remaining data is split between 70% for training and 30% for testing. Each dataset is shuffled and split 10 times to replicate the experiment. One fifth of the train dataset is reserved for early stopping.
Hardware Specification	Yes	All methods are run on CPU (AMD Ryzen Threadripper PRO 5965WX) with 6 CPU threads per experiment.
Software Dependencies	No	The paper discusses various models and optimizers (e.g., 'Adam optimizer', 'Re LU activations') but does not provide specific version numbers for programming languages or libraries (e.g., Python, PyTorch, TensorFlow, scikit-learn).
Experiment Setup	Yes	All our models are based on a fully connected neural network of three hidden layers with 100 neurons in each layer and Re LU activations. We consider three types of base models with appropriate output layers and loss functions... Training is performed with Adam optimizer. One fifth of the train dataset is reserved for early stopping.