reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Wasserstein-Regularized Conformal Prediction under General Distribution Shift

Authors: Rui Xu, Chao Chen, Yue Sun, Parvathinathan Venkitasubramaniam, Sihong Xie

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on six datasets prove that WR-CP can reduce coverage gaps to 3.2% across different confidence levels and outputs prediction sets 37% smaller than the worst-case approach on average. ... Experiments were conducted on six datasets: (a) the airfoil self-noise dataset (Brooks & Marcolini, 2014); (b) Seattle-loop (Cui et al., 2019), Pe MSD4, Pe MSD8 (Guo et al., 2019) for traffic speed prediction; (c) Japan-Prefectures, and U.S.-States (Deng et al., 2020) for epidemic spread forecasting.
Researcher Affiliation	Academia	Rui Xu, Sihong Xie The Hong Kong University of Science and Technology (Guangzhou) EMAIL, EMAIL Chao Chen Harbin Institute of Technology EMAIL Yue Sun, Parvathinathan Venkitasubramaniam Lehigh University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Wasserstein-regularized Conformal Prediction (WR-CP)
Open Source Code	Yes	The code of our work is released on https://github.com/rxu0112/WR-CP.
Open Datasets	Yes	Experiments were conducted on six datasets: (a) the airfoil self-noise dataset (Brooks & Marcolini, 2014); (b) Seattle-loop (Cui et al., 2019), Pe MSD4, Pe MSD8 (Guo et al., 2019) for traffic speed prediction; (c) Japan-Prefectures, and U.S.-States (Deng et al., 2020) for epidemic spread forecasting. ... The airfoil self-noise dataset from the UCI Machine Learning Repository (Brooks & Marcolini, 2014). DOI: https://doi.org/10.24432/C5VW2C.
Dataset Splits	No	We conducted 10 sampling trials for each dataset. Within each trails, we sampled S(i) XY from each subset i, for i = 1, ..., k. After this step, we allocated the remaining elements within each subset for calibration and testing purposes. The parts intended for calibration across all subsets were then unified to form SP XY . Lastly, to create diverse testing scenarios, we generated multiple test sets by randomly mixing the parts designated for testing from each subset with replacement.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It only mentions using an MLP model.
Software Dependencies	No	To find the optimized bandwidth value of ˆPX and ˆD(i) X for i = 1, ..., k on each dataset, we applied the grid search method with a bandwidth pool using scikit-learn package (Pedregosa et al., 2011).
Experiment Setup	Yes	A multi-layer perceptron (MLP) with an architecture of (input dimension, 64, 64, 1) was utilized in all experimental setups to maintain comparison fairness. ... The β values for the WR-CP method are 9, 11, 9, 10, 13, and 13, respectively. ... The β values for the WR-CP method are 4.5, 9, 9, 6, 8, and 20, respectively. ... The selected β values for the results of Figure 5 are shown in Table 2.