reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Leave-One-Out Stable Conformal Prediction

Authors: Kiljae Lee, Yuan Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method is theoretically justified and demonstrates superior numerical performance on synthetic and real-world data. We applied our method to a screening problem, where its effective exploitation of training data led to improved test power compared to state-of-the-art method based on split conformal.
Researcher Affiliation	Academia	Kiljae Lee, Yuan Zhang The Ohio State University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: (LOO-Stab CP) Leave-One-Out Stable Conformal Prediction Set Algorithm 2: (LOO-cf BH) Conformal Selection by Prediction with Leave-One-Out p-values
Open Source Code	Yes	The code for reproducing numerical results is available at: https://github.com/Kiljae L/LOO-Stab CP.
Open Datasets	Yes	We considered two models for µ( ; β): linear µ(x; β) = Pd j=1 βjxj and nonlinear µ(x; β) = Pd j=1 βjexj/10. The Boston Housing data (Harrison Jr & Rubinfeld, 1978) contain 506 different areas in Boston... The Diabetes data (Efron et al., 2004) measured 442 individuals at their baseline time points... We used the recruitment data set Ganatara (2020) that was also analyzed in Jin & Cand es (2023).
Dataset Splits	Yes	We set n = m = 100, α = 0.1 and generated synthetic data... Split CP: 70% training + 30% calibration... For each data set, we randomly held out m data points (as the test data) for performance evaluation and released the rest to all methods for training/calibration. We tested two settings: m = 1 and m = 100... each time leaving out 20% data points as the test data. In cf BH, the data was split into 70% for training and 30% for calibration.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, only the algorithms, datasets, and hyperparameters.
Software Dependencies	No	The paper mentions algorithmic details and methods like "robust linear regression, equipped with Huber loss", "gradient descent", and "SGD", but does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	In this simulation, we set n = m = 100, α = 0.1 and generated synthetic data using Xi i.i.d. N(0, 1dΣ) with d = 100, where Σi,j = ρ\|i j\| (i.e., AR(1)). In particular, we chose ρ = 0.5 in this experiment. For the response variable we set Yi = µ(Xi; β) + ϵi, where ϵi i.i.d. N(0, 1). We considered two models for µ( ; β): linear µ(x; β) = Pd j=1 βjxj and nonlinear µ(x; β) = Pd j=1 βjexj/10. In both models, set βj (1 j/d)5 for j [d], and normalize: β 2 2 = d. To fit the model, we used robust linear regression, equipped with Huber loss: ℓ(y, fθ(x)) = 12(y fθ(x))2, if \|y fθ(x)\| ϵ, ϵ\|y fθ(x)\| 12ϵ2, if \|y fθ(x)\| > ϵ, where fθ(x) = x T θ and we set ϵ = 1 throughout. We used absolute residual as non-conformity scores. In RLM, we set Ω(θ) = θ 2 and solved it using gradient descent (Diamond & Boyd, 2016). Throughout, we ran SGD for R = 15 epochs for all methods, except R = 5 for the very slow Full CP. For both RLM and SGD, we set the learning rate to be η = 0.001. To further evaluate the performance of LOO-Stab CP with non-convex learning methods, we conducted experiments with a neural network of a single hidden layer of 20 nodes and a sigmoid activation function. We set η = 0.001 and R = 30.