reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Prediction-Powered E-Values

Authors: Daniel Csillag, Claudio Jose Struchiner, Guilherme Tegoni Goedert

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We showcase the effectiveness of our framework across a wide range of inference tasks, from simple hypothesis testing and confidence intervals to more involved procedures for change-point detection and causal discovery, which were out of reach of previous techniques. ... In this section we present four case studies where we use our method, highlighting the modifications made to the base methods in the process of prediction-empowerment.
Researcher Affiliation	Academia	1School of Applied Mathematics, Getulio Vargas Foundation, Rio de Janeiro, Brazil. Correspondence to: Daniel Csillag <EMAIL>.
Pseudocode	Yes	The resulting algorithm for hypothesis testing is remarkably simple to implement, given its generality. The pseudocode can be found in Algorithm 1. Algorithm 1 Prediction-Powered E-Values
Open Source Code	Yes	The source code to reproduce the experiments in the paper, as well as additional experiments (e.g. varying seed, varying sampling budget, underpowered settings [i.e., with worse predictive models]), is available at https://github.com/dccsillag/experiments-prediction-powered-evalues.
Open Datasets	Yes	In this first case study we seek to estimate the prevalence of diabetes on a cohort, upon which we work atop the dataset of (CDC, 2015). ... We use the dataset of (CDC, 2015). It is a tabular dataset... ...CDC. Cdc 2014 brfss survey data and documentation, 2015. URL https://www.cdc.gov/brfss/ annual_data/annual_2014.html. ... In our experiment, we work on the dataset of (Blackard, 1998). ... We use the dataset of (Blackard, 1998). Upon this dataset... ...Blackard, J. Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N.
Dataset Splits	No	Only labelled samples: collect πinf n labelled samples, and use the standard, non-prediction-powered e-values of (Waudby-Smith & Ramdas, 2020) to estimate the mean. ... This method requires the prediction model to be fixed a priori, so we first split the collected labels in a training set to train it, and use the remaining labels for their prediction-powered inference method. ... In our experiment, we work on the dataset of (Blackard, 1998). For the sake of evaluation, we have access to all Yi, but will simulate the missingness. ... For the non-poisoned data stream in Section 3.2, where the null should not be rejected, we just use the data remaining after the training and validation splits.
Hardware Specification	No	No specific hardware details are mentioned in the paper.
Software Dependencies	No	No specific software dependencies with version numbers are mentioned in the paper.
Experiment Setup	Yes	For the bets λi, we use the a GRAPA method proposed by (Waudby-Smith & Ramdas, 2020), bounded to ( 1 1 θ, 1 θ). ... We then have two predictive models: one which is the predictive model whose risk we want to monitor µ and another which is used for prediction-powered inference, which receives Xi and predicts the 0-1 loss for that point, 1[µ(Xi) = Yi]. The first model µ is held static over the course of the inference, while the one for prediction-powered inference is updated whenever we collect a new label. Collection probabilities πi(Xi) are held constant at πinf, leading to label collection matching the baseline of only using labelled samples. ... we first clip the p-values (prior to calibration) to lie within (10 7, 1] ... and then rescale the calibrated e-values by the means of a rescaling function rescaleη(e) := η (e 1) + 1, with η chosen so as to satisfy a labelling budget of πinf = 10% ... the batch size B cannot be too small; we use B = 100.