Prediction-Powered E-Values
Authors: Daniel Csillag, Claudio Jose Struchiner, Guilherme Tegoni Goedert
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the effectiveness of our framework across a wide range of inference tasks, from simple hypothesis testing and confidence intervals to more involved procedures for change-point detection and causal discovery, which were out of reach of previous techniques. ... In this section we present four case studies where we use our method, highlighting the modifications made to the base methods in the process of prediction-empowerment. |
| Researcher Affiliation | Academia | 1School of Applied Mathematics, Getulio Vargas Foundation, Rio de Janeiro, Brazil. Correspondence to: Daniel Csillag <EMAIL>. |
| Pseudocode | Yes | The resulting algorithm for hypothesis testing is remarkably simple to implement, given its generality. The pseudocode can be found in Algorithm 1. Algorithm 1 Prediction-Powered E-Values |
| Open Source Code | Yes | The source code to reproduce the experiments in the paper, as well as additional experiments (e.g. varying seed, varying sampling budget, underpowered settings [i.e., with worse predictive models]), is available at https://github.com/dccsillag/experiments-prediction-powered-evalues. |
| Open Datasets | Yes | In this first case study we seek to estimate the prevalence of diabetes on a cohort, upon which we work atop the dataset of (CDC, 2015). ... We use the dataset of (CDC, 2015). It is a tabular dataset... ...CDC. Cdc 2014 brfss survey data and documentation, 2015. URL https://www.cdc.gov/brfss/ annual_data/annual_2014.html. ... In our experiment, we work on the dataset of (Blackard, 1998). ... We use the dataset of (Blackard, 1998). Upon this dataset... ...Blackard, J. Covertype. UCI Machine Learning Repository, 1998. DOI: https://doi.org/10.24432/C50K5N. |
| Dataset Splits | No | Only labelled samples: collect πinf n labelled samples, and use the standard, non-prediction-powered e-values of (Waudby-Smith & Ramdas, 2020) to estimate the mean. ... This method requires the prediction model to be fixed a priori, so we first split the collected labels in a training set to train it, and use the remaining labels for their prediction-powered inference method. ... In our experiment, we work on the dataset of (Blackard, 1998). For the sake of evaluation, we have access to all Yi, but will simulate the missingness. ... For the non-poisoned data stream in Section 3.2, where the null should not be rejected, we just use the data remaining after the training and validation splits. |
| Hardware Specification | No | No specific hardware details are mentioned in the paper. |
| Software Dependencies | No | No specific software dependencies with version numbers are mentioned in the paper. |
| Experiment Setup | Yes | For the bets λi, we use the a GRAPA method proposed by (Waudby-Smith & Ramdas, 2020), bounded to ( 1 1 θ, 1 θ). ... We then have two predictive models: one which is the predictive model whose risk we want to monitor µ and another which is used for prediction-powered inference, which receives Xi and predicts the 0-1 loss for that point, 1[µ(Xi) = Yi]. The first model µ is held static over the course of the inference, while the one for prediction-powered inference is updated whenever we collect a new label. Collection probabilities πi(Xi) are held constant at πinf, leading to label collection matching the baseline of only using labelled samples. ... we first clip the p-values (prior to calibration) to lie within (10 7, 1] ... and then rescale the calibrated e-values by the means of a rescaling function rescaleη(e) := η (e 1) + 1, with η chosen so as to satisfy a labelling budget of πinf = 10% ... the batch size B cannot be too small; we use B = 100. |