reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online Conformal Prediction via Online Optimization

Authors: Felipe Areces, Christopher Mohri, Tatsunori Hashimoto, John Duchi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Complementary to our theory, our experiments spanning over 15 datasets suggest that the performance improvement of our methods over baselines grows with the magnitude of the data s dependence, even when baselines are tuned on the test set. We put these findings to the test by pre-registering an experiment for electricity demand forecasting in Texas, where our algorithms achieve over a 10% reduction in confidence set sizes, a more than a 30% improvement in quantile and absolute losses with respect to the observed errors, and significant outcomes on all 78 out of 78 pre-registered hypotheses. We provide documentation for the pypi package implementing our algorithms here: https: //conformalopt.readthedocs.io/.
Researcher Affiliation	Academia	1Department of Electrical Engineering, Stanford University, Stanford, USA 2Department of Computer Science, Stanford University, Stanford, USA 3Department of Statistics, Stanford University, Stanford, USA.
Pseudocode	Yes	Algorithm 1 Batched projected online gradient descent
Open Source Code	Yes	We provide documentation for the pypi package implementing our algorithms here: https: //conformalopt.readthedocs.io/.
Open Datasets	Yes	Stock data (AMZN, GOOGL, MSFT). Using stock data is common in online conformal work. Here we consider the returns of Amazon, Google, and Microsoft stock, which are datasets used in Angelopoulos et al. (2023) and contain roughly 3,000 observations each. Daily climate. This dataset has 1,575 daily temperature measurements in Delhi, India from 2013 to 2017, and is also used in Angelopoulos et al. (2023). Elec2 (Harries, 1999). This dataset consists of 45,312 hourly measurements of electricity demand in New South Wales, Austrailia from May 7, 1996 to December 5, 1998. As in Angelopoulos et al. (2024), we use a one-day delayed moving average as base forecaster, that is ˆYt := 1 24 P24 i=1 Yt 24 i and conformal scores St := \| ˆYt Yt\|. We gather data from the Electric Reliability Council of Texas (ERCOT), an organization that operates Texas s electrical grid. This data is accessible through the Grid Status API, which provides the true electricity load and a forecast for the load every 5 minutes.
Dataset Splits	Yes	In all experiments, we set the confidence level to 1 α = 0.9. We always reserve the first scores as a validation set, and set the rest as the test set. We tune the hyperparameters for our algorithms on the validation set, and for the baselines, we directly tune the hyperparameters on the test set. We reserve the first 1/3 of the datasets as validation data and tune our hyperparamters with the hyperparameter grid in Appendix B.1, while still tuning baseline hyperparameters on the test set.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running its experiments, such as GPU or CPU models. It mentions runtime in Table 1 but does not link it to specific hardware.
Software Dependencies	No	The paper mentions using 'the cvxpy python library' but does not specify its version or the versions of other key software components like Python itself, or other libraries/frameworks.
Experiment Setup	Yes	In all experiments, we set the confidence level to 1 α = 0.9. We always reserve the first scores as a validation set, and set the rest as the test set. We tune the hyperparameters for our algorithms on the validation set, and for the baselines, we directly tune the hyperparameters on the test set. We provide a starting grid in Section B.1 of Appendix B, which is implemented in our code and used in all our experiments. The decaying step sizes are of the form c t 0.6 as in Angelopoulos et al. (2024).