Online Conformal Prediction via Online Optimization
Authors: Felipe Areces, Christopher Mohri, Tatsunori Hashimoto, John Duchi
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Complementary to our theory, our experiments spanning over 15 datasets suggest that the performance improvement of our methods over baselines grows with the magnitude of the data s dependence, even when baselines are tuned on the test set. We put these findings to the test by pre-registering an experiment for electricity demand forecasting in Texas, where our algorithms achieve over a 10% reduction in confidence set sizes, a more than a 30% improvement in quantile and absolute losses with respect to the observed errors, and significant outcomes on all 78 out of 78 pre-registered hypotheses. We provide documentation for the pypi package implementing our algorithms here: https: //conformalopt.readthedocs.io/. |
| Researcher Affiliation | Academia | 1Department of Electrical Engineering, Stanford University, Stanford, USA 2Department of Computer Science, Stanford University, Stanford, USA 3Department of Statistics, Stanford University, Stanford, USA. |
| Pseudocode | Yes | Algorithm 1 Batched projected online gradient descent |
| Open Source Code | Yes | We provide documentation for the pypi package implementing our algorithms here: https: //conformalopt.readthedocs.io/. |
| Open Datasets | Yes | Stock data (AMZN, GOOGL, MSFT). Using stock data is common in online conformal work. Here we consider the returns of Amazon, Google, and Microsoft stock, which are datasets used in Angelopoulos et al. (2023) and contain roughly 3,000 observations each. Daily climate. This dataset has 1,575 daily temperature measurements in Delhi, India from 2013 to 2017, and is also used in Angelopoulos et al. (2023). Elec2 (Harries, 1999). This dataset consists of 45,312 hourly measurements of electricity demand in New South Wales, Austrailia from May 7, 1996 to December 5, 1998. As in Angelopoulos et al. (2024), we use a one-day delayed moving average as base forecaster, that is ˆYt := 1 24 P24 i=1 Yt 24 i and conformal scores St := | ˆYt Yt|. We gather data from the Electric Reliability Council of Texas (ERCOT), an organization that operates Texas s electrical grid. This data is accessible through the Grid Status API, which provides the true electricity load and a forecast for the load every 5 minutes. |
| Dataset Splits | Yes | In all experiments, we set the confidence level to 1 α = 0.9. We always reserve the first scores as a validation set, and set the rest as the test set. We tune the hyperparameters for our algorithms on the validation set, and for the baselines, we directly tune the hyperparameters on the test set. We reserve the first 1/3 of the datasets as validation data and tune our hyperparamters with the hyperparameter grid in Appendix B.1, while still tuning baseline hyperparameters on the test set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments, such as GPU or CPU models. It mentions runtime in Table 1 but does not link it to specific hardware. |
| Software Dependencies | No | The paper mentions using 'the cvxpy python library' but does not specify its version or the versions of other key software components like Python itself, or other libraries/frameworks. |
| Experiment Setup | Yes | In all experiments, we set the confidence level to 1 α = 0.9. We always reserve the first scores as a validation set, and set the rest as the test set. We tune the hyperparameters for our algorithms on the validation set, and for the baselines, we directly tune the hyperparameters on the test set. We provide a starting grid in Section B.1 of Appendix B, which is implemented in our code and used in all our experiments. The decaying step sizes are of the form c t 0.6 as in Angelopoulos et al. (2024). |