Stochastic Online Conformal Prediction with Semi-Bandit Feedback

Authors: Haosen Ge, Hamsa Bastani, Osbert Bastani

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our algorithm on a retrieval task, an image classification task, and an auction price-setting task, and demonstrate that it empirically achieves good performance compared to several baselines. Our experiments demonstrate that our algorithm generates prediction sets that converge to the optimal ones while maintaining the desired coverage rate. Moreover, our algorithm significantly outperforms three natural baselines; each baseline either achieves worse cumulative expected regret or does not satisfy the desired coverage rate. First, Figure 1 shows the cumulative regret of each approach on each task. Next, Figure 2 shows the coverage rate achieved by each algorithm for each task. Finally, Figure 3 shows the undercoverage count.
Researcher Affiliation Academia 1Wharton AI & Analytics Initiative, University of Pennsylvania 2Operations, Information and Decisions Department, The Wharton School, University of Pennsylvania 3Department of Computer and Information Science, University of Pennsylvania. Correspondence to: Haosen Ge <EMAIL>.
Pseudocode Yes Algorithm 1 Semi-bandit Prediction Set (SPS) Input: horizon T, desired quantile α τ1 for t = 1 to T do if st τt then observe st else st τt Compute Gt according to (4) τ1 α,t sup{τ R | Gt(τ) 1 α} τt max{τ1 α,t, τt} end for
Open Source Code No The paper mentions "DPR s public Github Repo: https://github. com/facebookresearch/DPR" as a source for data and a model used in their experiments, but it does not provide any statement or link for the source code of their own methodology described in the paper.
Open Datasets Yes We use the Vision Transformer (Dosovitskiy et al., 2020) model on the Image Net dataset (Deng et al., 2009).1 Obtained from https://www.image-net.org/ with a custom and non-commercial license; we use the 16 16 down sampled version. Our dataset is SQu AD question-answering dataset (Rajpurkar et al., 2016), a popular reading comprehension benchmark. The data were obtained from DPR s public Github Repo: https://github. com/facebookresearch/DPR with licenses CC BY-SA 4.0 and CC BY-NC 4.0. Following standard practice (Mohri & Medina, 2014), we use a synthetic dataset that adapts from e Bay auction data (Jank & Shmueli, 2010).
Dataset Splits No The paper describes how candidate documents are constructed for the SQuAD dataset ('for each question, we include one ground truth document and all the irrelevant documents to create the set of candidate documents') and mentions using a '16 16 down sampled version' of ImageNet, but it does not provide explicit training, validation, or test splits by percentages, sample counts, or references to standard splits for any of the datasets used to reproduce the data partitioning.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory amounts, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions using specific models like 'Vision Transformer (Dosovitskiy et al., 2020)' and 'Dense Passage Retriever (DPR) model (Karpukhin et al., 2020)', but it does not provide specific version numbers for any software libraries, programming languages, or development environments that would be necessary to replicate the experiment setup.
Experiment Setup Yes Experiment parameters. We use α = 0.9 and T = 10000, and report averages across 10 runs. We choose the learning rate γ from a grid search in a candidate set proposed in (Gibbs & Cand es, 2024). We set the learning rate to the one used in the experiments from the original paper i.e., ηt = t 1/2 ϵ with ϵ = 0.1. We take λ1 = 0.1 and λ2 = 10.