Contextual Linear Bandits with Delay as Payoff

Authors: Mengxiao Zhang, Yingfei Wang, Haipeng Luo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we implement and evaluate our algorithm for both the delay-as-loss and delay-as-reward settings. For simplicity, we only consider the non-contextual setting. Since there are no existing algorithms for this problem (to the best of our knowledge), we consider three simple benchmarks. ... Results In Figure 1, we plot the mean and the standard deviation of the regret over 8 independent experiments with different random seeds, for each n {6, 8, 10} (the columns) and also both delay-as-loss and delay-as-reward (the rows). We observe that our algorithm consistently outperforms all the three benchmarks in all setups.
Researcher Affiliation Academia Mengxiao Zhang 1 University of Iowa 2 University of Washington 3 University of Southern California. Correspondence to: Mengxiao Zhang <EMAIL>.
Pseudocode Yes Algorithm 1: Phased Elimination via Volumetric Spanner for Linear Bandits with Delay-as-Loss Algorithm 2: Reduction from Contextual Linear Bandits to Non-Contextual Linear Bandits (Hanna et al., 2023) Algorithm 3: Phased Elimination via Volumetric Spanner for Linear Bandits with Delay-as-Loss with misspecification Algorithm 4: Phased Elimination for Linear Bandits with Delay-as-Reward Algorithm 5: Lin UCB with Delayed Feedback
Open Source Code No The paper states: "Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments." and describes experiments in Section 5. However, there is no explicit statement about making the code open source, nor is a link to a repository provided.
Open Datasets No Experiment setup The experiment setup is as follows. We set the dimension n {6, 8, 10} and the size of the action set |A| = 70. The model parameter θ is set to be |ν| ν 2 where ν is drawn from the n-dimensional standard normal distribution and |ν| denotes the entry-wise absolute value of ν to make sure that θ Rn + Bn 2(1). Each action a A is constructed by first sampling ai uniformly from [0, 1] for all i [n] and then normalizing it to unit ℓ2-norm. When an action at is chosen at round t, the payoff ut is drawn from beta distribution with α = µat and β = 1 µat.
Dataset Splits No The paper generates synthetic data for its experiments and mentions running "8 independent experiments with different random seeds." It does not describe traditional training/test/validation splits of a pre-existing dataset.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It only describes the experimental setup in terms of parameters like dimension n, action set size, number of iterations, and maximum delay.
Software Dependencies No The paper mentions comparing their algorithm to "Lin UCB", "OTFLin UCB", and "OTFLin TS", which are algorithms. However, it does not specify any software dependencies (e.g., programming languages, libraries, frameworks) with version numbers that would be needed to replicate the experiments.
Experiment Setup Yes Experiment setup The experiment setup is as follows. We set the dimension n {6, 8, 10} and the size of the action set |A| = 70. The model parameter θ is set to be |ν| ν 2 where ν is drawn from the n-dimensional standard normal distribution and |ν| denotes the entry-wise absolute value of ν to make sure that θ Rn + Bn 2(1). Each action a A is constructed by first sampling ai uniformly from [0, 1] for all i [n] and then normalizing it to unit ℓ2-norm. When an action at is chosen at round t, the payoff ut is drawn from beta distribution with α = µat and β = 1 µat.The number of iterations T is 16000 and the maximal possible delay D is 1000. For simplicity, we also ignore the role of B in our algorithms. ... Results In Figure 1, we plot the mean and the standard deviation of the regret over 8 independent experiments with different random seeds, for each n {6, 8, 10} (the columns) and also both delay-as-loss and delay-as-reward (the rows).