reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contextual Linear Bandits with Delay as Payoff

Authors: Mengxiao Zhang, Yingfei Wang, Haipeng Luo

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we implement and evaluate our algorithm for both the delay-as-loss and delay-as-reward settings. For simplicity, we only consider the non-contextual setting. Since there are no existing algorithms for this problem (to the best of our knowledge), we consider three simple benchmarks. ... Results In Figure 1, we plot the mean and the standard deviation of the regret over 8 independent experiments with different random seeds, for each n {6, 8, 10} (the columns) and also both delay-as-loss and delay-as-reward (the rows). We observe that our algorithm consistently outperforms all the three benchmarks in all setups.
Researcher Affiliation	Academia	Mengxiao Zhang 1 University of Iowa 2 University of Washington 3 University of Southern California. Correspondence to: Mengxiao Zhang <EMAIL>.
Pseudocode	Yes	Algorithm 1: Phased Elimination via Volumetric Spanner for Linear Bandits with Delay-as-Loss Algorithm 2: Reduction from Contextual Linear Bandits to Non-Contextual Linear Bandits (Hanna et al., 2023) Algorithm 3: Phased Elimination via Volumetric Spanner for Linear Bandits with Delay-as-Loss with misspecification Algorithm 4: Phased Elimination for Linear Bandits with Delay-as-Reward Algorithm 5: Lin UCB with Delayed Feedback
Open Source Code	No	The paper states: "Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments." and describes experiments in Section 5. However, there is no explicit statement about making the code open source, nor is a link to a repository provided.
Open Datasets	No	Experiment setup The experiment setup is as follows. We set the dimension n {6, 8, 10} and the size of the action set \|A\| = 70. The model parameter θ is set to be \|ν\| ν 2 where ν is drawn from the n-dimensional standard normal distribution and \|ν\| denotes the entry-wise absolute value of ν to make sure that θ Rn + Bn 2(1). Each action a A is constructed by first sampling ai uniformly from [0, 1] for all i [n] and then normalizing it to unit ℓ2-norm. When an action at is chosen at round t, the payoff ut is drawn from beta distribution with α = µat and β = 1 µat.
Dataset Splits	No	The paper generates synthetic data for its experiments and mentions running "8 independent experiments with different random seeds." It does not describe traditional training/test/validation splits of a pre-existing dataset.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, memory) used for running the experiments. It only describes the experimental setup in terms of parameters like dimension n, action set size, number of iterations, and maximum delay.
Software Dependencies	No	The paper mentions comparing their algorithm to "Lin UCB", "OTFLin UCB", and "OTFLin TS", which are algorithms. However, it does not specify any software dependencies (e.g., programming languages, libraries, frameworks) with version numbers that would be needed to replicate the experiments.
Experiment Setup	Yes	Experiment setup The experiment setup is as follows. We set the dimension n {6, 8, 10} and the size of the action set \|A\| = 70. The model parameter θ is set to be \|ν\| ν 2 where ν is drawn from the n-dimensional standard normal distribution and \|ν\| denotes the entry-wise absolute value of ν to make sure that θ Rn + Bn 2(1). Each action a A is constructed by first sampling ai uniformly from [0, 1] for all i [n] and then normalizing it to unit ℓ2-norm. When an action at is chosen at round t, the payoff ut is drawn from beta distribution with α = µat and β = 1 µat.The number of iterations T is 16000 and the maximal possible delay D is 1000. For simplicity, we also ignore the role of B in our algorithms. ... Results In Figure 1, we plot the mean and the standard deviation of the regret over 8 independent experiments with different random seeds, for each n {6, 8, 10} (the columns) and also both delay-as-loss and delay-as-reward (the rows).