reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Contextual Bandits with Stage-wise Constraints

Authors: Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We also prove a lower-bound for this constrained problem, show how our algorithms and analyses can be extended to multiple constraints, and provide simulations to validate our theoretical results. In the high probability setting, we describe the minimum requirements for the action set for our algorithm to be tractable. In the setting that the constraint is in expectation, we specialize our results to multi-armed bandits and propose a computationally efﬁcient algorithm for this setting with regret analysis. Finally, we extend our results to the case where the reward and cost functions are both non-linear. We propose an algorithm for this case and prove a regret bound for it that characterize the function class complexity by the eluder dimension. Keywords: multi-armed bandits, contextual bandits, constraints, safety, eluder dimension
Researcher Affiliation	Collaboration	Aldo Pacchiano EMAIL Faculty of Computing & Data Sciences Boston University Boston, MA, USA Mohammad Ghavamzadeh EMAIL Amazon AGI Sunnyvale, CA, USA Peter Bartlett EMAIL Department of Electrical Engineering and Computer Sciences University of California Berkeley, CA, USA
Pseudocode	Yes	Algorithm 1 Linear Constraint Linear UCB (LC-LUCB)
Open Source Code	No	The paper does not explicitly state that the code for the methodology is open-sourced, nor does it provide a link to a code repository.
Open Datasets	No	In our ﬁrst experiment, presented in Figure 1, we consider a linear bandit problem in which the safe action is the zero-vector x0 = 0 and the arm sets, At, are 10 dimensional star-convex sets generated by the 10 cyclic shifted versions of the vector v/ v , where v = (0, 1, . . . , 9). For all t, the action set At is the star-convex set deﬁned by this set of actions and the lines emanating from the zero vector. We set θ = v/ v where v = (0, 1, . . . , 9) and4 µ = (9, 8, . . . , 0)/ v to be the ℓ2 normalized version of v and (9, . . . , 0). In our ﬁrst experiment, presented in Figure 3, we produce random instances of our constrained multi-armed bandit problem. In all the instances, we set the safe arm to have reward and cost 0. We generate different problem instances by sampling the Bernoulli mean rewards and costs of the rest of the arms uniformly at random from the interval [0, 1].
Dataset Splits	No	The paper describes simulated experiments where data instances are generated, rather than using pre-existing datasets with defined train/test/validation splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies	No	The paper does not provide specific software dependency details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	In all our experiments, we run a regularized least-squares regression by setting λ = 1. In our ﬁrst experiment, presented in Figure 1, we consider a linear bandit problem in which the safe action is the zero-vector x0 = 0 and the arm sets, At, are 10 dimensional star-convex sets generated by the 10 cyclic shifted versions of the vector v/ v , where v = (0, 1, . . . , 9). For all t, the action set At is the star-convex set deﬁned by this set of actions and the lines emanating from the zero vector. We set θ = v/ v where v = (0, 1, . . . , 9) and4 µ = (9, 8, . . . , 0)/ v to be the ℓ2 normalized version of v and (9, . . . , 0). In Figure 1, we plot the regret and cost evolution of LC-LUCB for different threshold values τ, and compare them with those for the Safe-LTS algorithm of Moradipari et al. (2019). The safe action is the zero vector and each plot is an average over 10 runs.