Contextual Bandits with Stage-wise Constraints
Authors: Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We also prove a lower-bound for this constrained problem, show how our algorithms and analyses can be extended to multiple constraints, and provide simulations to validate our theoretical results. In the high probability setting, we describe the minimum requirements for the action set for our algorithm to be tractable. In the setting that the constraint is in expectation, we specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting with regret analysis. Finally, we extend our results to the case where the reward and cost functions are both non-linear. We propose an algorithm for this case and prove a regret bound for it that characterize the function class complexity by the eluder dimension. Keywords: multi-armed bandits, contextual bandits, constraints, safety, eluder dimension |
| Researcher Affiliation | Collaboration | Aldo Pacchiano EMAIL Faculty of Computing & Data Sciences Boston University Boston, MA, USA Mohammad Ghavamzadeh EMAIL Amazon AGI Sunnyvale, CA, USA Peter Bartlett EMAIL Department of Electrical Engineering and Computer Sciences University of California Berkeley, CA, USA |
| Pseudocode | Yes | Algorithm 1 Linear Constraint Linear UCB (LC-LUCB) |
| Open Source Code | No | The paper does not explicitly state that the code for the methodology is open-sourced, nor does it provide a link to a code repository. |
| Open Datasets | No | In our first experiment, presented in Figure 1, we consider a linear bandit problem in which the safe action is the zero-vector x0 = 0 and the arm sets, At, are 10 dimensional star-convex sets generated by the 10 cyclic shifted versions of the vector v/ v , where v = (0, 1, . . . , 9). For all t, the action set At is the star-convex set defined by this set of actions and the lines emanating from the zero vector. We set θ = v/ v where v = (0, 1, . . . , 9) and4 µ = (9, 8, . . . , 0)/ v to be the ℓ2 normalized version of v and (9, . . . , 0). In our first experiment, presented in Figure 3, we produce random instances of our constrained multi-armed bandit problem. In all the instances, we set the safe arm to have reward and cost 0. We generate different problem instances by sampling the Bernoulli mean rewards and costs of the rest of the arms uniformly at random from the interval [0, 1]. |
| Dataset Splits | No | The paper describes simulated experiments where data instances are generated, rather than using pre-existing datasets with defined train/test/validation splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | In all our experiments, we run a regularized least-squares regression by setting λ = 1. In our first experiment, presented in Figure 1, we consider a linear bandit problem in which the safe action is the zero-vector x0 = 0 and the arm sets, At, are 10 dimensional star-convex sets generated by the 10 cyclic shifted versions of the vector v/ v , where v = (0, 1, . . . , 9). For all t, the action set At is the star-convex set defined by this set of actions and the lines emanating from the zero vector. We set θ = v/ v where v = (0, 1, . . . , 9) and4 µ = (9, 8, . . . , 0)/ v to be the ℓ2 normalized version of v and (9, . . . , 0). In Figure 1, we plot the regret and cost evolution of LC-LUCB for different threshold values τ, and compare them with those for the Safe-LTS algorithm of Moradipari et al. (2019). The safe action is the zero vector and each plot is an average over 10 runs. |