Bandits with Mean Bounds
Authors: Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical Evaluations: We compare the regret performance of R-OFUL with the vanilla OFUL algorithm empirically in Figure 2. For this, we sample a random set of 100 arms in R10 and sample 5, 10, 15 of these in each round. We set θ as the vector 0, 1/10 and normalize it. To generate rewards, for each arm a At, we first sample a Gaussian random variable with mean a, θ and variance 0.8, and clip this to be in [0, 1] ([ 1, 0]) if a, θ > 0( 0) to form our bounded rewards. The upper and lower bounds on the rewards are generated separately to be away from the mean by a uniform random variable in [0, 0.5] and these are also clipped the same way as the rewards. That is, arms with positive (non-positive) mean reward are forced to have bounds in [0, 1] ([ 1, 0]). The subgaussian upper bound for OFUL is provided as 0.5, the worst-case subgaussian factor for any random variable bounded in either [0, 1] or [ 1, 0] while R-OFUL computes tighter subgaussian factors based on the mean bounds and uses γ as in Equation 5 to choose arms. We also use the norm bounds on θ to m = 0, M = 1. We average our results over 200 independent runs. We see that R-OFUL consistently achieves much lower regret than the vanilla variant. Further, we also observe that empirically, R-OFUL restricts arms and only chooses between 2-3 arms per round on average. This leads to a 6 6.7 computation speed up compared to vanilla OFUL. |
| Researcher Affiliation | Collaboration | Nihal Sharma EMAIL The University of Texas at Austin Soumya Basu EMAIL Google Karthikeyan Shanmugam EMAIL Google Deep Mind Sanjay Shakkottai EMAIL The University of Texas at Austin |
| Pseudocode | Yes | Algorithm 1 Rrestricted-set Optimism in the Face of Uncertainty for Linear bandits with mean bounds ... Algorithm 2 GLobal Under-Explore (GLUE) for MABs with Mean Bounds |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of its own code. |
| Open Datasets | Yes | For this set of experiments, we use the Movielens 1M dataset of Harper & Konstan (2015). |
| Dataset Splits | No | Section 6.3 describes how the Movielens 1M dataset is processed and used (e.g., clustering age, combining occupation attributes, sampling movies), and for linear bandits, it mentions collecting "logs of size 10^5 to approximate the infinite logs". However, it does not specify explicit training/testing/validation splits in terms of percentages or sample counts for machine learning model evaluation. |
| Hardware Specification | No | The paper discusses computational efficiency and speed-ups (e.g., "6-6.7 computation speed up"), but it does not specify any particular hardware (GPU, CPU models, memory, etc.) used for running the experiments or achieving these results. |
| Software Dependencies | No | The paper mentions using the "Soft Impute algorithm from Mazumder et al. (2010)" but does not specify a version number for this or any other software dependency used in the implementation or experimentation. |
| Experiment Setup | Yes | Empirical Evaluations: We compare the regret performance of R-OFUL with the vanilla OFUL algorithm empirically in Figure 2. For this, we sample a random set of 100 arms in R10 and sample 5, 10, 15 of these in each round. We set θ as the vector 0, 1/10 and normalize it. To generate rewards, for each arm a At, we first sample a Gaussian random variable with mean a, θ and variance 0.8, and clip this to be in [0, 1] ([ 1, 0]) if a, θ > 0( 0) to form our bounded rewards. The upper and lower bounds on the rewards are generated separately to be away from the mean by a uniform random variable in [0, 0.5] and these are also clipped the same way as the rewards. That is, arms with positive (non-positive) mean reward are forced to have bounds in [0, 1] ([ 1, 0]). The subgaussian upper bound for OFUL is provided as 0.5, the worst-case subgaussian factor for any random variable bounded in either [0, 1] or [ 1, 0] while R-OFUL computes tighter subgaussian factors based on the mean bounds and uses γ as in Equation 5 to choose arms. We also use the norm bounds on θ to m = 0, M = 1. We average our results over 200 independent runs. ... For this set of experiments [Linear Bandits], we use a synthetic setup: We first fix θ z, θ u = (1, 2, 3, 4, 5)T ∈ R5. We normalize θ z to have norm 1, and θ u to have norm 0.9. For the arms, for each of the 12 arms, we draw ak,z to from a Folded Normal Distribution (|X| is folded normal if X is normally distributed) in R5. We set a1,u to be in a ball of radius 0.1 around θ u at random and for all other k, ak,u is set to be a normalized 2-sparse vector in R5. With these, we generate logs by uniformly sampling Tu(t) in a ball of radius 0.1 around θ u independently for each t, then picking the best arm for this Tu(t). Then we sample a noisy latent reward uniformly at random from a symmetric interval around Tu(t), aˆk(Tu(t)),u , ensuring that it is in [0, 1]. The reward at each time is given by the sum of this noisy latent reward and θ z, aˆk(Tu(t)),u . We collect logs of size 10^5 to approximate the infinite logs. |