reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bandits with Mean Bounds

Authors: Nihal Sharma, Soumya Basu, Karthikeyan Shanmugam, Sanjay Shakkottai

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical Evaluations: We compare the regret performance of R-OFUL with the vanilla OFUL algorithm empirically in Figure 2. For this, we sample a random set of 100 arms in R10 and sample 5, 10, 15 of these in each round. We set θ as the vector 0, 1/10 and normalize it. To generate rewards, for each arm a At, we ﬁrst sample a Gaussian random variable with mean a, θ and variance 0.8, and clip this to be in [0, 1] ([ 1, 0]) if a, θ > 0( 0) to form our bounded rewards. The upper and lower bounds on the rewards are generated separately to be away from the mean by a uniform random variable in [0, 0.5] and these are also clipped the same way as the rewards. That is, arms with positive (non-positive) mean reward are forced to have bounds in [0, 1] ([ 1, 0]). The subgaussian upper bound for OFUL is provided as 0.5, the worst-case subgaussian factor for any random variable bounded in either [0, 1] or [ 1, 0] while R-OFUL computes tighter subgaussian factors based on the mean bounds and uses γ as in Equation 5 to choose arms. We also use the norm bounds on θ to m = 0, M = 1. We average our results over 200 independent runs. We see that R-OFUL consistently achieves much lower regret than the vanilla variant. Further, we also observe that empirically, R-OFUL restricts arms and only chooses between 2-3 arms per round on average. This leads to a 6 6.7 computation speed up compared to vanilla OFUL.
Researcher Affiliation	Collaboration	Nihal Sharma EMAIL The University of Texas at Austin Soumya Basu EMAIL Google Karthikeyan Shanmugam EMAIL Google Deep Mind Sanjay Shakkottai EMAIL The University of Texas at Austin
Pseudocode	Yes	Algorithm 1 Rrestricted-set Optimism in the Face of Uncertainty for Linear bandits with mean bounds ... Algorithm 2 GLobal Under-Explore (GLUE) for MABs with Mean Bounds
Open Source Code	No	The paper does not provide an explicit statement or link for the open-sourcing of its own code.
Open Datasets	Yes	For this set of experiments, we use the Movielens 1M dataset of Harper & Konstan (2015).
Dataset Splits	No	Section 6.3 describes how the Movielens 1M dataset is processed and used (e.g., clustering age, combining occupation attributes, sampling movies), and for linear bandits, it mentions collecting "logs of size 10^5 to approximate the infinite logs". However, it does not specify explicit training/testing/validation splits in terms of percentages or sample counts for machine learning model evaluation.
Hardware Specification	No	The paper discusses computational efficiency and speed-ups (e.g., "6-6.7 computation speed up"), but it does not specify any particular hardware (GPU, CPU models, memory, etc.) used for running the experiments or achieving these results.
Software Dependencies	No	The paper mentions using the "Soft Impute algorithm from Mazumder et al. (2010)" but does not specify a version number for this or any other software dependency used in the implementation or experimentation.
Experiment Setup	Yes	Empirical Evaluations: We compare the regret performance of R-OFUL with the vanilla OFUL algorithm empirically in Figure 2. For this, we sample a random set of 100 arms in R10 and sample 5, 10, 15 of these in each round. We set θ as the vector 0, 1/10 and normalize it. To generate rewards, for each arm a At, we ﬁrst sample a Gaussian random variable with mean a, θ and variance 0.8, and clip this to be in [0, 1] ([ 1, 0]) if a, θ > 0( 0) to form our bounded rewards. The upper and lower bounds on the rewards are generated separately to be away from the mean by a uniform random variable in [0, 0.5] and these are also clipped the same way as the rewards. That is, arms with positive (non-positive) mean reward are forced to have bounds in [0, 1] ([ 1, 0]). The subgaussian upper bound for OFUL is provided as 0.5, the worst-case subgaussian factor for any random variable bounded in either [0, 1] or [ 1, 0] while R-OFUL computes tighter subgaussian factors based on the mean bounds and uses γ as in Equation 5 to choose arms. We also use the norm bounds on θ to m = 0, M = 1. We average our results over 200 independent runs. ... For this set of experiments [Linear Bandits], we use a synthetic setup: We ﬁrst ﬁx θ z, θ u = (1, 2, 3, 4, 5)T ∈ R5. We normalize θ z to have norm 1, and θ u to have norm 0.9. For the arms, for each of the 12 arms, we draw ak,z to from a Folded Normal Distribution (\|X\| is folded normal if X is normally distributed) in R5. We set a1,u to be in a ball of radius 0.1 around θ u at random and for all other k, ak,u is set to be a normalized 2-sparse vector in R5. With these, we generate logs by uniformly sampling Tu(t) in a ball of radius 0.1 around θ u independently for each t, then picking the best arm for this Tu(t). Then we sample a noisy latent reward uniformly at random from a symmetric interval around Tu(t), aˆk(Tu(t)),u , ensuring that it is in [0, 1]. The reward at each time is given by the sum of this noisy latent reward and θ z, aˆk(Tu(t)),u . We collect logs of size 10^5 to approximate the inﬁnite logs.