Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards

Authors: Junya Honda, Akimichi Takemura

JMLR 2015 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we consider a stochastic multiarmed bandit problem. It is known in this problem that Deterministic Minimum Empirical Divergence (DMED) policy achieves the asymptotic theoretical bound for the model where each reward distribution is supported in a known bounded interval, say [0, 1]. However, the regret bound of DMED is described in an asymptotic form and the performance in finite time has been unknown. We modify this policy and derive a finite-time regret bound for the new policy, Indexed Minimum Empirical Divergence (IMED), by refining large deviation probabilities to a simple nonasymptotic form. Further, the refined analysis reveals that the finite-time regret bound is valid even in the case that the reward is not bounded from below. Therefore, our finitetime result applies to the case that the minimum reward (that is, the maximum loss) is unknown or unbounded. We also present some simulation results which shows that IMED much improves DMED and performs competitively to other state-of-the-art policies.
Researcher Affiliation Academia Junya Honda EMAIL Department of Complexity Science and Engineering The University of Tokyo Kashiwa-shi, Chiba, 277-8561, Japan Akimichi Takemura EMAIL Department of Mathematical Informatics The University of Tokyo Bunkyo-ku, Tokyo, 113-8561, Japan
Pseudocode Yes Algorithm 1 IMED Policy Initialization: Pull each arm once. Loop: Choose an arm i minimizing Ii(n) Ti(n)Dinf( ˆFi(n), ˆµ (n); A) + log Ti(n) , where the tie-breaking rule is arbitrary.
Open Source Code No The paper describes the IMED policy and presents simulation results but does not provide any explicit statement or link to its open-source implementation.
Open Datasets Yes First, Fig. 1 shows simulation results of IMED, DMED, TS, kl-UCB and kl-UCB+ for ten-armed bandit with Bernoulli rewards with µ1 = 0.1, µ2 = µ3 = µ4 = 0.05, µ5 = µ6 = µ7 = 0.02, µ8 = µ9 = µ10 = 0.01, which is the same setting as those in2 Kaufmann et al. (2012b) and Capp e et al. (2013). Next, we consider the case that the time-delay X i for some task by the i-th agent follows an exponential distribution with density e x/µ i/µ i, x 0, and the player tries to minimize the cumulative delay. Since we modeled the reward as a random variable in ( , 1], we set the reward as Xi = 1 X i, that is, Xi has density e (1 x)/µ i/µ i = e (1 x)/(1 µi)/(1 µi), x 1, with expectation µi = 1 µ i. Fig. 2 shows simulation results for 5-armed bandit with µ i = 1/5, 1/4, 1/3, 1/2, 1, that is, µi = 4/5, 3/4, 2/3, 1/2, 0. We used IMED, DMED, KL-UCB, KL-UCB+ for A and KL-UCB for the (shifted) exponential distributions, which we refer as kl-exp-UCB, where the KL divergence is written as D(ˆµi µ) = 1 ˆµi 1 µ 1 log 1 ˆµi. Finally, Figs. 3 and 4 show results of IMED, DMED, KL-UCB and KL-UCB+ for truncated normal distributions on [0, 1] and ( , 1], respectively, as examples of multiparameter models.
Dataset Splits No The paper describes different bandit problem settings with specified reward distributions and parameters, which implicitly define the data generation process for simulations. However, it does not explicitly mention traditional training/test/validation dataset splits or their percentages, as typically found in supervised learning tasks. It does state: 'Each plot is an average over 10,000 trials.'
Hardware Specification No The paper presents simulation results for various bandit scenarios but does not provide any specific details about the hardware (e.g., CPU, GPU, memory) used to run these simulations.
Software Dependencies No The paper describes the IMED algorithm and discusses other bandit policies. While it outlines the mathematical framework and experimental setup (e.g., exploration functions), it does not specify any software, programming languages, or libraries with version numbers used for implementation or simulations.
Experiment Setup Yes In this section we give some simulation results for IMED, DMED, Thompson sampling (TS) and KL-UCB family. For the KL-UCB family, we use f(n) = log n as an exploration function for (10) since the asymptotic optimality is shown in Burnetas and Katehakis (1996) for some models and it is empirically recommended in Capp e et al. (2013) although the latter paper uses f(n) = log n + c log log n for some c > 0 in the proof of the optimality. The kl UCB+ and KL-UCB+ (Garivier and Capp e, 2011) are empirical improvements of kl-UCB and KL-UCB, respectively, where f(n) = log n is replaced with f(n) = log(n/Ti(n)). The optimality analysis of these policies has not been given but a similar version is discussed in Kaufmann (2014, Proposition 2.4) for some models. Each plot is an average over 10,000 trials. In the four figures given below, IMED and KL-UCB+ performed almost the best. Whereas the complexity of IMED is smaller than KL-UCB family as discussed in Sect. 4.1, the regret of IMED was slightly worse than that of KL-UCB+. First, Fig. 1 shows simulation results of IMED, DMED, TS, kl-UCB and kl-UCB+ for ten-armed bandit with Bernoulli rewards with µ1 = 0.1, µ2 = µ3 = µ4 = 0.05, µ5 = µ6 = µ7 = 0.02, µ8 = µ9 = µ10 = 0.01, which is the same setting as those in2 Kaufmann et al. (2012b) and Capp e et al. (2013). Next, we consider the case that the time-delay X i for some task by the i-th agent follows an exponential distribution with density e x/µ i/µ i, x 0, and the player tries to minimize the cumulative delay. Since we modeled the reward as a random variable in ( , 1], we set the reward as Xi = 1 X i, that is, Xi has density e (1 x)/µ i/µ i = e (1 x)/(1 µi)/(1 µi), x 1, with expectation µi = 1 µ i. Fig. 2 shows simulation results for 5-armed bandit with µ i = 1/5, 1/4, 1/3, 1/2, 1, that is, µi = 4/5, 3/4, 2/3, 1/2, 0. We used IMED, DMED, KL-UCB, KL-UCB+ for A and KL-UCB for the (shifted) exponential distributions, which we refer as kl-exp-UCB, where the KL divergence is written as D(ˆµi µ) = 1 ˆµi 1 µ 1 log 1 ˆµi. Finally, Figs. 3 and 4 show results of IMED, DMED, KL-UCB and KL-UCB+ for truncated normal distributions on [0, 1] and ( , 1], respectively, as examples of multiparameter models. The cumulative distribution of each reward is given by 0, x < a, Φ((x µ i)/σi) Φ((a µ i)/σi) Φ((1 µ i)/σi) Φ((a µ i)/σi), a x < 1, where a = 0 or , and Φ is the cumulative distribution function of the standard normal distribution. We also give results of kl-UCB and TS for the Bernoulli bandit for the setting of Fig. 3 where the reward is bounded. For each experiment we set expectations and variances before truncation as µ i = 0.6, 0.5, 0.5, 0.4, 0.4 and σi = 0.4, 0.2, 0.4, 0.2, 0.4. The expectation of each arm after truncation is given by µi = 0.519, 0.5, 0.5, 0.465, 0.481 for support [0, 1] and µi = 0.319, 0.390, 0.265, 0.320, 0.206 for support ( , 1].