reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Non-Asymptotic Analysis of a New Bandit Algorithm for Semi-Bounded Rewards

Authors: Junya Honda, Akimichi Takemura

JMLR 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we consider a stochastic multiarmed bandit problem. It is known in this problem that Deterministic Minimum Empirical Divergence (DMED) policy achieves the asymptotic theoretical bound for the model where each reward distribution is supported in a known bounded interval, say [0, 1]. However, the regret bound of DMED is described in an asymptotic form and the performance in ﬁnite time has been unknown. We modify this policy and derive a ﬁnite-time regret bound for the new policy, Indexed Minimum Empirical Divergence (IMED), by reﬁning large deviation probabilities to a simple nonasymptotic form. Further, the reﬁned analysis reveals that the ﬁnite-time regret bound is valid even in the case that the reward is not bounded from below. Therefore, our ﬁnitetime result applies to the case that the minimum reward (that is, the maximum loss) is unknown or unbounded. We also present some simulation results which shows that IMED much improves DMED and performs competitively to other state-of-the-art policies.
Researcher Affiliation	Academia	Junya Honda EMAIL Department of Complexity Science and Engineering The University of Tokyo Kashiwa-shi, Chiba, 277-8561, Japan Akimichi Takemura EMAIL Department of Mathematical Informatics The University of Tokyo Bunkyo-ku, Tokyo, 113-8561, Japan
Pseudocode	Yes	Algorithm 1 IMED Policy Initialization: Pull each arm once. Loop: Choose an arm i minimizing Ii(n) Ti(n)Dinf( ˆFi(n), ˆµ (n); A) + log Ti(n) , where the tie-breaking rule is arbitrary.
Open Source Code	No	The paper describes the IMED policy and presents simulation results but does not provide any explicit statement or link to its open-source implementation.
Open Datasets	Yes	First, Fig. 1 shows simulation results of IMED, DMED, TS, kl-UCB and kl-UCB+ for ten-armed bandit with Bernoulli rewards with µ1 = 0.1, µ2 = µ3 = µ4 = 0.05, µ5 = µ6 = µ7 = 0.02, µ8 = µ9 = µ10 = 0.01, which is the same setting as those in2 Kaufmann et al. (2012b) and Capp e et al. (2013). Next, we consider the case that the time-delay X i for some task by the i-th agent follows an exponential distribution with density e x/µ i/µ i, x 0, and the player tries to minimize the cumulative delay. Since we modeled the reward as a random variable in ( , 1], we set the reward as Xi = 1 X i, that is, Xi has density e (1 x)/µ i/µ i = e (1 x)/(1 µi)/(1 µi), x 1, with expectation µi = 1 µ i. Fig. 2 shows simulation results for 5-armed bandit with µ i = 1/5, 1/4, 1/3, 1/2, 1, that is, µi = 4/5, 3/4, 2/3, 1/2, 0. We used IMED, DMED, KL-UCB, KL-UCB+ for A and KL-UCB for the (shifted) exponential distributions, which we refer as kl-exp-UCB, where the KL divergence is written as D(ˆµi µ) = 1 ˆµi 1 µ 1 log 1 ˆµi. Finally, Figs. 3 and 4 show results of IMED, DMED, KL-UCB and KL-UCB+ for truncated normal distributions on [0, 1] and ( , 1], respectively, as examples of multiparameter models.
Dataset Splits	No	The paper describes different bandit problem settings with specified reward distributions and parameters, which implicitly define the data generation process for simulations. However, it does not explicitly mention traditional training/test/validation dataset splits or their percentages, as typically found in supervised learning tasks. It does state: 'Each plot is an average over 10,000 trials.'
Hardware Specification	No	The paper presents simulation results for various bandit scenarios but does not provide any specific details about the hardware (e.g., CPU, GPU, memory) used to run these simulations.
Software Dependencies	No	The paper describes the IMED algorithm and discusses other bandit policies. While it outlines the mathematical framework and experimental setup (e.g., exploration functions), it does not specify any software, programming languages, or libraries with version numbers used for implementation or simulations.
Experiment Setup	Yes	In this section we give some simulation results for IMED, DMED, Thompson sampling (TS) and KL-UCB family. For the KL-UCB family, we use f(n) = log n as an exploration function for (10) since the asymptotic optimality is shown in Burnetas and Katehakis (1996) for some models and it is empirically recommended in Capp e et al. (2013) although the latter paper uses f(n) = log n + c log log n for some c > 0 in the proof of the optimality. The kl UCB+ and KL-UCB+ (Garivier and Capp e, 2011) are empirical improvements of kl-UCB and KL-UCB, respectively, where f(n) = log n is replaced with f(n) = log(n/Ti(n)). The optimality analysis of these policies has not been given but a similar version is discussed in Kaufmann (2014, Proposition 2.4) for some models. Each plot is an average over 10,000 trials. In the four ﬁgures given below, IMED and KL-UCB+ performed almost the best. Whereas the complexity of IMED is smaller than KL-UCB family as discussed in Sect. 4.1, the regret of IMED was slightly worse than that of KL-UCB+. First, Fig. 1 shows simulation results of IMED, DMED, TS, kl-UCB and kl-UCB+ for ten-armed bandit with Bernoulli rewards with µ1 = 0.1, µ2 = µ3 = µ4 = 0.05, µ5 = µ6 = µ7 = 0.02, µ8 = µ9 = µ10 = 0.01, which is the same setting as those in2 Kaufmann et al. (2012b) and Capp e et al. (2013). Next, we consider the case that the time-delay X i for some task by the i-th agent follows an exponential distribution with density e x/µ i/µ i, x 0, and the player tries to minimize the cumulative delay. Since we modeled the reward as a random variable in ( , 1], we set the reward as Xi = 1 X i, that is, Xi has density e (1 x)/µ i/µ i = e (1 x)/(1 µi)/(1 µi), x 1, with expectation µi = 1 µ i. Fig. 2 shows simulation results for 5-armed bandit with µ i = 1/5, 1/4, 1/3, 1/2, 1, that is, µi = 4/5, 3/4, 2/3, 1/2, 0. We used IMED, DMED, KL-UCB, KL-UCB+ for A and KL-UCB for the (shifted) exponential distributions, which we refer as kl-exp-UCB, where the KL divergence is written as D(ˆµi µ) = 1 ˆµi 1 µ 1 log 1 ˆµi. Finally, Figs. 3 and 4 show results of IMED, DMED, KL-UCB and KL-UCB+ for truncated normal distributions on [0, 1] and ( , 1], respectively, as examples of multiparameter models. The cumulative distribution of each reward is given by 0, x < a, Φ((x µ i)/σi) Φ((a µ i)/σi) Φ((1 µ i)/σi) Φ((a µ i)/σi), a x < 1, where a = 0 or , and Φ is the cumulative distribution function of the standard normal distribution. We also give results of kl-UCB and TS for the Bernoulli bandit for the setting of Fig. 3 where the reward is bounded. For each experiment we set expectations and variances before truncation as µ i = 0.6, 0.5, 0.5, 0.4, 0.4 and σi = 0.4, 0.2, 0.4, 0.2, 0.4. The expectation of each arm after truncation is given by µi = 0.519, 0.5, 0.5, 0.465, 0.481 for support [0, 1] and µi = 0.319, 0.390, 0.265, 0.320, 0.206 for support ( , 1].