Thompson Sampling For Bandits With Cool-Down Periods

Authors: Jingxuan Zhu, Bin Liu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Numerical Evaluations In this section, we evaluate the performance of our proposed algorithms. We consider a 10-armed bandit problem, each arm follows a Bernoulli distribution with distinct success rate. For all the experiments in this section, we set the cool-down duration upper bound D as 10. The total time T is set to be 100000 and we provide the averaged performance over 50 runs with the corresponding error bar.
Researcher Affiliation Industry Jingxuan Zhu EMAIL E-Surfing Digital Life Technology Co., Ltd. China Telecom Bin Liu* EMAIL E-Surfing Digital Life Technology Co., Ltd. China Telecom
Pseudocode Yes Algorithm 1: Known Cool-Down Durations Algorithm 2: Unknown Cool-Down Durations The distribution exploration process is given in Algorithm 3 in Appendix B. The cool-down exploration process is delineated in Algorithm 4 in Appendix B. Further details and variable updates are provided in Algorithm 6.
Open Source Code No The paper does not provide any explicit statements about the release of source code or links to a code repository.
Open Datasets No The paper simulates a '10-armed bandit problem' with Bernoulli distributions, but it does not use or provide access to any specific publicly available dataset.
Dataset Splits No The paper describes a simulated bandit problem and does not mention using or splitting any external dataset.
Hardware Specification No The paper does not specify any details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies No The paper does not provide specific software or library names with version numbers used to replicate the experiments.
Experiment Setup Yes For all the experiments in this section, we set the cool-down duration upper bound D as 10. The total time T is set to be 100000 and we provide the averaged performance over 50 runs with the corresponding error bar. Consider the following example, there are three arms with success rates 0.9, 0.8, 0.5 and cool-down durations 2, 0, 0 respectively. Let D = 2. Since l(2) = 0, arm 2 is always active. Let us assume the agent fails to obtain reward 1 during the first 5 cool-down explorations of arm 2, which happens with probability at least 0.210. Given T = 10000, the performance results of the agent under Algorithm 2 and that without the decision-making bifurcation are illustrated in Figure 4.