Thompson Sampling For Bandits With Cool-Down Periods
Authors: Jingxuan Zhu, Bin Liu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Numerical Evaluations In this section, we evaluate the performance of our proposed algorithms. We consider a 10-armed bandit problem, each arm follows a Bernoulli distribution with distinct success rate. For all the experiments in this section, we set the cool-down duration upper bound D as 10. The total time T is set to be 100000 and we provide the averaged performance over 50 runs with the corresponding error bar. |
| Researcher Affiliation | Industry | Jingxuan Zhu EMAIL E-Surfing Digital Life Technology Co., Ltd. China Telecom Bin Liu* EMAIL E-Surfing Digital Life Technology Co., Ltd. China Telecom |
| Pseudocode | Yes | Algorithm 1: Known Cool-Down Durations Algorithm 2: Unknown Cool-Down Durations The distribution exploration process is given in Algorithm 3 in Appendix B. The cool-down exploration process is delineated in Algorithm 4 in Appendix B. Further details and variable updates are provided in Algorithm 6. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code or links to a code repository. |
| Open Datasets | No | The paper simulates a '10-armed bandit problem' with Bernoulli distributions, but it does not use or provide access to any specific publicly available dataset. |
| Dataset Splits | No | The paper describes a simulated bandit problem and does not mention using or splitting any external dataset. |
| Hardware Specification | No | The paper does not specify any details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | No | The paper does not provide specific software or library names with version numbers used to replicate the experiments. |
| Experiment Setup | Yes | For all the experiments in this section, we set the cool-down duration upper bound D as 10. The total time T is set to be 100000 and we provide the averaged performance over 50 runs with the corresponding error bar. Consider the following example, there are three arms with success rates 0.9, 0.8, 0.5 and cool-down durations 2, 0, 0 respectively. Let D = 2. Since l(2) = 0, arm 2 is always active. Let us assume the agent fails to obtain reward 1 during the first 5 cool-down explorations of arm 2, which happens with probability at least 0.210. Given T = 10000, the performance results of the agent under Algorithm 2 and that without the decision-making bifurcation are illustrated in Figure 4. |