reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Thompson Sampling For Bandits With Cool-Down Periods

Authors: Jingxuan Zhu, Bin Liu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5 Numerical Evaluations In this section, we evaluate the performance of our proposed algorithms. We consider a 10-armed bandit problem, each arm follows a Bernoulli distribution with distinct success rate. For all the experiments in this section, we set the cool-down duration upper bound D as 10. The total time T is set to be 100000 and we provide the averaged performance over 50 runs with the corresponding error bar.
Researcher Affiliation	Industry	Jingxuan Zhu EMAIL E-Surﬁng Digital Life Technology Co., Ltd. China Telecom Bin Liu* EMAIL E-Surﬁng Digital Life Technology Co., Ltd. China Telecom
Pseudocode	Yes	Algorithm 1: Known Cool-Down Durations Algorithm 2: Unknown Cool-Down Durations The distribution exploration process is given in Algorithm 3 in Appendix B. The cool-down exploration process is delineated in Algorithm 4 in Appendix B. Further details and variable updates are provided in Algorithm 6.
Open Source Code	No	The paper does not provide any explicit statements about the release of source code or links to a code repository.
Open Datasets	No	The paper simulates a '10-armed bandit problem' with Bernoulli distributions, but it does not use or provide access to any specific publicly available dataset.
Dataset Splits	No	The paper describes a simulated bandit problem and does not mention using or splitting any external dataset.
Hardware Specification	No	The paper does not specify any details regarding the hardware used for running the experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies	No	The paper does not provide specific software or library names with version numbers used to replicate the experiments.
Experiment Setup	Yes	For all the experiments in this section, we set the cool-down duration upper bound D as 10. The total time T is set to be 100000 and we provide the averaged performance over 50 runs with the corresponding error bar. Consider the following example, there are three arms with success rates 0.9, 0.8, 0.5 and cool-down durations 2, 0, 0 respectively. Let D = 2. Since l(2) = 0, arm 2 is always active. Let us assume the agent fails to obtain reward 1 during the ﬁrst 5 cool-down explorations of arm 2, which happens with probability at least 0.210. Given T = 10000, the performance results of the agent under Algorithm 2 and that without the decision-making bifurcation are illustrated in Figure 4.