Satisficing Regret Minimization in Bandits

Authors: Qing Feng, Tianyi Ma, Ruihao Zhu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, in Section 6 we conduct numerical experiments to demonstrate the performance of SELECT for several popular bandit optimization settings. ... Our results reveal that in the realizable case, SELECT does achieve constant satisficing regret.
Researcher Affiliation Academia Qing Feng Cornell University EMAIL Tianyi Ma Shanghai Jiao Tong University sean EMAIL Ruihao Zhu Cornell University EMAIL
Pseudocode Yes The pseudo-code of SELECT is presented in Algorithm 1.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to repositories.
Open Datasets No The paper describes creating specific problem instances for experiments (e.g., 'an instance of 4 arms, and the expected reward of all arms are {0.6, 0.7, 0.8, 1}' for finite-armed bandits, or 'concave reward function r(x) = 1 - 16(x - 0.25)^2' for concave bandits, and 'two-dimensional Lipschitz bandit in domain (x, y) [0, 1]^2'). These are custom-generated experimental setups, not references to external, publicly available datasets.
Dataset Splits No The paper describes experiments in simulated bandit environments over a 'time horizon' rather than using traditional datasets with training/test/validation splits. Therefore, the concept of dataset splits is not applicable in the typical sense for this research.
Hardware Specification Yes All experiments are run on a PC with 4-core CPU and 16GB of RAM.
Software Dependencies No The paper mentions various algorithms and benchmarks (e.g., Thompson sampling, SAT-UCB, convex bandit algorithm, uniformly discretized UCB), but does not provide specific software names with version numbers for any libraries, frameworks, or operating systems used in implementation.
Experiment Setup Yes Setup: In this case, we consider an instance of 4 arms, and the expected reward of all arms are {0.6, 0.7, 0.8, 1}. The length of the time horizon is set to be 500 to 5000 with a stepsize of 500. We conduct experiments for both the realizable case and the non-realizable case. For the realizable case, we set the satisficing level S = 0.93; for the non-realizable case, we set the satisficing level S = 1.5. In both cases, we compare SELECT against Thompson sampling in Agrawal & Goyal (2017) as well as SAT-UCB and SAT-UCB+ in Michel et al. (2023). For SELECT, we use Thompson Sampling as the learning oracle. The experiment is repeated for 1000 times and we report the average results.