Satisficing Regret Minimization in Bandits
Authors: Qing Feng, Tianyi Ma, Ruihao Zhu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, in Section 6 we conduct numerical experiments to demonstrate the performance of SELECT for several popular bandit optimization settings. ... Our results reveal that in the realizable case, SELECT does achieve constant satisficing regret. |
| Researcher Affiliation | Academia | Qing Feng Cornell University EMAIL Tianyi Ma Shanghai Jiao Tong University sean EMAIL Ruihao Zhu Cornell University EMAIL |
| Pseudocode | Yes | The pseudo-code of SELECT is presented in Algorithm 1. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to repositories. |
| Open Datasets | No | The paper describes creating specific problem instances for experiments (e.g., 'an instance of 4 arms, and the expected reward of all arms are {0.6, 0.7, 0.8, 1}' for finite-armed bandits, or 'concave reward function r(x) = 1 - 16(x - 0.25)^2' for concave bandits, and 'two-dimensional Lipschitz bandit in domain (x, y) [0, 1]^2'). These are custom-generated experimental setups, not references to external, publicly available datasets. |
| Dataset Splits | No | The paper describes experiments in simulated bandit environments over a 'time horizon' rather than using traditional datasets with training/test/validation splits. Therefore, the concept of dataset splits is not applicable in the typical sense for this research. |
| Hardware Specification | Yes | All experiments are run on a PC with 4-core CPU and 16GB of RAM. |
| Software Dependencies | No | The paper mentions various algorithms and benchmarks (e.g., Thompson sampling, SAT-UCB, convex bandit algorithm, uniformly discretized UCB), but does not provide specific software names with version numbers for any libraries, frameworks, or operating systems used in implementation. |
| Experiment Setup | Yes | Setup: In this case, we consider an instance of 4 arms, and the expected reward of all arms are {0.6, 0.7, 0.8, 1}. The length of the time horizon is set to be 500 to 5000 with a stepsize of 500. We conduct experiments for both the realizable case and the non-realizable case. For the realizable case, we set the satisficing level S = 0.93; for the non-realizable case, we set the satisficing level S = 1.5. In both cases, we compare SELECT against Thompson sampling in Agrawal & Goyal (2017) as well as SAT-UCB and SAT-UCB+ in Michel et al. (2023). For SELECT, we use Thompson Sampling as the learning oracle. The experiment is repeated for 1000 times and we report the average results. |