Regret Bounds for Satisficing in Multi-Armed Bandit Problems
Authors: Thomas Michel, Hossein Hajiabolhassan, Ronald Ortner
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compared Sat-UCB to other bandit algorithms in order to show that the latter keep accumulating S-regret, while Sat-UCB sticks to a satisficing arm after finite time, thus confirming the results of Theorem 2. We also investigated the behavior of Sat-UCB in the not realizable case with different values for the chosen satisfaction level S and did experiments with a slightly modified version Sat-UCB+ introduced below in Section 4.2. |
| Researcher Affiliation | Academia | Thomas Michel EMAIL Université Paris-Saclay ENS Paris-Saclay France Hossein Hajiabolhassan EMAIL Institute of Human Genetics Diagnostic and Research Center for Molecular Biomedicine Medical University of Graz Austria Ronald Ortner EMAIL Lehrstuhl für Informationstechnologie Montanuniversität Leoben Austria |
| Pseudocode | Yes | Algorithm 1: Simple-Sat (Simple Algorithm for Satisficing in the Realizable Case) Algorithm 2: Sat-UCB Scheme for Satisficing in the General Case Algorithm 3: Sat-UCB+ (Experimental Simplification of Sat-UCB) |
| Open Source Code | No | The paper does not contain any explicit statement about providing open-source code, nor does it provide links to any code repositories. |
| Open Datasets | No | The paper describes generating data for experiments using "normally distributed rewards with standard deviation 1" in specific settings. It does not refer to any established public datasets or provide access information for its generated data. |
| Dataset Splits | No | The paper describes a multi-armed bandit problem, which typically involves online learning and simulation rather than static dataset splits. It specifies how reward means are set for arms (e.g., "mean reward of each arm i = 1, 2, . . . , 20 is set to i 1/20") but does not discuss training, validation, or test splits of a fixed dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to conduct the experiments. |
| Software Dependencies | No | The paper mentions various algorithms (e.g., UCB1, UCBalpha, UCL) but does not specify any software libraries or programming languages with version numbers used for their implementation. |
| Experiment Setup | Yes | To illustrate the influence of the structure of the underlying bandit problem we performed experiments in the following two settings each with 20 arms and normally distributed rewards with standard deviation 1: In Setting 1 the mean reward of each arm i = 1, 2, . . . , 20 is set to i 1/20 . For the satisfaction level we chose 0.8 in the realizable case, resulting in four satisfying arms. For experiments in the not realizable case S = 1 was chosen. In Setting 2 the mean reward of each arm i is set to q i/20. For the satisfaction level we chose S = 0.92 so that there are three satisfying arms in the realizable case. This setting is more difficult than Setting 1, as arms are closer to S as well as to each other. The not realizable satisfaction level was set to 1.1. |