Regret Bounds for Satisficing in Multi-Armed Bandit Problems

Authors: Thomas Michel, Hossein Hajiabolhassan, Ronald Ortner

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compared Sat-UCB to other bandit algorithms in order to show that the latter keep accumulating S-regret, while Sat-UCB sticks to a satisficing arm after finite time, thus confirming the results of Theorem 2. We also investigated the behavior of Sat-UCB in the not realizable case with different values for the chosen satisfaction level S and did experiments with a slightly modified version Sat-UCB+ introduced below in Section 4.2.
Researcher Affiliation Academia Thomas Michel EMAIL Université Paris-Saclay ENS Paris-Saclay France Hossein Hajiabolhassan EMAIL Institute of Human Genetics Diagnostic and Research Center for Molecular Biomedicine Medical University of Graz Austria Ronald Ortner EMAIL Lehrstuhl für Informationstechnologie Montanuniversität Leoben Austria
Pseudocode Yes Algorithm 1: Simple-Sat (Simple Algorithm for Satisficing in the Realizable Case) Algorithm 2: Sat-UCB Scheme for Satisficing in the General Case Algorithm 3: Sat-UCB+ (Experimental Simplification of Sat-UCB)
Open Source Code No The paper does not contain any explicit statement about providing open-source code, nor does it provide links to any code repositories.
Open Datasets No The paper describes generating data for experiments using "normally distributed rewards with standard deviation 1" in specific settings. It does not refer to any established public datasets or provide access information for its generated data.
Dataset Splits No The paper describes a multi-armed bandit problem, which typically involves online learning and simulation rather than static dataset splits. It specifies how reward means are set for arms (e.g., "mean reward of each arm i = 1, 2, . . . , 20 is set to i 1/20") but does not discuss training, validation, or test splits of a fixed dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used to conduct the experiments.
Software Dependencies No The paper mentions various algorithms (e.g., UCB1, UCBalpha, UCL) but does not specify any software libraries or programming languages with version numbers used for their implementation.
Experiment Setup Yes To illustrate the influence of the structure of the underlying bandit problem we performed experiments in the following two settings each with 20 arms and normally distributed rewards with standard deviation 1: In Setting 1 the mean reward of each arm i = 1, 2, . . . , 20 is set to i 1/20 . For the satisfaction level we chose 0.8 in the realizable case, resulting in four satisfying arms. For experiments in the not realizable case S = 1 was chosen. In Setting 2 the mean reward of each arm i is set to q i/20. For the satisfaction level we chose S = 0.92 so that there are three satisfying arms in the realizable case. This setting is more difficult than Setting 1, as arms are closer to S as well as to each other. The not realizable satisfaction level was set to 1.1.