Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment

Authors: Zixian Yang, Xin Liu, Lei Ying

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Simulation results in Section 5 and Appendix D confirm our theoretical results and show that our algorithms have significantly lower regret than the traditional UCB and KL-UCB, and Q-learning-based algorithms.
Researcher Affiliation Academia Zixian Yang EMAIL Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109, USA; Xin Liu EMAIL School of Information Science and Technology Shanghai Tech University Shanghai, China; Lei Ying EMAIL Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109, USA
Pseudocode Yes Algorithm 1 ULCB Algorithm; Algorithm 2 CONT-ULCB Algorithm
Open Source Code No The paper does not explicitly state that source code for the proposed ULCB or KL-ULCB algorithms is provided, nor does it include any links to code repositories. It discusses comparisons to existing algorithms but not its own implementation availability.
Open Datasets No The paper presents "Simulation results" and defines the parameters for these simulations (e.g., M=2, µ(a1)=0.9, µ(a2)=0.8, abandonment probabilities). It does not use or provide access to any external, publicly available datasets in the traditional sense for its experimental evaluation.
Dataset Splits No The paper conducts simulations and mentions the number of episodes and independent runs ("2 * 10^4 episodes with 10^7 independent runs"). However, it does not involve traditional dataset splits (e.g., training, validation, test sets) as it is simulation-based rather than using pre-collected data.
Hardware Specification No The paper describes the simulation parameters and results but does not specify any hardware (e.g., GPU, CPU models, memory) used to run these simulations.
Software Dependencies No The paper mentions comparing with various algorithms like UCB, KL-UCB, Q-learning, and UCBVI, and details some of their parameters (e.g., ϵ=0.1 for Q-learning, learning rate, discount factor γ=1). However, it does not specify the software libraries, frameworks (e.g., Python, PyTorch, TensorFlow), or their version numbers used for the implementations of any algorithms discussed in the paper.
Experiment Setup Yes In the simulation, we assume Sk,1 1 for simplicity. This is to say that the user assumes a class of items are good if the user has not yet seen the items. Let M 2, µpa1q 0.9, and µpa2q 0.8. Note that for all the algorithms in the simulation, we do not include the log log terms (i.e., c 0) in the indices which are also omitted in (Garivier and Cappé, 2011). We simulated 2 * 10^4 episodes with 10^7 independent runs. We set c1 1, c0 1 for ULCB and c1 c0 1 for KL-ULCB. We also compare our algorithms with Q-learning (Watkins, 1989) with ϵ-greedy, Q-learning with UCB (Yang et al., 2021), and UCBVI (Azar et al., 2017). For Q-learning with ϵ-greedy, at each step, we select a random action with probability ϵ and select a greedy action according to the Q table with probability 1 ϵ. At each step, we update the Q table based on the update formula in (Watkins, 1989) with the discount factor γ 1. For Q-learning with UCB, at each step, we select a greedy action according to the estimated Q table. At each step, we update the Q table with an additional bonus term... For UCBVI, we also set the episode length parameter H to be the expected episode length under the best policy and the number of episodes K is set to be 1000. The probability parameter δ is set to be 0.001. The bonus term is modified to bps, aq 7H logp5SAHK{δq b 1 mint Npaq,Nrps,1q,Nrps,0qu... Simulations for the general-state setting can be found in Appendix C.5. Consider the MAB-A problem of the general-state setting. Let the abandonment probability function qp q be qpsq 1 logpc6s 1q ... for any s P r0, 1s, where c6 is a constant. Let the forgetting factor θ 0.5 in the simulation. ... Let n 4 for the discretization of DISC-ULCB and DISC-KL-ULCB.