Exploration, Exploitation, and Engagement in Multi-Armed Bandits with Abandonment
Authors: Zixian Yang, Xin Liu, Lei Ying
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Simulation results in Section 5 and Appendix D confirm our theoretical results and show that our algorithms have significantly lower regret than the traditional UCB and KL-UCB, and Q-learning-based algorithms. |
| Researcher Affiliation | Academia | Zixian Yang EMAIL Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109, USA; Xin Liu EMAIL School of Information Science and Technology Shanghai Tech University Shanghai, China; Lei Ying EMAIL Electrical Engineering and Computer Science University of Michigan Ann Arbor, MI 48109, USA |
| Pseudocode | Yes | Algorithm 1 ULCB Algorithm; Algorithm 2 CONT-ULCB Algorithm |
| Open Source Code | No | The paper does not explicitly state that source code for the proposed ULCB or KL-ULCB algorithms is provided, nor does it include any links to code repositories. It discusses comparisons to existing algorithms but not its own implementation availability. |
| Open Datasets | No | The paper presents "Simulation results" and defines the parameters for these simulations (e.g., M=2, µ(a1)=0.9, µ(a2)=0.8, abandonment probabilities). It does not use or provide access to any external, publicly available datasets in the traditional sense for its experimental evaluation. |
| Dataset Splits | No | The paper conducts simulations and mentions the number of episodes and independent runs ("2 * 10^4 episodes with 10^7 independent runs"). However, it does not involve traditional dataset splits (e.g., training, validation, test sets) as it is simulation-based rather than using pre-collected data. |
| Hardware Specification | No | The paper describes the simulation parameters and results but does not specify any hardware (e.g., GPU, CPU models, memory) used to run these simulations. |
| Software Dependencies | No | The paper mentions comparing with various algorithms like UCB, KL-UCB, Q-learning, and UCBVI, and details some of their parameters (e.g., ϵ=0.1 for Q-learning, learning rate, discount factor γ=1). However, it does not specify the software libraries, frameworks (e.g., Python, PyTorch, TensorFlow), or their version numbers used for the implementations of any algorithms discussed in the paper. |
| Experiment Setup | Yes | In the simulation, we assume Sk,1 1 for simplicity. This is to say that the user assumes a class of items are good if the user has not yet seen the items. Let M 2, µpa1q 0.9, and µpa2q 0.8. Note that for all the algorithms in the simulation, we do not include the log log terms (i.e., c 0) in the indices which are also omitted in (Garivier and Cappé, 2011). We simulated 2 * 10^4 episodes with 10^7 independent runs. We set c1 1, c0 1 for ULCB and c1 c0 1 for KL-ULCB. We also compare our algorithms with Q-learning (Watkins, 1989) with ϵ-greedy, Q-learning with UCB (Yang et al., 2021), and UCBVI (Azar et al., 2017). For Q-learning with ϵ-greedy, at each step, we select a random action with probability ϵ and select a greedy action according to the Q table with probability 1 ϵ. At each step, we update the Q table based on the update formula in (Watkins, 1989) with the discount factor γ 1. For Q-learning with UCB, at each step, we select a greedy action according to the estimated Q table. At each step, we update the Q table with an additional bonus term... For UCBVI, we also set the episode length parameter H to be the expected episode length under the best policy and the number of episodes K is set to be 1000. The probability parameter δ is set to be 0.001. The bonus term is modified to bps, aq 7H logp5SAHK{δq b 1 mint Npaq,Nrps,1q,Nrps,0qu... Simulations for the general-state setting can be found in Appendix C.5. Consider the MAB-A problem of the general-state setting. Let the abandonment probability function qp q be qpsq 1 logpc6s 1q ... for any s P r0, 1s, where c6 is a constant. Let the forgetting factor θ 0.5 in the simulation. ... Let n 4 for the discretization of DISC-ULCB and DISC-KL-ULCB. |