DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

Authors: GUOJUN XIONG, Ujwal Dinesha, Debajoy Mukherjee, Jian Li, Srinivas Shakkottai

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results further demonstrate the effectiveness of DOPL. In this section, we evaluate the efficacy of DOPL for PREF-RMAB via real-world applications. The experiment code is available at https://github.com/flash-36/DOPL. Baselines. We consider three classes of baselines... Observations. As shown in Figures 1 2 3 (left), DOPL significantly outperforms all considered baselines and reaches close to the oracle.
Researcher Affiliation Academia Guojun Xiong1 , Ujwal Dinesha2, Debajoy Mukherjee2, Jian Li3, Srinivas Shakkottai2 1Harvard University, 2Texas A&M University, 3Stony Brook University
Pseudocode Yes Algorithm 1 Online Interactions between the DM and the PREF-RMAB Environment Algorithm 2 DOPL: Direct Online Preference Learning for PREF-RMAB Algorithm 3 Online Preference Learning for the k-th Epsiode
Open Source Code Yes The experiment code is available at https://github.com/flash-36/DOPL.
Open Datasets No The paper describes 'Synthetic Environment' (App Marketing) and 'Real-world Environments' (CPAP, ARMMAN) but does not provide specific links, DOIs, or citations to publicly available datasets. For these environments, it defines system dynamics and transition probabilities rather than providing access to pre-existing public datasets.
Dataset Splits No The paper describes simulated environments ('App Marketing', 'CPAP', 'ARMMAN') with defined system dynamics, transition probabilities, and latent rewards. It mentions running 'simulations' and specifies parameters like '10 arms' or '20 patients', but it does not describe traditional dataset splits (e.g., training, validation, test sets) in terms of percentages, sample counts, or specific files for these environments, as it is an online learning setup.
Hardware Specification Yes Some of the experiments presented in this paper were run on an M1 Macbook Air and some on a compute cluster with Dual AMD EPYC 7443 with 48 cores and 256GB RAM.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers for its implementation (e.g., programming languages like Python, or libraries like PyTorch, TensorFlow, etc., with their respective versions).
Experiment Setup Yes Below we detail the hyperparameters used during the training. Hyperparameter Value K (Number of Epochs) 4000 H (Horizon) 100 ϵ (Epsilon) 1 10 5