DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback
Authors: GUOJUN XIONG, Ujwal Dinesha, Debajoy Mukherjee, Jian Li, Srinivas Shakkottai
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results further demonstrate the effectiveness of DOPL. In this section, we evaluate the efficacy of DOPL for PREF-RMAB via real-world applications. The experiment code is available at https://github.com/flash-36/DOPL. Baselines. We consider three classes of baselines... Observations. As shown in Figures 1 2 3 (left), DOPL significantly outperforms all considered baselines and reaches close to the oracle. |
| Researcher Affiliation | Academia | Guojun Xiong1 , Ujwal Dinesha2, Debajoy Mukherjee2, Jian Li3, Srinivas Shakkottai2 1Harvard University, 2Texas A&M University, 3Stony Brook University |
| Pseudocode | Yes | Algorithm 1 Online Interactions between the DM and the PREF-RMAB Environment Algorithm 2 DOPL: Direct Online Preference Learning for PREF-RMAB Algorithm 3 Online Preference Learning for the k-th Epsiode |
| Open Source Code | Yes | The experiment code is available at https://github.com/flash-36/DOPL. |
| Open Datasets | No | The paper describes 'Synthetic Environment' (App Marketing) and 'Real-world Environments' (CPAP, ARMMAN) but does not provide specific links, DOIs, or citations to publicly available datasets. For these environments, it defines system dynamics and transition probabilities rather than providing access to pre-existing public datasets. |
| Dataset Splits | No | The paper describes simulated environments ('App Marketing', 'CPAP', 'ARMMAN') with defined system dynamics, transition probabilities, and latent rewards. It mentions running 'simulations' and specifies parameters like '10 arms' or '20 patients', but it does not describe traditional dataset splits (e.g., training, validation, test sets) in terms of percentages, sample counts, or specific files for these environments, as it is an online learning setup. |
| Hardware Specification | Yes | Some of the experiments presented in this paper were run on an M1 Macbook Air and some on a compute cluster with Dual AMD EPYC 7443 with 48 cores and 256GB RAM. |
| Software Dependencies | No | The paper does not explicitly mention specific software dependencies with version numbers for its implementation (e.g., programming languages like Python, or libraries like PyTorch, TensorFlow, etc., with their respective versions). |
| Experiment Setup | Yes | Below we detail the hyperparameters used during the training. Hyperparameter Value K (Number of Epochs) 4000 H (Horizon) 100 ϵ (Epsilon) 1 10 5 |