reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

Authors: GUOJUN XIONG, Ujwal Dinesha, Debajoy Mukherjee, Jian Li, Srinivas Shakkottai

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results further demonstrate the effectiveness of DOPL. In this section, we evaluate the efficacy of DOPL for PREF-RMAB via real-world applications. The experiment code is available at https://github.com/flash-36/DOPL. Baselines. We consider three classes of baselines... Observations. As shown in Figures 1 2 3 (left), DOPL significantly outperforms all considered baselines and reaches close to the oracle.
Researcher Affiliation	Academia	Guojun Xiong1 , Ujwal Dinesha2, Debajoy Mukherjee2, Jian Li3, Srinivas Shakkottai2 1Harvard University, 2Texas A&M University, 3Stony Brook University
Pseudocode	Yes	Algorithm 1 Online Interactions between the DM and the PREF-RMAB Environment Algorithm 2 DOPL: Direct Online Preference Learning for PREF-RMAB Algorithm 3 Online Preference Learning for the k-th Epsiode
Open Source Code	Yes	The experiment code is available at https://github.com/flash-36/DOPL.
Open Datasets	No	The paper describes 'Synthetic Environment' (App Marketing) and 'Real-world Environments' (CPAP, ARMMAN) but does not provide specific links, DOIs, or citations to publicly available datasets. For these environments, it defines system dynamics and transition probabilities rather than providing access to pre-existing public datasets.
Dataset Splits	No	The paper describes simulated environments ('App Marketing', 'CPAP', 'ARMMAN') with defined system dynamics, transition probabilities, and latent rewards. It mentions running 'simulations' and specifies parameters like '10 arms' or '20 patients', but it does not describe traditional dataset splits (e.g., training, validation, test sets) in terms of percentages, sample counts, or specific files for these environments, as it is an online learning setup.
Hardware Specification	Yes	Some of the experiments presented in this paper were run on an M1 Macbook Air and some on a compute cluster with Dual AMD EPYC 7443 with 48 cores and 256GB RAM.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers for its implementation (e.g., programming languages like Python, or libraries like PyTorch, TensorFlow, etc., with their respective versions).
Experiment Setup	Yes	Below we detail the hyperparameters used during the training. Hyperparameter Value K (Number of Epochs) 4000 H (Horizon) 100 ϵ (Epsilon) 1 10 5