reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning in complex action spaces without policy gradients

Authors: Arash Tavakoli, Sina Ghiassian, Nemanja Rakicevic

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that QMLE can be applied to complex action spaces at a computational cost comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE exhibits strong performance on the Deep Mind Control Suite, even when compared to state-of-the-art methods such as DMPO and D4PG. We make our code publicly available. The paper includes a dedicated "Experiments" section (Section 5) with subsections on "Illustrative example", "Benchmarking results", and "Ablation studies".
Researcher Affiliation	Industry	All listed authors are affiliated with private companies: "Arash Tavakoli EMAIL Riot Games", "Sina Ghiassian EMAIL Spotify", and "Nemanja Rakićević EMAIL Google Deep Mind".
Pseudocode	Yes	Algorithm 1 details the training procedures for QMLE. Specifically, our presentation is based on integrating our framework ( 4) into the deep Q-learning algorithm by Mnih et al. (2015). In line with this, we make use of experience replay and a target network that is only periodically updated with the parameters of the online network. Importantly, we extend the scope of the target network to encompass the arg max predictors in QMLE.
Open Source Code	Yes	We make our code publicly available. To support reproducibility, we release the implementation used in our benchmarking experiments at: https://github.com/atavakol/qmle
Open Datasets	Yes	In this section, we evaluate QMLE on 18 continuous control tasks from the Deep Mind Control Suite (Tassa et al., 2018).
Dataset Splits	No	The paper evaluates QMLE on continuous control tasks from the Deep Mind Control Suite (Tassa et al., 2018). These are reinforcement learning environments where agents generate data through interaction, rather than using a static dataset with predefined splits. Therefore, the concept of explicit training/test/validation splits for a dataset is not directly applicable, and no such splits are provided in the paper.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	Our QMLE implementation is based on a DQN script from Clean RL (Huang et al., 2022) and incorporates prioritized experience replay adapted from Stable Baselines (Hill et al., 2018), both available under the permissive MIT license. The paper mentions software tools but does not include specific version numbers for them.
Experiment Setup	Yes	Table 1 provides the hyper-parameters of QMLE in our benchmarking experiments. Parameters listed include mtarget, mgreedy, ρ0, ρ1, ρ2, step sizes αq, αargmax, update frequency, batch size, training start size, memory buffer size, target network update frequency, loss function, optimizer, exploration ε, discount factor, time limit, truncation approach, importance sampling exponent, and priority exponent.