Learning in complex action spaces without policy gradients
Authors: Arash Tavakoli, Sina Ghiassian, Nemanja Rakicevic
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that QMLE can be applied to complex action spaces at a computational cost comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE exhibits strong performance on the Deep Mind Control Suite, even when compared to state-of-the-art methods such as DMPO and D4PG. We make our code publicly available. The paper includes a dedicated "Experiments" section (Section 5) with subsections on "Illustrative example", "Benchmarking results", and "Ablation studies". |
| Researcher Affiliation | Industry | All listed authors are affiliated with private companies: "Arash Tavakoli EMAIL Riot Games", "Sina Ghiassian EMAIL Spotify", and "Nemanja Rakićević EMAIL Google Deep Mind". |
| Pseudocode | Yes | Algorithm 1 details the training procedures for QMLE. Specifically, our presentation is based on integrating our framework ( 4) into the deep Q-learning algorithm by Mnih et al. (2015). In line with this, we make use of experience replay and a target network that is only periodically updated with the parameters of the online network. Importantly, we extend the scope of the target network to encompass the arg max predictors in QMLE. |
| Open Source Code | Yes | We make our code publicly available. To support reproducibility, we release the implementation used in our benchmarking experiments at: https://github.com/atavakol/qmle |
| Open Datasets | Yes | In this section, we evaluate QMLE on 18 continuous control tasks from the Deep Mind Control Suite (Tassa et al., 2018). |
| Dataset Splits | No | The paper evaluates QMLE on continuous control tasks from the Deep Mind Control Suite (Tassa et al., 2018). These are reinforcement learning environments where agents generate data through interaction, rather than using a static dataset with predefined splits. Therefore, the concept of explicit training/test/validation splits for a dataset is not directly applicable, and no such splits are provided in the paper. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | Our QMLE implementation is based on a DQN script from Clean RL (Huang et al., 2022) and incorporates prioritized experience replay adapted from Stable Baselines (Hill et al., 2018), both available under the permissive MIT license. The paper mentions software tools but does not include specific version numbers for them. |
| Experiment Setup | Yes | Table 1 provides the hyper-parameters of QMLE in our benchmarking experiments. Parameters listed include mtarget, mgreedy, ρ0, ρ1, ρ2, step sizes αq, αargmax, update frequency, batch size, training start size, memory buffer size, target network update frequency, loss function, optimizer, exploration ε, discount factor, time limit, truncation approach, importance sampling exponent, and priority exponent. |