ADDQ: Adaptive distributional double Q-learning

Authors: Leif Döring, Benedikt Wille, Maximilian Birr, Mihail Bı̂rsan, Martin Slowik

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments are provided for tabular, Atari, and Mu Jo Co environments.
Researcher Affiliation Academia 1Institute of Mathematics, University of Mannheim, Germany 2Department of Mathematics and Computer Science, Freie Universit at Berlin, Germany.
Pseudocode Yes Algorithm 1 Distributional Q-learning update step Algorithm 2 ADDQ update step
Open Source Code Yes The code used in our experiments can be found on Git Hub: https://github.com/Bomme HD/ADDQ.git.
Open Datasets Yes Experiments are provided for tabular, Atari, and Mu Jo Co environments. ... We run experiments on Atari environments from the Arcade Learning Environment (Bellemare et al., 2013) using the Gymnasium API (Towers et al., 2023). ... Mu Jo Co [(Todorov et al., 2012)] environments
Dataset Splits No The paper mentions evaluating on environments and providing '10 evaluation episodes on 10 evaluation environments without exploration', but does not specify how the datasets within these environments (e.g., Atari, Mu Jo Co) were split into training, validation, or test sets.
Hardware Specification Yes The experiments were executed on a HPC cluster with NVIDIA Tesla V100 and NVIDIA A100 GPUs.
Software Dependencies No The paper references software frameworks like 'Gymnasium API (Towers et al., 2023)', 'RL Baselines3 Zoo (Raffin, 2020)', and 'Stable-Baselines3 (Raffin et al., 2021)' but does not provide specific version numbers for these software components or libraries, which are crucial for reproducibility.
Experiment Setup Yes The C51 algorithm obtained its name from using a categorical representation of return distributions with m = 51 atoms. ... target network which is kept constant and is overwritten from η every e.g. 10000 steps with the parameters from the online network. ... Accordingly, we use twice the batch size for these methods ... step-size schedule αt(s, a) = 1/Ts,a(t), with Ts,a(t) the number of visits in (s, a) up to time t, i.e. 1/n state-action wise counted, exploration: ε-greedy with ε linearly decreasing from 1 to 0.1 in 10000 steps, then constant