Optimizing Return Distributions with Distributional Dynamic Programming

Authors: Bernardo Ávila Pires, Mark Rowland, Diana Borsa, Zhaohan Daniel Guo, Khimya Khetarpal, André Barreto, David Abel, Rémi Munos, Will Dabney

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we introduce an agent that combines DQN and the core ideas of distributional DP, and empirically evaluate it for solving instances of the applications discussed.
Researcher Affiliation Industry Bernardo Avila Pires EMAIL Google Deep Mind, London, UK Mark Rowland Google Deep Mind Diana Borsa Google Deep Mind Zhaohan Daniel Guo Google Deep Mind Khimya Khetarpal Google Deep Mind Andr e Barreto Google Deep Mind David Abel Google Deep Mind R emi Munos FAIR, Meta; work done at Google Deep Mind Will Dabney Google Deep Mind
Pseudocode No The paper describes algorithms and models like DQN, QR-DQN, and DηN in prose and with architectural diagrams (Figure 1), but it does not present any formal pseudocode blocks or algorithms labeled as such.
Open Source Code No Our experimental infrastructure was built using Python 3, Flax (Heek et al., 2024), Haiku (Hennigan et al., 2020), JAX (Bradbury et al., 2018), and Num Py (Harris et al., 2020). We have used Matplotlib (Hunter, 2007), Num Py (Harris et al., 2020), pandas (Wes Mc Kinney, 2010; pandas development team, 2020) and Sci Py (Virtanen et al., 2020) for analyzing and plotting our experimental data.
Open Datasets Yes Atari 2600 (Bellemare et al., 2013) is a popular RL benchmark where several deep RL agents have been evaluated, including DQN (Mnih et al., 2015) and QR-DQN (Dabney et al., 2018).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits with percentages or sample counts. For Gridworld experiments, it mentions randomizing the starting c0. For Atari, it specifies episode duration and mentions a 3:7 mixture of online and replay data for training, but not dataset splits.
Hardware Specification Yes In these experiments we trained DηN on an Nvidia V100 GPU... In these experiments we trained DηN in a distributed actor-learner setup (Horgan et al., 2018) using TPUv3 actors and learners.
Software Dependencies No Our experimental infrastructure was built using Python 3, Flax (Heek et al., 2024), Haiku (Hennigan et al., 2020), JAX (Bradbury et al., 2018), and Num Py (Harris et al., 2020). We have used Matplotlib (Hunter, 2007), Num Py (Harris et al., 2020), pandas (Wes Mc Kinney, 2010; pandas development team, 2020) and Sci Py (Virtanen et al., 2020) for analyzing and plotting our experimental data.
Experiment Setup Yes Table 7: Training parameters for DηN in the gridworld experiments. Parameter Value Batch size 64 Trajectory length 16 Training duration (environment steps) 2M Training duration (learner updates) 2K Adam optimizer learning rate 10^-4 Target network exponential moving average step size (α) 10^-2 Discount (γ) 0.997 ε-greedy parameter 0.1 Interval for sampling c0 [ -10, 10)... Table 9: Training parameters for DηN in the Atari experiments. Parameter Value Batch size (global, across 6 learners) 144 Trajectory length 19 Training duration (environment steps) 75M Training duration (learner updates) 3.44K Adam optimizer learning rate 10^-4 Weight decay 10^-2 Gradient norm clipping 10 Target network exponential moving average step size α 10^-2 Discount (γ) 0.997 Interval for sampling c0 [ -9, 9)