Simplifying Deep Temporal Difference Learning
Authors: Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Foerster, Mario Martin
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate our theoretical results, we evaluated PQN in Baird s counterexample, a challenging domain that is provably divergent for off-policy methods (Baird, 1995). Our results show that PQN can converge where non-regularised variants fails. We provide an extensive empirical evaluation to test the performance of PQN in single-agent and multi-agent settings. Despite its simplicity, our algorithm is competitive in a range of tasks; notably, PQN achieves high performances in just a few hours in many games of the Arcade Learning Environment (ALE) (Bellemare et al., 2013), competes effectively with PPO on the open-ended Craftax task (Matthews et al., 2024a), and stands alongside state-of-the-art Multi-Agent RL (MARL) algorithms, such as MAPPO in Overcooked (Carroll et al., 2019) and Hanabi (Bard et al., 2020) and Qmix in Smax (Rutherford et al., 2023). |
| Researcher Affiliation | Academia | 1Universitat Politècnica de Catalunya 2University of Oxford 3Barcelona Supercomputing Center 4 Institut de Ciències del Mar EMAIL EMAIL EMAIL EMAIL |
| Pseudocode | Yes | In Algorithm 1 we present PQN with λ-returns, which is a parallelised variant of the approach of Daley & Amato (2019). An exploration policy πExplore (ϵ-greedy for this paper) is rolled out for a small trajectory of size T: (si, ai, ri, si+1 . . . si+T ). Starting with Rλ i+T = maxa Qϕ(si+T , a ) the targets are computed recursively back in time from Rλ i+T 1 to Rλ i using: Rλ t = rt +γ λRλ t+1 + (1 λ) maxa Qϕ(st+1, a ) or Rλ t = rt if st is a terminal state. We provide a derivation of our approach in Appendix B.4. Due to the use of λ-returns and minibatches, we require a small buffer of size I T containing interactions from the current exploration policy. Algorithm 1 PQN with λ-returns 1: ϕ initialise regularised Q-network parameters 2: s0 P0, t 0 3: for each episode do 4: for each i {0, 1, . . . I 1} (in parallel) do 5: ai t πExplore(si t), (e.g. ϵ-greedy) 6: ri t PR(si t, ai t) si t+1 PS(si t, ai t), 7: t t + 1, 8: end for 9: if t mod T = 0 then 10: calculate Rλ,i t 1 to Rλ,i t T , 11: for number of epochs do 12: for number of minibatches do 13: draw minibatch B of size b I T from {t T, . . . t 1} and {0, . . . I 1} 14: ϕ ϕ + αt i,t B(Rλ,i t Qϕ(xi t))2 15: end for 16: end for 17: end if 18: end for |
| Open Source Code | Yes | We open-source our code at: https://github.com/mttga/purejaxql. REPRODUCIBILITY STATEMENT All our experiments can be replicated with the following repository: https://github.com/mttga/purejaxql. |
| Open Datasets | Yes | Our results show that PQN can converge where non-regularised variants fails. We provide an extensive empirical evaluation to test the performance of PQN in single-agent and multi-agent settings. Despite its simplicity, our algorithm is competitive in a range of tasks; notably, PQN achieves high performances in just a few hours in many games of the Arcade Learning Environment (ALE) (Bellemare et al., 2013), competes effectively with PPO on the open-ended Craftax task (Matthews et al., 2024a), and stands alongside state-of-the-art Multi-Agent RL (MARL) algorithms, such as MAPPO in Overcooked (Carroll et al., 2019) and Hanabi (Bard et al., 2020) and Qmix in Smax (Rutherford et al., 2023). |
| Dataset Splits | No | The paper describes using various RL environments (e.g., Arcade Learning Environment, Craftax, Smax, Overcooked, Hanabi) for experiments, often referring to game suites or specific tasks within those environments. It mentions rolling out a greedy-policy in parallel environments for training and evaluation. However, it does not provide explicit numerical training/test/validation dataset splits (e.g., 80/10/10 percentages or sample counts) for any static datasets, as the nature of RL experiments often involves online interaction with environments rather than pre-split datasets. |
| Hardware Specification | Yes | All experimental results are shown as mean of 10 seeds, except in Atari Learning Environment (ALE) where we followed a common practice of reporting 3 seeds. They were performed on a single NVIDIA A40 by jit-compiling the entire pipeline with Jax in the GPU, except for the Atari experiments where the environments run on an AMD 7513 32-Core Processor. |
| Software Dependencies | No | The paper mentions using Jax for jit-compiling the pipeline, Rectified Adam optimizer (Liu et al., 2019), and various environment libraries like Envpool and Jax MARL. However, it does not specify explicit version numbers for Jax, Envpool, Jax MARL, or any other software libraries, which is necessary for full reproducibility. |
| Experiment Setup | Yes | Hyperparameters for all experiments can be found in Appendix E. We used the algorithm proposed in Algorithm 1. All experiments used Rectified Adam optimiser Liu et al. (2019). Appendix E contains tables such as 'Table 4: Craftax RNN Hyperparameters', 'Table 5: Atari Hyperparameters', 'Table 6: SMAX Hyperparameters', 'Table 7: Overcooked Hyperparameters', 'Table 8: Hanabi Hyperparameters', which list specific values for parameters like NUM_ENVs, NUM_STEPS, EPS_START, EPS_FINISH, EPS_DECAY, NUM_MINIBATCHES, NUM_EPOCHS, LR, MAX_GRAD_NORM, REW_SCALE, GAMMA, LAMBDA. |