reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Simplifying Deep Temporal Difference Learning

Authors: Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Foerster, Mario Martin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate our theoretical results, we evaluated PQN in Baird s counterexample, a challenging domain that is provably divergent for off-policy methods (Baird, 1995). Our results show that PQN can converge where non-regularised variants fails. We provide an extensive empirical evaluation to test the performance of PQN in single-agent and multi-agent settings. Despite its simplicity, our algorithm is competitive in a range of tasks; notably, PQN achieves high performances in just a few hours in many games of the Arcade Learning Environment (ALE) (Bellemare et al., 2013), competes effectively with PPO on the open-ended Craftax task (Matthews et al., 2024a), and stands alongside state-of-the-art Multi-Agent RL (MARL) algorithms, such as MAPPO in Overcooked (Carroll et al., 2019) and Hanabi (Bard et al., 2020) and Qmix in Smax (Rutherford et al., 2023).
Researcher Affiliation	Academia	1Universitat Politècnica de Catalunya 2University of Oxford 3Barcelona Supercomputing Center 4 Institut de Ciències del Mar EMAIL EMAIL EMAIL EMAIL
Pseudocode	Yes	In Algorithm 1 we present PQN with λ-returns, which is a parallelised variant of the approach of Daley & Amato (2019). An exploration policy πExplore (ϵ-greedy for this paper) is rolled out for a small trajectory of size T: (si, ai, ri, si+1 . . . si+T ). Starting with Rλ i+T = maxa Qϕ(si+T , a ) the targets are computed recursively back in time from Rλ i+T 1 to Rλ i using: Rλ t = rt +γ λRλ t+1 + (1 λ) maxa Qϕ(st+1, a ) or Rλ t = rt if st is a terminal state. We provide a derivation of our approach in Appendix B.4. Due to the use of λ-returns and minibatches, we require a small buffer of size I T containing interactions from the current exploration policy. Algorithm 1 PQN with λ-returns 1: ϕ initialise regularised Q-network parameters 2: s0 P0, t 0 3: for each episode do 4: for each i {0, 1, . . . I 1} (in parallel) do 5: ai t πExplore(si t), (e.g. ϵ-greedy) 6: ri t PR(si t, ai t) si t+1 PS(si t, ai t), 7: t t + 1, 8: end for 9: if t mod T = 0 then 10: calculate Rλ,i t 1 to Rλ,i t T , 11: for number of epochs do 12: for number of minibatches do 13: draw minibatch B of size b I T from {t T, . . . t 1} and {0, . . . I 1} 14: ϕ ϕ + αt i,t B(Rλ,i t Qϕ(xi t))2 15: end for 16: end for 17: end if 18: end for
Open Source Code	Yes	We open-source our code at: https://github.com/mttga/purejaxql. REPRODUCIBILITY STATEMENT All our experiments can be replicated with the following repository: https://github.com/mttga/purejaxql.
Open Datasets	Yes	Our results show that PQN can converge where non-regularised variants fails. We provide an extensive empirical evaluation to test the performance of PQN in single-agent and multi-agent settings. Despite its simplicity, our algorithm is competitive in a range of tasks; notably, PQN achieves high performances in just a few hours in many games of the Arcade Learning Environment (ALE) (Bellemare et al., 2013), competes effectively with PPO on the open-ended Craftax task (Matthews et al., 2024a), and stands alongside state-of-the-art Multi-Agent RL (MARL) algorithms, such as MAPPO in Overcooked (Carroll et al., 2019) and Hanabi (Bard et al., 2020) and Qmix in Smax (Rutherford et al., 2023).
Dataset Splits	No	The paper describes using various RL environments (e.g., Arcade Learning Environment, Craftax, Smax, Overcooked, Hanabi) for experiments, often referring to game suites or specific tasks within those environments. It mentions rolling out a greedy-policy in parallel environments for training and evaluation. However, it does not provide explicit numerical training/test/validation dataset splits (e.g., 80/10/10 percentages or sample counts) for any static datasets, as the nature of RL experiments often involves online interaction with environments rather than pre-split datasets.
Hardware Specification	Yes	All experimental results are shown as mean of 10 seeds, except in Atari Learning Environment (ALE) where we followed a common practice of reporting 3 seeds. They were performed on a single NVIDIA A40 by jit-compiling the entire pipeline with Jax in the GPU, except for the Atari experiments where the environments run on an AMD 7513 32-Core Processor.
Software Dependencies	No	The paper mentions using Jax for jit-compiling the pipeline, Rectified Adam optimizer (Liu et al., 2019), and various environment libraries like Envpool and Jax MARL. However, it does not specify explicit version numbers for Jax, Envpool, Jax MARL, or any other software libraries, which is necessary for full reproducibility.
Experiment Setup	Yes	Hyperparameters for all experiments can be found in Appendix E. We used the algorithm proposed in Algorithm 1. All experiments used Rectified Adam optimiser Liu et al. (2019). Appendix E contains tables such as 'Table 4: Craftax RNN Hyperparameters', 'Table 5: Atari Hyperparameters', 'Table 6: SMAX Hyperparameters', 'Table 7: Overcooked Hyperparameters', 'Table 8: Hanabi Hyperparameters', which list specific values for parameters like NUM_ENVs, NUM_STEPS, EPS_START, EPS_FINISH, EPS_DECAY, NUM_MINIBATCHES, NUM_EPOCHS, LR, MAX_GRAD_NORM, REW_SCALE, GAMMA, LAMBDA.