reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-agent cooperation through learning-aware policy gradients

Authors: Alexander Meulemans, Seijin Kobayashi, Johannes von Oswald, Nino Scherrer, Eric Elmoznino, Blake A Richards, Guillaume Lajoie, Blaise Aguera y Arcas, Joao Sacramento

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The present paper contains two main novel results on learning awareness in general-sum games. First, we introduce a new learning-aware reinforcement learning rule derived as a policy gradient estimator... We then leverage efﬁcient sequence models to condition behavior on long observation histories... Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required... We train a long-context sequence policy πi(ai,b l \| hi,b l ; φi) with the COALA-PG rule to play the (ﬁnite) iterated prisoner s dilemma, see Appendix B. We choose a Hawk recurrent neural network as the policy backbone (De et al., 2024). Hawk models achieve transformer-level performance at scale, but with time and memory costs that grow only linearly with sequence length. This allows processing efﬁciently the long history context hi,b l , which contains all actions played by the agents across episodes. Based on the results of the preceding section, we consider a mixed group setting, pitting COALA-PG-trained agents against naive learners as well as other equally capable learning-aware agents.
Researcher Affiliation	Industry	Alexander Meulemans1, , Seijin Kobayashi1, , Johannes von Oswald1, Nino Scherrer1, Eric Elmoznino1,2,3, Blake A. Richards1,2,3,4,5, Guillaume Lajoie1,2,3,4,5, Blaise Ag uera y Arcas1, Jo ao Sacramento1 1Google, Paradigms of Intelligence Team, 2Mila Quebec AI Institute, 3Universit e de Montr eal, 4Mc Gill University, 5CIFAR, Equal contribution EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Batch Lambda Returns Input: rt, discount, vt, λ, average future episodes, normalize current episode, inner episode length Output: returns seq len rt.shape[1] batch size rt.shape[0] if normalize current episode then normalization batch size else normalization 1 episode end (range(seq len) mod inner episode length) == (inner episode length 1) acc vt[:, 1] global acc mean(vt[:, 1]) for t = seq len 1 to 0 do if average future episodes and episode end[t] then acc global acc acc rt[:, t]/normalization + discount ((1 λ) vt[:, t] + λ acc) global acc mean(rt[:, t] + discount ((1 λ) vt[:, t] + λ global acc)) returns[:, t] acc return returns
Open Source Code	No	The results reported in this paper were produced with open-source software. We used the Python programming language together with the Google JAX (Bradbury et al., 2018) framework, and the Num Py (Harris et al., 2020), Matplotlib (Hunter, 2007), Flax (Heek et al., 2024) and Optax (Babuschkin et al., 2020) packages.
Open Datasets	Yes	Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required... We analyze the iterated prisoner s dilemma (IPD), a canonical model for studying cooperation among self-interested agents (Rapoport, 1974; Axelrod & Hamilton, 1981)... Finally, we consider Clean Up-lite, a simpliﬁed two-player version of the Clean Up game, which is part of the Melting Pot suite of multi-agent environments (Agapiou et al., 2023).
Dataset Splits	No	The paper discusses concepts such as 'inner-episodes', 'meta-trajectories', and 'minibatches' in the context of reinforcement learning. It describes how agents learn by generating experience in environments like the Iterated Prisoner's Dilemma and Clean Up-lite, and details the training procedure for meta agents, including how batches of opponent trajectories are sampled and used for updates. However, it does not specify traditional dataset splits (e.g., train/test/validation percentages or sample counts) for a static dataset, which is common in supervised learning. The information provided relates to how training data (experience) is generated and processed in an episodic learning setup, not to pre-defined splits of a fixed dataset.
Hardware Specification	No	The paper mentions general concepts like 'scalable architectures based on recurrent sequence policy models' and 'modern sequence models', and refers to collaborators at 'Google', implying use of their infrastructure. However, it does not provide specific details on the hardware used for the experiments, such as GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or specific TPU versions. The text does not contain any concrete hardware specifications.
Software Dependencies	No	The results reported in this paper were produced with open-source software. We used the Python programming language together with the Google JAX (Bradbury et al., 2018) framework, and the Num Py (Harris et al., 2020), Matplotlib (Hunter, 2007), Flax (Heek et al., 2024) and Optax (Babuschkin et al., 2020) packages.
Experiment Setup	Yes	In all experiments, we ﬁrst ﬁx the environment hyperparameters. In order to ﬁnd the suitable hyperparameter for each methods, we perform for each of them a sweep over reinforcement learning hyperparameters, and select the best hyperparameters over after averaging over 3 seeds. The ﬁnal performance and metrics are then computed using 5 fresh seeds. In all our experiments, naive agents update their parameters using the Advantage Actor Critic (A2C) algorithm, without value bootstrapping on the batch of length T trajectories. The hyperparameter for all experiments, can be found on Table 8. IPD, Figure 5 We perform 2 experiments in the IPD environment... For both experimental setting, we show the environment hyperparameters in Table 2. All meta agents are trained by PPO and Adam optimizer. For each method, we sweep hyperparameters over range speciﬁed in Table 3. Table 4 shows the resulting hyperparameters for all methods. Cleanup, Figure 6, 7 Likewise, we have the pure shaping (Figure 6) and mixed pool (Figure 7) experiment in the Cleanup-lite environment. For both experimental setting, we show the environment hyperparameters in Table 5. All meta agents are trained by PPO and Adam optimizer for the pure shaping setting, while using A2C and SGD for the mixed pool setting. For each method, we sweep hyperparameters over range speciﬁed in Table 6. Table 7 shows the resulting hyperparameters for PPO for all methods.