Multi-agent cooperation through learning-aware policy gradients
Authors: Alexander Meulemans, Seijin Kobayashi, Johannes von Oswald, Nino Scherrer, Eric Elmoznino, Blake A Richards, Guillaume Lajoie, Blaise Aguera y Arcas, Joao Sacramento
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The present paper contains two main novel results on learning awareness in general-sum games. First, we introduce a new learning-aware reinforcement learning rule derived as a policy gradient estimator... We then leverage efficient sequence models to condition behavior on long observation histories... Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required... We train a long-context sequence policy πi(ai,b l | hi,b l ; φi) with the COALA-PG rule to play the (finite) iterated prisoner s dilemma, see Appendix B. We choose a Hawk recurrent neural network as the policy backbone (De et al., 2024). Hawk models achieve transformer-level performance at scale, but with time and memory costs that grow only linearly with sequence length. This allows processing efficiently the long history context hi,b l , which contains all actions played by the agents across episodes. Based on the results of the preceding section, we consider a mixed group setting, pitting COALA-PG-trained agents against naive learners as well as other equally capable learning-aware agents. |
| Researcher Affiliation | Industry | Alexander Meulemans1, , Seijin Kobayashi1, , Johannes von Oswald1, Nino Scherrer1, Eric Elmoznino1,2,3, Blake A. Richards1,2,3,4,5, Guillaume Lajoie1,2,3,4,5, Blaise Ag uera y Arcas1, Jo ao Sacramento1 1Google, Paradigms of Intelligence Team, 2Mila Quebec AI Institute, 3Universit e de Montr eal, 4Mc Gill University, 5CIFAR, Equal contribution EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Batch Lambda Returns Input: rt, discount, vt, λ, average future episodes, normalize current episode, inner episode length Output: returns seq len rt.shape[1] batch size rt.shape[0] if normalize current episode then normalization batch size else normalization 1 episode end (range(seq len) mod inner episode length) == (inner episode length 1) acc vt[:, 1] global acc mean(vt[:, 1]) for t = seq len 1 to 0 do if average future episodes and episode end[t] then acc global acc acc rt[:, t]/normalization + discount ((1 λ) vt[:, t] + λ acc) global acc mean(rt[:, t] + discount ((1 λ) vt[:, t] + λ global acc)) returns[:, t] acc return returns |
| Open Source Code | No | The results reported in this paper were produced with open-source software. We used the Python programming language together with the Google JAX (Bradbury et al., 2018) framework, and the Num Py (Harris et al., 2020), Matplotlib (Hunter, 2007), Flax (Heek et al., 2024) and Optax (Babuschkin et al., 2020) packages. |
| Open Datasets | Yes | Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required... We analyze the iterated prisoner s dilemma (IPD), a canonical model for studying cooperation among self-interested agents (Rapoport, 1974; Axelrod & Hamilton, 1981)... Finally, we consider Clean Up-lite, a simplified two-player version of the Clean Up game, which is part of the Melting Pot suite of multi-agent environments (Agapiou et al., 2023). |
| Dataset Splits | No | The paper discusses concepts such as 'inner-episodes', 'meta-trajectories', and 'minibatches' in the context of reinforcement learning. It describes how agents learn by generating experience in environments like the Iterated Prisoner's Dilemma and Clean Up-lite, and details the training procedure for meta agents, including how batches of opponent trajectories are sampled and used for updates. However, it does not specify traditional dataset splits (e.g., train/test/validation percentages or sample counts) for a static dataset, which is common in supervised learning. The information provided relates to how training data (experience) is generated and processed in an episodic learning setup, not to pre-defined splits of a fixed dataset. |
| Hardware Specification | No | The paper mentions general concepts like 'scalable architectures based on recurrent sequence policy models' and 'modern sequence models', and refers to collaborators at 'Google', implying use of their infrastructure. However, it does not provide specific details on the hardware used for the experiments, such as GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or specific TPU versions. The text does not contain any concrete hardware specifications. |
| Software Dependencies | No | The results reported in this paper were produced with open-source software. We used the Python programming language together with the Google JAX (Bradbury et al., 2018) framework, and the Num Py (Harris et al., 2020), Matplotlib (Hunter, 2007), Flax (Heek et al., 2024) and Optax (Babuschkin et al., 2020) packages. |
| Experiment Setup | Yes | In all experiments, we first fix the environment hyperparameters. In order to find the suitable hyperparameter for each methods, we perform for each of them a sweep over reinforcement learning hyperparameters, and select the best hyperparameters over after averaging over 3 seeds. The final performance and metrics are then computed using 5 fresh seeds. In all our experiments, naive agents update their parameters using the Advantage Actor Critic (A2C) algorithm, without value bootstrapping on the batch of length T trajectories. The hyperparameter for all experiments, can be found on Table 8. IPD, Figure 5 We perform 2 experiments in the IPD environment... For both experimental setting, we show the environment hyperparameters in Table 2. All meta agents are trained by PPO and Adam optimizer. For each method, we sweep hyperparameters over range specified in Table 3. Table 4 shows the resulting hyperparameters for all methods. Cleanup, Figure 6, 7 Likewise, we have the pure shaping (Figure 6) and mixed pool (Figure 7) experiment in the Cleanup-lite environment. For both experimental setting, we show the environment hyperparameters in Table 5. All meta agents are trained by PPO and Adam optimizer for the pure shaping setting, while using A2C and SGD for the mixed pool setting. For each method, we sweep hyperparameters over range specified in Table 6. Table 7 shows the resulting hyperparameters for PPO for all methods. |