Expected Return Symmetries
Authors: Darius Muglich, Johannes Forkel, Elise van der Pol, Jakob Foerster
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method in four different environments, focusing on how ER symmetries impact zero-shot coordination (ZSC) compared to self-play and other-play with Dec-POMDP symmetries. Specifically, we train independent agent populations that take advantage of ER symmetries and compare their cross-play performance within the population to baseline populations. The goal is to assess whether the use of ER symmetries leads to better coordination between agents than self-play or Dec-POMDP-symmetry-based training. |
| Researcher Affiliation | Collaboration | Darius Muglich University of Oxford EMAIL Johannes Forkel University of Oxford EMAIL Elise van der Pol Microsoft Research AI for Science EMAIL Jakob Foerster University of Oxford EMAIL |
| Pseudocode | Yes | See Algorithm 1 in Appendix E for details. ... This is detailed in Algorithm 2 in Appendix E. Note that in the term Eo O d(o, ϕ2 θ(o))2 we abuse notation, and let ϕθ map into a continuous extension of O, otherwise this term would be locally constant with a gradient of zero almost everywhere. We also propose an alternative objective for learning ER symmetries through XP maximization: ϕθ s.t. θ = arg sup θ Θ XP(πi, ϕθ(πj)), (11) where πi, πj Π are a pair of SP optimal policies chosen from the fixed training pool. If πi and πj belong to the same equivalence class induced by ΦER, then by definition there exists an ER symmetry ϕ that maximizes Equation 11 to the self-play optimum value of J(πi). Therefore, for each pair of optimal policies πi, πj Π , we optimize Equation 11 over ϕθ, and save the ϕθ that optimize Equation 11 to the highest value. We outline this approach in Algorithm 3 of Appendix E. |
| Open Source Code | Yes | Code for Cat/Dog: https://colab.research.google.com/drive/1enEW6cjnzTbM9sTtHlD2-vowYh9NDlhc?usp=sharing Code for Iterated Three-Lever Game: https://colab.research.google.com/drive/1T9LpOkLDBl9BBkjzUelXKND6U8dOvfXV Code for Hanabi/Overcooked V2: https://github.com/gfppoy/expected-return-symmetries/tree/main |
| Open Datasets | Yes | Overcooked V2 is a recent AI benchmark for ZSC (Gessler et al., 2025), which improves on the cooperative multi-agent benchmark Overcooked (Carroll et al., 2019), by introducing asymmetric information and increased stochasticity, creating more nuanced coordination challenges. ... Hanabi (see Appendix G for details) is a challenging AI benchmark, and has served as the primary test bed for many algorithms designed for zero-shot coordination, ad-hoc teamplay, and other cooperative tasks (Bard et al., 2020; Cui et al., 2021; Nekoei et al., 2021; 2023; Muglich et al., 2022a;b). |
| Dataset Splits | No | The paper evaluates performance in simulated environments (Iterated Three-Lever Game, Cat/Dog Environment, Overcooked V2, Hanabi) where agents learn policies through interaction. The concept of explicit training/test/validation dataset splits as typically used in supervised learning is not directly applicable to these RL environments; instead, agents are trained and then evaluated through cross-play in these environments. The paper does not provide specific dataset split information. |
| Hardware Specification | Yes | Methods for Hanabi and Overcooked V2 were ran on A40 and L40 GPUs. |
| Software Dependencies | No | The experiments in Sections 4.3 and 4.4 use the Jax MARL environment and implementations (Rutherford et al., 2023). This mentions a software component but does not specify a version number. Other mentioned methods like IQL, Q-learning, IPPO, PPO are algorithms or general frameworks, not specific software with version numbers. |
| Experiment Setup | Yes | For expected return symmetry discovery in the three-lever game, each ERS agent trains 20 self-play optimal policies using IQL over 10000 episodes, with an ϵ = 0.1 ϵ-greedy behaviour policy and a learning rate of 0.1. ... For both Hanabi and Overcooked V2, we use PPO and Generalized Advantage Estimation. For Hanabi, we use 4 epochs, 1024 environments per pretrained policy, 128 environment steps per update, 4 minibatches, γ = 0.99, GAE Lambda = 0.95, CLIP EPS = 0.2, VF COEFF = 0.5, MAX GRAD NORM = 0.5, a learning rate of 1e 5 and a linear learning rate annealing schedule. For Overcooked V2, we use 4 epochs, 256 environments, 256 environment steps per update, 64 minibatches, γ = 0.99, GAE Lambda = 0.95, CLIP EPS = 0.2, VF COEFF = 0.5, MAX GRAD NORM = 0.25, a learning rate of 1e 5 with no annealing. |