Convex Markov Games: A New Frontier for Multi-Agent Reinforcement Learning
Authors: Ian Gemp, Andreas Alexander Haupt, Luke Marris, Siqi Liu, Georgios Piliouras
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experiments We test a variety of nonlinear utilities in several domains. In the first set of (creativity-based) domains, we compare against four baseline algorithms, and discuss and contrast the resulting exploitability and policy profiles. The resulting experiments demonstrate other distinct use cases of c MGs. |
| Researcher Affiliation | Collaboration | 1Google DeepMind, London, UK 2College of Computing, MIT, Cambridge, MA, USA. Correspondence to: Ian Gemp <EMAIL>. |
| Pseudocode | Yes | We refer to this approach as projected-gradient loss minimization (PGL), with pseudocode in Algorithm 1. Algorithm 1 Projected-Gradient Loss Minimization (PGL) 1: Given: Initial profile π, temperature schedule τt 2: for t = 0, . . . , T do 3: π Opt(π, πLτt) 4: end for 5: Output: π |
| Open Source Code | No | The paper mentions "Pathfinding We use Open Spiel’s (Lanctot et al., 2019) pathfinding game with a horizon of 1000: https:// openspiel.readthedocs.io/en/latest/games.html." This refers to a third-party framework used, not the authors' own code for their methodology. There is no explicit statement about releasing the code for the work described in this paper or a direct link to a repository for their implementation. |
| Open Datasets | Yes | The reference human policies µref i were derived from experiments where subjects played the iterated prisoner s dilemma, selecting cooperate (C) or defect (D) in each period (Romero & Rosokha, 2023, Table 1, Current, Direct Response). |
| Dataset Splits | No | The paper describes using various game environments and human policies from a cited work. However, it does not specify any training/test/validation splits for any of these. For instance, the 'human policies' are referenced from Romero & Rosokha (2023) but no splits are mentioned for them. The iterated normal-form games and grid worlds are simulated environments where data generation is part of the experimental process, not typically subject to train/test splits in the traditional machine learning sense. |
| Hardware Specification | No | All experiments except pathfinding were run on a single CPU and take about a minute to solve although exact exploitability reporting via CVXOPT (Diamond & Boyd, 2016) increases runtime approximately 10 ; pathfinding used 1 GPU. This statement mentions the use of a 'single CPU' and '1 GPU' but lacks specific details such as model numbers, processor types, or memory specifications, which are necessary for hardware reproducibility. |
| Software Dependencies | No | We minimize Lτt(π) with Adam; its internal state is not reset after annealing. ... directly minimizing exploitability using a differentiable convex optimization package CVXPYLAYERS in JAX (Agrawal et al., 2019; Bradbury et al., 2018). ... We also compare against the SGAMESOLVER (Eibelsh auser & Poensgen, 2023) package of homotopy methods for Markov games. ... All experiments were run with Adam (Kingma & Ba, 2015). The text mentions several software packages and frameworks (Adam, CVXPYLAYERS, JAX, SGAMESOLVER, CVXOPT, Open Spiel) along with citations, but none of these mentions include specific version numbers for the software components, which is crucial for reproducible software dependencies. |
| Experiment Setup | Yes | Hyperparameters. We minimize Lτt(π) with Adam; its internal state is not reset after annealing. Three types of annealing schedules τt are used for entropy regularization (Appendix E). Each policy πi is initialized to uniform unless otherwise specified. All experiments except pathfinding were run on a single CPU... All algorithms performed best with a learning rate of 0.1. ... We set γ = 0.99 in all domains. Iterated NFGs use the last joint action selected by each player as state, i.e., S = A. ... Table 5 lists out the three types of annealing schedules used in the experiments. Table 6 lists out the learning rate used in each domain as well as the type of annealing schedule used. All experiments were run with Adam (Kingma & Ba, 2015). |