Learning Global Nash Equilibrium in Team Competitive Games with Generalized Fictitious Cross-Play
Authors: Zelai Xu, Chao Yu, Yancheng Liang, Yi Wu, Yu Wang
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate GFXP in matrix games and gridworld domains where GFXP achieves the lowest exploitabilities. We further conduct experiments in a challenging football game where GFXP defeats SOTA models with over 94% win rate. |
| Researcher Affiliation | Academia | Zelai Xu EMAIL Department of Electronic Engineering Tsinghua University Beijing, 100084, China; Chao Yu EMAIL Department of Electronic Engineering Tsinghua University Beijing, 100084, China; Yancheng Liang EMAIL School of Computer Science and Engineering University of Washington Seattle, WA 98195, USA; Yi Wu EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, 100084, China; Yu Wang EMAIL Department of Electronic Engineering Tsinghua University Beijing, 100084, China |
| Pseudocode | Yes | Algorithm 1: Self-Play (SP); Algorithm 2: Policy-Space Response Oracles (PSRO); Algorithm 3: Fictitious Cross-Play (FXP) |
| Open Source Code | No | The paper does not explicitly state that the authors' code for GFXP is publicly available or provide a link to a code repository. It mentions that Tikick's model is released and that PSRO w. BD&RD never released their code or model, but not for the current work. |
| Open Datasets | Yes | Then we use MAPPO (Yu et al., 2021) as an approximate BR oracle and consider a gridworld environment MAgent Battle (Zheng et al., 2018). Finally, with large-scale training, we use GFXP to solve the challenging 11-vs-11 multi-agent full game in the Google Research Football (GRF) (Kurach et al., 2020) environment. |
| Dataset Splits | No | The paper uses simulation environments like MAgent Battle and Google Research Football. While it describes scenarios and game configurations (e.g., 3-vs-3 battle, 11-vs-11 full-game task), it does not provide specific dataset splits (e.g., training/test/validation percentages or counts) for a pre-collected dataset, as data is generated dynamically through simulation. |
| Hardware Specification | Yes | Each algorithm is trained on a 128-core CPU server for 30k steps. All algorithms use a recurrent policy and are trained on a single 4090 GPU for 100M environment frames. The experiments are trained on a single 4090 GPU. |
| Software Dependencies | No | The paper mentions software components like 'SGD optimizer', 'policy gradient', 'MAPPO', 'Adam', and 'PPO' but does not specify their version numbers, which are necessary for reproducible software dependencies. |
| Experiment Setup | Yes | For SP and FSP, we simply train the single agent for 30k steps. For PSRO with and without reset, we run 30 iterations and the BR policy in each iteration is trained for 1k steps. For GFXP, we run 15 iterations and the main policy and counter policy in each iteration are both trained for 1k steps. The self-play probability η is set to 0.2 and decays exponentially to 0 with a factor of 0.97. All training hyperparameters for different algorithms and BR learning are the same and listed in Table 4. All training hyperparameters for GFXP in GRF are listed in Table 8. |