Learning Global Nash Equilibrium in Team Competitive Games with Generalized Fictitious Cross-Play

Authors: Zelai Xu, Chao Yu, Yancheng Liang, Yi Wu, Yu Wang

JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate GFXP in matrix games and gridworld domains where GFXP achieves the lowest exploitabilities. We further conduct experiments in a challenging football game where GFXP defeats SOTA models with over 94% win rate.
Researcher Affiliation Academia Zelai Xu EMAIL Department of Electronic Engineering Tsinghua University Beijing, 100084, China; Chao Yu EMAIL Department of Electronic Engineering Tsinghua University Beijing, 100084, China; Yancheng Liang EMAIL School of Computer Science and Engineering University of Washington Seattle, WA 98195, USA; Yi Wu EMAIL Institute for Interdisciplinary Information Sciences Tsinghua University Beijing, 100084, China; Yu Wang EMAIL Department of Electronic Engineering Tsinghua University Beijing, 100084, China
Pseudocode Yes Algorithm 1: Self-Play (SP); Algorithm 2: Policy-Space Response Oracles (PSRO); Algorithm 3: Fictitious Cross-Play (FXP)
Open Source Code No The paper does not explicitly state that the authors' code for GFXP is publicly available or provide a link to a code repository. It mentions that Tikick's model is released and that PSRO w. BD&RD never released their code or model, but not for the current work.
Open Datasets Yes Then we use MAPPO (Yu et al., 2021) as an approximate BR oracle and consider a gridworld environment MAgent Battle (Zheng et al., 2018). Finally, with large-scale training, we use GFXP to solve the challenging 11-vs-11 multi-agent full game in the Google Research Football (GRF) (Kurach et al., 2020) environment.
Dataset Splits No The paper uses simulation environments like MAgent Battle and Google Research Football. While it describes scenarios and game configurations (e.g., 3-vs-3 battle, 11-vs-11 full-game task), it does not provide specific dataset splits (e.g., training/test/validation percentages or counts) for a pre-collected dataset, as data is generated dynamically through simulation.
Hardware Specification Yes Each algorithm is trained on a 128-core CPU server for 30k steps. All algorithms use a recurrent policy and are trained on a single 4090 GPU for 100M environment frames. The experiments are trained on a single 4090 GPU.
Software Dependencies No The paper mentions software components like 'SGD optimizer', 'policy gradient', 'MAPPO', 'Adam', and 'PPO' but does not specify their version numbers, which are necessary for reproducible software dependencies.
Experiment Setup Yes For SP and FSP, we simply train the single agent for 30k steps. For PSRO with and without reset, we run 30 iterations and the BR policy in each iteration is trained for 1k steps. For GFXP, we run 15 iterations and the main policy and counter policy in each iteration are both trained for 1k steps. The self-play probability η is set to 0.2 and decays exponentially to 0 with a factor of 0.97. All training hyperparameters for different algorithms and BR learning are the same and listed in Table 4. All training hyperparameters for GFXP in GRF are listed in Table 8.