Solving Zero-Sum Convex Markov Games

Authors: Fivos Kalogiannis, Emmanouil-Vasileios Vlatakis-Gkaragkounis, Ian Gemp, Georgios Piliouras

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Numerical Results We demonstrate Algorithms 1 and 2 on an iterated version of rock-paper-scissors-dummy where each player remembers the actions selected in the previous round. Hence, the previous joint action constitutes the state in the Markov game. The dummy action is dominated by all other actions such that the Nash equilibrium of the stage game is uniform across rock, paper, and scissors with zero mass on dummy. We set the step-sizes τx = τy = 0.1 and vary the regularization coefficient µ to demonstrate its effect in biasing convergence. Figure 1. Exploitability decays towards a small, but positive value corresponding to the bias introduced by the regularization coefficient µ. Results are averaged over 100 trials, each running the algorithm with a different randomly initialized policy profile. The leftmost plot reports results for Algorithm 2; the right two report results for Algorithm 1 with Tin = 10 and 100 respectively.
Researcher Affiliation Collaboration 1Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA 2Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA and Archimedes, Athena Reaserch Center, Greece 3Google Deep Mind, London, UK. Correspondence to: Fivos Kalogiannis <EMAIL>.
Pseudocode Yes Algorithm 1 Nest-PG: Nested Policy Gradient input (x0, y0), step-sizes τx, τy, regul. coeff. µ 0 for t = 1 to Tout do yt,0 yt 1 for s = 1 to Tin do yt,s ΠY ys 1 + τy ˆ y U µ (xt 1, yt,s 1) yt yt,Tin end for xt ΠX xt 1 τx ˆ x U (xt 1, yt) end for Pick t {1, . . . , T} of the best iterate. output (xt , yt +1).
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code, nor does it provide a link to a code repository in the main text or supplementary sections mentioned.
Open Datasets No We demonstrate Algorithms 1 and 2 on an iterated version of rock-paper-scissors-dummy where each player remembers the actions selected in the previous round. Hence, the previous joint action constitutes the state in the Markov game. The dummy action is dominated by all other actions such that the Nash equilibrium of the stage game is uniform across rock, paper, and scissors with zero mass on dummy.
Dataset Splits No The paper describes experiments on an iterated version of rock-paper-scissors-dummy, which is a simulated game environment. It does not use external datasets that would require explicit training/test/validation splits.
Hardware Specification No The paper does not provide any specific details regarding the hardware specifications used to run the numerical experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers for reproducibility.
Experiment Setup Yes We set the step-sizes τx = τy = 0.1 and vary the regularization coefficient µ to demonstrate its effect in biasing convergence.