Global Convergence of Multi-Agent Policy Gradient in Markov Potential Games
Authors: Stefanos Leonardos, Will Overman, Ioannis Panageas, Georgios Piliouras
ICLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS: CONGESTION GAMES; Results. The left panel of Figure 5 shows that the agents learn the expected Nash profile in both states in all runs.; We implemented this environment with N = 4 agents... We used our implementation of the independent policy gradient algorithm with the same parameters as in our experiment from Section 5, specifically we have T = 20, γ = 0.99, and η = 0.0001. The results are shown in Figure 10. |
| Researcher Affiliation | Academia | Stefanos Leonardos Singapore University of Technology and Design stefanos EMAIL William Overman University of California, Irvine EMAIL Ioannis Panageas University of California, Irvine EMAIL Georgios Piliouras Singapore University of Technology and Design EMAIL |
| Pseudocode | No | The PGA algorithm is given by π(t+1) i := P (Ai)S π(t) i + η πi V i ρ(π(t)) , (PGA); PSGA) is given by π(t+1) i := P (Ai)S π(t) i + η ˆ (t) πi . (PSGA) |
| Open Source Code | Yes | We also uploaded the code that was used to run the experiments (policy gradient algorithm) as supplementary material. |
| Open Datasets | No | We consider an experiment (Figure 4) with N = 8 agents, Ai = 4 facilities (resources or locations) that the agents can select from and S = 2 states: a safe state and a distancing state. |
| Dataset Splits | No | No information about training/validation/test dataset splits is provided, as the paper conducts experiments in a simulated environment rather than on a static dataset. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, memory, or cloud instances) are mentioned for the experiments. |
| Software Dependencies | No | No specific software dependencies with version numbers are provided in the paper. |
| Experiment Setup | Yes | We perform episodic updates with T = 20 steps. At each iteration, we estimate the policy gradients using the average of mini-batches of size 20. We use γ = 0.99 and a common learning rate η = 0.0001 (larger than the theoretical guarantee, η = (1 γ)3 2γAmaxn 1e 08, of Theorem 4.2). |