Pareto Actor-Critic for Equilibrium Selection in Multi-Agent Reinforcement Learning
Authors: Filippos Christianos, Georgios Papoudakis, Stefano V Albrecht
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Pareto-AC in a diverse set of multi-agent games and show that it converges to higher episodic returns compared to seven state-of-the-art MARL algorithms and that it successfully converges to a Pareto-optimal equilibrium in a range of matrix games. |
| Researcher Affiliation | Academia | Filippos Christianos EMAIL University of Edinburgh Georgios Papoudakis EMAIL University of Edinburgh Stefano V. Albrecht EMAIL University of Edinburgh |
| Pseudocode | Yes | The pseudocode of Pareto-AC is presented in Algorithm 1. |
| Open Source Code | Yes | 2Implementation code for Pareto-AC can be found in https://github.com/uoe-agents/epymarl. |
| Open Datasets | Yes | Matrix Games: Three common-reward multi-agent matrix games proposed by Claus & Boutilier (1998): the Climbing game with two and three agents and the Penalty game. Boulder Push: In the Boulder Push game (illustrated in Figure 8a), two agents and a boulder are situated within an 8 8 grid-world. Level-Based Foraging (LBF): In this game, one food item is placed in a 5x5 grid world (Christianos et al., 2020; Papoudakis et al., 2021), as depicted in Figure 8b. To showcase that PACDCG can be used even in tasks with many agents, where Pareto-AC cannot, we also evaluate in two Starcraft Multi-Agent Challenge (SMAC) tasks. |
| Dataset Splits | No | The paper conducts experiments in reinforcement learning environments where data is generated through agent-environment interaction rather than from pre-defined static datasets. Therefore, traditional dataset splits (e.g., train/test/validation percentages or counts) are not applicable or specified. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | Pareto-AC and PACDCG were implemented based on the EPy MARL codebase (Papoudakis et al., 2021). The implementation of PACDCG s critic was based on the official implementation of DCG (Böhmer et al., 2020). The parameters of all networks are optimised using the Adam optimiser (Kingma & Ba, 2015). However, no specific version numbers for the programming language, machine learning frameworks, or any other ancillary software dependencies are provided. |
| Experiment Setup | Yes | Throughout the hyperparameter search, we systematically examined multiple configurations for the training process for both the baseline algorithms and Pareto-AC. Our approach ensured fairness by maintaining a roughly equal number of search configurations for all algorithms under consideration. This included testing hidden dimensions of 64 and 128, learning rates of 0.0003 and 0.0005, considering both Fully Connected (FC) and GRU network architectures, and experimenting with initial entropy coefficients of 0.1, 0.8, 4, and 20, as well as final entropy coefficients of 0.001, 0.01, and 0.02 (entropy only applies to PG algorithms). Tables 3, 4, and 5 provide detailed hyperparameters for each algorithm and environment. |