Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
Authors: Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson
JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the performance of QMIX, we propose the Star Craft Multi-Agent Challenge (SMAC) as a new benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a challenging set of SMAC scenarios and show that it significantly outperforms existing multi-agent reinforcement learning methods. |
| Researcher Affiliation | Collaboration | Tabish Rashid EMAIL University of Oxford Mikayel Samvelyan EMAIL Russian-Armenian University Christian Schroeder de Witt EMAIL University of Oxford Gregory Farquhar EMAIL University of Oxford Jakob Foerster EMAIL Facebook AI Research Shimon Whiteson EMAIL University of Oxford |
| Pseudocode | Yes | Algorithm 1 QMIX |
| Open Source Code | Yes | 1. Code is available at https://github.com/oxwhirl/smac. To further facilitate research in this field, we also open-source Py MARL, a learning framework that can serve as a starting point for other researchers and includes implementations of several key multi-agent RL algorithms. Py MARL is modular, extensible, built on Py Torch, and serves as a template for dealing with some of the unique challenges of deep multi-agent RL in practice. 5. Py MARL is available at https://github.com/oxwhirl/pymarl. |
| Open Datasets | No | To evaluate QMIX, as well as the growing number of other algorithms recently proposed for multi-agent RL (Foerster et al., 2018; Sunehag et al., 2017), we introduce the Star Craft Multi-Agent Challenge (SMAC)1. In single-agent RL, standard environments such as the Arcade Learning Environment (Bellemare et al., 2013) and Mu Jo Co (Plappert et al., 2018) have facilitated rapid progress. While some multi-agent testbeds have emerged... SMAC fills this gap. It is built on the popular real-time strategy game Star Craft II and makes use of the SC2LE environment (Vinyals et al., 2017). |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits. It describes sampling batches of episodes from a replay buffer for training in a reinforcement learning setup, rather than fixed splits from a pre-defined static dataset. |
| Hardware Specification | Yes | Each independent run takes between 8 to 16 hours, depending on the exact scenario, using Nvidia Geforce GTX 1080 Ti graphics cards. |
| Software Dependencies | No | Py MARL is modular, extensible, built on Py Torch, and serves as a template for dealing with some of the unique challenges of deep multi-agent RL in practice. All neural networks are trained using RMSprop6 with learning rate 5 10 4. |
| Experiment Setup | Yes | The architecture of all agent networks is a DRQN with a recurrent layer comprised of a GRU with a 64-dimensional hidden state, with a fully-connected layer before and after. Exploration is performed during training using independent ϵ-greedy action selection, where each agent a performs ϵ-greedy action selection over its own Qa. Throughout the training, we anneal ϵ linearly from 1.0 to 0.05 over 50k time steps and keep it constant for the rest of the learning. We set γ = 0.99 for all experiments. The replay buffer contains the most recent 5000 episodes. We sample batches of 32 episodes uniformly from the replay buffer, and train on fully unrolled episodes, performing a single gradient descent step after every episode. The target networks are updated after every 200 training episodes. The Double Q-Learning update rule from (Van Hasselt et al., 2016) is used for all Q-Learning variants (IQL, VDN, QMIX and QTRAN). To speed up the learning, we share the parameters of the agent networks across all agents. Because of this, a one-hot encoding of the agent id is concatenated onto each agent s observations. All neural networks are trained using RMSprop6 with learning rate 5 10 4. The mixing network consists of a single hidden layer of 32 units, utilising an ELU non-linearity. The hypernetworks consist of a feedforward network with a single hidden layer of 64 units with a Re LU non-linearity. |