Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Authors: Tabish Rashid, Mikayel Samvelyan, Christian Schroeder de Witt, Gregory Farquhar, Jakob Foerster, Shimon Whiteson

JMLR 2020 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the performance of QMIX, we propose the Star Craft Multi-Agent Challenge (SMAC) as a new benchmark for deep multi-agent reinforcement learning. We evaluate QMIX on a challenging set of SMAC scenarios and show that it significantly outperforms existing multi-agent reinforcement learning methods.
Researcher Affiliation Collaboration Tabish Rashid EMAIL University of Oxford Mikayel Samvelyan EMAIL Russian-Armenian University Christian Schroeder de Witt EMAIL University of Oxford Gregory Farquhar EMAIL University of Oxford Jakob Foerster EMAIL Facebook AI Research Shimon Whiteson EMAIL University of Oxford
Pseudocode Yes Algorithm 1 QMIX
Open Source Code Yes 1. Code is available at https://github.com/oxwhirl/smac. To further facilitate research in this field, we also open-source Py MARL, a learning framework that can serve as a starting point for other researchers and includes implementations of several key multi-agent RL algorithms. Py MARL is modular, extensible, built on Py Torch, and serves as a template for dealing with some of the unique challenges of deep multi-agent RL in practice. 5. Py MARL is available at https://github.com/oxwhirl/pymarl.
Open Datasets No To evaluate QMIX, as well as the growing number of other algorithms recently proposed for multi-agent RL (Foerster et al., 2018; Sunehag et al., 2017), we introduce the Star Craft Multi-Agent Challenge (SMAC)1. In single-agent RL, standard environments such as the Arcade Learning Environment (Bellemare et al., 2013) and Mu Jo Co (Plappert et al., 2018) have facilitated rapid progress. While some multi-agent testbeds have emerged... SMAC fills this gap. It is built on the popular real-time strategy game Star Craft II and makes use of the SC2LE environment (Vinyals et al., 2017).
Dataset Splits No The paper does not provide specific training/test/validation dataset splits. It describes sampling batches of episodes from a replay buffer for training in a reinforcement learning setup, rather than fixed splits from a pre-defined static dataset.
Hardware Specification Yes Each independent run takes between 8 to 16 hours, depending on the exact scenario, using Nvidia Geforce GTX 1080 Ti graphics cards.
Software Dependencies No Py MARL is modular, extensible, built on Py Torch, and serves as a template for dealing with some of the unique challenges of deep multi-agent RL in practice. All neural networks are trained using RMSprop6 with learning rate 5 10 4.
Experiment Setup Yes The architecture of all agent networks is a DRQN with a recurrent layer comprised of a GRU with a 64-dimensional hidden state, with a fully-connected layer before and after. Exploration is performed during training using independent ϵ-greedy action selection, where each agent a performs ϵ-greedy action selection over its own Qa. Throughout the training, we anneal ϵ linearly from 1.0 to 0.05 over 50k time steps and keep it constant for the rest of the learning. We set γ = 0.99 for all experiments. The replay buffer contains the most recent 5000 episodes. We sample batches of 32 episodes uniformly from the replay buffer, and train on fully unrolled episodes, performing a single gradient descent step after every episode. The target networks are updated after every 200 training episodes. The Double Q-Learning update rule from (Van Hasselt et al., 2016) is used for all Q-Learning variants (IQL, VDN, QMIX and QTRAN). To speed up the learning, we share the parameters of the agent networks across all agents. Because of this, a one-hot encoding of the agent id is concatenated onto each agent s observations. All neural networks are trained using RMSprop6 with learning rate 5 10 4. The mixing network consists of a single hidden layer of 32 units, utilising an ELU non-linearity. The hypernetworks consist of a feedforward network with a single hidden layer of 64 units with a Re LU non-linearity.