Multi-Agent Advisor Q-Learning

Authors: Sriram Ganapathi Subramanian, Matthew E. Taylor, Kate Larson, Mark Crowley

JAIR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.
Researcher Affiliation Academia Sriram Ganapathi Subramanian EMAIL University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1 Vector Institute Matthew E. Taylor EMAIL University of Alberta 116 Street and 85 Avenue, Edmonton, AB T6G 2R3 Alberta Machine Intelligence Institute (Amii) Kate Larson EMAIL University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1 Mark Crowley EMAIL University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1
Pseudocode Yes Algorithm 1 ADvising Multiple Intelligent Reinforcement Agents Decision Making (ADMIRALDM) Algorithm 2 ADvising Multiple Intelligent Reinforcement Agents Advisor Evaluation (ADMIRALAE) Algorithm 3 ADMIRAL-DM Neural Network Implementation Algorithm 4 ADMIRAL-AE Neural Network Implementation Algorithm 5 ADMIRAL-DM(AC) Neural Network Implementation
Open Source Code Yes The source code for the experiments has been open sourced (Subramanian, 2022).
Open Datasets Yes We experimentally validate our algorithms, showing their effectiveness in a variety of situations using different testbeds. We also demonstrate superior performance to common baselines previously used in literature. The source code for the experiments has been open sourced (Subramanian, 2022). ...Pommerman environment (Resnick et al., 2018). ...two cooperative domains from the Stanford Intelligent Systems Laboratory (SISL) (Gupta et al., 2017). ...implemented in the petting zoo environment (Terry et al., 2020).
Dataset Splits Yes We perform 50,000 episodes of training, where the algorithms train against specific opponents. Each episode is a full Pommerman game (lasting a maximum of 800 steps). All the algorithms relying on demonstrations (DQf D, CHAT, ADMIRAL-DM, and ADMIRAL-DM(AC)) use the Advisor 1 considered in Section 5.2. ... After the training phase, the trained algorithms enter a face-off competition of 10,000 games where there is no more training, no further exploration and additionally ADMIRAL-DM and ADMIRAL-DM(AC) play without any advisor influence. ...The algorithms train for 1000 games in the training phase and then enter an execution phase, where they execute the trained policy for 100 games.
Hardware Specification Yes The training for all the experiments on the Pommerman domain in Section 5.2 and Section 5.3 was run on a 2 GPU virtual machine with 16 GB GPU memory per GPU. The experiments take an average of 18 hours wall clock time to complete. We use Nvidia Volta-100 (V100) GPUs for all these experiments. The CPUs use Skylake as the processor microarchitecture. The experiments on Pursuit and Waterworld domains in Section 5.3 were run on a virtual machine having the same configuration containing 2 GPUs. These experiments take an average of 12 hours wall clock time to complete.
Software Dependencies No The hyperparameters for the baselines were chosen to be the same as those recommended by the respective papers. Some minor modifications were made due to performance and computational efficiency considerations. Regarding the hyperparameters of DQf D, we set 1 106 as the demo buffer size and perform 50,000 mini-batch updates for pretraining. The replay buffer size is twice the size of the demo buffer. The N-step return weight is 1.0, the supervised loss weight is 1.0 and the L2 regularization weight is 10 5. The epsilon greedy exploration is 0.9. The discount factor is 0.99 and the learning rate is 0.002. The pretraining for DQf D comes from a data buffer related to a series of games where two rule-based agents (advisors) compete against each other. All other values are similar to that used in Hester et al. (2018).
Experiment Setup Yes Regarding the hyperparameters of DQf D, we set 1 106 as the demo buffer size and perform 50,000 mini-batch updates for pretraining. The replay buffer size is twice the size of the demo buffer. The N-step return weight is 1.0, the supervised loss weight is 1.0 and the L2 regularization weight is 10 5. The epsilon greedy exploration is 0.9. The discount factor is 0.99 and the learning rate is 0.002. The CHAT (Wang and Taylor, 2017) implementation uses a neural network for confidence measurement (termed NNHAT in Wang and Taylor (2017)). The learning rate is 0.01, we use a discount factor of 0.9 and a fixed exploration constant ( -greedy) of 0.9. We use the extra action variant of HAT (Taylor et al., 2011) in the CHAT implementation, as this gave the best performance in most of our comparative experiments. A neural network is used as the function approximator, as described in Mnih et al. (2015). The target net is replaced every 10 learning iterations. The confidence threshold is set as 0.6 and the default action as action-0 . The mini-batch size is 32 and learning rate is = 0.01. The advisor influence parameter ( 0t in Algorithm 1) for ADMIRAL-DM and ADMIRAL-DM(AC) starts at a high value of 0.8 at the beginning of the training and linearly decays to 0.01 during training. The DDPG uses the learning rate of the actor as 0.001 and the critic as 0.002. The discount factor is 0.9. We use the soft replacement strategy with a learning rate of 0.01. The batch size is 32. The PPO implementation also uses the same batch size and actor and critic learning rates.