reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Agent Advisor Q-Learning

Authors: Sriram Ganapathi Subramanian, Matthew E. Taylor, Kate Larson, Mark Crowley

JAIR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.
Researcher Affiliation	Academia	Sriram Ganapathi Subramanian EMAIL University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1 Vector Institute Matthew E. Taylor EMAIL University of Alberta 116 Street and 85 Avenue, Edmonton, AB T6G 2R3 Alberta Machine Intelligence Institute (Amii) Kate Larson EMAIL University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1 Mark Crowley EMAIL University of Waterloo 200 University Ave W, Waterloo, ON N2L 3G1
Pseudocode	Yes	Algorithm 1 ADvising Multiple Intelligent Reinforcement Agents Decision Making (ADMIRALDM) Algorithm 2 ADvising Multiple Intelligent Reinforcement Agents Advisor Evaluation (ADMIRALAE) Algorithm 3 ADMIRAL-DM Neural Network Implementation Algorithm 4 ADMIRAL-AE Neural Network Implementation Algorithm 5 ADMIRAL-DM(AC) Neural Network Implementation
Open Source Code	Yes	The source code for the experiments has been open sourced (Subramanian, 2022).
Open Datasets	Yes	We experimentally validate our algorithms, showing their effectiveness in a variety of situations using different testbeds. We also demonstrate superior performance to common baselines previously used in literature. The source code for the experiments has been open sourced (Subramanian, 2022). ...Pommerman environment (Resnick et al., 2018). ...two cooperative domains from the Stanford Intelligent Systems Laboratory (SISL) (Gupta et al., 2017). ...implemented in the petting zoo environment (Terry et al., 2020).
Dataset Splits	Yes	We perform 50,000 episodes of training, where the algorithms train against speciﬁc opponents. Each episode is a full Pommerman game (lasting a maximum of 800 steps). All the algorithms relying on demonstrations (DQf D, CHAT, ADMIRAL-DM, and ADMIRAL-DM(AC)) use the Advisor 1 considered in Section 5.2. ... After the training phase, the trained algorithms enter a face-off competition of 10,000 games where there is no more training, no further exploration and additionally ADMIRAL-DM and ADMIRAL-DM(AC) play without any advisor inﬂuence. ...The algorithms train for 1000 games in the training phase and then enter an execution phase, where they execute the trained policy for 100 games.
Hardware Specification	Yes	The training for all the experiments on the Pommerman domain in Section 5.2 and Section 5.3 was run on a 2 GPU virtual machine with 16 GB GPU memory per GPU. The experiments take an average of 18 hours wall clock time to complete. We use Nvidia Volta-100 (V100) GPUs for all these experiments. The CPUs use Skylake as the processor microarchitecture. The experiments on Pursuit and Waterworld domains in Section 5.3 were run on a virtual machine having the same conﬁguration containing 2 GPUs. These experiments take an average of 12 hours wall clock time to complete.
Software Dependencies	No	The hyperparameters for the baselines were chosen to be the same as those recommended by the respective papers. Some minor modiﬁcations were made due to performance and computational efﬁciency considerations. Regarding the hyperparameters of DQf D, we set 1 106 as the demo buffer size and perform 50,000 mini-batch updates for pretraining. The replay buffer size is twice the size of the demo buffer. The N-step return weight is 1.0, the supervised loss weight is 1.0 and the L2 regularization weight is 10 5. The epsilon greedy exploration is 0.9. The discount factor is 0.99 and the learning rate is 0.002. The pretraining for DQf D comes from a data buffer related to a series of games where two rule-based agents (advisors) compete against each other. All other values are similar to that used in Hester et al. (2018).
Experiment Setup	Yes	Regarding the hyperparameters of DQf D, we set 1 106 as the demo buffer size and perform 50,000 mini-batch updates for pretraining. The replay buffer size is twice the size of the demo buffer. The N-step return weight is 1.0, the supervised loss weight is 1.0 and the L2 regularization weight is 10 5. The epsilon greedy exploration is 0.9. The discount factor is 0.99 and the learning rate is 0.002. The CHAT (Wang and Taylor, 2017) implementation uses a neural network for conﬁdence measurement (termed NNHAT in Wang and Taylor (2017)). The learning rate is 0.01, we use a discount factor of 0.9 and a ﬁxed exploration constant ( -greedy) of 0.9. We use the extra action variant of HAT (Taylor et al., 2011) in the CHAT implementation, as this gave the best performance in most of our comparative experiments. A neural network is used as the function approximator, as described in Mnih et al. (2015). The target net is replaced every 10 learning iterations. The conﬁdence threshold is set as 0.6 and the default action as action-0 . The mini-batch size is 32 and learning rate is = 0.01. The advisor inﬂuence parameter ( 0t in Algorithm 1) for ADMIRAL-DM and ADMIRAL-DM(AC) starts at a high value of 0.8 at the beginning of the training and linearly decays to 0.01 during training. The DDPG uses the learning rate of the actor as 0.001 and the critic as 0.002. The discount factor is 0.9. We use the soft replacement strategy with a learning rate of 0.01. The batch size is 32. The PPO implementation also uses the same batch size and actor and critic learning rates.