Implicit Ensemble Training for Efficient and Robust Multiagent Reinforcement Learning

Authors: Macheng Shen, JONATHAN P HOW

TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate in several competitive multiagent scenarios in the board game and robotic domains that our new approach improves robustness against unseen adversarial opponents while achieving higher sample-efficiency and less computation.
Researcher Affiliation Academia Macheng Shen EMAIL Department of Mechanical Engineering Massachusetts Institute of Technology Jonathan P. How EMAIL Department of Department of Aeronautics and Astronautics Massachusetts Institute of Technology
Pseudocode Yes Algorithm 1 Implicit Ensemble Training Require: Number of training episodes M, a set of agents with index in {1, . . . , N} 1: Initialize network parameters ψi, ϕi, i {1, . . . , N} for each agent 2: for j = 1 : M do 3: for i = 1 : N do 4: Sample latent vector zi N(0, IL L) 5: end for 6: Rollout with each agent sampling its action via ai hϕi(oi; gψ(zi)) 7: Update network parameters ψi ϕi with policy gradient 8: end for 9: return ψi, ϕi
Open Source Code Yes 1code: https://github.com/Macheng Shen/Implicit Ensemble Training
Open Datasets Yes We evaluate our approach on two types of 2-player (which we refer to as the blue agent and the red agent hereafter) multiagent scenarios. Board-game: turn-based games implemented in the Petting Zoo multiagent environment2 (Terry et al., 2020) and the RLCard toolkit3 (Zha et al., 2019): Connect Four, Leduc Hold em, and Texas Hold em (Limit). Robo School-Racer: continuous problems modified from the robot racing scenarios in the Robo School environment4: Ant and Cheetah, where we decompose each robot into front and rear parts, and assign opposite rewards to each part. 2https://www.pettingzoo.ml 3https://github.com/datamllab/rlcard 4https://github.com/openai/roboschool
Dataset Splits Yes To evaluate the robustness of the learned policies, we adopted a similar approach as in (Vinyals et al., 2019; Gleave et al., 2019) by training an independent exploiter agent. Specifically, we launched two concurrent threads, one for the training, the other for the testing, and repeated the following steps: 1. Train the blue agent and the red agent in the training thread for one training epoch. 2. Copy the blue agent s policy to the testing thread and freeze it. 3. Train the red exploiter agent in the testing thread against the fixed blue agent. As a result, the red exploiter agent learns to exploit any weakness of the blue policy, and the corresponding reward is an informative indicator of the adversarial robustness of the blue policy. Table 1 shows the average testing reward, the best testing reward, and the average reward gap between training and testing. We compare our IET approach with two baselines. One baseline is our IET approach with the input latent noise fixed as a zero vector, and the other baseline is a standard ensemble with 10 policies. For the IET settings, we ran experiments across 25 random seeds. However, we found that due to the difficulty of balancing the relative strength between the blue and the red agents, sometimes (for some random seeds) the blue agent s training reward converges to near -1.0, and so does the testing reward. Since we are interested in investigating the robustness of the learned policy (whether a strong training performance is sufficient to guarantee a strong testing performance), we filter out those seeds that lead to poor training performance (converged training reward lower than -0.5). There turned out to be 19 remaining valid seeds for each of the two IET settings, and the results are evaluated only across these valid seeds. As for the SET setting, we found that both the training and the testing reward are not very sensitive to random seeds, so we report the results across 4 random seeds.
Hardware Specification No The paper does not specify any particular hardware used for the experiments, only general software frameworks and hyperparameters.
Software Dependencies No We used the RLlib (Liang et al., 2018) implementation of Proximal Policy Optimization (PPO) with a minibatch size of 256 and a learning rate of 5 10 5. We use independent networks for the policy and the value function approximations and set the following hyperparameters for IET: L = 10 for the latent condition variable dimension; H = 64 for the hidden layer dimension in the shaping network; n = 2 and m = 2 for the number of layers and number of modules of the modular network; D = 64 and d = 64 for the embedding and module hidden dimension. For the other approaches, each of the policy and the value networks consists of two fully-connected layers with 256 hidden units. Although RLlib and PPO are mentioned, specific version numbers are not provided.
Experiment Setup Yes We used the RLlib (Liang et al., 2018) implementation of Proximal Policy Optimization (PPO) with a minibatch size of 256 and a learning rate of 5 10 5. We use independent networks for the policy and the value function approximations and set the following hyperparameters for IET: L = 10 for the latent condition variable dimension; H = 64 for the hidden layer dimension in the shaping network; n = 2 and m = 2 for the number of layers and number of modules of the modular network; D = 64 and d = 64 for the embedding and module hidden dimension. For the other approaches, each of the policy and the value networks consists of two fully-connected layers with 256 hidden units.