reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Implicit Ensemble Training for Efficient and Robust Multiagent Reinforcement Learning

Authors: Macheng Shen, JONATHAN P HOW

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate in several competitive multiagent scenarios in the board game and robotic domains that our new approach improves robustness against unseen adversarial opponents while achieving higher sample-efficiency and less computation.
Researcher Affiliation	Academia	Macheng Shen EMAIL Department of Mechanical Engineering Massachusetts Institute of Technology Jonathan P. How EMAIL Department of Department of Aeronautics and Astronautics Massachusetts Institute of Technology
Pseudocode	Yes	Algorithm 1 Implicit Ensemble Training Require: Number of training episodes M, a set of agents with index in {1, . . . , N} 1: Initialize network parameters ψi, ϕi, i {1, . . . , N} for each agent 2: for j = 1 : M do 3: for i = 1 : N do 4: Sample latent vector zi N(0, IL L) 5: end for 6: Rollout with each agent sampling its action via ai hϕi(oi; gψ(zi)) 7: Update network parameters ψi ϕi with policy gradient 8: end for 9: return ψi, ϕi
Open Source Code	Yes	1code: https://github.com/Macheng Shen/Implicit Ensemble Training
Open Datasets	Yes	We evaluate our approach on two types of 2-player (which we refer to as the blue agent and the red agent hereafter) multiagent scenarios. Board-game: turn-based games implemented in the Petting Zoo multiagent environment2 (Terry et al., 2020) and the RLCard toolkit3 (Zha et al., 2019): Connect Four, Leduc Hold em, and Texas Hold em (Limit). Robo School-Racer: continuous problems modified from the robot racing scenarios in the Robo School environment4: Ant and Cheetah, where we decompose each robot into front and rear parts, and assign opposite rewards to each part. 2https://www.pettingzoo.ml 3https://github.com/datamllab/rlcard 4https://github.com/openai/roboschool
Dataset Splits	Yes	To evaluate the robustness of the learned policies, we adopted a similar approach as in (Vinyals et al., 2019; Gleave et al., 2019) by training an independent exploiter agent. Specifically, we launched two concurrent threads, one for the training, the other for the testing, and repeated the following steps: 1. Train the blue agent and the red agent in the training thread for one training epoch. 2. Copy the blue agent s policy to the testing thread and freeze it. 3. Train the red exploiter agent in the testing thread against the fixed blue agent. As a result, the red exploiter agent learns to exploit any weakness of the blue policy, and the corresponding reward is an informative indicator of the adversarial robustness of the blue policy. Table 1 shows the average testing reward, the best testing reward, and the average reward gap between training and testing. We compare our IET approach with two baselines. One baseline is our IET approach with the input latent noise fixed as a zero vector, and the other baseline is a standard ensemble with 10 policies. For the IET settings, we ran experiments across 25 random seeds. However, we found that due to the difficulty of balancing the relative strength between the blue and the red agents, sometimes (for some random seeds) the blue agent s training reward converges to near -1.0, and so does the testing reward. Since we are interested in investigating the robustness of the learned policy (whether a strong training performance is sufficient to guarantee a strong testing performance), we filter out those seeds that lead to poor training performance (converged training reward lower than -0.5). There turned out to be 19 remaining valid seeds for each of the two IET settings, and the results are evaluated only across these valid seeds. As for the SET setting, we found that both the training and the testing reward are not very sensitive to random seeds, so we report the results across 4 random seeds.
Hardware Specification	No	The paper does not specify any particular hardware used for the experiments, only general software frameworks and hyperparameters.
Software Dependencies	No	We used the RLlib (Liang et al., 2018) implementation of Proximal Policy Optimization (PPO) with a minibatch size of 256 and a learning rate of 5 10 5. We use independent networks for the policy and the value function approximations and set the following hyperparameters for IET: L = 10 for the latent condition variable dimension; H = 64 for the hidden layer dimension in the shaping network; n = 2 and m = 2 for the number of layers and number of modules of the modular network; D = 64 and d = 64 for the embedding and module hidden dimension. For the other approaches, each of the policy and the value networks consists of two fully-connected layers with 256 hidden units. Although RLlib and PPO are mentioned, specific version numbers are not provided.
Experiment Setup	Yes	We used the RLlib (Liang et al., 2018) implementation of Proximal Policy Optimization (PPO) with a minibatch size of 256 and a learning rate of 5 10 5. We use independent networks for the policy and the value function approximations and set the following hyperparameters for IET: L = 10 for the latent condition variable dimension; H = 64 for the hidden layer dimension in the shaping network; n = 2 and m = 2 for the number of layers and number of modules of the modular network; D = 64 and d = 64 for the embedding and module hidden dimension. For the other approaches, each of the policy and the value networks consists of two fully-connected layers with 256 hidden units.