Strategic Knowledge Transfer

Authors: Max Olan Smith, Thomas Anthony, Michael P. Wellman

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental evaluation of these methods on general-sum grid-world games provide evidence about their advantages and limitations in comparison to standard PSRO.
Researcher Affiliation Collaboration Max Olan Smith EMAIL University of Michigan Computer Science & Engineering Ann Arbor, MI 48109-2121, USA. Thomas Anthony EMAIL Deep Mind 6 Pancras Square London N1C 4AG, UK. Michael P. Wellman EMAIL University of Michigan Computer Science & Engineering Ann Arbor, MI 48109-2121, USA.
Pseudocode Yes Algorithm 1: Value Iteration: Q-Mixing; Algorithm 2: Policy-Space Response Oracles (Lanctot et al., 2017); Algorithm 3: Mixed-Oracles; Algorithm 4: Mixed-Opponents
Open Source Code No The paper does not contain an explicit statement by the authors about releasing their source code for the methodologies described in this work (e.g., Q-Mixing, Mixed-Oracles, Mixed-Opponents). While it references a third-party open-source implementation for the Gathering environment ('https://github. com/Human Compatible AI/multi-agent'), this is not the authors' own code for their contributions.
Open Datasets Yes We first evaluate Q-Mixing on the Running With Scissors (RWS) grid-world game (Vezhnevets et al., 2020; Leibo et al., 2021). ... We compare the methods on two distinct games: RWS and Gathering (Leibo et al., 2021). ... Joel Z. Leibo, Edgar Du e nez-Guzm an, Alexander Sasha Vezhnevets, John P. Agapiou, Peter Sunehag, Raphael Koster, Jayd Matyas, Charles Beattie, Igor Mordatch, and Thore Graepel. Scalable evaluation of multi-agent reinforcement learning with melting pot. In 38th International Conference on Machine Learning, ICML, pages 6187 6199, 2021.
Dataset Splits No The paper describes the use of replay buffers for training and evaluating performance over '300 simulated episodes' for each opponent policy, averaged across 'five random seeds' for evaluation. However, it does not specify traditional training/validation/test dataset splits with explicit percentages or sample counts for a static dataset, which is common in supervised learning contexts.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. It mentions the use of 'Deep RL' but no concrete hardware specifications.
Software Dependencies No The paper mentions several algorithms and architectures like 'Double DQN' and 'LSTM', but it does not provide specific software dependencies with version numbers (e.g., Python version, specific deep learning framework like PyTorch or TensorFlow with their versions) that would be needed to replicate the experiment.
Experiment Setup Yes The LSTM has a memory size of 128, and the output is projected through a series of fully-connected layers with sizes [128, 64, 64, 9]. ... To train such a policy, an auxiliary reward of 1 is added each time the agent collects their preferred item.