Learning Strategy Representation for Imitation Learning in Multi-Agent Games

Authors: Shiqi Lei, Kanghoon Lee, Linjing Li, Jinkyoo Park

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of STRIL across competitive multi-agent scenarios, including Two-player Pong, Limit Texas Hold em, and Connect Four. Our approach successfully acquires strategy representations and indicators, thereby identifying dominant trajectories and significantly enhancing existing IL performance across these environments. ... We evaluate our method across three environments to demonstrate the effectiveness of the learned strategy representation in STRIL using estimated indicators. ... In Table 1, we compared the WS of four types of data filtering methods. A hyperparameter search was conducted to identify the appropriate percentile, p, of indicators for each model and environment. Note that all experiments were repeated three times, and the results are reported with error bars.
Researcher Affiliation Collaboration 1Institute of Automation, Chinese Academy of Sciences (CASIA) 2Korea Advanced Institute of Science and Technology (KAIST) 3Beijing Wenge Technology Co., Ltd.
Pseudocode No The paper describes methods using mathematical equations and diagrams (Figure 1 and 2), but does not contain a clearly labeled 'Pseudocode' or 'Algorithm' block, nor structured steps formatted like code.
Open Source Code No The paper does not contain any explicit statement about releasing source code, nor does it provide a link to a code repository.
Open Datasets No The paper states: 'Dataset generation. We employ different methods to create training datasets with diverse demonstrators for the environments.' and 'Behavior models are then selected from multiple intermediate checkpoints to generate the offline data.' While the environments (Two-player Pong, Limit Texas Hold em, Connect Four) are based on existing platforms (RLCard, Pettingzoo), the specific offline datasets generated for the experiments are not explicitly stated to be publicly available, nor are any links or citations for these generated datasets provided.
Dataset Splits No The paper mentions 'We assume that only 5% of the dataset is reward-labeled for EL estimation.' This specifies a subset for reward-labeling, but it does not provide comprehensive training/test/validation splits for the main imitation learning experiments.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies No The paper mentions algorithms like 'Proximal Policy Optimization (PPO)' and 'Deep Q-network (DQN)' and frameworks like 'RLCard' and 'Pettingzoo', but it does not specify version numbers for any software libraries, programming languages, or specific tools used in the implementation.
Experiment Setup No The paper mentions 'A hyperparameter search was conducted to identify the appropriate percentile, p, of indicators for each model and environment.' and 'We set Ngame to 2,000.' It also states 'At the beginning of the training, the strategy representation l, which is a trainable variable, is randomly initialized for each trajectory τ.' and 'We use a two-layer MLP as L.' While these are some setup details, the paper lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs/iterations, specific optimizer settings) for the main IL algorithms (BC, IQ-Learn, ILEED) that would be needed for reproduction.