Improving Adversarial Training for Two-player Competitive Games via Episodic Reward Engineering

Authors: Siyuan Chen, Fuyuan Zhang, Zhuo Li, Xiongfei Wu, Jianlang Chen, Pengzhan Zhao, Lei Ma, Jianjun Zhao

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we evaluate our method on two-player competitive games in Mu Jo Co domains (Todorov et al., 2012) and compare it with state-of-the-art adversarial policy training approaches (Gleave et al., 2020; Guo et al., 2021; Wu et al., 2021). Our experimental results show that our method establishes the most promising attack performance and defense difficulty. The comparison of the win rates and non-loss rates between our approach and the baseline approaches are summarized in Figure 2.
Researcher Affiliation Academia Siyuan Chen EMAIL The University of Tokyo Fuyuan Zhang EMAIL Zhejiang University Zhuo Li EMAIL Kyushu University Xiongfei Wu EMAIL University of Luxembourg Jianlang Chen EMAIL Kyushu University Pengzhan Zhao EMAIL Hebei Normal University Lei Ma EMAIL The University of Tokyo, University of Alberta Jianjun Zhao EMAIL Kyushu University
Pseudocode Yes Algorithm 1 Adversarial policy training with reward revision. Input: A: Adversarial Agent, E: Environment, M: Our episodic memory, B: Experience Storage, O: Objective Function from the Fundamental Training Method Parameter: k: Pattern Length, n: Group Size, ϵ: Revision Coefficient Output: A: A Well-trained Adversarial Agent
Open Source Code Yes The source code is available at https://github.com/alsachai/episodic_reward_engineering.
Open Datasets Yes We evaluate our approach using two-player competitive games in Mu Jo Co simulation environments, demonstrating that our method establishes the most promising attack performance and defense difficulty against the victims among the existing adversarial policy training techniques. In our experiments, we evaluate our method on two-player competitive games in Mu Jo Co domains (Todorov et al., 2012) and compare it with state-of-the-art adversarial policy training approaches (Gleave et al., 2020; Guo et al., 2021; Wu et al., 2021). To answer Q1, we employ PPO (Schulman et al., 2017) as the fundamental single-agent policy training method to train adversarial agents against well-trained Zoo agents (Bansal et al., 2018).
Dataset Splits No The paper does not explicitly provide specific dataset splits for training, validation, or testing. It mentions using '5 seeds per environment' for evaluation, which refers to multiple experimental runs for statistical reliability, not data partitioning for supervised learning.
Hardware Specification Yes All experiments were conducted on a system running Ubuntu 20.04.4 with Intel Xeon E5-1650 v4 CPU, NVIDIA Ge Force RTX 3090, and 128 GB of memory.
Software Dependencies No The paper mentions using PPO (Schulman et al., 2017) as the fundamental policy training method and implementing NECSA (Li et al., 2023). However, it does not provide specific version numbers for these algorithms, nor does it list any programming language or library versions (e.g., Python, PyTorch, TensorFlow, Gym) used in the implementation.
Experiment Setup Yes For the hyper-parameters, we set the ϵ as 0.1, pattern length k as 3 and Group Size n as 100. Further hyper-parameter analysis is detailed in Appendix A.3. Table 3: Hyper-parameters of episodic memory used in our experiments. Pattern Length k 3 Group Size n 100 Epsilon ϵ 0.1