reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Adversarial Training for Two-player Competitive Games via Episodic Reward Engineering

Authors: Siyuan Chen, Fuyuan Zhang, Zhuo Li, Xiongfei Wu, Jianlang Chen, Pengzhan Zhao, Lei Ma, Jianjun Zhao

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we evaluate our method on two-player competitive games in Mu Jo Co domains (Todorov et al., 2012) and compare it with state-of-the-art adversarial policy training approaches (Gleave et al., 2020; Guo et al., 2021; Wu et al., 2021). Our experimental results show that our method establishes the most promising attack performance and defense difficulty. The comparison of the win rates and non-loss rates between our approach and the baseline approaches are summarized in Figure 2.
Researcher Affiliation	Academia	Siyuan Chen EMAIL The University of Tokyo Fuyuan Zhang EMAIL Zhejiang University Zhuo Li EMAIL Kyushu University Xiongfei Wu EMAIL University of Luxembourg Jianlang Chen EMAIL Kyushu University Pengzhan Zhao EMAIL Hebei Normal University Lei Ma EMAIL The University of Tokyo, University of Alberta Jianjun Zhao EMAIL Kyushu University
Pseudocode	Yes	Algorithm 1 Adversarial policy training with reward revision. Input: A: Adversarial Agent, E: Environment, M: Our episodic memory, B: Experience Storage, O: Objective Function from the Fundamental Training Method Parameter: k: Pattern Length, n: Group Size, ϵ: Revision Coefficient Output: A: A Well-trained Adversarial Agent
Open Source Code	Yes	The source code is available at https://github.com/alsachai/episodic_reward_engineering.
Open Datasets	Yes	We evaluate our approach using two-player competitive games in Mu Jo Co simulation environments, demonstrating that our method establishes the most promising attack performance and defense difficulty against the victims among the existing adversarial policy training techniques. In our experiments, we evaluate our method on two-player competitive games in Mu Jo Co domains (Todorov et al., 2012) and compare it with state-of-the-art adversarial policy training approaches (Gleave et al., 2020; Guo et al., 2021; Wu et al., 2021). To answer Q1, we employ PPO (Schulman et al., 2017) as the fundamental single-agent policy training method to train adversarial agents against well-trained Zoo agents (Bansal et al., 2018).
Dataset Splits	No	The paper does not explicitly provide specific dataset splits for training, validation, or testing. It mentions using '5 seeds per environment' for evaluation, which refers to multiple experimental runs for statistical reliability, not data partitioning for supervised learning.
Hardware Specification	Yes	All experiments were conducted on a system running Ubuntu 20.04.4 with Intel Xeon E5-1650 v4 CPU, NVIDIA Ge Force RTX 3090, and 128 GB of memory.
Software Dependencies	No	The paper mentions using PPO (Schulman et al., 2017) as the fundamental policy training method and implementing NECSA (Li et al., 2023). However, it does not provide specific version numbers for these algorithms, nor does it list any programming language or library versions (e.g., Python, PyTorch, TensorFlow, Gym) used in the implementation.
Experiment Setup	Yes	For the hyper-parameters, we set the ϵ as 0.1, pattern length k as 3 and Group Size n as 100. Further hyper-parameter analysis is detailed in Appendix A.3. Table 3: Hyper-parameters of episodic memory used in our experiments. Pattern Length k 3 Group Size n 100 Epsilon ϵ 0.1