An Open-Ended Learning Framework for Opponent Modeling
Authors: Yuheng Jing, Kai Li, Bingyun Liu, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments in cooperative, competitive, and mixed environments demonstrate that OEOM is an approach-agnostic framework that improves generalizability compared to training against a fixed set of opponents, regardless of OM approaches or testing opponent settings. The results also indicate that our proposed approach generally outperforms existing OM baselines. In this section, Sec. 5.1 presents our experimental environments, baselines, and evaluation protocols. Sec. 5.2 poses a series of questions and provides empirical results to answer them, aiming to analyze the effectiveness of the OEOM framework and the IOM approach. |
| Researcher Affiliation | Collaboration | Yuheng Jing1,2, Kai Li1,2 , Bingyun Liu1,2, Haobo Fu6, Qiang Fu6, Junliang Xing5, Jian Cheng1,3,4 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Future Technology, University of Chinese Academy of Sciences 4Ai Ri A 5Tsinghua University 6Tencent AI Lab EMAIL, EMAIL |
| Pseudocode | No | The paper describes the OEOM framework and the IOM approach with textual explanations, equations, and diagrams (Fig. 1 and Fig. 2), but no structured pseudocode blocks or algorithms are explicitly labeled. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide any links to code repositories. |
| Open Datasets | No | The paper mentions experimental environments: Predator Prey (PP), Level-Based Foraging (LBF), and Over Cooked (OC). These are multi-agent environments or games used for evaluation, but the paper does not provide concrete access information (links, DOIs, citations with authors/year) for publicly available datasets used within these environments. |
| Dataset Splits | Yes | For each OG method, we construct a Πtrain of size 30. For OEOM, OEOM T SD, and OEOM P BT , we set the population size K = 6, the number of iterations m = 25, and the selection ratio ρ = 0.2 to generate 30 opponent policies. For SP, we run 30 different seeds, using the same total training steps as OEOM, to generate 30 opponent policies. For Script, we manually create 30 different hard-coded opponent policies for each environment. For the testing stage, we construct four different Πtest settings: (1) Seen, (2) Unseen-L1, (3) Unseen-L2, and (4) Unseen-L3. Seen uses Πtrain as Πtest, while Unseen-L1 to Unseen-L3 are constructed using three Levels of opponent policies never appeared in Πtrain, where the higher the level, the stronger the opponents are in strength. We assume the test opponents are unknown and non-stationary, with their true policies unknown to the self-agent, and sample a policy from Πtest every 10 episodes. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments, such as specific GPU or CPU models, or memory details. |
| Software Dependencies | No | Specifically, we use PPO (Schulman et al. 2017) to optimize Eq. (5). While it mentions PPO, it does not provide version numbers for PPO or any other software libraries, programming languages, or environments. |
| Experiment Setup | Yes | For each OG method, we construct a Πtrain of size 30. For OEOM, OEOM T SD, and OEOM P BT , we set the population size K = 6, the number of iterations m = 25, and the selection ratio ρ = 0.2 to generate 30 opponent policies. For SP, we run 30 different seeds, using the same total training steps as OEOM, to generate 30 opponent policies. For Script, we manually create 30 different hard-coded opponent policies for each environment. During the training stage, each OM approach is trained for 30000 steps using the given Πtrain. All OM approaches use the final checkpoint from training to play 900 episodes against the unknown non-stationary opponents, who switch policies a total of 90 times. |