Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement
Authors: Yuheng Jing, Kai Li, Bingyun Liu, Ziwen Zhang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets. [...] 4. Experiments In this section, Sec. 4.1 provides a detailed description of the experimental setup. Sec. 4.2 poses a series of questions and presents empirical results to answer them, aiming to analyze the effectiveness of the TIPR framework. |
| Researcher Affiliation | Collaboration | Yuheng Jing 1 2 Kai Li 1 2 Bingyun Liu 1 2 Ziwen Zhang 1 2 Haobo Fu 3 Qiang Fu 3 Junliang Xing 4 Jian Cheng 1 5 6 [...] 1C2DL, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Tencent AI Lab 4Tsinghua University 5School of Future Technology, University of Chinese Academy of Sciences 6Ai Ri A. [...] Corresponding to: Kai Li <EMAIL>, Jian Cheng <EMAIL>. |
| Pseudocode | Yes | Our TIPR is designed as a plug-and-play framework to address the suboptimality induced by T for OOM algorithms. The overview of our TIPR framework is illustrated in Fig. 2. We also provide the corresponding pseudocode in Algo. 1. [...] Algorithm 1 Truncated Q-driven Instant Policy Refinement |
| Open Source Code | No | For specific implementation of this environment, we adopt the open-source code of Open Spiel, which is available at https://github.com/deepmind/open_spiel. [...] For specific implementation of this environment, we adopt the open-source code of lb-foraging, which is available at https://github.com/semitable/lb-foraging. [...] For specific implementation of this environment, we adopt the open-source code of Multi-Agent Particle Environment, which is available at https://github.com/openai/multiagent-particle-envs. [...] We implement Prompt-DT s neural architecture directly based on its open-source code, which is available at https://github.com/mxu34/prompt-dt. [...] We implement TAO s neural architecture directly based on its open-source code, which is available at [TAO Code]. The paper mentions adopting/implementing baselines using third-party open-source code, but does not provide specific access to the source code for the proposed TIPR framework or the full experimental setup implemented by the authors themselves. The '[TAO Code]' is a placeholder and not a concrete link. |
| Open Datasets | No | We consider four sparse-reward competitive multi-agent environmental benchmarks. See Sec. C for detailed introductions of these environments. [...] Offline datasets T with varying suboptimality, measured by the Optimal Ratio ρ, are constructed. [...] The paper describes constructing its own offline datasets from various open-source environments, but does not provide a specific link, DOI, or repository for the *actual datasets* generated and used in their experiments. It only refers to the open-source code of the *environments* used to generate these datasets. |
| Dataset Splits | No | Fig. 1 shows the error curves of learning Q on validation datasets. [...] We pretrain all OOM baselines for 3000 steps. The final checkpoints of the pretrained OOM baselines were used to test against Unknown Non-stationary Opponents for 2400 episodes. [...] We set up three types of Πon: 1) Seen: This Πon is equivalent to Πoff, which contains 12 policies selected from the MEP population. 2) Unseen: This Πon contains 8 policies selected from the MEP population that have never appeared in Πoff. 3) Mixed: This Πon is the union of the Seen and Unseen. While the paper mentions |
| Hardware Specification | No | The paper does not provide any specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments. |
| Software Dependencies | No | To maximize Truncated Q s ability to recognize opponents through ICL, our neural architecture adopts a causal Transformer (Radford et al., 2019). [...] The backbone of the Truncated Q is mainly implemented based on the causal Transformer, i.e., GPT2 (Radford et al., 2019) model of Hugging Face (Wolf et al., 2020). [...] We adopt the same timestep encoding as in (Chen et al., 2021). [...] We use PPO (Schulman et al., 2017) for varying numbers of steps while keeping opponent policy fixed. [...] Learning rate for Adam W (Loshchilov & Hutter, 2018) optimizer. The paper mentions various software components and libraries like 'GPT2 model of Hugging Face', 'Adam W optimizer', and 'PPO', but it does not specify their version numbers, which are necessary for full reproducibility. |
| Experiment Setup | Yes | See all the hyperparameters in Sec. F. [...] F.1. Hyperparameters for Opponent Policies & Offline Datasets [...] F.2. Hyperparameters for OOM Baselines Pretraining [...] F.3. Hyperparameters for Truncated Q Training [...] F.4. Hyperparameters for Instant Policy Refinement |