reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement

Authors: Yuheng Jing, Kai Li, Bingyun Liu, Ziwen Zhang, Haobo Fu, Qiang Fu, Junliang Xing, Jian Cheng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets. [...] 4. Experiments In this section, Sec. 4.1 provides a detailed description of the experimental setup. Sec. 4.2 poses a series of questions and presents empirical results to answer them, aiming to analyze the effectiveness of the TIPR framework.
Researcher Affiliation	Collaboration	Yuheng Jing 1 2 Kai Li 1 2 Bingyun Liu 1 2 Ziwen Zhang 1 2 Haobo Fu 3 Qiang Fu 3 Junliang Xing 4 Jian Cheng 1 5 6 [...] 1C2DL, Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Tencent AI Lab 4Tsinghua University 5School of Future Technology, University of Chinese Academy of Sciences 6Ai Ri A. [...] Corresponding to: Kai Li <EMAIL>, Jian Cheng <EMAIL>.
Pseudocode	Yes	Our TIPR is designed as a plug-and-play framework to address the suboptimality induced by T for OOM algorithms. The overview of our TIPR framework is illustrated in Fig. 2. We also provide the corresponding pseudocode in Algo. 1. [...] Algorithm 1 Truncated Q-driven Instant Policy Refinement
Open Source Code	No	For specific implementation of this environment, we adopt the open-source code of Open Spiel, which is available at https://github.com/deepmind/open_spiel. [...] For specific implementation of this environment, we adopt the open-source code of lb-foraging, which is available at https://github.com/semitable/lb-foraging. [...] For specific implementation of this environment, we adopt the open-source code of Multi-Agent Particle Environment, which is available at https://github.com/openai/multiagent-particle-envs. [...] We implement Prompt-DT s neural architecture directly based on its open-source code, which is available at https://github.com/mxu34/prompt-dt. [...] We implement TAO s neural architecture directly based on its open-source code, which is available at [TAO Code]. The paper mentions adopting/implementing baselines using third-party open-source code, but does not provide specific access to the source code for the proposed TIPR framework or the full experimental setup implemented by the authors themselves. The '[TAO Code]' is a placeholder and not a concrete link.
Open Datasets	No	We consider four sparse-reward competitive multi-agent environmental benchmarks. See Sec. C for detailed introductions of these environments. [...] Offline datasets T with varying suboptimality, measured by the Optimal Ratio ρ, are constructed. [...] The paper describes constructing its own offline datasets from various open-source environments, but does not provide a specific link, DOI, or repository for the actual datasets generated and used in their experiments. It only refers to the open-source code of the environments used to generate these datasets.
Dataset Splits	No	Fig. 1 shows the error curves of learning Q on validation datasets. [...] We pretrain all OOM baselines for 3000 steps. The final checkpoints of the pretrained OOM baselines were used to test against Unknown Non-stationary Opponents for 2400 episodes. [...] We set up three types of Πon: 1) Seen: This Πon is equivalent to Πoff, which contains 12 policies selected from the MEP population. 2) Unseen: This Πon contains 8 policies selected from the MEP population that have never appeared in Πoff. 3) Mixed: This Πon is the union of the Seen and Unseen. While the paper mentions
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies	No	To maximize Truncated Q s ability to recognize opponents through ICL, our neural architecture adopts a causal Transformer (Radford et al., 2019). [...] The backbone of the Truncated Q is mainly implemented based on the causal Transformer, i.e., GPT2 (Radford et al., 2019) model of Hugging Face (Wolf et al., 2020). [...] We adopt the same timestep encoding as in (Chen et al., 2021). [...] We use PPO (Schulman et al., 2017) for varying numbers of steps while keeping opponent policy fixed. [...] Learning rate for Adam W (Loshchilov & Hutter, 2018) optimizer. The paper mentions various software components and libraries like 'GPT2 model of Hugging Face', 'Adam W optimizer', and 'PPO', but it does not specify their version numbers, which are necessary for full reproducibility.
Experiment Setup	Yes	See all the hyperparameters in Sec. F. [...] F.1. Hyperparameters for Opponent Policies & Offline Datasets [...] F.2. Hyperparameters for OOM Baselines Pretraining [...] F.3. Hyperparameters for Truncated Q Training [...] F.4. Hyperparameters for Instant Policy Refinement