reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Authors: Zongkai Liu, Qian Lin, Chao Yu, Xiawei Wu, Yile Liang, Donghui Li, Xuetao Ding

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods. We evaluate In SPO in the XOR game, Multi-NE game, and Bridge to demonstrate its effectiveness in addressing OOD joint action and local optimum convergence issues. Additionally, we test it on various types of offline datasets in the Star Craft II micromanagement benchmark to showcase its competitiveness with current state-of-the-art offline MARL algorithms.
Researcher Affiliation	Collaboration	1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2Pengcheng Laboratory, Shenzhen, China 3Shanghai Innovation Institute, Shanghai, China 4Meituan, Beijing, China
Pseudocode	Yes	Algorithm 1: In SPO Input: Offline dataset D, initial policy π0 and Q-function Q0 Output: πK 1: Compute behavior policy µ by simple Behavior Cloning 2: for k = 1, , K do 3: Compute Qk by Eq.(10) 4: Draw a permutation i1:N of agents at random 5: for n = 1, , N do 6: Update πin k by Eq.(13) 7: end for 8: end for
Open Source Code	Yes	Code https://github.com/kkkaiaiai/In SPO/
Open Datasets	Yes	We evaluate In SPO in the XOR game, M-NE game, Bridge (Fu et al. 2022) and Star Craft II Micromanagement (Xu et al. 2023b). For this experiment, we use two datasets provided by Matsunaga et al. (2023): optimal and mixed. We use four datasets provided by Shao et al. (2023): medium, expert, medium-replay and mixed.
Dataset Splits	No	The paper describes types of datasets used (e.g., 'optimal and mixed' for Bridge, 'medium, expert, medium-replay and mixed' for Star Craft II, 'balanced' and 'imbalanced' for M-NE game) and their collection methods, but it does not specify explicit training/validation/test splits with percentages, sample counts, or references to predefined splits for reproducibility.
Hardware Specification	No	The main text states: 'See Appendix C for experimental details.' No specific hardware details (like GPU models or CPU types) are provided in the main body of the paper.
Software Dependencies	No	The main text states: 'See Appendix C for experimental details.' No specific software libraries or versions are provided in the main body of the paper.
Experiment Setup	Yes	Temperature α is used to control the degree of conservatism. A too large α will result in an overly conservative policy, while a too small one will easily causes distribution shift. Thus, to obtain a suitable α, we implement both fixed and auto-tuned α in practice (see Appendix B for details), where the auto-tuned α is adjusted by minα ED[αDKL(π, µ) α DKL], where DKL is the target value. Table 5 gives ablation results for α, which shows that the auto-tuned α can find an appropriate α to further improve performance.