Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization

Authors: Zongkai Liu, Qian Lin, Chao Yu, Xiawei Wu, Yile Liang, Donghui Li, Xuetao Ding

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods. We evaluate In SPO in the XOR game, Multi-NE game, and Bridge to demonstrate its effectiveness in addressing OOD joint action and local optimum convergence issues. Additionally, we test it on various types of offline datasets in the Star Craft II micromanagement benchmark to showcase its competitiveness with current state-of-the-art offline MARL algorithms.
Researcher Affiliation Collaboration 1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2Pengcheng Laboratory, Shenzhen, China 3Shanghai Innovation Institute, Shanghai, China 4Meituan, Beijing, China
Pseudocode Yes Algorithm 1: In SPO Input: Offline dataset D, initial policy π0 and Q-function Q0 Output: πK 1: Compute behavior policy µ by simple Behavior Cloning 2: for k = 1, , K do 3: Compute Qk by Eq.(10) 4: Draw a permutation i1:N of agents at random 5: for n = 1, , N do 6: Update πin k by Eq.(13) 7: end for 8: end for
Open Source Code Yes Code https://github.com/kkkaiaiai/In SPO/
Open Datasets Yes We evaluate In SPO in the XOR game, M-NE game, Bridge (Fu et al. 2022) and Star Craft II Micromanagement (Xu et al. 2023b). For this experiment, we use two datasets provided by Matsunaga et al. (2023): optimal and mixed. We use four datasets provided by Shao et al. (2023): medium, expert, medium-replay and mixed.
Dataset Splits No The paper describes types of datasets used (e.g., 'optimal and mixed' for Bridge, 'medium, expert, medium-replay and mixed' for Star Craft II, 'balanced' and 'imbalanced' for M-NE game) and their collection methods, but it does not specify explicit training/validation/test splits with percentages, sample counts, or references to predefined splits for reproducibility.
Hardware Specification No The main text states: 'See Appendix C for experimental details.' No specific hardware details (like GPU models or CPU types) are provided in the main body of the paper.
Software Dependencies No The main text states: 'See Appendix C for experimental details.' No specific software libraries or versions are provided in the main body of the paper.
Experiment Setup Yes Temperature α is used to control the degree of conservatism. A too large α will result in an overly conservative policy, while a too small one will easily causes distribution shift. Thus, to obtain a suitable α, we implement both fixed and auto-tuned α in practice (see Appendix B for details), where the auto-tuned α is adjusted by minα ED[αDKL(π, µ) α DKL], where DKL is the target value. Table 5 gives ablation results for α, which shows that the auto-tuned α can find an appropriate α to further improve performance.