Offline Multi-Agent Reinforcement Learning via In-Sample Sequential Policy Optimization
Authors: Zongkai Liu, Qian Lin, Chao Yu, Xiawei Wu, Yile Liang, Donghui Li, Xuetao Ding
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate the effectiveness of our method compared to current state-of-the-art offline MARL methods. We evaluate In SPO in the XOR game, Multi-NE game, and Bridge to demonstrate its effectiveness in addressing OOD joint action and local optimum convergence issues. Additionally, we test it on various types of offline datasets in the Star Craft II micromanagement benchmark to showcase its competitiveness with current state-of-the-art offline MARL algorithms. |
| Researcher Affiliation | Collaboration | 1School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China 2Pengcheng Laboratory, Shenzhen, China 3Shanghai Innovation Institute, Shanghai, China 4Meituan, Beijing, China |
| Pseudocode | Yes | Algorithm 1: In SPO Input: Offline dataset D, initial policy π0 and Q-function Q0 Output: πK 1: Compute behavior policy µ by simple Behavior Cloning 2: for k = 1, , K do 3: Compute Qk by Eq.(10) 4: Draw a permutation i1:N of agents at random 5: for n = 1, , N do 6: Update πin k by Eq.(13) 7: end for 8: end for |
| Open Source Code | Yes | Code https://github.com/kkkaiaiai/In SPO/ |
| Open Datasets | Yes | We evaluate In SPO in the XOR game, M-NE game, Bridge (Fu et al. 2022) and Star Craft II Micromanagement (Xu et al. 2023b). For this experiment, we use two datasets provided by Matsunaga et al. (2023): optimal and mixed. We use four datasets provided by Shao et al. (2023): medium, expert, medium-replay and mixed. |
| Dataset Splits | No | The paper describes types of datasets used (e.g., 'optimal and mixed' for Bridge, 'medium, expert, medium-replay and mixed' for Star Craft II, 'balanced' and 'imbalanced' for M-NE game) and their collection methods, but it does not specify explicit training/validation/test splits with percentages, sample counts, or references to predefined splits for reproducibility. |
| Hardware Specification | No | The main text states: 'See Appendix C for experimental details.' No specific hardware details (like GPU models or CPU types) are provided in the main body of the paper. |
| Software Dependencies | No | The main text states: 'See Appendix C for experimental details.' No specific software libraries or versions are provided in the main body of the paper. |
| Experiment Setup | Yes | Temperature α is used to control the degree of conservatism. A too large α will result in an overly conservative policy, while a too small one will easily causes distribution shift. Thus, to obtain a suitable α, we implement both fixed and auto-tuned α in practice (see Appendix B for details), where the auto-tuned α is adjusted by minα ED[αDKL(π, µ) α DKL], where DKL is the target value. Table 5 gives ablation results for α, which shows that the auto-tuned α can find an appropriate α to further improve performance. |