Reconstruction-Guided Policy: Enhancing Decision-Making through Agent-Wise State Consistency
Authors: Qifan Liang, Yixiang Shan, Haipeng Liu, Zhengbang Zhu, Ting Long, Weinan Zhang, Yuan Tian
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the performance of RGP, we conduct extensive experiments on discrete and continuous environments, experimental results demonstrate the effectiveness. We conduct extensive experiments to evaluate the performance of RGP, and we particularly focus on the research questions: i) How does RGP perform compared with other methods (RQ1)? ii) Can RGP reduce the gap between training and execution (RQ2)? iii) Why does RGP can achieve better performance than other methods(RQ3)? iv) Can RGP explore the potential relationships between agents? (RQ4)? v) Can RGP adapt to continuous action environments? (RQ5)? vi) How does the RGP perform under more challenge partially observable conditions? (RQ6)? The results are illustrated in Table 1. |
| Researcher Affiliation | Academia | Qifan Liang1, Yixiang Shan1, Haipeng Liu1, Zhengbang Zhu2, Ting Long1 , Weinan Zhang2, Yuan Tian1 1 Jilin University, 2 Shanghai Jiao Tong University |
| Pseudocode | Yes | Algorithm 1 Training of RGP with Value decomposition methods Algorithm 2 Training of RGP with policy gradient methods |
| Open Source Code | Yes | Our code is public in https://github.com/Muise4/RGP4/tree/main |
| Open Datasets | Yes | Environments. We primarily evaluated RGP on SMAC (Samvelyan et al., 2019) and SMACv2 (Ellis et al., 2024). SMAC is the most widely used discrete multi-agent environment, while SMACv2 introduces stochasticity based on SAMC. We set up the SMACv2 maps with 5 ally agents against 5 enemies. Additionally, to further demonstrate the portability of RGP, we conducted experiments in continuous predator-prey and continuous cooperative navigation scenarios (Lowe et al., 2017). |
| Dataset Splits | No | The paper describes environmental setups and experimental parameters (e.g., number of agents, prey) but does not provide explicit training/test/validation splits for static datasets. The 'data' is generated dynamically through interaction with the simulated environments. |
| Hardware Specification | Yes | Our model was trained on a setup with 4 NVIDIA A40 GPUs, an Intel Gold 5220 CPU, and 504GB of memory, optimized using the Adam optimizer (Kingma & Ba, 2014). |
| Software Dependencies | No | The paper mentions using the "Adam optimizer" and refers to "Py MARL2 (Hu et al., 2021)", but it does not specify version numbers for these or other key software components, which is required for reproducibility. |
| Experiment Setup | Yes | Implementation Details. Our model was trained on a setup with 4 NVIDIA A40 GPUs, an Intel Gold 5220 CPU, and 504GB of memory, optimized using the Adam optimizer (Kingma & Ba, 2014). Due to limited computational resources, we replaced the Unet used in the original paper DDPM (Ho et al., 2020; Rombach et al., 2022) with an MLP. We set the timestep of diffusion to 10, and the heads of attention to 4. The details of other hyperparameters can be found in Appendix A.2 table 4. Appendix A.2 HYPERPARAMETERS DETAIL. Details of RGP s hyperparameters are provided in Table 4. For baseline VDN, QMIX, QPLEX, they were implemented with the hyperparameters of Py MARL2 (Hu et al., 2021). For HPN-QMIX, CADP, PTDE, and SIDiff, they were implemented with their optimal hyperparameters, as specified in their respective papers (Jianye et al., 2022; Zhou et al., 2023; Chen et al., 2022; Xu et al., 2024). Table 4: Hyperparameter settings for the RGP training. (Including Diffusion process timestep, Type of optimizer, Learning rate, Batch size, TD lambda, Training epochs, Buffer size, Target update interval, Attention heads, Attention embedding dim, Agent information mapping dim) |