Revisiting Cooperative Off-Policy Multi-Agent Reinforcement Learning

Authors: Yueheng Li, Guangming Xie, Zongqing Lu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that these methods effectively mitigate erroneous estimations, yielding substantial performance improvements in challenging benchmarks such as SMAC, SMACv2, and Google Research Football.
Researcher Affiliation Academia 1College of Engineering, 2Institute of Artificial Intelligence, 3School of Computer Science, Peking University. Correspondence to: Guangming Xie <EMAIL>, Zongqing Lu <EMAIL>.
Pseudocode Yes C. Pseudo code The pseudo code of AEQMIX is summarized in Algorithm 1.
Open Source Code No The paper states, "Our implementation of VDN, QMIX and QPLEX is based on the pymarl2 (Hu et al., 2023) code base." and "The codebases for FACMAC and MADDPG are adopted from Peng et al. (2021).". This indicates that the authors built upon existing open-source codebases, but there is no explicit statement or link confirming that their specific contributions (e.g., AEQMIX, AEFACMAC, AEMADDPG-RAR implementations) are publicly available.
Open Datasets Yes When integrated into existing off-policy MARL methods, these techniques yield substantial performance gains across a variety of challenging tasks, including SMAC (Samvelyan et al., 2019), SMACv2 (Ellis et al., 2023), and Google Research Football (GRF) (Kurach et al., 2020).
Dataset Splits No The paper mentions evaluating on specific maps/scenarios for environments like SMAC ("four maps... one Easy map, one Hard map, and two Super Hard maps") and SMACv2 ("15 maps of SMACv2"). While these define the testing conditions, the paper does not provide explicit numerical dataset splits (e.g., percentages or sample counts) of fixed data points for training, validation, and testing as typically understood for static datasets. The environments are dynamic, and data is generated through interaction.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or other computing infrastructure used for experiments.
Software Dependencies No The paper mentions using "pymarl2" as a codebase and "Adam" as an optimizer, but does not provide specific version numbers for these or any other software dependencies, such as programming languages or libraries like PyTorch/TensorFlow.
Experiment Setup Yes Table 1. Hyperparameters used for SMAC, SMACv2 and GRF. Action Selector epsilon greedy ϵ start 1.0 ϵ finish 0.05 ϵ Anneal Time 100000 Runner parallel Batch Size Run [8, 4, 32 for SMAC, SMACv2, GRF respectively] Buffer Size 5000 Batch Size 128 Optimizer Adam Target Update Interval 200 Mixing Embed Dimension 32 Hypernet Embed Dimension 64 Learning Rate [0.001, 0.001, 0.0005 for SMAC, SMACv2, GRF respectively] λ [0.6, 0.4, 0.8] λ {0.0, 0.4} {0.0, 0.2} 0.8 Ensemble Size [8, 8, 2] Gamma [0.99, 0.99, 0.999] RNN Hidden Dim [64, 64, 256]