MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment
Authors: Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones. 5 Experiments Based on the above, our methods include MACCA-OMAR, MACCA-CQL and MACCA-ICQ. For baselines, we compare with both CTDE and independent learning paradigm methods, including I-CQL (Kumar et al., 2020): conservative Q-learning in independent paradigm, OMAR (Pan et al., 2022): based on I-CQL, but learning better coordination actions among agents using zeroth-order optimization, MA-ICQ (Yang et al., 2021): Implicit constraint Q-learning within CTDE paradigm, SHAQ (Wang et al., 2022a) and SQDDPG (Wang et al., 2020): variants of credit assignment method using Shapley value, which are the SOTA on the online multi-agent RL, SHAQ-CQL: In pursuit of a more fair comparison, we integrated CQL with SHAQ, which adopts the architectural framework of SHAQ while using CQL in the estimations of agents Q-values and the target Q-values, QMIX-CQL: conservative Q-learning within CTDE paradigm, following QMIX structure to calculate the Qtot using a mixing layer, which is similar to the MA-ICQ framework. We evaluate those performance in two environments: Multi-agent Particle Environments (MPE) (Lowe et al., 2017) and Star Craft Micromanagement Challenges (SMAC) (Samvelyan et al., 2019). Through these comparative evaluations, we want to highlight the relative effectiveness and superiority of the MACCA approach. Furthermore, we conduct three ablations to investigate the interpretability and efficiency of our method. 5.1 General Implementation 5.2 Main Results 5.3 Ablation Studies |
| Researcher Affiliation | Academia | Ziyan Wang EMAIL King s College London Yali Du EMAIL King s College London Yudi Zhang EMAIL Eindhoven University of Technology Meng Fang EMAIL University of Liverpool Biwei Huang EMAIL University of California San Diego |
| Pseudocode | Yes | F Implementations F.1 Algorithm Algorithm 1 MACCA: Multi-Agent Causal Credit Assignment 1: for training step t = 1 to T do 2: Sample trajectories from D, save in minibatch B 3: for agent i = 1 to N do 4: Update the team reward Rt to ˆri t in B (Eq 6) 5: Optimize ψm: ψm ψm α ψm Lm (Eq 4) 6: Update policy π with minibatch B (Eq 7, Eq 8 or Eq 9) 7: Reset B |
| Open Source Code | No | The paper does not explicitly provide a link to open-source code, nor does it contain a clear, affirmative statement of code release. It states that "The implementation specifics for all the baseline methods and our proposed MACCA are thoroughly outlined in Section 4 and Appendix F," but this refers to documentation, not the code itself. |
| Open Datasets | Yes | We evaluate those performance in two environments: Multi-agent Particle Environments (MPE) (Lowe et al., 2017) and Star Craft Micromanagement Challenges (SMAC) (Samvelyan et al., 2019). E Environments Setting We adopt the open-source implementations for the multi-agent particle environment (Lowe et al., 2017)1 and SMAC(Samvelyan et al., 2019)2. 1https://github.com/openai/multiagent-particle-envs 2https://github.com/oxwhirl/smac |
| Dataset Splits | No | The paper describes different *types* of offline datasets based on their generation policy (Random, Medium Reply, Medium, Expert) but does not provide specific training/test/validation splits for any of these datasets. For example, it doesn't state proportions like "80% for training, 20% for testing" from a given dataset. |
| Hardware Specification | Yes | All experiments were conducted on a heterogeneous computing cluster running Ubuntu Linux. The hardware configuration included a mix of CPU models (Dual Intel Xeon E5-2650, E5-2680 v2, and E5-2690 v3) with a total of 180 CPU cores and 500 GB of system memory. For GPU acceleration, we utilized three NVIDIA A30 GPUs. |
| Software Dependencies | No | The paper mentions the operating system "Ubuntu Linux" and general software components like "Adam optimizer" and "Re LU activation function", but does not provide specific version numbers for these or other key software dependencies like programming languages (e.g., Python), machine learning frameworks (e.g., PyTorch, TensorFlow), or GPU acceleration libraries (e.g., CUDA). |
| Experiment Setup | Yes | The common hyperparameters are shown in Table.9. The neural network used in training is initialized from scratch and optimized using the Adam optimizer with a learning rate of 3 10 4. The policy learning process involves varying initial learning rates based on the specific algorithm, while the hyperparameters for policy learning, including a discount factor of 0.95, are consistent across all tasks. Table 9: The Common Hyperparameters. hyperparameters value hyperparameters value steps per update 100 optimizer Adam batch size 1024 learning rate 3 10 4 hidden layer dim 64 γ 0.95 evaluation interval 1000 evaluation episodes 10 Table 10: Hyperparameters for OMAR, CQL and MACCA OMAR τ CQL α MACCA λ1 MACCA λ2 MACCA rlr MACCA h Expert 0.9 5.0 7e-3 7e-3 5e-2 0.1 Medium 0.7 0.5 5e-3 5e-3 5e-2 0.1 Medium-Replay 0.7 1.0 5e-3 7e-3 5e-2 0.1 Random 0.99 1.0 1e-7 1e-3 5e-2 0.1 |