Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank

Authors: Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason Lee, Daniel Jiang, Yonathan Efroni

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our theoretical results are complemented by experiments that showcase the potential of critic architectures with low interaction rank in offline MARL, contrasting with commonly used single-agent value decomposition architectures. Lastly, in Section 6, we empirically corroborate our findings. This shows the potential of using reward architectures with low interaction rank in offline MARL setting, and the need to go beyond the standard single agent value decomposition architectures, which have been popularized for MARL (Sunehag et al., 2017; Rashid et al., 2020; Yu et al., 2022).
Researcher Affiliation Collaboration Princeton University. Work done at Meta. Meta. Pokee AI. Work done at Meta. Princeton University.
Pseudocode Yes Algorithm 1 Decentralized Regularized Actor-Critic (DR-AC) ... Algorithm 2 Q-function Estimation
Open Source Code No The paper does not contain any explicit statement about releasing its own source code, nor does it provide a link to a code repository. It mentions using the Tianshou library and TD3+BC objective, but these are third-party tools and not the authors' own code release for this specific work.
Open Datasets No The paper does not provide concrete access information (specific link, DOI, repository name, formal citation, or reference to established benchmark datasets) for a publicly available or open dataset. Instead, it describes generating data within a simulated environment: 'We consider the continuous action setting, where i [N], ai [ 1, 1]. The underlying reward model is a 2-IR function of the form i [N], r i (s, a) = PN j=1 aiaj/ N + ϵ where ϵ Uniform( σ, σ) and σ > 0. Further, we set number of agents as N = 50. We collect offline data with the uniform policy and set the number of samples M such that σN/M = 0.1.'
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions generating data and a 'holdout validation dataset' but without details on the split.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It does not mention any hardware at all.
Software Dependencies No The paper mentions the 'Tianshou library (Weng et al., 2022)' and using the 'TD3+BC objective (Fujimoto and Gu, 2021)', but does not provide specific version numbers for these or any other ancillary software components. It also mentions 'Optimizer Adam' in a table but without a version.
Experiment Setup Yes Additional hyper-parameters related to training are given in Table 2. Hyperparameter Value Critic learning rate 1e-4 Critic batch size 64 Patience parameter for critic 20 Actor learning rate 1e-3 Actor batch size 64 Number of epochs 500 Optimizer Adam Policy architecture MLP, 3 layers, width 128, w/ Re Lu activations TD3+BC α parameter 5 # of trials per experiment 10