reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank

Authors: Wenhao Zhan, Scott Fujimoto, Zheqing Zhu, Jason Lee, Daniel Jiang, Yonathan Efroni

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our theoretical results are complemented by experiments that showcase the potential of critic architectures with low interaction rank in offline MARL, contrasting with commonly used single-agent value decomposition architectures. Lastly, in Section 6, we empirically corroborate our findings. This shows the potential of using reward architectures with low interaction rank in offline MARL setting, and the need to go beyond the standard single agent value decomposition architectures, which have been popularized for MARL (Sunehag et al., 2017; Rashid et al., 2020; Yu et al., 2022).
Researcher Affiliation	Collaboration	Princeton University. Work done at Meta. Meta. Pokee AI. Work done at Meta. Princeton University.
Pseudocode	Yes	Algorithm 1 Decentralized Regularized Actor-Critic (DR-AC) ... Algorithm 2 Q-function Estimation
Open Source Code	No	The paper does not contain any explicit statement about releasing its own source code, nor does it provide a link to a code repository. It mentions using the Tianshou library and TD3+BC objective, but these are third-party tools and not the authors' own code release for this specific work.
Open Datasets	No	The paper does not provide concrete access information (specific link, DOI, repository name, formal citation, or reference to established benchmark datasets) for a publicly available or open dataset. Instead, it describes generating data within a simulated environment: 'We consider the continuous action setting, where i [N], ai [ 1, 1]. The underlying reward model is a 2-IR function of the form i [N], r i (s, a) = PN j=1 aiaj/ N + ϵ where ϵ Uniform( σ, σ) and σ > 0. Further, we set number of agents as N = 50. We collect offline data with the uniform policy and set the number of samples M such that σN/M = 0.1.'
Dataset Splits	No	The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. It mentions generating data and a 'holdout validation dataset' but without details on the split.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. It does not mention any hardware at all.
Software Dependencies	No	The paper mentions the 'Tianshou library (Weng et al., 2022)' and using the 'TD3+BC objective (Fujimoto and Gu, 2021)', but does not provide specific version numbers for these or any other ancillary software components. It also mentions 'Optimizer Adam' in a table but without a version.
Experiment Setup	Yes	Additional hyper-parameters related to training are given in Table 2. Hyperparameter Value Critic learning rate 1e-4 Critic batch size 64 Patience parameter for critic 20 Actor learning rate 1e-3 Actor batch size 64 Number of epochs 500 Optimizer Adam Policy architecture MLP, 3 layers, width 128, w/ Re Lu activations TD3+BC α parameter 5 # of trials per experiment 10