reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

O-MAPL: Offline Multi-agent Preference Learning

Authors: The Viet Bui, Tien Anh Mai, Thanh Hong Nguyen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on SMAC and MAMu Jo Co benchmarks show that our algorithm outperforms existing methods across various tasks.
Researcher Affiliation	Academia	1Singapore Management University, Singapore 2University of Oregon Eugene, Oregon, United States.
Pseudocode	Yes	Algorithm 1 O-MAPL
Open Source Code	No	The paper does not contain any explicit statement about releasing code or a link to a repository for the methodology described.
Open Datasets	Yes	We evaluate the performance of our O-MAPL in different complex MARL environments, including: multiagent Star Craft II (i.e., SMACv1 (Samvelyan et al., 2019), SMACv2 (Ellis et al., 2022)) and multi-agent Mujoco (de Witt et al., 2020a) benchmarks.
Dataset Splits	No	For Ma Mujoco tasks, 1k trajectory pairs were sampled, while for SMAC tasks, 2k trajectory pairs were sampled. The dataset quality varies across poor, medium, and good levels, ensuring comprehensive coverage of different learning stages. To ensure varying quality levels, we created additional datasets for poor and expert levels.
Hardware Specification	Yes	All experiments were implemented using Py Torch and executed in parallel on a single NVIDIA H100 NVL Tensor Core GPU to ensure computational efficiency.
Software Dependencies	No	All experiments were implemented using Py Torch and executed in parallel on a single NVIDIA H100 NVL Tensor Core GPU to ensure computational efficiency.
Experiment Setup	Yes	Table 6 reports hyperparameters used consistently across all experiments: Optimizer Adam Learning rate (Q-value and policy networks) 1e-4 Tau (soft update target rate) 0.005 Gamma (discount factor) 0.99 Batch size 32 Agent hidden dimension 256 Mixer hidden dimension 64 Number of seeds 4 Number of episodes per evaluation step 32 Number of evaluation steps 100