reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer

Authors: Yilun Kong, Guozheng Ma, Qi Zhao, Haoyu Wang, Li Shen, Xueqian Wang, Dacheng Tao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance. ... We demonstrate the superior performance of M3DT through rigorous testing on a broad spectrum of task scales, analyze its functionality through extensive ablation studies, and verify its task scalability and parameter scalability.
Researcher Affiliation	Academia	1Tsinghua University, China 2Nanyang Technological University, Singapore 3Shenzhen Campus of Sun Yat-sen University.
Pseudocode	No	The paper describes the methodology in prose and mathematical equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or code-like formatted procedures.
Open Source Code	Yes	Our code is available at: https://github.com/ Kong Yilun/M3DT
Open Datasets	Yes	We consider a total of 160 continuous control tasks from 3 task domains: Meta-World (Yu et al., 2020b), DMControl (Tassa et al., 2018), Mujoco Locomotion (Todorov et al., 2012). ... For the offline dataset, we follow the works (He et al., 2023; Hu et al., 2024) and utilize their dataset with the near-optimal trajectories...
Dataset Splits	No	The paper describes the datasets used (Meta-World, DMControl, Mujoco Locomotion) and how rewards are scaled, but it does not explicitly provide specific training/test/validation splits (e.g., percentages, sample counts, or references to standard splits for its own experiments) for the data used in the main M3DT training.
Hardware Specification	Yes	We use NVIDIA Ge Force RTX 4090 to train and evaluate each model except Harmo DT-Large, while it is trained and evaluated on NVIDIA A100 40G due to its substantial resource requirements.
Software Dependencies	No	The paper mentions using 'Optimizer Adam' and references specific methods like 'Prompt DT' and 'Harmo DT', but it does not specify version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch), or other key software components used in the implementation.
Experiment Setup	Yes	The specific model parameters and hyper-parameters utilized in our training process are outlined in Table 5. ... Parameter Value Number of layers 6 Number of attention heads 8 Hidden dimension 256 Number of experts [8,16,24,32,40,48] Nonlinearity function Re LU Batch size 16 Prompt length K 20 Dropout 0.1 Learning rate 1.0e-4 Optimizer Adam Total rounds 1e6 -Backbone training rounds 4e5 -Expert training rounds 2e5 -Router training rounds 4e5