Mastering Massive Multi-Task Reinforcement Learning via Mixture-of-Expert Decision Transformer
Authors: Yilun Kong, Guozheng Ma, Qi Zhao, Haoyu Wang, Li Shen, Xueqian Wang, Dacheng Tao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that, by increasing the number of experts, M3DT not only consistently enhances its performance as model expansion on the fixed task numbers, but also exhibits remarkable task scalability, successfully extending to 160 tasks with superior performance. ... We demonstrate the superior performance of M3DT through rigorous testing on a broad spectrum of task scales, analyze its functionality through extensive ablation studies, and verify its task scalability and parameter scalability. |
| Researcher Affiliation | Academia | 1Tsinghua University, China 2Nanyang Technological University, Singapore 3Shenzhen Campus of Sun Yat-sen University. |
| Pseudocode | No | The paper describes the methodology in prose and mathematical equations but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks or code-like formatted procedures. |
| Open Source Code | Yes | Our code is available at: https://github.com/ Kong Yilun/M3DT |
| Open Datasets | Yes | We consider a total of 160 continuous control tasks from 3 task domains: Meta-World (Yu et al., 2020b), DMControl (Tassa et al., 2018), Mujoco Locomotion (Todorov et al., 2012). ... For the offline dataset, we follow the works (He et al., 2023; Hu et al., 2024) and utilize their dataset with the near-optimal trajectories... |
| Dataset Splits | No | The paper describes the datasets used (Meta-World, DMControl, Mujoco Locomotion) and how rewards are scaled, but it does not explicitly provide specific training/test/validation splits (e.g., percentages, sample counts, or references to standard splits for its own experiments) for the data used in the main M3DT training. |
| Hardware Specification | Yes | We use NVIDIA Ge Force RTX 4090 to train and evaluate each model except Harmo DT-Large, while it is trained and evaluated on NVIDIA A100 40G due to its substantial resource requirements. |
| Software Dependencies | No | The paper mentions using 'Optimizer Adam' and references specific methods like 'Prompt DT' and 'Harmo DT', but it does not specify version numbers for any programming languages (e.g., Python), libraries (e.g., PyTorch), or other key software components used in the implementation. |
| Experiment Setup | Yes | The specific model parameters and hyper-parameters utilized in our training process are outlined in Table 5. ... Parameter Value Number of layers 6 Number of attention heads 8 Hidden dimension 256 Number of experts [8,16,24,32,40,48] Nonlinearity function Re LU Batch size 16 Prompt length K 20 Dropout 0.1 Learning rate 1.0e-4 Optimizer Adam Total rounds 1e6 -Backbone training rounds 4e5 -Expert training rounds 2e5 -Router training rounds 4e5 |