ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing
Authors: Ziteng Wang, Jun Zhu, Jianfei Chen
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Re Mo E consistently outperforms vanilla Top K-routed Mo E across various model sizes, expert counts, and levels of granularity. Furthermore, Re Mo E exhibits superior scalability with respect to the number of experts, surpassing traditional Mo E architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/Re Mo E. |
| Researcher Affiliation | Collaboration | Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University EMAIL; EMAIL |
| Pseudocode | No | The paper describes methods using mathematical equations (e.g., Equation 6, 7, 9, 10) and prose. It also includes diagrams like Figure 1 and 2 illustrating concepts. However, it does not contain any structured pseudocode or algorithm blocks with numbered steps formatted like code. |
| Open Source Code | Yes | The implementation based on Megatron-LM is available at https://github.com/thu-ml/Re Mo E. |
| Open Datasets | Yes | We train the models on The Pile (Gao et al., 2020), an 800 GB diverse corpus. We evaluate the zero-shot performance of the trained models on the following downstream tasks: ARC (Clark et al., 2018); Bool Q (Clark et al., 2019); Hella Swag (Zellers et al., 2019); LAMBADA (Paperno et al., 2016); PIQA (Bisk et al., 2020); RACE (Lai et al., 2017). |
| Dataset Splits | No | The paper mentions training models for 60k steps on 30B tokens and evaluates validation loss. It also evaluates zero-shot accuracy on various downstream tasks. While these imply the existence of training/validation/test sets, the paper does not explicitly specify the exact proportions, sample counts, or methodology for splitting these datasets (e.g., '80/10/10 split' or specific random seeds for splitting). |
| Hardware Specification | Yes | All models are trained with 8 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions leveraging "Megatron-LM (Shoeybi et al., 2019) as our code base" and adopting "Adam W (Loshchilov, 2017) as the optimizer with β1 = 0.9, β2 = 0.999 with Ze RO optimization (Rajbhandari et al., 2020)" and using a "byte pair encoding (BPE) tokenizer (Sennrich, 2015)". While these are software components or techniques, no specific version numbers for any libraries, frameworks (like PyTorch or TensorFlow), or Python itself are provided. |
| Experiment Setup | Yes | We experiment with the mainstream LLa MA (Touvron et al., 2023) architecture, featuring grouped query attention (GQA) (Ainslie et al., 2023), Swi GLU (Shazeer, 2020) activation function, Ro PE (Su et al., 2024) position embedding, and RMSNorm (Zhang & Sennrich, 2019). The context length is set to 1024, and the batch size is 512. We experiment with three different dense backbone sizes as shown in Table 1. For vanilla Mo E we adopt a load balancing loss of weight 0.01 following Fedus et al. (2022). For Re Mo E we use the adaptive load balancing L1 regularization in Equation 10. All models are trained for 60k steps ( 30B tokens)... The learning rate is set to be 5e 4 with a cosine scheduler. |