ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing

Authors: Ziteng Wang, Jun Zhu, Jianfei Chen

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Re Mo E consistently outperforms vanilla Top K-routed Mo E across various model sizes, expert counts, and levels of granularity. Furthermore, Re Mo E exhibits superior scalability with respect to the number of experts, surpassing traditional Mo E architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/Re Mo E.
Researcher Affiliation Collaboration Dept. of Comp. Sci. and Tech., Institute for AI, BNRist Center, THBI Lab, Tsinghua-Bosch Joint ML Center, Tsinghua University EMAIL; EMAIL
Pseudocode No The paper describes methods using mathematical equations (e.g., Equation 6, 7, 9, 10) and prose. It also includes diagrams like Figure 1 and 2 illustrating concepts. However, it does not contain any structured pseudocode or algorithm blocks with numbered steps formatted like code.
Open Source Code Yes The implementation based on Megatron-LM is available at https://github.com/thu-ml/Re Mo E.
Open Datasets Yes We train the models on The Pile (Gao et al., 2020), an 800 GB diverse corpus. We evaluate the zero-shot performance of the trained models on the following downstream tasks: ARC (Clark et al., 2018); Bool Q (Clark et al., 2019); Hella Swag (Zellers et al., 2019); LAMBADA (Paperno et al., 2016); PIQA (Bisk et al., 2020); RACE (Lai et al., 2017).
Dataset Splits No The paper mentions training models for 60k steps on 30B tokens and evaluates validation loss. It also evaluates zero-shot accuracy on various downstream tasks. While these imply the existence of training/validation/test sets, the paper does not explicitly specify the exact proportions, sample counts, or methodology for splitting these datasets (e.g., '80/10/10 split' or specific random seeds for splitting).
Hardware Specification Yes All models are trained with 8 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions leveraging "Megatron-LM (Shoeybi et al., 2019) as our code base" and adopting "Adam W (Loshchilov, 2017) as the optimizer with β1 = 0.9, β2 = 0.999 with Ze RO optimization (Rajbhandari et al., 2020)" and using a "byte pair encoding (BPE) tokenizer (Sennrich, 2015)". While these are software components or techniques, no specific version numbers for any libraries, frameworks (like PyTorch or TensorFlow), or Python itself are provided.
Experiment Setup Yes We experiment with the mainstream LLa MA (Touvron et al., 2023) architecture, featuring grouped query attention (GQA) (Ainslie et al., 2023), Swi GLU (Shazeer, 2020) activation function, Ro PE (Su et al., 2024) position embedding, and RMSNorm (Zhang & Sennrich, 2019). The context length is set to 1024, and the batch size is 512. We experiment with three different dense backbone sizes as shown in Table 1. For vanilla Mo E we adopt a load balancing loss of weight 0.01 following Fedus et al. (2022). For Re Mo E we use the adaptive load balancing L1 regularization in Equation 10. All models are trained for 60k steps ( 30B tokens)... The learning rate is set to be 5e 4 with a cosine scheduler.