Layerwise Recurrent Router for Mixture-of-Experts

Authors: Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, zili wang, Ivan Titov, Jie Fu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive empirical evaluations demonstrate that RMo E-based language models consistently outperform a spectrum of baseline models. Furthermore, RMo E integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other Mo E architectures. Our analyses attribute RMo E s gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMo E. ... 4 EXPERIMENTS Our analyses attribute RMo E s gains to its effective cross-layer information sharing, which also improves expert selection and diversity.
Researcher Affiliation Collaboration 1Alibaba Group, 2University of Edinburgh, 3ICT, Chinese Academy of Sciences, 4Nanjing University, 5INF Technology 6University of Amsterdam 7Shanghai AI Lab
Pseudocode No The paper describes the methodology using mathematical equations (e.g., Eq. 2, 3, 4, 5, 6) and prose. While Appendix A.4.2 and A.4.3 contain Python code snippets for analysis (Mutual Information and Expert Similarities), these are not pseudocode or algorithm blocks for the core RMoE methodology itself.
Open Source Code Yes Our code is at https://github.com/qiuzh20/RMo E.
Open Datasets Yes Langauge Modeling Tasks and Metrics Following (Pham et al., 2024), we first test on two common language modeling tasks: enwiki8 (character-level language modeling, with Bits-Per-Character (BPC) as the evaluation metrics) and Wiki Text-103 (word-level language modeling, with Perplexity (PPL) as the evaluation metrics). ... All models are trained on Alpaca (Taori et al., 2023) with the same configuration. We use lm-evaluation-harness1 to evaluate the fine-tuned model. ... Therefore, we only test on tasks (ARC-easy, Hellaswag, PIQA, Sci Q, LAMBADA) in lm-evaluation-harness.
Dataset Splits Yes We employ default train-validation-test splits for each dataset.
Hardware Specification Yes Each task is trained on 2 NVIDIA A100 GPUs for about 20 hours. ... These experiments are conducted using the Megablocks (Gale et al., 2023) on 8 NVIDIA A100 GPUs for about 5 days.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'Adam W optimizer' but does not specify their version numbers. It also refers to 'Megablocks' and 'lm-evaluation-harness' without providing specific software versions for these tools or any other key libraries/frameworks.
Experiment Setup Yes Specifically, we use a 24-layer model and top-4 gating from 16 experts per layer... Our model architecture is modified based on Llama family (Touvron et al., 2023). Specifically, we use a 24-layer model and top-4 gating from 16 experts per layer... For model architecture, our 24-layer model employs Rotary Embedding for positional encoding, Swi GLU for activation functions, and RMSNorm to enhance the model s efficiency and performance. Other model configuration includes a hidden size of 1280, 20 attention heads, an initialization method standard deviation of 0.02, a sequence length of 4096, and a maximum positional embedding length of 4096. All dropout rates are set to 0. For the Mo E part, we use 16 experts, with each expert having a feedforward network hidden size of 448, following the fine-grained Mo E settings, and each token activating 4 experts. We use a tokenizer with a 96512 vocabulary size... For pre-training configurations, we use a global batch size of 1120, a warmup period of 2000 iterations, a learning rate of 4.2e-4, a minimum learning rate of 4.2e-5, cosine learning rate decay, Adam optimizer with β1 = 0.9 and β2 = 0.95, a weight decay of 0.1, and gradient clipping at 1.0. ... We use bfloat16 (bf16) precision to accelerate training while maintaining numerical stability. The model is trained for 3 epochs using Adam W optimizer with a global batch size 128. We set the learning rate to 2e-5 and do not apply weight decay. A warmup ratio of 0.03 is used to gradually increase the learning rate at the beginning of training, and we utilize a cosine learning rate scheduler to adjust it throughout the training process, promoting smoother convergence.