reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Layerwise Recurrent Router for Mixture-of-Experts

Authors: Zihan Qiu, Zeyu Huang, Shuang Cheng, Yizhi Zhou, zili wang, Ivan Titov, Jie Fu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical evaluations demonstrate that RMo E-based language models consistently outperform a spectrum of baseline models. Furthermore, RMo E integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other Mo E architectures. Our analyses attribute RMo E s gains to its effective cross-layer information sharing, which also improves expert selection and diversity. Our code is at https://github.com/qiuzh20/RMo E. ... 4 EXPERIMENTS Our analyses attribute RMo E s gains to its effective cross-layer information sharing, which also improves expert selection and diversity.
Researcher Affiliation	Collaboration	1Alibaba Group, 2University of Edinburgh, 3ICT, Chinese Academy of Sciences, 4Nanjing University, 5INF Technology 6University of Amsterdam 7Shanghai AI Lab
Pseudocode	No	The paper describes the methodology using mathematical equations (e.g., Eq. 2, 3, 4, 5, 6) and prose. While Appendix A.4.2 and A.4.3 contain Python code snippets for analysis (Mutual Information and Expert Similarities), these are not pseudocode or algorithm blocks for the core RMoE methodology itself.
Open Source Code	Yes	Our code is at https://github.com/qiuzh20/RMo E.
Open Datasets	Yes	Langauge Modeling Tasks and Metrics Following (Pham et al., 2024), we first test on two common language modeling tasks: enwiki8 (character-level language modeling, with Bits-Per-Character (BPC) as the evaluation metrics) and Wiki Text-103 (word-level language modeling, with Perplexity (PPL) as the evaluation metrics). ... All models are trained on Alpaca (Taori et al., 2023) with the same configuration. We use lm-evaluation-harness1 to evaluate the fine-tuned model. ... Therefore, we only test on tasks (ARC-easy, Hellaswag, PIQA, Sci Q, LAMBADA) in lm-evaluation-harness.
Dataset Splits	Yes	We employ default train-validation-test splits for each dataset.
Hardware Specification	Yes	Each task is trained on 2 NVIDIA A100 GPUs for about 20 hours. ... These experiments are conducted using the Megablocks (Gale et al., 2023) on 8 NVIDIA A100 GPUs for about 5 days.
Software Dependencies	No	The paper mentions using 'Adam optimizer' and 'Adam W optimizer' but does not specify their version numbers. It also refers to 'Megablocks' and 'lm-evaluation-harness' without providing specific software versions for these tools or any other key libraries/frameworks.
Experiment Setup	Yes	Specifically, we use a 24-layer model and top-4 gating from 16 experts per layer... Our model architecture is modified based on Llama family (Touvron et al., 2023). Specifically, we use a 24-layer model and top-4 gating from 16 experts per layer... For model architecture, our 24-layer model employs Rotary Embedding for positional encoding, Swi GLU for activation functions, and RMSNorm to enhance the model s efficiency and performance. Other model configuration includes a hidden size of 1280, 20 attention heads, an initialization method standard deviation of 0.02, a sequence length of 4096, and a maximum positional embedding length of 4096. All dropout rates are set to 0. For the Mo E part, we use 16 experts, with each expert having a feedforward network hidden size of 448, following the fine-grained Mo E settings, and each token activating 4 experts. We use a tokenizer with a 96512 vocabulary size... For pre-training configurations, we use a global batch size of 1120, a warmup period of 2000 iterations, a learning rate of 4.2e-4, a minimum learning rate of 4.2e-5, cosine learning rate decay, Adam optimizer with β1 = 0.9 and β2 = 0.95, a weight decay of 0.1, and gradient clipping at 1.0. ... We use bfloat16 (bf16) precision to accelerate training while maintaining numerical stability. The model is trained for 3 epochs using Adam W optimizer with a global batch size 128. We set the learning rate to 2e-5 and do not apply weight decay. A warmup ratio of 0.03 is used to gradually increase the learning rate at the beginning of training, and we utilize a cosine learning rate scheduler to adjust it throughout the training process, promoting smoother convergence.