Ada-K Routing: Boosting the Efficiency of MoE-based LLMs

Authors: Zijia Zhao, Longteng Guo, Jie Cheng, Xuange Gao, Hua Huang, Jing Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluations on four popular baseline models demonstrate that our Ada-K routing method significantly outperforms conventional Top-K routing. Compared to Top-K, our method achieves over 25% reduction in FLOPs and more than 20% inference speedup while still improving performance across various benchmarks.
Researcher Affiliation Academia 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3School of Artificial Intelligence, Beijing Normal University
Pseudocode No The paper describes the method using mathematical formulas and textual explanations in Section 3.2 and 3.3, but no explicitly labeled 'Pseudocode' or 'Algorithm' block is provided.
Open Source Code No The code and checkpoints will be released at https://github.com/ivattyue/Ada-K.
Open Datasets Yes Following previous works (Touvron et al., 2023b; Le Scao et al., 2023; Li et al., 2023; Black et al., 2022), we employ the lm-evaluation-harness (Gao et al., 2021) to evaluate our model. This tool serves as the backend for the Hugging Face Open LLM Leaderboard (Beeching et al., 2023). Our model is assessed on 6 key benchmarks aligned with Open LLM Leaderboard. [...] These tasks include AI2 Reasoning Challenge (ARC-C) (Clark et al., 2018), Hella Swag (Hella) (Zellers et al., 2019), MMLU (Hendrycks et al., 2020), Truthful QA (Truth) (Lin et al., 2021), Winogrande (Wino) (Sakaguchi et al., 2021) and GSM8K (GSM) (Cobbe et al., 2021).
Dataset Splits Yes Benchmark and Evaluation Details. Following previous works (Touvron et al., 2023b; Le Scao et al., 2023; Li et al., 2023; Black et al., 2022), we employ the lm-evaluation-harness (Gao et al., 2021) to evaluate our model. This tool serves as the backend for the Hugging Face Open LLM Leaderboard (Beeching et al., 2023). Our model is assessed on 6 key benchmarks aligned with Open LLM Leaderboard. [...] Table 10: Details of benchmarks. We follow the setting of Hugging Face Open LLM Leaderboard. Benchmark #shots # Samples Details ARC-C (Clark et al., 2018) 25 2.59k A set of grade-school science questions.
Hardware Specification Yes We employ 16 NVIDIA A800 GPUs to train Mixtral 8x22B, whereas each of the other three utilizes 8 NVIDIA A800 GPUs.
Software Dependencies No The paper mentions using "Adam W" as an optimizer and "bf16" precision, but it does not specify version numbers for any software libraries, programming languages, or other key software components.
Experiment Setup Yes Training Details. We adopt Adam W (Loshchilov & Hutter, 2017) as the optimizer. All baseline models are trained for one epoch using a consistent set of 10k samples. The batch size and learning rate is set to 64 and 1e-3, respectively. We leverage 2 PPO epochs for reinforcement learning. For all four baseline models, we uniformly set λ as 3e-3. [...] Table 11: Additional training details. Configuration Fine-tuning Warm-Start PPO Optimizer Adam W Adam W Base LR 1e-3 1e-3 Precision bf16 bf16 Weight Decay 0.1 0.1 Batch Size 64 64 LR Decay Schedule cosine constant Gradient Checkpoint True True Training Epochs 1 1 Max Length 2048 2048 Threshold p 0.3 Regularization Coef 3e-3 PPO Epoch 2