MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition

Authors: Wei Li, Lujun Li, Hao Gu, You-Liang Huang, Mark G. Lee, Shengjie Sun, Wei Xue, Yike Guo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive experiments on Mixtral, Phi-3.5, Deep Seek, and Qwen2 Mo E LLMs show Mo E-SVD outperforms other compression methods, achieving a 60% compression ratio and 1.5 faster inference with minimal performance loss.
Researcher Affiliation Collaboration 1University of Birmingham 2The Hong Kong University of Science and Technology 3The Hong Kong University of Science and Technology (Guangzhou) 4AISpeech Co., Ltd.. Correspondence to: Mark Lee <EMAIL>, Yike Guo <EMAIL>.
Pseudocode Yes C. Pseudocode In our experimental implementation, we present a detailed algorithmic procedure for compressing Mo E-based large language models using the proposed Mo E-SVD method. Algorithm 4 outlines the main steps of this approach. The process begins by collecting scaling matrices through forward hooks during inference, as shown in Algorithm 1 (Step 1). This step is crucial for capturing activation patterns and computing the sensitivity metric for each expert. Subsequently, we perform singular value decomposition (SVD) on the scaled weight matrices, followed by truncation for effective compression, as detailed in Algorithm 2 (Step 2). Our method introduces a V-matrix sharing mechanism, where the most frequently used V-matrix is selected and shared among all experts, as described in Algorithm 3 (Step 3). Additionally, we employ U-matrix trimming by retaining the top-k U-matrices based on expert sampling frequencies to refine the expert functions (Step 4). To ensure numerical stability, we apply the adjustment function provided in Algorithm 5, which modifies matrices to be positive definite when necessary.
Open Source Code No We affirm the solid reproducibility of our results and provide specific code implementations in the appendix. Our main experiments represent average outcomes from multiple repetitions, ensuring reliability. Mo E LLMs, being very large models, exhibit relatively small variances in experimental results and evaluations. To further demonstrate the robustness and repeatability of our method, we present detailed results for different initial seeds, showcasing consistent performance across various conditions.
Open Datasets Yes We evaluate our method across 10 datasets, encompassing 3 language modeling datasets (Wiki Text-2 (Merity et al., 2017), PTB (Marcus et al., 1993), and C4 (Raffel et al., 2020)), along with 7 common sense reasoning datasets (Openbook QA (Mihaylov et al., 2018), Wino Grande (Sakaguchi et al., 2020), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Math QA (Amini et al., 2019), ARC-e, and ARC-c (Clark et al., 2018)) in a zero-shot setting using the LM-Evaluation-Harness framework (Gao et al., 2023).
Dataset Splits Yes For fair comparisons, we followed the same settings as ASVD and SVD-LLM and used 256 random samples from Wiki Text-2 as calibration data. We focus on compressing the model without retraining the full model parameters.
Hardware Specification Yes All experiments are conducted on NVIDIA H800 GPUs. [...] We evaluate performance on NVIDIA A100 GPUs, which feature 600 GB/s NVLink bandwidth and a 19.5 TFLOPS peak performance (FP32).
Software Dependencies No The pseudocode in Appendix C is written in Python using Py Torch, but no specific version numbers for Py Torch or other libraries are provided in the paper.
Experiment Setup Yes For fair comparisons, we followed the same settings as ASVD and SVD-LLM and used 256 random samples from Wiki Text-2 as calibration data. We focus on compressing the model without retraining the full model parameters. [...] To apply selective decomposition, we set a threshold τ based on the desired compression ratio. [...] Furthermore, with U-matrix selection, we typically select k = 2 U-matrices, reducing the parameter count by a factor of N/2 for the U-matrices.