Delta Decompression for MoE-based LLMs Compression

Authors: Hao Gu, Wei Li, Lujun Li, Zhu Qiyuan, Mark G. Lee, Shengjie Sun, Wei Xue, Yike Guo

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|Deep Seek|Qwen2 Mo E LLMs at 40 60% compression rates. Codes are available in https://github.com/lliai/D2Mo E. Our extensive experimental evaluation highlights the exceptional performance of D2-Mo E across multiple state-of-the-art Mo E language models and a wide range of benchmarks. To provide deeper insights into our method s performance, we also conduct detailed ablation studies on D2-Mo E. All experiments are performed on NVIDIA A100 GPUs.
Researcher Affiliation Collaboration 1Hong Kong University of Science and Technology 2University of Birmingham 3AISpeech Co., Ltd.. Correspondence to: Mark Lee <EMAIL>, Yike Guo <EMAIL>.
Pseudocode Yes Algorithm 1 D2Mo E layer Implement Algorithm 2 Base Weight Merge and Delta SVD Decomposition for D2Mo E Algorithm 3 Semi-Dynamic Pruning on Base Weight for D2Mo E Algorithm 4 Delta Decomposition for Mo E Models Algorithm 5 Truncation-Aware SVD Compression
Open Source Code Yes Codes are available in https://github.com/lliai/D2Mo E.
Open Datasets Yes We evaluate our method across 10 datasets, encompassing 3 language modeling datasets (Wiki Text-2 (Merity et al., 2017), PTB (Marcus et al., 1993), and C4 (Raffel et al., 2020)), along with 7 common sense reasoning datasets (Openbook QA (Mihaylov et al., 2018), Wino Grande (Sakaguchi et al., 2020), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Math QA (Amini et al., 2019), ARC-e, and ARC-c (Clark et al., 2018)) in a zero-shot setting using the LM-Evaluation-Harness framework (Gao et al., 2023).
Dataset Splits No The paper uses well-known benchmark datasets and evaluates them in a zero-shot setting using the LM-Evaluation-Harness framework, but it does not explicitly detail specific train/test/validation split percentages or counts within the paper's text for these datasets. It only specifies that "We use 512 random samples from Wiki Text-2 as calibration data" for calibration, not for explicit model training/evaluation splits.
Hardware Specification Yes All experiments are performed on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions the use of the LM-Evaluation-Harness framework and provides PyTorch-like pseudocode in the appendix. However, it does not specify version numbers for PyTorch, LM-Evaluation-Harness, or any other software libraries or dependencies.
Experiment Setup Yes For fair comparisons, we use 512 random samples from Wiki Text-2 as calibration data to conduct all experiments. We focus on compressing the model without retraining the full model parameters. Table 12. Hyperparameter Settings for D2-Mo E Experiments. Hyperparameter: Pruning ratio for Performance (10% of Base weights), Pruning ratio for Throughput (60% of Base weights), SVD Truncation Threshold for Performance preserve (68.05%, 47.34%, 26.62%, 16.26%, 5.93%), SVD Truncation Threshold for Throughput preserve (74.30%, 53.58%, 32.86%, 22.54%, 12.18%), Static Pruning Sparsity for Performance (5% of Base weights), Dynamic Pruning Sparsity for Performance (5% of Base weights), Static Pruning Sparsity for Throughput (30% of Base weights), Dynamic Pruning Sparsity for Throughput (30% of Base weights), Calibration Dataset Size (512 samples), Batch Size (128).