Delta Decompression for MoE-based LLMs Compression
Authors: Hao Gu, Wei Li, Lujun Li, Zhu Qiyuan, Mark G. Lee, Shengjie Sun, Wei Xue, Yike Guo
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments highlight the superiority of our approach, with over 13% performance gains than other compressors on Mixtral|Phi-3.5|Deep Seek|Qwen2 Mo E LLMs at 40 60% compression rates. Codes are available in https://github.com/lliai/D2Mo E. Our extensive experimental evaluation highlights the exceptional performance of D2-Mo E across multiple state-of-the-art Mo E language models and a wide range of benchmarks. To provide deeper insights into our method s performance, we also conduct detailed ablation studies on D2-Mo E. All experiments are performed on NVIDIA A100 GPUs. |
| Researcher Affiliation | Collaboration | 1Hong Kong University of Science and Technology 2University of Birmingham 3AISpeech Co., Ltd.. Correspondence to: Mark Lee <EMAIL>, Yike Guo <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 D2Mo E layer Implement Algorithm 2 Base Weight Merge and Delta SVD Decomposition for D2Mo E Algorithm 3 Semi-Dynamic Pruning on Base Weight for D2Mo E Algorithm 4 Delta Decomposition for Mo E Models Algorithm 5 Truncation-Aware SVD Compression |
| Open Source Code | Yes | Codes are available in https://github.com/lliai/D2Mo E. |
| Open Datasets | Yes | We evaluate our method across 10 datasets, encompassing 3 language modeling datasets (Wiki Text-2 (Merity et al., 2017), PTB (Marcus et al., 1993), and C4 (Raffel et al., 2020)), along with 7 common sense reasoning datasets (Openbook QA (Mihaylov et al., 2018), Wino Grande (Sakaguchi et al., 2020), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), Math QA (Amini et al., 2019), ARC-e, and ARC-c (Clark et al., 2018)) in a zero-shot setting using the LM-Evaluation-Harness framework (Gao et al., 2023). |
| Dataset Splits | No | The paper uses well-known benchmark datasets and evaluates them in a zero-shot setting using the LM-Evaluation-Harness framework, but it does not explicitly detail specific train/test/validation split percentages or counts within the paper's text for these datasets. It only specifies that "We use 512 random samples from Wiki Text-2 as calibration data" for calibration, not for explicit model training/evaluation splits. |
| Hardware Specification | Yes | All experiments are performed on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions the use of the LM-Evaluation-Harness framework and provides PyTorch-like pseudocode in the appendix. However, it does not specify version numbers for PyTorch, LM-Evaluation-Harness, or any other software libraries or dependencies. |
| Experiment Setup | Yes | For fair comparisons, we use 512 random samples from Wiki Text-2 as calibration data to conduct all experiments. We focus on compressing the model without retraining the full model parameters. Table 12. Hyperparameter Settings for D2-Mo E Experiments. Hyperparameter: Pruning ratio for Performance (10% of Base weights), Pruning ratio for Throughput (60% of Base weights), SVD Truncation Threshold for Performance preserve (68.05%, 47.34%, 26.62%, 16.26%, 5.93%), SVD Truncation Threshold for Throughput preserve (74.30%, 53.58%, 32.86%, 22.54%, 12.18%), Static Pruning Sparsity for Performance (5% of Base weights), Dynamic Pruning Sparsity for Performance (5% of Base weights), Static Pruning Sparsity for Throughput (30% of Base weights), Dynamic Pruning Sparsity for Throughput (30% of Base weights), Calibration Dataset Size (512 samples), Batch Size (128). |