MoDeGPT: Modular Decomposition for Large Language Model Compression
Authors: Chi-Heng Lin, Shangqian Gao, James Smith, Abhishek Patel, Shikhar Tuli, Yilin Shen, Hongxia Jin, Yen-Chang Hsu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that Mo De GPT, without relying on backward propagation, consistently matches or surpasses the performance of prior techniques that depend on gradient information, while achieving a 98% reduction in compute costs when compressing a 13B-parameter model. On LLa MA-2/3 and OPT models, Mo De GPT retains 90-95% of zero-shot performance with compression rates of 25-30%. We present a thorough evaluation of Mo De GPT, comparing it against existing methods across key metrics, including perplexity, downstream accuracy, and real-world speed improvements. The paper includes a dedicated "4 EXPERIMENTS" section detailing empirical results, comparisons, and ablation studies across various models and tasks. |
| Researcher Affiliation | Collaboration | Chi-Heng Lin Samsung Research America Shangqian Gao Florida State University James Seale Smith Samsung Research America Abhishek Patel Samsung Research America Shikhar Tuli Samsung Research America Yilin Shen Samsung Research America Hongxia Jin Samsung Research America Yen-Chang Hsu Samsung Research America. The affiliations include both Samsung Research America (industry) and Florida State University (academia). |
| Pseudocode | Yes | Algorithm 1 Type-I compression for MLP by Nyström approximation. Algorithm 2 Type-II compression for key-query matrices by CR decomposition. Algorithm 3 Type-III compression for value-output matrices by SVD. |
| Open Source Code | No | The paper states: "We implemented our models using Hugging Face Transformers (Wolf et al., 2019), with correlation computations in FP64." and "We utilize the Hugging Face generation library (Wolf et al., 2019) to implement our LLM models and adapt the Slice GPT (Ashkboos et al., 2024) Git Hub repository for correlation matrix estimations." This indicates the use of existing open-source tools and an adaptation of a third-party repository, but there is no explicit statement from the authors about releasing their own implementation of Mo De GPT. |
| Open Datasets | Yes | Following calibration setups similar to prior studies (Frantar et al., 2022; Ashkboos et al., 2024; Dettmers et al., 2023), we employed the Wiki Text-2 (Merity et al., 2016) and Alpaca datasets (Taori et al., 2023), each comprising 128 samples of 2048 characters. Zero-shot performance was evaluated using the LM Evaluation Harness (Gao et al., 2021), with task details provided in Appendix B.2. |
| Dataset Splits | Yes | Following calibration setups similar to prior studies (Frantar et al., 2022; Ashkboos et al., 2024; Dettmers et al., 2023), we employed the Wiki Text-2 (Merity et al., 2016) and Alpaca datasets (Taori et al., 2023), each comprising 128 samples of 2048 characters. We use a calibration set of 128 random samples, each 2048 in length, from the Alpaca dataset, and a recovery fine-tuning set of 8000 samples, each 1024 in length, employing Lo RA (Hu et al., 2021). |
| Hardware Specification | Yes | Model compression and performance testing were conducted on a single NVIDIA A100 80GB GPU, except for the 70B model, which we used 8 A100 GPUs. The throughput benchmarks in Appendix B.16 also mention: "Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50 GHz with 20 cores". |
| Software Dependencies | No | The paper mentions: "We implemented our models using Hugging Face Transformers (Wolf et al., 2019), with correlation computations in FP64." and "We utilize torch.svd and torch.pinv in Py Torch for performing Singular Value Decomposition (SVD) and computing the Moore-Penrose inverse on tensors of dtype FP64." While software packages like Hugging Face Transformers and PyTorch are mentioned, specific version numbers for these libraries are not provided. |
| Experiment Setup | Yes | Unless otherwise specified, the calibration set consists of a random sample of 128 sequences, each of length 2048, from Wiki Text-2... MLP Module Algorithm 1 requires a ridge leverage score parameter λ. We find that the results are largely insensitive to this parameter; therefore, we simply use λ = 1 across all experiments. We use Slice GPT s hyperparameters for Lo RA, except for the learning rate, which is set to 5 10 5. The other primary hyperparameters used are lora_alpha = 10, lora_r = 32, lora_dropout = 0.05, and batch_size = 3. |