DLP: Dynamic Layerwise Pruning in Large Language Models
Authors: Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLa MA27B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine Tuning (PEFT). |
| Researcher Affiliation | Academia | 1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China 2Hong Kong University of Science and Technology, Hong Kong, China. Correspondence to: Bo Cheng <EMAIL>, Jiale Han <EMAIL>. |
| Pseudocode | Yes | The pseudocode of DLP is provided in Algorithm 1. |
| Open Source Code | Yes | We release the code1 to facilitate future research. 1The code is available at: https://github.com/ ironartisan/DLP. |
| Open Datasets | Yes | Specifically, we measure language modeling performance using the perplexity metric on the Wiki Text (Merity et al., 2017), PTB (Marcus et al., 1994), and C4 (Raffel et al., 2020) validation datasets. |
| Dataset Splits | Yes | Specifically, we measure language modeling performance using the perplexity metric on the Wiki Text (Merity et al., 2017), PTB (Marcus et al., 1994), and C4 (Raffel et al., 2020) validation datasets. For zero-shot evaluation, we assess accuracy on seven commonsense benchmarks from Eleuther AI LM Harness (Gao et al., 2024), including Bool Q (Clark et al., 2019), RTE (Wang et al., 2019), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2020), ARC Easy and Challenge (Boratko et al., 2018), and Open Book QA (Mihaylov et al., 2018). ... We fine-tune the LLa MA1-7B and LLa MA1-13B models pruned using Sparse GPT on the C4 training dataset. |
| Hardware Specification | Yes | In our experimental setup, we utilize four NVIDIA A40 GPUs, each with 48 GB of memory. ... test its end-to-end decoding latency using the Deep Sparse (Kurtic et al., 2023) inference engine on Intel(R) Xeon(R) Gold 6248R CPU equipped with 24 cores |
| Software Dependencies | No | The paper mentions the 'Hugging Face Transformers library' but does not specify a version number or other key software components with their versions. |
| Experiment Setup | Yes | In Appendix G, we present the hyperparameter configurations for various sparsity levels. ... The quantization bit widths are set to 3, 4, 8, and 16. ... During fine-tuning, the pruning mask remains fixed, and the pretraining autoregressive loss is utilized. |