reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DLP: Dynamic Layerwise Pruning in Large Language Models

Authors: Yuli Chen, Bo Cheng, Jiale Han, Yingying Zhang, Yingting Li, Shuhao Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that DLP effectively preserves model performance at high sparsity levels across multiple LLMs. Specifically, at 70% sparsity, DLP reduces the perplexity of LLa MA27B by 7.79 and improves the average accuracy by 2.7% compared to state-of-the-art methods. Moreover, DLP is compatible with various existing LLM compression techniques and can be seamlessly integrated into Parameter-Efficient Fine Tuning (PEFT).
Researcher Affiliation	Academia	1State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China 2Hong Kong University of Science and Technology, Hong Kong, China. Correspondence to: Bo Cheng <EMAIL>, Jiale Han <EMAIL>.
Pseudocode	Yes	The pseudocode of DLP is provided in Algorithm 1.
Open Source Code	Yes	We release the code1 to facilitate future research. 1The code is available at: https://github.com/ ironartisan/DLP.
Open Datasets	Yes	Specifically, we measure language modeling performance using the perplexity metric on the Wiki Text (Merity et al., 2017), PTB (Marcus et al., 1994), and C4 (Raffel et al., 2020) validation datasets.
Dataset Splits	Yes	Specifically, we measure language modeling performance using the perplexity metric on the Wiki Text (Merity et al., 2017), PTB (Marcus et al., 1994), and C4 (Raffel et al., 2020) validation datasets. For zero-shot evaluation, we assess accuracy on seven commonsense benchmarks from Eleuther AI LM Harness (Gao et al., 2024), including Bool Q (Clark et al., 2019), RTE (Wang et al., 2019), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2020), ARC Easy and Challenge (Boratko et al., 2018), and Open Book QA (Mihaylov et al., 2018). ... We fine-tune the LLa MA1-7B and LLa MA1-13B models pruned using Sparse GPT on the C4 training dataset.
Hardware Specification	Yes	In our experimental setup, we utilize four NVIDIA A40 GPUs, each with 48 GB of memory. ... test its end-to-end decoding latency using the Deep Sparse (Kurtic et al., 2023) inference engine on Intel(R) Xeon(R) Gold 6248R CPU equipped with 24 cores
Software Dependencies	No	The paper mentions the 'Hugging Face Transformers library' but does not specify a version number or other key software components with their versions.
Experiment Setup	Yes	In Appendix G, we present the hyperparameter configurations for various sparsity levels. ... The quantization bit widths are set to 3, 4, 8, and 16. ... During fine-tuning, the pruning mask remains fixed, and the pretraining autoregressive loss is utilized.