Let LLM Tell What to Prune and How Much to Prune

Authors: Mingzhe Yang, Sihao Lin, Changlin Li, Xiaojun Chang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on multiple benchmarks and LLM variants demonstrate that our method effectively balances the trade-off between efficiency and performance. ... 4. Experiment 4.1. Experimental Setup 4.2. Pruning Results on LLMs 4.3. Inference Speed 4.4. Baseline Bias Compensation 4.5. Robustness to Calibration Samples 4.6. Dependency on Calibration Dataset 4.7. Ablation Studies
Researcher Affiliation Academia 1University of Science and Technology of China 2RMIT University 3Stanford University. Correspondence to: Xiaojun Chang <EMAIL>.
Pseudocode Yes Algorithm 1 Block-wise pruning ratio assignment Input: an original LLM, target pruning ratio ptarget. Output: the pruning ratio for each block.
Open Source Code No The paper does not contain any explicit statement about releasing source code or a link to a code repository. It mentions using existing libraries like Hugging Face Transformers, but not for their own implementation.
Open Datasets Yes For calibration data, we randomly select 128 samples from the Wiki Text2 training dataset (Merity et al., 2016). ... Furthermore, to further validate the efficacy of our method, we report the Zero-Shot accuracy on five benchmark datasets: PIQA (Bisk et al., 2020); Wino Grande (Sakaguchi et al., 2021); Hella Swag (Zellers et al., 2019); ARC-e and ARC-c (Clark et al., 2018). ... We use C4 and Wiki Text2 training datasets as different calibration datasets.
Dataset Splits No For calibration data, we randomly select 128 samples from the Wiki Text2 training dataset (Merity et al., 2016). ... we utilize the Wiki Text2 test dataset to evaluate the perplexity of the model after pruning with different pruning methods. ... We report the Zero-Shot accuracy on five benchmark datasets: PIQA; Wino Grande; Hella Swag; ARC-e and ARC-c. While calibration data size and evaluation datasets are mentioned, explicit training/validation/test splits (e.g., percentages or counts for model training/evaluation beyond using standard test sets) are not provided for reproducibility.
Hardware Specification Yes All experiments are conducted on NVIDIA A800 GPUs with 80GB memory.
Software Dependencies No all of which are available through the Hugging Face Transformers library (Wolf, 2019). This mentions a specific library, but no version number is provided for the library itself.
Experiment Setup Yes For calibration data, we randomly select 128 samples from the Wiki Text2 training dataset (Merity et al., 2016). ... We begin by presenting the Wiki Text2 performance results for the LLa MA2, LLa MA3, and Vicuna models at three distinct pruning ratios (30%, 40%, and 50%). ... We provide a detailed comparison of the inference speed for generation the sequences of length 128 (batch size of 1) under different pruning methods. ... Algorithm 1 Block-wise pruning ratio assignment Input: an original LLM, target pruning ratio ptarget.