Rethinking the Value of Training-Free Structured Pruning of LLMs
Authors: Nahush Lele, Arnav Chavan, Aryamaan Thakur, Deepak Gupta
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through an extensive empirical evaluation across a diverse range of tasks, datasets and modalities, we reveal critical limitations in current pruning methods. Our analysis also finds that depth pruning, despite its simplicity, usually outperforms the more granular width pruning approaches in maintaining downstream task performance. Our findings highlight that existing evaluations of pruned LLMs often overstate their effectiveness due to incomplete or limited evaluation tasks, necessitating a critical reassessment of the true value of pruning and emphasizing the need to explore more robust pruning algorithms. |
| Researcher Affiliation | Collaboration | 1Indian Institute of Technology (ISM), Dhanbad, India 2Nyun AI, India 3Amazon Lab126 4Transmute AI Lab (Texmin Hub) |
| Pseudocode | No | The paper describes the pruning techniques (Short GPT and FLAP) using mathematical formulas and descriptive text in Sections 6.1 and 6.2, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Key benchmarks include ARC, GSM8k, MMLU, Wikitext-2, Human Eval, IFEVAL, Stereo Set, POPE, Text VQA, and MMMU. These datasets assess linguistic, logical, and multimodal capabilities comprehensively. The details of each of the mentioned tasks is provided in the Appendix. |
| Dataset Splits | No | The paper mentions evaluating models on various benchmarks like Human Eval, IFEVAL, and LEval framework, and for Human Eval states "For each sample in the test set, we generate five completions... and compute the pass@1 metric across the entire dataset to assess performance.". However, it does not explicitly provide specific dataset split information (e.g., percentages, sample counts, or explicit citations for how the splits are defined) for any of the datasets used, beyond implying the use of test sets for benchmarks. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments or run the evaluations. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks like Python, PyTorch, or TensorFlow versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | For each sample in the test set, we generate five completions with a temperature setting of 0.8 and compute the pass@1 metric across the entire dataset to assess performance. We conduct evaluations on the Gemma-2-9B and LLa MA-3-8B models, compressed by 10% and 20% using both depth and width pruning methods. |