reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking the Value of Training-Free Structured Pruning of LLMs

Authors: Nahush Lele, Arnav Chavan, Aryamaan Thakur, Deepak Gupta

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through an extensive empirical evaluation across a diverse range of tasks, datasets and modalities, we reveal critical limitations in current pruning methods. Our analysis also finds that depth pruning, despite its simplicity, usually outperforms the more granular width pruning approaches in maintaining downstream task performance. Our findings highlight that existing evaluations of pruned LLMs often overstate their effectiveness due to incomplete or limited evaluation tasks, necessitating a critical reassessment of the true value of pruning and emphasizing the need to explore more robust pruning algorithms.
Researcher Affiliation	Collaboration	1Indian Institute of Technology (ISM), Dhanbad, India 2Nyun AI, India 3Amazon Lab126 4Transmute AI Lab (Texmin Hub)
Pseudocode	No	The paper describes the pruning techniques (Short GPT and FLAP) using mathematical formulas and descriptive text in Sections 6.1 and 6.2, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the described methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Key benchmarks include ARC, GSM8k, MMLU, Wikitext-2, Human Eval, IFEVAL, Stereo Set, POPE, Text VQA, and MMMU. These datasets assess linguistic, logical, and multimodal capabilities comprehensively. The details of each of the mentioned tasks is provided in the Appendix.
Dataset Splits	No	The paper mentions evaluating models on various benchmarks like Human Eval, IFEVAL, and LEval framework, and for Human Eval states "For each sample in the test set, we generate five completions... and compute the pass@1 metric across the entire dataset to assess performance.". However, it does not explicitly provide specific dataset split information (e.g., percentages, sample counts, or explicit citations for how the splits are defined) for any of the datasets used, beyond implying the use of test sets for benchmarks.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments or run the evaluations.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers (e.g., programming languages, libraries, or frameworks like Python, PyTorch, or TensorFlow versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	For each sample in the test set, we generate five completions with a temperature setting of 0.8 and compute the pass@1 metric across the entire dataset to assess performance. We conduct evaluations on the Gemma-2-9B and LLa MA-3-8B models, compressed by 10% and 20% using both depth and width pruning methods.