reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Let LLM Tell What to Prune and How Much to Prune

Authors: Mingzhe Yang, Sihao Lin, Changlin Li, Xiaojun Chang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on multiple benchmarks and LLM variants demonstrate that our method effectively balances the trade-off between efficiency and performance. ... 4. Experiment 4.1. Experimental Setup 4.2. Pruning Results on LLMs 4.3. Inference Speed 4.4. Baseline Bias Compensation 4.5. Robustness to Calibration Samples 4.6. Dependency on Calibration Dataset 4.7. Ablation Studies
Researcher Affiliation	Academia	1University of Science and Technology of China 2RMIT University 3Stanford University. Correspondence to: Xiaojun Chang <EMAIL>.
Pseudocode	Yes	Algorithm 1 Block-wise pruning ratio assignment Input: an original LLM, target pruning ratio ptarget. Output: the pruning ratio for each block.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository. It mentions using existing libraries like Hugging Face Transformers, but not for their own implementation.
Open Datasets	Yes	For calibration data, we randomly select 128 samples from the Wiki Text2 training dataset (Merity et al., 2016). ... Furthermore, to further validate the efficacy of our method, we report the Zero-Shot accuracy on five benchmark datasets: PIQA (Bisk et al., 2020); Wino Grande (Sakaguchi et al., 2021); Hella Swag (Zellers et al., 2019); ARC-e and ARC-c (Clark et al., 2018). ... We use C4 and Wiki Text2 training datasets as different calibration datasets.
Dataset Splits	No	For calibration data, we randomly select 128 samples from the Wiki Text2 training dataset (Merity et al., 2016). ... we utilize the Wiki Text2 test dataset to evaluate the perplexity of the model after pruning with different pruning methods. ... We report the Zero-Shot accuracy on five benchmark datasets: PIQA; Wino Grande; Hella Swag; ARC-e and ARC-c. While calibration data size and evaluation datasets are mentioned, explicit training/validation/test splits (e.g., percentages or counts for model training/evaluation beyond using standard test sets) are not provided for reproducibility.
Hardware Specification	Yes	All experiments are conducted on NVIDIA A800 GPUs with 80GB memory.
Software Dependencies	No	all of which are available through the Hugging Face Transformers library (Wolf, 2019). This mentions a specific library, but no version number is provided for the library itself.
Experiment Setup	Yes	For calibration data, we randomly select 128 samples from the Wiki Text2 training dataset (Merity et al., 2016). ... We begin by presenting the Wiki Text2 performance results for the LLa MA2, LLa MA3, and Vicuna models at three distinct pruning ratios (30%, 40%, and 50%). ... We provide a detailed comparison of the inference speed for generation the sequences of length 128 (batch size of 1) under different pruning methods. ... Algorithm 1 Block-wise pruning ratio assignment Input: an original LLM, target pruning ratio ptarget.