reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sample-aware Adaptive Structured Pruning for Large Language Models

Authors: Jun Kong, Xinge Ma, Jin Wang, Xuejie Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that the Ada Pruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20% pruning ratio, the model pruned with Ada Pruner maintains 97% of the performance of the unpruned model. We conduct extensive experiments on a variety of language benchmarks. The Ada Pruner outperforms the existing methods on the LLa MA series models, achieving superior average performance over the LLM-Pruner by 1.37%. The experimental results demonstrate the effectiveness of the proposed Ada Pruner.
Researcher Affiliation	Academia	Jun Kong, Xinge Ma, Jin Wang*, Xuejie Zhang School of Information Science and Engineering Yunnan University Kunming, China EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using mathematical formulations and textual explanations, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor structured code-like procedures.
Open Source Code	Yes	Code https://github.com/Jun Kong5/Ada Pruner
Open Datasets	Yes	To evaluate the performance of LLMs before and after pruning, we use the perplexity (PPL) metric on the Wiki Text2 (Merity et al. 2017) and PTB (Marcus 1993) datasets to measure the language modeling capability. To evaluate the performance of the pruning method comprehensively and intuitively, we evaluate the zero-shot performance on seven commonsense reasoning datasets, including Bool Q (Clark et al. 2019), PIQA (Bisk et al. 2020), Hella Swag (Zellers et al. 2020), Wino Grande (Sakaguchi et al. 2021), ARC (including ARC-easy and ARC-challenge) (Clark et al. 2018), and Openbook QA (Mihaylov et al. 2018). ...Following LLMPruner, we construct the subspace of the calibration data using samples from the Book Corpus dataset, as shown in Figure 2.
Dataset Splits	No	The paper mentions using Wikitext2, PTB, and seven commonsense reasoning datasets for evaluation, and Book Corpus for calibration data. It also states: '10 samples are selected as the calibration data' for baselines, and 'samples with lengths less than 128 in Book Corpus (Zhu et al. 2015) are eliminated'. However, it does not provide specific train/test/validation percentages, sample counts for its own experimental setup, or detailed splitting methodologies for the primary evaluation datasets, relying implicitly on standard benchmarks without explicit description of the splits used.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or other computing resource specifications used for running the experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers, such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup	Yes	To evaluate the effectiveness of Ada Prune, we conduct experiments on the LLa MA-7B (Touvron et al. 2023) and Vicuna-7B (Chiang et al. 2023) models. Meanwhile, samples with lengths less than 128 in Book Corpus (Zhu et al. 2015) are eliminated to narrow the calibration data subspace. The optimization range of the balance coefficients α1 and α2 for coarse and fine-grained importance estimation is set between 0 and 1. The estimation metrics alignment factors are optimized in the range (e5, e6, e7), (e-2, e-3), (1e-4). ... We performed 20% parameter pruning ratio experiments on LLa MA-7B... and 50% pruning ratio experiment on the LLa MA-7B model with fine-tuning... We employ the Low-Rank Adaptation (Lo RA) (Hu et al. 2022) to fine-tune the model on the Stanford Alpaca (Taori et al. 2023) dataset.