reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Olica: Efficient Structured Pruning of Large Language Models without Retraining

Authors: Jiujun He, Huazhen Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks. Table 1. We compare the resource consumption of different pruning methods on the LLa MA-7B model, focusing on the number of data usage, peak GPU memory consumption, and the runtime required for pruning (or retraining). The performance of the pruned model is evaluated based on perplexity (PPL) on the Wiki Text2 dataset and accuracy averaged across the following datasets: Bool Q, PIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA.
Researcher Affiliation	Academia	1Center of Statistical Research, School of Statistics and Data Science, and New Cornerstone Science Laboratory, Southwestern University of Finance and Economics, Chengdu, China. Correspondence to: Huazhen Lin <EMAIL>.
Pseudocode	Yes	Algorithm 1 Overview of the proposed Olica
Open Source Code	Yes	1Code is available at https://github.com/ Better TMr R/LLM-Olica.
Open Datasets	Yes	We assess the performance of the pruned models on Wiki Text2 (Merity et al., 2017) with sequence length of 128 tokens, and on the following downstream tasks: Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2021) , ARC-easy (Clark et al., 2018), ARC-challenge (Clark et al., 2018), and Openbook QA (Mihaylov et al., 2018). We employ lm-eval-harness framework (Gao et al., 2021) to evaluate the pruned model performance on these tasks. We randomly select 256 samples from Bookcorpus (Zhu et al., 2015) and Alpaca (Taori et al., 2023) datasets... These samples are sampled from C4 dataset (Raffel et al., 2020).
Dataset Splits	Yes	We assess the performance of the pruned models on Wiki Text2 (Merity et al., 2017) with sequence length of 128 tokens, and on the following downstream tasks: Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2021) , ARC-easy (Clark et al., 2018), ARC-challenge (Clark et al., 2018), and Openbook QA (Mihaylov et al., 2018). We employ lm-eval-harness framework (Gao et al., 2021) to evaluate the pruned model performance on these tasks. The version of lm-eval-harness used in this paper is the same as Slim GPT (Ling et al., 2024), which can be found in their the supplementary material: https://openreview.net/forum?id=Mx F0IKJt KW.
Hardware Specification	Yes	All the experiments are conducted on a single NVIDIA A100 80GB GPU. Inference latency is also notably reduced, from 46.95s at 0% sparsity to 34.28s at 33% sparsity (measured on the Wiki Text2 test set using a single NVIDIA Ge Force RTX 4090).
Software Dependencies	Yes	We employ lm-eval-harness framework (Gao et al., 2021) to evaluate the pruned model performance on these tasks2. 2The version of lm-eval-harness used in this paper is the same as Slim GPT (Ling et al., 2024), which can be found in their the supplementary material: https://openreview.net/forum?id=Mx F0IKJt KW.
Experiment Setup	Yes	We randomly select 256 samples from Bookcorpus (Zhu et al., 2015) and Alpaca (Taori et al., 2023) datasets, each of which is truncated to a sequence length of 128 tokens, as the calibration data. The number of calibrated FFN layers is selected from {6, 12, 16} for models with different parameter sizes. In the linear calibration, we retain the top 3% eigenvectors for low-rank approximation, i.e., r/d = 0.03. Following (Frantar & Alistarh, 2023; Frantar et al., 2023), we set the λ in 7 as: λ = λ0 Mean(diag(X X)) where λ0 is fixed as 0.5.