Olica: Efficient Structured Pruning of Large Language Models without Retraining

Authors: Jiujun He, Huazhen Lin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the proposed Olica is efficient in terms of data usage, GPU memory, and running time, while delivering superior performance across multiple benchmarks. Table 1. We compare the resource consumption of different pruning methods on the LLa MA-7B model, focusing on the number of data usage, peak GPU memory consumption, and the runtime required for pruning (or retraining). The performance of the pruned model is evaluated based on perplexity (PPL) on the Wiki Text2 dataset and accuracy averaged across the following datasets: Bool Q, PIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA.
Researcher Affiliation Academia 1Center of Statistical Research, School of Statistics and Data Science, and New Cornerstone Science Laboratory, Southwestern University of Finance and Economics, Chengdu, China. Correspondence to: Huazhen Lin <EMAIL>.
Pseudocode Yes Algorithm 1 Overview of the proposed Olica
Open Source Code Yes 1Code is available at https://github.com/ Better TMr R/LLM-Olica.
Open Datasets Yes We assess the performance of the pruned models on Wiki Text2 (Merity et al., 2017) with sequence length of 128 tokens, and on the following downstream tasks: Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2021) , ARC-easy (Clark et al., 2018), ARC-challenge (Clark et al., 2018), and Openbook QA (Mihaylov et al., 2018). We employ lm-eval-harness framework (Gao et al., 2021) to evaluate the pruned model performance on these tasks. We randomly select 256 samples from Bookcorpus (Zhu et al., 2015) and Alpaca (Taori et al., 2023) datasets... These samples are sampled from C4 dataset (Raffel et al., 2020).
Dataset Splits Yes We assess the performance of the pruned models on Wiki Text2 (Merity et al., 2017) with sequence length of 128 tokens, and on the following downstream tasks: Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2021) , ARC-easy (Clark et al., 2018), ARC-challenge (Clark et al., 2018), and Openbook QA (Mihaylov et al., 2018). We employ lm-eval-harness framework (Gao et al., 2021) to evaluate the pruned model performance on these tasks. The version of lm-eval-harness used in this paper is the same as Slim GPT (Ling et al., 2024), which can be found in their the supplementary material: https://openreview.net/forum?id=Mx F0IKJt KW.
Hardware Specification Yes All the experiments are conducted on a single NVIDIA A100 80GB GPU. Inference latency is also notably reduced, from 46.95s at 0% sparsity to 34.28s at 33% sparsity (measured on the Wiki Text2 test set using a single NVIDIA Ge Force RTX 4090).
Software Dependencies Yes We employ lm-eval-harness framework (Gao et al., 2021) to evaluate the pruned model performance on these tasks2. 2The version of lm-eval-harness used in this paper is the same as Slim GPT (Ling et al., 2024), which can be found in their the supplementary material: https://openreview.net/forum?id=Mx F0IKJt KW.
Experiment Setup Yes We randomly select 256 samples from Bookcorpus (Zhu et al., 2015) and Alpaca (Taori et al., 2023) datasets, each of which is truncated to a sequence length of 128 tokens, as the calibration data. The number of calibrated FFN layers is selected from {6, 12, 16} for models with different parameter sizes. In the linear calibration, we retain the top 3% eigenvectors for low-rank approximation, i.e., r/d = 0.03. Following (Frantar & Alistarh, 2023; Frantar et al., 2023), we set the λ in 7 as: λ = λ0 Mean(diag(X X)) where λ0 is fixed as 0.5.