Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training

Authors: Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our method over 300 test cases with four LLM families, three sparsity ratios, and ten language tasks (three language modeling and seven zero-shot datasets), showing how it consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off.
Researcher Affiliation Collaboration Elia Cunegatti EMAIL University of Trento, Italy Leonardo Lucio Custode EMAIL Independent Researcher Giovanni Iacca EMAIL University of Trento, Italy
Pseudocode Yes Figure 3: Left: Overall Neuron Al top-up pruning procedure. Right: Get Best Neuron AL sub-routine used in both blockand row-selection stages.
Open Source Code Yes The code is available at https://github.com/eliacunegatti/Neuro AL.
Open Datasets Yes Language Modeling Datasets To measure the models perplexity on Language Modeling datasets, we use the following three datasets: (1) Wiki Text2 (Merity et al., 2017), (2) Colossal Clean Common Crawl (C4) (Raffel et al., 2020), and (3) Penn Treebank (PTB). Zero-Shot Tasks To assess more thoroughly how the different pruning algorithms affect the models capabilities, we employ the following 7 datasets: (1) Recognizing Textual Entailment (RTE) (Dagan et al., 2006; Bar Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) , (2) Wino Grande (Sakaguchi et al., 2021), (3) Bool Q (Clark et al., 2019), (4) Hella Swag (Zellers et al., 2019), (5) ARC-e (Clark et al., 2018), (6) ARC-c (Clark et al., 2018), (7) OBQA (Mihaylov et al., 2018)
Dataset Splits Yes For all the pruning algorithms that use calibration data (i.e., multiflow, Wanda, and Sparse GPT), we use 128 samples from the C4 dataset, as in (Frantar & Alistarh, 2023; Sun et al., 2023; Yin et al., 2024). ... For both C and Cλ, we use the same seed (0) for the calibration set, i.e., Cλ contains the first 8 elements of C.
Hardware Specification Yes All the experiments have been run on NVIDIA A100 GPUs, both with 40 and 80 GB. ... The evaluation consists of the end-to-end token generation and has been done over an Intel i910980XE CPU using 18 cores.
Software Dependencies No The paper mentions 'inference pipeline based on Deep Sparse (Neural Magic, 2021) ONNXRuntime backends' but does not specify version numbers for these or other software libraries.
Experiment Setup Yes For OWL, we set the hyperparameters to the values that are used mostly in the original paper, hence M = 5 and λ = 0.08; we do the same for Alpha Pruning, setting ϵ = 0.3. ... In the experiments, we set λset = [0.01, 0.02, 0.03, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.12, 0.15, 0.20,0.25] for the block step, while for the row step, we also added 0.0 (in case of no performance improvement). ... For both C and Cλ, we use the same seed (0) for the calibration set