Zeroth-Order Adaptive Neuron Alignment Based Pruning without Re-Training
Authors: Elia Cunegatti, Leonardo Lucio Custode, Giovanni Iacca
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our method over 300 test cases with four LLM families, three sparsity ratios, and ten language tasks (three language modeling and seven zero-shot datasets), showing how it consistently outperforms the latest state-of-the-art methods in terms of performance-runtime trade-off. |
| Researcher Affiliation | Collaboration | Elia Cunegatti EMAIL University of Trento, Italy Leonardo Lucio Custode EMAIL Independent Researcher Giovanni Iacca EMAIL University of Trento, Italy |
| Pseudocode | Yes | Figure 3: Left: Overall Neuron Al top-up pruning procedure. Right: Get Best Neuron AL sub-routine used in both blockand row-selection stages. |
| Open Source Code | Yes | The code is available at https://github.com/eliacunegatti/Neuro AL. |
| Open Datasets | Yes | Language Modeling Datasets To measure the models perplexity on Language Modeling datasets, we use the following three datasets: (1) Wiki Text2 (Merity et al., 2017), (2) Colossal Clean Common Crawl (C4) (Raffel et al., 2020), and (3) Penn Treebank (PTB). Zero-Shot Tasks To assess more thoroughly how the different pruning algorithms affect the models capabilities, we employ the following 7 datasets: (1) Recognizing Textual Entailment (RTE) (Dagan et al., 2006; Bar Haim et al., 2006; Giampiccolo et al., 2007; Bentivogli et al., 2009) , (2) Wino Grande (Sakaguchi et al., 2021), (3) Bool Q (Clark et al., 2019), (4) Hella Swag (Zellers et al., 2019), (5) ARC-e (Clark et al., 2018), (6) ARC-c (Clark et al., 2018), (7) OBQA (Mihaylov et al., 2018) |
| Dataset Splits | Yes | For all the pruning algorithms that use calibration data (i.e., multiflow, Wanda, and Sparse GPT), we use 128 samples from the C4 dataset, as in (Frantar & Alistarh, 2023; Sun et al., 2023; Yin et al., 2024). ... For both C and Cλ, we use the same seed (0) for the calibration set, i.e., Cλ contains the first 8 elements of C. |
| Hardware Specification | Yes | All the experiments have been run on NVIDIA A100 GPUs, both with 40 and 80 GB. ... The evaluation consists of the end-to-end token generation and has been done over an Intel i910980XE CPU using 18 cores. |
| Software Dependencies | No | The paper mentions 'inference pipeline based on Deep Sparse (Neural Magic, 2021) ONNXRuntime backends' but does not specify version numbers for these or other software libraries. |
| Experiment Setup | Yes | For OWL, we set the hyperparameters to the values that are used mostly in the original paper, hence M = 5 and λ = 0.08; we do the same for Alpha Pruning, setting ϵ = 0.3. ... In the experiments, we set λset = [0.01, 0.02, 0.03, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.12, 0.15, 0.20,0.25] for the block step, while for the row step, we also added 0.0 (in case of no performance improvement). ... For both C and Cλ, we use the same seed (0) for the calibration set |