Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Authors: Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results demonstrate the effectiveness of our non-linear pruning approach in maintaining model performance while significantly reducing computational costs, which is beyond the current state-of-the-art methods, i.e., Sparse GPT and Wanda, by a large margin. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices.
Researcher Affiliation Academia EMAIL. The University of Hong Kong. EMAIL. University of Wisconsin-Madison. EMAIL. South China University of Technology. EMAIL. University of Wisconsin-Madison. EMAIL. The Simons Institute for the Theory of Computing at UC, Berkeley. EMAIL. University of Pennsylvania.
Pseudocode Yes Algorithm 1 Gradient Descent for Pruning Mask (Theorem 4.1).
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets Yes Our experiments on synthetic clearly align and support our theoretical analysis (Section 6.1). Furthermore, we evaluate our non-linear pruning method and show it beyond Sparse GPT and Wanda by a large margin on real data (C4 Dataset Raffel et al. (2020)) under real LLM (Llama 3.2-1B Meta (2024)) in Section 6.2 and Section 6.3.
Dataset Splits Yes We use 8 sentences in the C4 training dataset Raffel et al. (2020) as calibration data, and we use 128 sentences in the C4 validation dataset Raffel et al. (2020) to evaluate perplexity.
Hardware Specification No The paper mentions '16GB GPU memory' in the introduction as a general requirement for Llama 3.1, but does not specify the particular GPU models or other hardware used for their own experiments.
Software Dependencies No We implement our method following the pseudocode in Algorithm 1, using NumPy and JAX for acceleration. No version numbers are specified for these software components.
Experiment Setup Yes In our experiments, the weight matrix dimension d = 64 is kept constant across all figures, and we simulate two datasets of size k = 16 and k = 64. We set the input sequence length n = 128 for experiments (a) and (c) in Figure 2. The pruning ratio ρ = 0.5 is set for experiments (a) and (b) in Figure 2. For our method, the regularization coefficient λ := eλ/n where we abuse the notation to denote eλ as the same used in Definition 3.2 and λ here is the parameter we really control in experiments. λ is set as 0.04 for experiments (b) and (c) in Figure 2 (intuition drawn from experiment (a)). The total number of epochs is set as T = 100. The step size is set as η = 0.1/λ because Theorem 4.1 indicates that η is inversely proportional to λ with some constant, i.e., η 1/L 1/(λ + other terms). For all the experiments, we set λ as 0.05 and η as 0.005. We set the pruning ratio as 0.5 for MLP and 0.5 for Attention Layer... We set 0.005 as the learning rate η and 0.05 as the regularization coefficient λ.