Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix
Authors: Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results demonstrate the effectiveness of our non-linear pruning approach in maintaining model performance while significantly reducing computational costs, which is beyond the current state-of-the-art methods, i.e., Sparse GPT and Wanda, by a large margin. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices. |
| Researcher Affiliation | Academia | EMAIL. The University of Hong Kong. EMAIL. University of Wisconsin-Madison. EMAIL. South China University of Technology. EMAIL. University of Wisconsin-Madison. EMAIL. The Simons Institute for the Theory of Computing at UC, Berkeley. EMAIL. University of Pennsylvania. |
| Pseudocode | Yes | Algorithm 1 Gradient Descent for Pruning Mask (Theorem 4.1). |
| Open Source Code | No | The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | Our experiments on synthetic clearly align and support our theoretical analysis (Section 6.1). Furthermore, we evaluate our non-linear pruning method and show it beyond Sparse GPT and Wanda by a large margin on real data (C4 Dataset Raffel et al. (2020)) under real LLM (Llama 3.2-1B Meta (2024)) in Section 6.2 and Section 6.3. |
| Dataset Splits | Yes | We use 8 sentences in the C4 training dataset Raffel et al. (2020) as calibration data, and we use 128 sentences in the C4 validation dataset Raffel et al. (2020) to evaluate perplexity. |
| Hardware Specification | No | The paper mentions '16GB GPU memory' in the introduction as a general requirement for Llama 3.1, but does not specify the particular GPU models or other hardware used for their own experiments. |
| Software Dependencies | No | We implement our method following the pseudocode in Algorithm 1, using NumPy and JAX for acceleration. No version numbers are specified for these software components. |
| Experiment Setup | Yes | In our experiments, the weight matrix dimension d = 64 is kept constant across all figures, and we simulate two datasets of size k = 16 and k = 64. We set the input sequence length n = 128 for experiments (a) and (c) in Figure 2. The pruning ratio ρ = 0.5 is set for experiments (a) and (b) in Figure 2. For our method, the regularization coefficient λ := eλ/n where we abuse the notation to denote eλ as the same used in Definition 3.2 and λ here is the parameter we really control in experiments. λ is set as 0.04 for experiments (b) and (c) in Figure 2 (intuition drawn from experiment (a)). The total number of epochs is set as T = 100. The step size is set as η = 0.1/λ because Theorem 4.1 indicates that η is inversely proportional to λ with some constant, i.e., η 1/L 1/(λ + other terms). For all the experiments, we set λ as 0.05 and η as 0.005. We set the pruning ratio as 0.5 for MLP and 0.5 for Attention Layer... We set 0.005 as the learning rate η and 0.05 as the regularization coefficient λ. |