reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Authors: Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical results demonstrate the effectiveness of our non-linear pruning approach in maintaining model performance while significantly reducing computational costs, which is beyond the current state-of-the-art methods, i.e., Sparse GPT and Wanda, by a large margin. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices.
Researcher Affiliation	Academia	EMAIL. The University of Hong Kong. EMAIL. University of Wisconsin-Madison. EMAIL. South China University of Technology. EMAIL. University of Wisconsin-Madison. EMAIL. The Simons Institute for the Theory of Computing at UC, Berkeley. EMAIL. University of Pennsylvania.
Pseudocode	Yes	Algorithm 1 Gradient Descent for Pruning Mask (Theorem 4.1).
Open Source Code	No	The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets	Yes	Our experiments on synthetic clearly align and support our theoretical analysis (Section 6.1). Furthermore, we evaluate our non-linear pruning method and show it beyond Sparse GPT and Wanda by a large margin on real data (C4 Dataset Raffel et al. (2020)) under real LLM (Llama 3.2-1B Meta (2024)) in Section 6.2 and Section 6.3.
Dataset Splits	Yes	We use 8 sentences in the C4 training dataset Raffel et al. (2020) as calibration data, and we use 128 sentences in the C4 validation dataset Raffel et al. (2020) to evaluate perplexity.
Hardware Specification	No	The paper mentions '16GB GPU memory' in the introduction as a general requirement for Llama 3.1, but does not specify the particular GPU models or other hardware used for their own experiments.
Software Dependencies	No	We implement our method following the pseudocode in Algorithm 1, using NumPy and JAX for acceleration. No version numbers are specified for these software components.
Experiment Setup	Yes	In our experiments, the weight matrix dimension d = 64 is kept constant across all figures, and we simulate two datasets of size k = 16 and k = 64. We set the input sequence length n = 128 for experiments (a) and (c) in Figure 2. The pruning ratio ρ = 0.5 is set for experiments (a) and (b) in Figure 2. For our method, the regularization coefficient λ := eλ/n where we abuse the notation to denote eλ as the same used in Definition 3.2 and λ here is the parameter we really control in experiments. λ is set as 0.04 for experiments (b) and (c) in Figure 2 (intuition drawn from experiment (a)). The total number of epochs is set as T = 100. The step size is set as η = 0.1/λ because Theorem 4.1 indicates that η is inversely proportional to λ with some constant, i.e., η 1/L 1/(λ + other terms). For all the experiments, we set λ as 0.05 and η as 0.005. We set the pruning ratio as 0.5 for MLP and 0.5 for Attention Layer... We set 0.005 as the learning rate η and 0.05 as the regularization coefficient λ.