A Proximal Operator for Inducing 2:4-Sparsity

Authors: Jonas M. Kübler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matthäus Kleindessner, Kailash Budhathoki, Volkan Cevher, George Karypis

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters. On models up to 13B we improve over previous state of the art algorithms, whilst on 70B models we match their performance. Our key empirical contribution is that we apply such local gradient descent after masking to previous state-of-the-art methods (Wanda and Sparse GPT) and find that we can improve those out of the box. Section 4, titled 'Experiments', further details these empirical evaluations, including 'Toy experiments' and 'Pruning large language models' and presents results in tables like Table 1 showing 'Validation Perplexity'.
Researcher Affiliation Collaboration The authors' affiliations include 'Amazon', 'Huawei', 'University of California San Diego, USA', 'Technion Israel Institute of Technology, Haifa, Israel', and 'École Polytechnique Fédéral de Lausanne, Lausanne, Switzerland'. This mix of company and university affiliations indicates a collaborative effort.
Pseudocode Yes The paper includes a block titled 'Algorithm 1 Solve Prox with decreasing non-negative input' and another titled 'Algorithm 2 Matrix Prox Pruner 2:4'.
Open Source Code No The text does not contain a clear and affirmative statement of code release for the methodology described in this paper, nor does it provide a direct link to a code repository. Mentions of URLs are for review forums or third-party models/data.
Open Datasets Yes To prune LLMs, we use 2 million tokens from the c4 dataset (Raffel et al., 2020)... We evaluate the models both on in-distribution validation data from c4, as well as out of distribution data from Wiki Text2 (Merity et al., 2016).
Dataset Splits Yes To prune LLMs, we use 2 million tokens from the c4 dataset (Raffel et al., 2020), which we pack with end-of-sequence tokens and chunk into 1024 sequences4 of lengths 2048. We evaluate the models both on in-distribution validation data from c4, as well as out of distribution data from Wiki Text2 (Merity et al., 2016). In this section, we provide an ablation on the calibration samples for the 3B model, where we consider the same number of samples as in Sun et al. (2024, Table 17)).
Hardware Specification Yes All experiments can run on a single NVIDIA A100 GPU with 40GB of memory, but we used multiple GPUs in parallel to speed up the experimentation.
Software Dependencies No The paper mentions software tools and frameworks (e.g., PyTorch implicitly through mentioning GPUs and machine learning context) but does not provide specific version numbers for any libraries, frameworks, or programming languages used.
Experiment Setup Yes For our experiments unless otherwise stated we use λk = λ0βk with λ0 = 0.01 and β = 1.01. Furthermore, we always use the solver Algorithm 2 relying on Conjecture 9. After ending Proximal Gradient, we do 1000 steps of masked gradients according to equation 5 to minimize the local squared loss. Furthermore, we initialize all methods Wanda style by transforming W i,j 7 W i,j H1/2 j,j , Hi,j 7 Hi,j H 1/2 j,j H 1/2 i,i . For the 70B models, we observed that this would require more than 2000 iterations of PQ. We thus used the heuristic from Equation (15) and the results of Figure 3 and selected β = 1.005, λ0 = 1 10 3 to strike a good balance between performance and runtime.