A Proximal Operator for Inducing 2:4-Sparsity
Authors: Jonas M. Kübler, Yu-Xiang Wang, Shoham Sabach, Navid Ansari, Matthäus Kleindessner, Kailash Budhathoki, Volkan Cevher, George Karypis
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We illustrate our method on toy problems and apply it to pruning entire large language models up to 70B parameters. On models up to 13B we improve over previous state of the art algorithms, whilst on 70B models we match their performance. Our key empirical contribution is that we apply such local gradient descent after masking to previous state-of-the-art methods (Wanda and Sparse GPT) and find that we can improve those out of the box. Section 4, titled 'Experiments', further details these empirical evaluations, including 'Toy experiments' and 'Pruning large language models' and presents results in tables like Table 1 showing 'Validation Perplexity'. |
| Researcher Affiliation | Collaboration | The authors' affiliations include 'Amazon', 'Huawei', 'University of California San Diego, USA', 'Technion Israel Institute of Technology, Haifa, Israel', and 'École Polytechnique Fédéral de Lausanne, Lausanne, Switzerland'. This mix of company and university affiliations indicates a collaborative effort. |
| Pseudocode | Yes | The paper includes a block titled 'Algorithm 1 Solve Prox with decreasing non-negative input' and another titled 'Algorithm 2 Matrix Prox Pruner 2:4'. |
| Open Source Code | No | The text does not contain a clear and affirmative statement of code release for the methodology described in this paper, nor does it provide a direct link to a code repository. Mentions of URLs are for review forums or third-party models/data. |
| Open Datasets | Yes | To prune LLMs, we use 2 million tokens from the c4 dataset (Raffel et al., 2020)... We evaluate the models both on in-distribution validation data from c4, as well as out of distribution data from Wiki Text2 (Merity et al., 2016). |
| Dataset Splits | Yes | To prune LLMs, we use 2 million tokens from the c4 dataset (Raffel et al., 2020), which we pack with end-of-sequence tokens and chunk into 1024 sequences4 of lengths 2048. We evaluate the models both on in-distribution validation data from c4, as well as out of distribution data from Wiki Text2 (Merity et al., 2016). In this section, we provide an ablation on the calibration samples for the 3B model, where we consider the same number of samples as in Sun et al. (2024, Table 17)). |
| Hardware Specification | Yes | All experiments can run on a single NVIDIA A100 GPU with 40GB of memory, but we used multiple GPUs in parallel to speed up the experimentation. |
| Software Dependencies | No | The paper mentions software tools and frameworks (e.g., PyTorch implicitly through mentioning GPUs and machine learning context) but does not provide specific version numbers for any libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | For our experiments unless otherwise stated we use λk = λ0βk with λ0 = 0.01 and β = 1.01. Furthermore, we always use the solver Algorithm 2 relying on Conjecture 9. After ending Proximal Gradient, we do 1000 steps of masked gradients according to equation 5 to minimize the local squared loss. Furthermore, we initialize all methods Wanda style by transforming W i,j 7 W i,j H1/2 j,j , Hi,j 7 Hi,j H 1/2 j,j H 1/2 i,i . For the 70B models, we observed that this would require more than 2000 iterations of PQ. We thus used the heuristic from Equation (15) and the results of Figure 3 and selected β = 1.005, λ0 = 1 10 3 to strike a good balance between performance and runtime. |