PROXSPARSE: REGULARIZED LEARNING OF SEMI-STRUCTURED SPARSITY MASKS FOR PRETRAINED LLMS
Authors: Hongyi Liu, Rajarshi Saha, Zhen Jia, Youngsuk Park, Jiaji Huang, Shoham Sabach, Yu-Xiang Wang, George Karypis
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive evaluations on 7 widely used models show that Prox Sparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning. We conducted extensive experiments on 7 widely used high-performance open-source models from four model families including Mistral (Jiang et al., 2023), Qwen (Yang et al., 2024), Open Llama (Geng & Liu, 2023) and Llama (Touvron et al., 2023) family. |
| Researcher Affiliation | Collaboration | 1Rice University 2Amazon Web Service 3Technion 4UCSD. Correspondence to: Hongyi L. <EMAIL>, Rajarshi S. <EMAIL>, Yu-Xiang W. <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Prox Sparse: Proximal Gradient Descent for End-to-End 2:4-Sparsity Pruning Algorithm 2 ALM: Alternating Minimization Algorithm 3 Enum ALM for solving (6) |
| Open Source Code | Yes | Code available here. |
| Open Datasets | Yes | For calibration, we followed Wanda (Sun et al., 2023) and Sparse GPT (Frantar & Alistarh, 2023) to utilize the C4 (Raffel et al., 2020) dataset for calibration. Zero-shot performance was evaluated with the Eleuther AI LM-Eval Harness (Gao et al., 2024) on seven widely used tasks (Liu et al., 2024), while Wikitext (Merity et al., 2016) perplexity (PPL) was used as the language modeling metric, consistent with previous evaluation protocol (Sun et al., 2023; Frantar & Alistarh, 2023). |
| Dataset Splits | No | The experiments use 400 data samples for calibration unless specified, with consistent counts across baselines for fair comparison. |
| Hardware Specification | Yes | Our experiments were done on Nvidia A100 GPUs. We utilize the Nvidia CUTLASS library as the underlying implementation for 2:4 semi-structured sparse operations. |
| Software Dependencies | No | Table 8 presents the configurations and hyperparameters used in our experiments. There are three key hyperparameters for learning an optimal semi-structured mask: sparsity regularization strength (λ1), frozen weight regularization extent ( λ2), and learning rate. Our learning procedure follows standard settings, using Adam W as the optimizer with a warmup ratio of 0.1. (No specific versions for AdamW or CUTLASS are provided). |
| Experiment Setup | Yes | Table 8 presents the configurations and hyperparameters used in our experiments. There are three key hyperparameters for learning an optimal semi-structured mask: sparsity regularization strength (λ1), frozen weight regularization extent ( λ2), and learning rate. Our learning procedure follows standard settings, using Adam W as the optimizer with a warmup ratio of 0.1. |