PROXSPARSE: REGULARIZED LEARNING OF SEMI-STRUCTURED SPARSITY MASKS FOR PRETRAINED LLMS

Authors: Hongyi Liu, Rajarshi Saha, Zhen Jia, Youngsuk Park, Jiaji Huang, Shoham Sabach, Yu-Xiang Wang, George Karypis

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive evaluations on 7 widely used models show that Prox Sparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning. We conducted extensive experiments on 7 widely used high-performance open-source models from four model families including Mistral (Jiang et al., 2023), Qwen (Yang et al., 2024), Open Llama (Geng & Liu, 2023) and Llama (Touvron et al., 2023) family.
Researcher Affiliation Collaboration 1Rice University 2Amazon Web Service 3Technion 4UCSD. Correspondence to: Hongyi L. <EMAIL>, Rajarshi S. <EMAIL>, Yu-Xiang W. <EMAIL>.
Pseudocode Yes Algorithm 1 Prox Sparse: Proximal Gradient Descent for End-to-End 2:4-Sparsity Pruning Algorithm 2 ALM: Alternating Minimization Algorithm 3 Enum ALM for solving (6)
Open Source Code Yes Code available here.
Open Datasets Yes For calibration, we followed Wanda (Sun et al., 2023) and Sparse GPT (Frantar & Alistarh, 2023) to utilize the C4 (Raffel et al., 2020) dataset for calibration. Zero-shot performance was evaluated with the Eleuther AI LM-Eval Harness (Gao et al., 2024) on seven widely used tasks (Liu et al., 2024), while Wikitext (Merity et al., 2016) perplexity (PPL) was used as the language modeling metric, consistent with previous evaluation protocol (Sun et al., 2023; Frantar & Alistarh, 2023).
Dataset Splits No The experiments use 400 data samples for calibration unless specified, with consistent counts across baselines for fair comparison.
Hardware Specification Yes Our experiments were done on Nvidia A100 GPUs. We utilize the Nvidia CUTLASS library as the underlying implementation for 2:4 semi-structured sparse operations.
Software Dependencies No Table 8 presents the configurations and hyperparameters used in our experiments. There are three key hyperparameters for learning an optimal semi-structured mask: sparsity regularization strength (λ1), frozen weight regularization extent ( λ2), and learning rate. Our learning procedure follows standard settings, using Adam W as the optimizer with a warmup ratio of 0.1. (No specific versions for AdamW or CUTLASS are provided).
Experiment Setup Yes Table 8 presents the configurations and hyperparameters used in our experiments. There are three key hyperparameters for learning an optimal semi-structured mask: sparsity regularization strength (λ1), frozen weight regularization extent ( λ2), and learning rate. Our learning procedure follows standard settings, using Adam W as the optimizer with a warmup ratio of 0.1.