R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference
Authors: Zhenyu Zhang, Zechun Liu, Yuandong Tian, Harshit Khaitan, Zhangyang Wang, Steven Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity, resulting in a significant 43% end-to-end efficient improvements with customized kernels. |
| Researcher Affiliation | Collaboration | 1The University of Texas at Austin, 2Meta AI EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Search Algorithm for Sparsification Recipe |
| Open Source Code | Yes | The code is available at https://github.com/VITA-Group/R-Sparse. |
| Open Datasets | Yes | We assess the models on several popular tasks, including eight common-sense reasoning tasks: Winogrande (WG) (Sakaguchi et al., 2021), PIQA (Bisk et al., 2020), Sci Q (Welbl et al., 2017), Open Book QA (OBQA) (Mihaylov et al., 2018b), Hella Swag (HS) (Zellers et al., 2019), Bool Q (Clark et al., 2019), and ARC (ARC-Easy and ARC-Challenge) (Clark et al., 2018b). Evaluations are conducted using the lm-evaluation-harness framework (Gao et al., 2021). Additionally, we report results on text summarization tasks using XSUM (Narayan et al., 2018), as well as language modeling tasks on the validation set of Wiki Text-2 (Merity et al., 2016). |
| Dataset Splits | Yes | We collected the distribution of S for Llama-2-7B (Touvron et al., 2023) using 16 training samples from the C4 dataset (Dodge et al., 2021), each containing 4096 tokens. ... language modeling tasks on the validation set of Wiki Text-2 (Merity et al., 2016). ... For the uniform approach, we set ρ = 0.95 uniformly across all layers, based on a grid search using 16 training samples from the C4 dataset. |
| Hardware Specification | Yes | The overhead of the search process is minimal, taking approximately one hour on a single A6000 GPU for the Llama-2-7B model. ... All experiments are conducted on a single NVIDIA A6000 GPU without offloading. |
| Software Dependencies | No | Our implementation is based on the Hugging Face library with FP32 precision data format. All experiments are conducted on a single NVIDIA A6000 GPU without offloading. We applied a uniform 50% sparsity to R-Sparse, achieving comparable performance as shown in Section 4.2 and utilized a customized Triton kernel to reduce data transfer between on-chip SRAM and HBM. ... Note that we use GPTQ (Frantar et al., 2022) for weight quantization with a group size of 128, that provides matching performance as the full baseline. |
| Experiment Setup | Yes | The population size is set to 32, with both the mutation rate pm and crossover rate pc equals 0.5, and the total number of generations is 5. ... For the uniform approach, we set ρ = 0.95 uniformly across all layers, based on a grid search using 16 training samples from the C4 dataset. ... We applied a uniform 50% sparsity to R-Sparse |