reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

R-Sparse: Rank-Aware Activation Sparsity for Efficient LLM Inference

Authors: Zhenyu Zhang, Zechun Liu, Yuandong Tian, Harshit Khaitan, Zhangyang Wang, Steven Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on Llama-2/3 and Mistral models across ten diverse tasks demonstrate that R-Sparse achieves comparable performance at 50% model-level sparsity, resulting in a significant 43% end-to-end efficient improvements with customized kernels.
Researcher Affiliation	Collaboration	1The University of Texas at Austin, 2Meta AI EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Search Algorithm for Sparsification Recipe
Open Source Code	Yes	The code is available at https://github.com/VITA-Group/R-Sparse.
Open Datasets	Yes	We assess the models on several popular tasks, including eight common-sense reasoning tasks: Winogrande (WG) (Sakaguchi et al., 2021), PIQA (Bisk et al., 2020), Sci Q (Welbl et al., 2017), Open Book QA (OBQA) (Mihaylov et al., 2018b), Hella Swag (HS) (Zellers et al., 2019), Bool Q (Clark et al., 2019), and ARC (ARC-Easy and ARC-Challenge) (Clark et al., 2018b). Evaluations are conducted using the lm-evaluation-harness framework (Gao et al., 2021). Additionally, we report results on text summarization tasks using XSUM (Narayan et al., 2018), as well as language modeling tasks on the validation set of Wiki Text-2 (Merity et al., 2016).
Dataset Splits	Yes	We collected the distribution of S for Llama-2-7B (Touvron et al., 2023) using 16 training samples from the C4 dataset (Dodge et al., 2021), each containing 4096 tokens. ... language modeling tasks on the validation set of Wiki Text-2 (Merity et al., 2016). ... For the uniform approach, we set ρ = 0.95 uniformly across all layers, based on a grid search using 16 training samples from the C4 dataset.
Hardware Specification	Yes	The overhead of the search process is minimal, taking approximately one hour on a single A6000 GPU for the Llama-2-7B model. ... All experiments are conducted on a single NVIDIA A6000 GPU without offloading.
Software Dependencies	No	Our implementation is based on the Hugging Face library with FP32 precision data format. All experiments are conducted on a single NVIDIA A6000 GPU without offloading. We applied a uniform 50% sparsity to R-Sparse, achieving comparable performance as shown in Section 4.2 and utilized a customized Triton kernel to reduce data transfer between on-chip SRAM and HBM. ... Note that we use GPTQ (Frantar et al., 2022) for weight quantization with a group size of 128, that provides matching performance as the full baseline.
Experiment Setup	Yes	The population size is set to 32, with both the mutation rate pm and crossover rate pc equals 0.5, and the total number of generations is 5. ... For the uniform approach, we set ρ = 0.95 uniformly across all layers, based on a grid search using 16 training samples from the C4 dataset. ... We applied a uniform 50% sparsity to R-Sparse