La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation
Authors: Kai Liu, Bowen Xu, Shaoyu Wu, Xin Chen, Hao Zhou, Yongliang Tao, Lulu Hu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments with leading LLMs and demonstrate that La Ro SA is effective and robust across different types, sizes, and sparsity levels. La Ro SA presents minimal performance degradation while providing consistent wall-clock time speed-up. Specifically, for LLa MA2-7B at 40% sparsity, La Ro SA achieves a mere 0.17 perplexity gap with a consistent 1.30 wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54%, while surpassing TEAL by 1.77% and CATS by 17.14%. |
| Researcher Affiliation | Industry | 1Alibaba Group. Correspondence to: Kai Liu <EMAIL>, Bowen Xu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Grid Search for Optimal Sparsity Coefficients |
| Open Source Code | No | The paper does not provide an explicit statement or a link to its own source code for the methodology described. It mentions using 'Hugging Face Open-R1 repository' for evaluation but not for releasing its own implementation. |
| Open Datasets | Yes | We use the Wiki Text2 train set (Merity et al., 2016) as calibration dataset for La Ro SA and other reproducible works. We conduct experiments on complex tasks such as MATH500 (Lightman et al., 2024), GPQA-Diamond (Rein et al., 2024), and AIME 24 (AIME, 2025) |
| Dataset Splits | Yes | We use the Wiki Text2 train set (Merity et al., 2016) as calibration dataset for La Ro SA and other reproducible works. All models are evaluated on the same 128 random samples with a 2048-token context length. |
| Hardware Specification | Yes | The computation of Q is performed on 8x80G A100 GPUs, taking approximately 12 minutes to complete for the LLa MA3 70B model. Experiments are conducted on NVIDIA A100 GPUs. (Table 3 also lists 'A100' and 'H20' under the 'GPU' column). |
| Software Dependencies | No | The paper mentions software components like 'Triton-based kernel', 'Deja Vu (Liu et al., 2023)', 'TEAL (Liu et al., 2024a)', 'lm-evaluation-harness(Gao et al., 2023)', and 'Hugging Face Open-R1 repository (Face, 2025)' but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We randomly select 16 sequences with sequence length of 2048 tokens to compute the rotation matrices Q for La Ro SA and empirical distributions for CATS and TEAL. For sparsity coefficient α, we employ Grid Search to find the optimal hyperparameter for each model, as shown in Appendix B Algorithm 1. The optimal α for each activation type of models is presented in Appendix B Table 11. We collected ten samples, each consisting of 128 tokens, from various test datasets and generated new sequences with lengths ranging from 128 to 2048 tokens. Tensor parallelism (TP2) is set for LLa MA3-70B and Qwen2.5-72B. |