Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Authors: Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To address the three issues in the Introduction section, we conduct extensive experiments, training, evaluating, and analyzing the models ranging from 0.1B to 1.2B.
Researcher Affiliation Collaboration 1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China 2Huawei Noah s Ark Lab, China.
Pseudocode Yes Algorithm 1 Find the CETT hyper-parameter for CETT-PPL-p% sparsity
Open Source Code Yes The codes and checkpoints are available at https: //github.com/thunlp/Sparsing Law/.
Open Datasets Yes The pre-training data is a mixture of various corpus, including a cleaned version of Common Crawl, Dolma (Soldaini et al., 2024), C4 (Raffel et al., 2020), Pile (Gao et al., 2020), the Stack (Kocetkov et al., 2022), Star Coder (Li et al., 2023), and other collected raw corpus.
Dataset Splits Yes We introduce a tiny validation dataset, which shares the same distribution as the pre-training data. We conduct deduplication to eliminate any intersections between validation and pre-training data. ... For the measurement of sparsity, to eliminate the impact of stochastic factors (especially the sparsity fluctuations during the early stage), we employ a sparsity stabilizing strategy (see Appendix E). ... the task-specific performance is evaluated on checkpoints after the decay stage.
Hardware Specification Yes Both frameworks are compiled with CUDA enabled and run on the same machine with 104 CPUs and 1 NVIDIA A800 GPU.
Software Dependencies No The paper mentions 'CUDA enabled' but does not specify a version number for CUDA or any other software dependencies like Python, PyTorch, or specific versions of Power Infer/llama.cpp.
Experiment Setup Yes We employ the following pre-training hyper-parameters across all settings: peak learning rate lr = 0.01, β1 = 0.9, β2 = 0.95, weight decay = 0.1. The batch size depends on the parameter scale, as presented in Table 3.