Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
Authors: Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To address the three issues in the Introduction section, we conduct extensive experiments, training, evaluating, and analyzing the models ranging from 0.1B to 1.2B. |
| Researcher Affiliation | Collaboration | 1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China 2Huawei Noah s Ark Lab, China. |
| Pseudocode | Yes | Algorithm 1 Find the CETT hyper-parameter for CETT-PPL-p% sparsity |
| Open Source Code | Yes | The codes and checkpoints are available at https: //github.com/thunlp/Sparsing Law/. |
| Open Datasets | Yes | The pre-training data is a mixture of various corpus, including a cleaned version of Common Crawl, Dolma (Soldaini et al., 2024), C4 (Raffel et al., 2020), Pile (Gao et al., 2020), the Stack (Kocetkov et al., 2022), Star Coder (Li et al., 2023), and other collected raw corpus. |
| Dataset Splits | Yes | We introduce a tiny validation dataset, which shares the same distribution as the pre-training data. We conduct deduplication to eliminate any intersections between validation and pre-training data. ... For the measurement of sparsity, to eliminate the impact of stochastic factors (especially the sparsity fluctuations during the early stage), we employ a sparsity stabilizing strategy (see Appendix E). ... the task-specific performance is evaluated on checkpoints after the decay stage. |
| Hardware Specification | Yes | Both frameworks are compiled with CUDA enabled and run on the same machine with 104 CPUs and 1 NVIDIA A800 GPU. |
| Software Dependencies | No | The paper mentions 'CUDA enabled' but does not specify a version number for CUDA or any other software dependencies like Python, PyTorch, or specific versions of Power Infer/llama.cpp. |
| Experiment Setup | Yes | We employ the following pre-training hyper-parameters across all settings: peak learning rate lr = 0.01, β1 = 0.9, β2 = 0.95, weight decay = 0.1. The batch size depends on the parameter scale, as presented in Table 3. |