reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

Authors: Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Xiaojun Meng, Liqun Deng, Jiansheng Wei, Zhiyuan Liu, Maosong Sun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To address the three issues in the Introduction section, we conduct extensive experiments, training, evaluating, and analyzing the models ranging from 0.1B to 1.2B.
Researcher Affiliation	Collaboration	1Dept. of Comp. Sci. & Tech., Institute for AI, Tsinghua University, Beijing, China 2Huawei Noah s Ark Lab, China.
Pseudocode	Yes	Algorithm 1 Find the CETT hyper-parameter for CETT-PPL-p% sparsity
Open Source Code	Yes	The codes and checkpoints are available at https: //github.com/thunlp/Sparsing Law/.
Open Datasets	Yes	The pre-training data is a mixture of various corpus, including a cleaned version of Common Crawl, Dolma (Soldaini et al., 2024), C4 (Raffel et al., 2020), Pile (Gao et al., 2020), the Stack (Kocetkov et al., 2022), Star Coder (Li et al., 2023), and other collected raw corpus.
Dataset Splits	Yes	We introduce a tiny validation dataset, which shares the same distribution as the pre-training data. We conduct deduplication to eliminate any intersections between validation and pre-training data. ... For the measurement of sparsity, to eliminate the impact of stochastic factors (especially the sparsity fluctuations during the early stage), we employ a sparsity stabilizing strategy (see Appendix E). ... the task-specific performance is evaluated on checkpoints after the decay stage.
Hardware Specification	Yes	Both frameworks are compiled with CUDA enabled and run on the same machine with 104 CPUs and 1 NVIDIA A800 GPU.
Software Dependencies	No	The paper mentions 'CUDA enabled' but does not specify a version number for CUDA or any other software dependencies like Python, PyTorch, or specific versions of Power Infer/llama.cpp.
Experiment Setup	Yes	We employ the following pre-training hyper-parameters across all settings: peak learning rate lr = 0.01, β1 = 0.9, β2 = 0.95, weight decay = 0.1. The batch size depends on the parameter scale, as presented in Table 3.