Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models

Authors: Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluations on LLa MA2, Mistral, and Gemma model families demonstrate that Da SS not only achieves superior perplexity and accuracy compared to Sparse GPT and Wanda in achieving hardware-friendly N:M sparsity patterns but also maintains the computational efficiency of Wanda. We perform extensive experiments on LLa MA2 (Touvron et al., 2023), Gemma (Team et al., 2024), and Mistral (Jiang et al., 2023) to evaluate Da SS across various tasks from language modeling, 5 commonsense reasoning tasks.
Researcher Affiliation Academia Zhiyu Guo EMAIL Nara Institute of Science and Technology Hidetaka Kamigaito EMAIL Nara Institute of Science and Technology Taro Watanabe EMAIL Nara Institute of Science and Technology
Pseudocode No The paper describes methods in narrative form, without structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/guozhiyu/glu_dass
Open Datasets Yes We used the same calibration data set as Sparse GPT and Wanda in their model pruning processes, consisting of 128 sequences of 2048 tokens each, randomly selected from the first shard of the C4 dataset (Raffel et al., 2020). For perplexity evaluation, we use the validation dataset of Wiki Text2 (Merity et al., 2017).
Dataset Splits Yes We used the same calibration data set as Sparse GPT and Wanda in their model pruning processes, consisting of 128 sequences of 2048 tokens each, randomly selected from the first shard of the C4 dataset (Raffel et al., 2020). For perplexity evaluation, we use the validation dataset of Wiki Text2 (Merity et al., 2017).
Hardware Specification Yes We use a single A6000 48GB GPU to prune the 7B and 13B models, and use 8 A100 40GB GPUs to prune the larger 70B model. The speed is tested on an Intel Xeon Platinum 8160 CPU with 24 cores.
Software Dependencies No The paper mentions Py Torch, CUTLASS library, cu SPARSELt library, Hugging Face Transformers (Wolf et al., 2019), and Lm-Evaluation-Harness (Gao et al., 2021), but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We apply a uniform sparsity ratio across all the pruned layers and evaluate three sparsity types: unstructured sparsity, and semi-structured sparsities of 4:8 and 2:4. We used the same calibration data set as Sparse GPT and Wanda in their model pruning processes, consisting of 128 sequences of 2048 tokens each, randomly selected from the first shard of the C4 dataset (Raffel et al., 2020). We set the context size for perplexity evaluation as 2048 for all the models. We empirically find that α = 0.5 is a well-balanced point for different models and datasets in our preliminary studies.