Sparsified State-Space Models are Efficient Highway Networks

Authors: Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the effect of highways, showing that Simba not only enhances efficiency but also improves the information flow across long sequences. Our experiments show that Simba, obtained by sparsifying Mamba without any fine-tuning, significantly outperforms Mamba using the same number of FLOPS in various tasks.
Researcher Affiliation Academia Woomin Song EMAIL Korea Advanced Institute of Science & Technology (KAIST) Jihoon Tack EMAIL Korea Advanced Institute of Science & Technology (KAIST) Sangwoo Mo EMAIL University of Michigan, Ann Arbor Seunghyuk Oh EMAIL Korea Advanced Institute of Science & Technology (KAIST) Jinwoo Shin EMAIL Korea Advanced Institute of Science & Technology (KAIST)
Pseudocode No The paper describes the proposed method using mathematical formulas and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/woominsong/Simba.
Open Datasets Yes We also demonstrate the language modeling ability of Simba by measuring perplexity on the PG-19 dataset (Rae et al., 2019) across different context lengths. Simba consistently achieves better FLOPS-accuracy curves on 6 NLP benchmarks, including Lambada (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC-Challenge (Clark et al., 2018), ARC-Easy (Clark et al., 2018), and Wino Grande (Sakaguchi et al., 2021). We perform a simple fine-tuning experiment, further training the Mamba-370m model with Mini Pile dataset (Kaddour, 2023), which is a subset of the pre-training dataset (the Pile (Gao et al., 2020)).
Dataset Splits No The paper describes evaluation methodologies using few-shot prompts (e.g., "10-shot prompts for Hella Swag") and measuring perplexity on sampled snippets, but it does not specify explicit training, validation, or test dataset splits in terms of percentages, sample counts, or references to predefined splits for the datasets used for model training or fine-tuning.
Hardware Specification Yes All experiments are done on RTX-3090 or RTX-2080 GPUs. All measurements are taken on 8 RTX 2080 Ti GPUs.
Software Dependencies No The paper mentions using "Py Torch implementation of Mamba s parallel scan operation (Torres-Leguet, 2024)" and "Adam W optimizer" but does not provide specific version numbers for PyTorch, Python, CUDA, or other key software libraries.
Experiment Setup Yes For our method, we implement a linear pruning schedule that preserves 10% of the tokens at the final layer unless specified otherwise. We use Adam W optimizer with a learning rate of 5e-5. We schedule the learning rate with a linear warmup for 10% of the total training steps and cosine learning rate decay for the remaining steps. We train the model for 400 steps on the Mini Pile dataset (Kaddour, 2023). We randomly select the pruning ratio between 0% and 90% for each sample.