reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sparsified State-Space Models are Efficient Highway Networks

Authors: Woomin Song, Jihoon Tack, Sangwoo Mo, Seunghyuk Oh, Jinwoo Shin

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that Simba outperforms the baseline model, Mamba, with the same FLOPS in various natural language tasks. Moreover, we illustrate the eﬀect of highways, showing that Simba not only enhances eﬃciency but also improves the information ﬂow across long sequences. Our experiments show that Simba, obtained by sparsifying Mamba without any ﬁne-tuning, signiﬁcantly outperforms Mamba using the same number of FLOPS in various tasks.
Researcher Affiliation	Academia	Woomin Song EMAIL Korea Advanced Institute of Science & Technology (KAIST) Jihoon Tack EMAIL Korea Advanced Institute of Science & Technology (KAIST) Sangwoo Mo EMAIL University of Michigan, Ann Arbor Seunghyuk Oh EMAIL Korea Advanced Institute of Science & Technology (KAIST) Jinwoo Shin EMAIL Korea Advanced Institute of Science & Technology (KAIST)
Pseudocode	No	The paper describes the proposed method using mathematical formulas and textual descriptions, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/woominsong/Simba.
Open Datasets	Yes	We also demonstrate the language modeling ability of Simba by measuring perplexity on the PG-19 dataset (Rae et al., 2019) across diﬀerent context lengths. Simba consistently achieves better FLOPS-accuracy curves on 6 NLP benchmarks, including Lambada (Paperno et al., 2016), Hella Swag (Zellers et al., 2019), PIQA (Bisk et al., 2020), ARC-Challenge (Clark et al., 2018), ARC-Easy (Clark et al., 2018), and Wino Grande (Sakaguchi et al., 2021). We perform a simple ﬁne-tuning experiment, further training the Mamba-370m model with Mini Pile dataset (Kaddour, 2023), which is a subset of the pre-training dataset (the Pile (Gao et al., 2020)).
Dataset Splits	No	The paper describes evaluation methodologies using few-shot prompts (e.g., "10-shot prompts for Hella Swag") and measuring perplexity on sampled snippets, but it does not specify explicit training, validation, or test dataset splits in terms of percentages, sample counts, or references to predefined splits for the datasets used for model training or fine-tuning.
Hardware Specification	Yes	All experiments are done on RTX-3090 or RTX-2080 GPUs. All measurements are taken on 8 RTX 2080 Ti GPUs.
Software Dependencies	No	The paper mentions using "Py Torch implementation of Mamba s parallel scan operation (Torres-Leguet, 2024)" and "Adam W optimizer" but does not provide specific version numbers for PyTorch, Python, CUDA, or other key software libraries.
Experiment Setup	Yes	For our method, we implement a linear pruning schedule that preserves 10% of the tokens at the ﬁnal layer unless speciﬁed otherwise. We use Adam W optimizer with a learning rate of 5e-5. We schedule the learning rate with a linear warmup for 10% of the total training steps and cosine learning rate decay for the remaining steps. We train the model for 400 steps on the Mini Pile dataset (Kaddour, 2023). We randomly select the pruning ratio between 0% and 90% for each sample.