reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Parameter-Efficient Fine-Tuning of State Space Models

Authors: Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, Kangwook Lee

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark six widely used PEFT methods across three categories on diverse tasks, including natural language understanding, generation, and computer vision. We evaluate these methods on both SSM-based models (i.e., Mamba) and a hybrid model (i.e., Jamba (Lieber et al., 2025)). Our results show that Lo RA consistently outperforms all other PEFT methods on both SSM-based and hybrid models. Through extensive experiments, we demonstrate that integrating SDT into SSM-based models, combined with applying Lo RA to their linear projection matrices, achieves state-of-the-art fine-tuning performance.
Researcher Affiliation	Collaboration	1Furiosa AI 2Seoul National University 3University of Wisconsin-Madison.
Pseudocode	Yes	The resulting dimension selection approach is outlined in the pseudo-code (Alg. 1), which corresponds to the update scheme illustrated in Fig. 1.
Open Source Code	Yes	The roadmap of our paper is illustrated in Fig. 1. Our code is available at https://github.com/furiosa-ai/ssm-peft.
Open Datasets	Yes	We use six datasets spanning different domains: GLUE for natural language understanding (Wang et al., 2019), DART for RDF-to-text generation (Nan et al., 2021), SAMSum (Gliwa et al., 2019) for summarization, Spider for text-to-SQL generation (Yu et al., 2018), and two vision datasets CIFAR-10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015)
Dataset Splits	Yes	The dataset characteristics, including our train, validation and test set sizes, sequence lengths, and number of epochs, are summarized in Table 5.
Hardware Specification	Yes	All experiments were carried out on a single H100 GPU, and the reported metrics represent averages across the four simulations.
Software Dependencies	No	The paper does not explicitly mention specific software dependencies with version numbers for replication.
Experiment Setup	Yes	We fine-tune pretrained Mamba and Jamba models with Adam W with a linear learning rate decay schedule. For Lo RA we set rank to 8, alpha to 8, and dropout to 0.1 for all experiments.