Parameter-Efficient Fine-Tuning of State Space Models

Authors: Kevin Galim, Wonjun Kang, Yuchen Zeng, Hyung Il Koo, Kangwook Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We benchmark six widely used PEFT methods across three categories on diverse tasks, including natural language understanding, generation, and computer vision. We evaluate these methods on both SSM-based models (i.e., Mamba) and a hybrid model (i.e., Jamba (Lieber et al., 2025)). Our results show that Lo RA consistently outperforms all other PEFT methods on both SSM-based and hybrid models. Through extensive experiments, we demonstrate that integrating SDT into SSM-based models, combined with applying Lo RA to their linear projection matrices, achieves state-of-the-art fine-tuning performance.
Researcher Affiliation Collaboration 1Furiosa AI 2Seoul National University 3University of Wisconsin-Madison.
Pseudocode Yes The resulting dimension selection approach is outlined in the pseudo-code (Alg. 1), which corresponds to the update scheme illustrated in Fig. 1.
Open Source Code Yes The roadmap of our paper is illustrated in Fig. 1. Our code is available at https://github.com/furiosa-ai/ssm-peft.
Open Datasets Yes We use six datasets spanning different domains: GLUE for natural language understanding (Wang et al., 2019), DART for RDF-to-text generation (Nan et al., 2021), SAMSum (Gliwa et al., 2019) for summarization, Spider for text-to-SQL generation (Yu et al., 2018), and two vision datasets CIFAR-10 (Krizhevsky et al., 2009) and Celeb A (Liu et al., 2015)
Dataset Splits Yes The dataset characteristics, including our train, validation and test set sizes, sequence lengths, and number of epochs, are summarized in Table 5.
Hardware Specification Yes All experiments were carried out on a single H100 GPU, and the reported metrics represent averages across the four simulations.
Software Dependencies No The paper does not explicitly mention specific software dependencies with version numbers for replication.
Experiment Setup Yes We fine-tune pretrained Mamba and Jamba models with Adam W with a linear learning rate decay schedule. For Lo RA we set rank to 8, alpha to 8, and dropout to 0.1 for all experiments.