MambaPEFT: Exploring Parameter-Efficient Fine-Tuning for Mamba
Authors: Masakazu Yoshimura, Teruaki Hayashi, Yota Maeda
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments indicate that PEFT performs more effectively for Mamba than Transformers. Lastly, we demonstrate how to effectively combine multiple PEFT methods and provide a framework that outperforms previous works. ... In the experiments, we benchmarked Mamba using PEFT methods, including seven main methods and a total of 20 derived variations (see Figure 1). ... We conduct our evaluation on the VTAB-1k image classification dataset (Zhai et al., 2019). ... In addition to the image tasks, we evaluate our method on language tasks using the vanilla Mamba Gu & Dao (2023). |
| Researcher Affiliation | Industry | Masakazu Yoshimura , Teruaki Hayashi & Yota Maeda Sony Group Corporation Japan EMAIL |
| Pseudocode | Yes | The detailed algorithm is provided in Appendix B. ... Algorithm 1 Hybrid PEFT Search Algorithm |
| Open Source Code | Yes | The source code is available at: https://github.com/sony/mambapeft. |
| Open Datasets | Yes | We conduct our evaluation on the VTAB-1k image classification dataset (Zhai et al., 2019). ... We adopt pre-trained weights trained with Image Net-1k (Deng et al., 2009) using De IT (Touvron et al., 2021) training framework in all models. ... We experiment with a commonsense reasoning task, following the setup and dataset of Hu et al. (2023). |
| Dataset Splits | Yes | For each task, 1000 images are used for training. ... This experiment uses 170k datasets, in contrast to the 1k used for VTAB-1k. ... We used official test data of VTAB-1K as training data and vice versa. ... Each model is fine-tuned with about 140,000 data for three epochs with a batch size of 16. |
| Hardware Specification | Yes | By processing five tasks in parallel on one A100 GPU, one trial can be completed in around 20 minutes, with minimal dependency on the type and size of the applied PEFT methods. |
| Software Dependencies | No | The paper mentions using "Adam W optimizer" and "Optuna" for hyperparameter optimization but does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We follow the setup of Jie & Deng (2023) in our experiments, using Adam W optimizer (Loshchilov & Hutter, 2017) and training the model for 100 epochs. The learning rate is set to 1e-3, with a cosine scheduler and a warmup period of 10 epochs. A weight decay with 1e-4 magnitude is applied. We do not perform data augmentation. ... Each model is fine-tuned with about 140,000 data for three epochs with a batch size of 16. A linear learning rate scheduler is used with a warmup period of 100 iterations. ... The learning rate configurations for language tasks are shown in Table 8. |