DeciMamba: Exploring the Length Extrapolation Potential of Mamba
Authors: Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, Raja Giryes
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments over real-world long-range NLP tasks show that Deci Mamba can extrapolate to context lengths that are significantly longer than the ones seen during training, while enjoying faster inference. |
| Researcher Affiliation | Collaboration | 1Tel Aviv University, 2Google Research |
| Pseudocode | Yes | Algorithm 1 Decimated SSM |
| Open Source Code | Yes | https://github.com/assafbk/DeciMamba... First, we provide the source code used for the key experiments. |
| Open Datasets | Yes | Passkey Retrieval task... Wiki-Text (Merity et al., 2016)... SQuAD v2 (Rajpurkar et al., 2018)... PG-19 dataset... The Pile dataset (Gao et al., 2020). |
| Dataset Splits | Yes | Each model is trained for 5 epochs... In each epoch the models train over 6144 sequences of length 2K... We train each model with data from SQuAD v2... Our training samples have the following form: Ndocs <Document>; <Answer>... During Evaluation we use the same setting but vary the value of Ndocs... During training we sample a single window from each example and train on it (For the extrapolating models the window length is 2K, for the lower bound models the window length is equal to the context length trained on). During evaluation, for each example we evaluate 10 windows with a maximal constant stride. We evaluate only the last 100 labels in each window, which represent the extrapolation abilities of the model at sequence lengths in the range of [ctx len 100, ctx len], providing an approximation to the model s performance at the wanted ctx len. |
| Hardware Specification | Yes | We benchmark both Deci Mamba and Mamba with a Nvidia RTX A6000 GPU |
| Software Dependencies | No | The paper mentions "Adam W optimizer (Kingma & Ba, 2017)" but this is an algorithm, not a software dependency with a specific version number. It also states "Our code is based on the official Mamba implementation" with a GitHub link, but no specific software versions (e.g., Python, PyTorch, CUDA versions) are provided. |
| Experiment Setup | Yes | Each model is trained for 5 epochs with a learning rate of 1e-4, gradient clipping of 1, batch size of 32 (used batch accumulation) and Adam W optimizer (Kingma & Ba, 2017) with weight decay of 0.1... We train for two epochs (1500 steps in each), use a learning rate of 1e-4, gradient clipping of 1, batch size of 64 (used batch accumulation), and Adam W optimizer with weight decay of 0.1... We train each model on a total of 100M tokens with a learning rate of 1e-4, gradient clipping of 1, batch size of 250 (used batch accumulation) and Adam W optimizer with weight decay of 0.1. |