DeciMamba: Exploring the Length Extrapolation Potential of Mamba

Authors: Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, Raja Giryes

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical experiments over real-world long-range NLP tasks show that Deci Mamba can extrapolate to context lengths that are significantly longer than the ones seen during training, while enjoying faster inference.
Researcher Affiliation Collaboration 1Tel Aviv University, 2Google Research
Pseudocode Yes Algorithm 1 Decimated SSM
Open Source Code Yes https://github.com/assafbk/DeciMamba... First, we provide the source code used for the key experiments.
Open Datasets Yes Passkey Retrieval task... Wiki-Text (Merity et al., 2016)... SQuAD v2 (Rajpurkar et al., 2018)... PG-19 dataset... The Pile dataset (Gao et al., 2020).
Dataset Splits Yes Each model is trained for 5 epochs... In each epoch the models train over 6144 sequences of length 2K... We train each model with data from SQuAD v2... Our training samples have the following form: Ndocs <Document>; <Answer>... During Evaluation we use the same setting but vary the value of Ndocs... During training we sample a single window from each example and train on it (For the extrapolating models the window length is 2K, for the lower bound models the window length is equal to the context length trained on). During evaluation, for each example we evaluate 10 windows with a maximal constant stride. We evaluate only the last 100 labels in each window, which represent the extrapolation abilities of the model at sequence lengths in the range of [ctx len 100, ctx len], providing an approximation to the model s performance at the wanted ctx len.
Hardware Specification Yes We benchmark both Deci Mamba and Mamba with a Nvidia RTX A6000 GPU
Software Dependencies No The paper mentions "Adam W optimizer (Kingma & Ba, 2017)" but this is an algorithm, not a software dependency with a specific version number. It also states "Our code is based on the official Mamba implementation" with a GitHub link, but no specific software versions (e.g., Python, PyTorch, CUDA versions) are provided.
Experiment Setup Yes Each model is trained for 5 epochs with a learning rate of 1e-4, gradient clipping of 1, batch size of 32 (used batch accumulation) and Adam W optimizer (Kingma & Ba, 2017) with weight decay of 0.1... We train for two epochs (1500 steps in each), use a learning rate of 1e-4, gradient clipping of 1, batch size of 64 (used batch accumulation), and Adam W optimizer with weight decay of 0.1... We train each model on a total of 100M tokens with a learning rate of 1e-4, gradient clipping of 1, batch size of 250 (used batch accumulation) and Adam W optimizer with weight decay of 0.1.