reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DeciMamba: Exploring the Length Extrapolation Potential of Mamba

Authors: Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, Raja Giryes

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical experiments over real-world long-range NLP tasks show that Deci Mamba can extrapolate to context lengths that are significantly longer than the ones seen during training, while enjoying faster inference.
Researcher Affiliation	Collaboration	1Tel Aviv University, 2Google Research
Pseudocode	Yes	Algorithm 1 Decimated SSM
Open Source Code	Yes	https://github.com/assafbk/DeciMamba... First, we provide the source code used for the key experiments.
Open Datasets	Yes	Passkey Retrieval task... Wiki-Text (Merity et al., 2016)... SQuAD v2 (Rajpurkar et al., 2018)... PG-19 dataset... The Pile dataset (Gao et al., 2020).
Dataset Splits	Yes	Each model is trained for 5 epochs... In each epoch the models train over 6144 sequences of length 2K... We train each model with data from SQuAD v2... Our training samples have the following form: Ndocs <Document>; <Answer>... During Evaluation we use the same setting but vary the value of Ndocs... During training we sample a single window from each example and train on it (For the extrapolating models the window length is 2K, for the lower bound models the window length is equal to the context length trained on). During evaluation, for each example we evaluate 10 windows with a maximal constant stride. We evaluate only the last 100 labels in each window, which represent the extrapolation abilities of the model at sequence lengths in the range of [ctx len 100, ctx len], providing an approximation to the model s performance at the wanted ctx len.
Hardware Specification	Yes	We benchmark both Deci Mamba and Mamba with a Nvidia RTX A6000 GPU
Software Dependencies	No	The paper mentions "Adam W optimizer (Kingma & Ba, 2017)" but this is an algorithm, not a software dependency with a specific version number. It also states "Our code is based on the official Mamba implementation" with a GitHub link, but no specific software versions (e.g., Python, PyTorch, CUDA versions) are provided.
Experiment Setup	Yes	Each model is trained for 5 epochs with a learning rate of 1e-4, gradient clipping of 1, batch size of 32 (used batch accumulation) and Adam W optimizer (Kingma & Ba, 2017) with weight decay of 0.1... We train for two epochs (1500 steps in each), use a learning rate of 1e-4, gradient clipping of 1, batch size of 64 (used batch accumulation), and Adam W optimizer with weight decay of 0.1... We train each model on a total of 100M tokens with a learning rate of 1e-4, gradient clipping of 1, batch size of 250 (used batch accumulation) and Adam W optimizer with weight decay of 0.1.