Understanding and Improving Length Generalization in Recurrent Models
Authors: Ricardo Buitrago, Albert Gu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we provide comprehensive empirical and theoretical analysis to support the unexplored states hypothesis, which posits that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states (i.e. states that would be attained if the recurrence was applied to long sequences). Furthermore, we investigate simple training interventions that aim to increase the coverage of the states that the model is trained on, e.g. by initializing the state with Gaussian noise or with the final state of a different input sequence. With only 500 post-training steps ( 0.1% of the pretraining budget), these interventions enable length generalization for sequences that are orders of magnitude longer than the training context (e.g. 2k 128k) and show improved performance in long context tasks, thus presenting a simple and efficient way to enable robust length generalization in general recurrent models. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Cartesia AI. Correspondence to: Ricardo Buitrago Ruiz <EMAIL>. |
| Pseudocode | Yes | I. State Passing Pytorch Pseudocode |
| Open Source Code | No | The paper does not explicitly state that the authors' implementation code is open-source, nor does it provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | Position-wise perplexity as a function of token position on the Pile validation dataset (Gao et al., 2020) for the official Mamba-1 and Mamba-2 checkpoints trained with context T = 2048, as well as for Gated Linear Attention (GLA) models trained with context T = 512. BABILong (Kuratov et al., 2024) is a challenging benchmark which tests both the common sense understanding of a model as well as its ability to capture long range dependencies in text. We finetune the official Mamba-2 checkpoints on the passkey retrieval task using the same procedure as section B.1 |
| Dataset Splits | Yes | For the experiments in Section 3, we train on the Pile (Gao et al., 2020) with the Eleuther AI/gpt-neox-20b tokenizer (Black et al., 2022). Position-wise perplexity as a function of token position on the Pile validation dataset (Gao et al., 2020) In the finetuned setting, all the models are finetuned on BABILong without State Passing nor TBTT, thus the benefits of having a State Passing or TBTT finetuned checkpoint are not lost when finetuning again for this task. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions software components like "Eleuther AI/gpt-neox-20b tokenizer", "Adam optimizer", "RMSNorm", and implicitly "PyTorch" (from the pseudocode), but does not specify their version numbers, which is required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | For the learning rate, we use cosine scheduling with warmup in the 10% first training steps, a peak learning rate given by Table 2 and a decay to 1e 5. The gradients are clipped to 1.0 and no dropout is used. Additionally, we also follow the improved training recipe of Grattafiori et al. (2024), with an Adam optimizer with β1 = 0.9 and β2 = 0.95, weight decay scheduling with a peak of 0.01, RMSNorm (Zhang & Sennrich, 2019) instead of Layer Norm and no linear biases. For Mamba-1 and Mamba-2 we use a training context of 2048 and for GLA we use a context of 512. Table 2. Model configurations and training hyperparameters for our experiments. The learning rates follow the values of previous works (Brown et al., 2020; Biderman et al., 2023). ARCHITECTURE PARAMS n layers d model n heads / d head d state LEARNING RATE BATCH SIZE |