Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
Authors: Riccardo Grazzi, Julien Siems, Arber Zela, Jörg Franke, Frank Hutter, massimiliano pontil
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments confirm that extending the eigenvalue range of Mamba and Delta Net to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. We also show that state-tracking enabled LRNNs can be pretrained stably and efficiently at scale (1.3B parameters), achieving competitive performance on language modeling and showing promise on code and math tasks. ... In this work, we showed the substantial impact of extending the eigenvalue range of state-transition matrices in LRNNs from [0, 1] to [−1, 1]. This modification provably enhances LRNN expressivity in state-tracking tasks, without adding overhead in training or inference. While Mamba successfully solves the parity problem, its diagonal matrix structure limits further gains. In contrast, Delta Net, thanks to its non-diagonal state transition matrices which enable simultaneous token and channel mixing, excels across a broader spectrum of tasks. Our results underscore the critical role of nondiagonal state-transition matrices in augmenting state-tracking capabilities, highlighting a promising direction for future LRNN advancements. |
| Researcher Affiliation | Academia | Riccardo Grazzi , Julien Siems , Arber Zela , J org K.H. Franke , Frank Hutter , Massimiliano Pontil Equal contribution , CSML, Istituto Italiano di Tecnologia , University of Freiburg , ELLIS Institute T ubingen , AI Centre, University College London |
| Pseudocode | No | The paper describes methods and theoretical analyses using mathematical notation, and provides code snippets in the appendix, but it does not contain explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code for part of our experiments is available at https://github.com/automl/unlocking_state_tracking |
| Open Datasets | Yes | We chose Fine Web rather than Fine Web-Edu since it contains more code. We aligned our training pipeline with Yang et al. (2024b); see Appendix E.3.1 for details. ... To test this hypothesis, we evaluate the perplexity of these models in a length extrapolation setup using various datasets: Code Parrot (Tunstall et al., 2022) for coding, Math-Hard (Hendrycks et al., 2021) for mathematics, Trivia QA (Joshi et al., 2017), and Slim Pajama (Soboleva et al., 2023). |
| Dataset Splits | Yes | Like Beck et al. (2024), we trained each model with sequence lengths ranging from 3 to 40 and evaluated on lengths from 40 to 256, to understand the length generalization capabilities. ... For each setup, we randomly sample 1.6M examples for and 40K examples of length 500 to construct the train and test dataset. ... The train and validation datasets are kept the same across runs. |
| Hardware Specification | Yes | The 1.3B parameter Delta Net models are trained on 32 Nvidia A100s using a per-device batch size of 6 and 5 gradient accumulation steps for 50,000 steps. The 340M parameter Delta Net models and the 370M parameter Mamba models are trained using a training batch size of 16 and 200,000 steps on 16 Nvidia A100s. |
| Software Dependencies | No | We use the training pipeline which is part of the flash-linear-attention library (flame) (Yang & Zhang, 2024) and which in turn is based on Hugging Face accelerate (Gugger et al., 2022). ... For optimization, we use Adam W (Loshchilov & Hutter, 2019)... The paper mentions several software components like 'flash-linear-attention library (flame)', 'Hugging Face accelerate', 'Adam W', 'Py Torch' (in Appendix E.4), but specific version numbers for these components are not provided. |
| Experiment Setup | Yes | For parity, all models contain 2 blocks (layers), with 4 heads for the x LSTM and Delta Net models. We set the embedding and heads dimensions to 128. For Mamba and Delta Net, we also enable the 1-D depthwise-separable convolution layer with kernel size equal to 4... For modular arithmetic, we increase the number of layers to 3 and use a gradient clipping norm of 1.0 for Transformer, Mamba, and Delta Net... We train each model using Adam W (Loshchilov & Hutter, 2019) without gradient clipping, using 3 different learning rates (1e-2, 1e-3, 5e-4 1e-4), with 3 different seeds each. ... We use a batch size of 1024 (except for m LSTM, where we use 512...) and a cosine annealing learning rate schedule (Loshchilov & Hutter, 2017) (minimum learning rate: 1e-6) after 10% warm-up steps. The weight decay is set to 0.1 during training. We train on every task for 100k steps in total. |