Longhorn: State Space Models are Amortized Online Learners
Authors: Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference. The code is provided at https://github. com/Cranial-XIX/Longhorn. We validate Longhorn s performance through the following experiments: 1) We compare Longhorn against other SSMs on the multi-query associative recall benchmark... 2) Using the Open Web Text dataset..., we assess Longhorn s performance on language modeling... 3) We train a 1.3B language model on the Slim Pajama dataset... 4) We additional apply Longhorn to vision domain and compare it against the Vision Mamba (Vi M). |
| Researcher Affiliation | Collaboration | :The University of Texas at Austin, Meta, ;Helixon, Sony AI. EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Longhorn s Single-layer SSM Recurrence (Inference Time) |
| Open Source Code | Yes | The code is provided at https://github. com/Cranial-XIX/Longhorn. |
| Open Datasets | Yes | Using the Open Web Text dataset (Gokaslan & Cohen, 2019), we assess Longhorn s performance on language modeling... We train a 1.3B language model on the Slim Pajama dataset (Soboleva et al., 2023)... we conduct experiments on the Image Net (Deng et al., 2009) classification task. |
| Dataset Splits | No | The paper mentions using a "disjoint validation set from Slim Pajama dataset" and varying context lengths for evaluation (T P t2048, 4096, 8192, 16384, 32768u), but it does not specify explicit train/validation/test split percentages, sample counts for each split, or reference a standard predefined split with a citation for all experiments. While it details training tokens (100B tokens for Slim Pajama), it lacks the specific partitioning information required for full reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It mentions a "CUDA kernel" in Algorithm 1, which implies GPU usage, but no specific hardware model is identified. |
| Software Dependencies | No | The paper mentions the Adam W optimizer and PyTorch in a footnote (referencing adapted code from nano GPT repository). However, it does not provide specific version numbers for these or any other software libraries or frameworks, which is required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | For the large-scale language modeling task... training a 1.3B parameter model on the Slim Pajama (Soboleva et al., 2023) dataset with 100B tokens and a batch size of 2M. We used the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01, cosine learning rate decay (peak: 3e 4, final: 3e 5), and gradient clipping of 1.0. Each experiment individually searches for the best learning rate from t10 4, 4.6 ˆ 10 4, 2.2 ˆ 10 3, 10 2u. Table 6 in Appendix D provides training details on Open Web Text, including parameters, number of layers, model dimension, number of heads, training steps, learning rate, batch size, and tokens for 125M and 350M models. |