Longhorn: State Space Models are Amortized Online Learners

Authors: Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference. The code is provided at https://github. com/Cranial-XIX/Longhorn. We validate Longhorn s performance through the following experiments: 1) We compare Longhorn against other SSMs on the multi-query associative recall benchmark... 2) Using the Open Web Text dataset..., we assess Longhorn s performance on language modeling... 3) We train a 1.3B language model on the Slim Pajama dataset... 4) We additional apply Longhorn to vision domain and compare it against the Vision Mamba (Vi M).
Researcher Affiliation Collaboration :The University of Texas at Austin, Meta, ;Helixon, Sony AI. EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Longhorn s Single-layer SSM Recurrence (Inference Time)
Open Source Code Yes The code is provided at https://github. com/Cranial-XIX/Longhorn.
Open Datasets Yes Using the Open Web Text dataset (Gokaslan & Cohen, 2019), we assess Longhorn s performance on language modeling... We train a 1.3B language model on the Slim Pajama dataset (Soboleva et al., 2023)... we conduct experiments on the Image Net (Deng et al., 2009) classification task.
Dataset Splits No The paper mentions using a "disjoint validation set from Slim Pajama dataset" and varying context lengths for evaluation (T P t2048, 4096, 8192, 16384, 32768u), but it does not specify explicit train/validation/test split percentages, sample counts for each split, or reference a standard predefined split with a citation for all experiments. While it details training tokens (100B tokens for Slim Pajama), it lacks the specific partitioning information required for full reproducibility.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It mentions a "CUDA kernel" in Algorithm 1, which implies GPU usage, but no specific hardware model is identified.
Software Dependencies No The paper mentions the Adam W optimizer and PyTorch in a footnote (referencing adapted code from nano GPT repository). However, it does not provide specific version numbers for these or any other software libraries or frameworks, which is required for a reproducible description of software dependencies.
Experiment Setup Yes For the large-scale language modeling task... training a 1.3B parameter model on the Slim Pajama (Soboleva et al., 2023) dataset with 100B tokens and a batch size of 2M. We used the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01, cosine learning rate decay (peak: 3e 4, final: 3e 5), and gradient clipping of 1.0. Each experiment individually searches for the best learning rate from t10 4, 4.6 ˆ 10 4, 2.2 ˆ 10 3, 10 2u. Table 6 in Appendix D provides training details on Open Web Text, including parameters, number of layers, model dimension, number of heads, training steps, learning rate, batch size, and tokens for 125M and 350M models.