reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Longhorn: State Space Models are Amortized Online Learners

Authors: Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, Qiang Liu

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that Longhorn outperforms state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks, language modeling, and vision tasks. Specifically, Longhorn achieves a 1.8x improvement in sample efficiency compared to Mamba, and can extrapolate over contexts that are up to 16x longer during inference. The code is provided at https://github. com/Cranial-XIX/Longhorn. We validate Longhorn s performance through the following experiments: 1) We compare Longhorn against other SSMs on the multi-query associative recall benchmark... 2) Using the Open Web Text dataset..., we assess Longhorn s performance on language modeling... 3) We train a 1.3B language model on the Slim Pajama dataset... 4) We additional apply Longhorn to vision domain and compare it against the Vision Mamba (Vi M).
Researcher Affiliation	Collaboration	:The University of Texas at Austin, Meta, ;Helixon, Sony AI. EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Longhorn s Single-layer SSM Recurrence (Inference Time)
Open Source Code	Yes	The code is provided at https://github. com/Cranial-XIX/Longhorn.
Open Datasets	Yes	Using the Open Web Text dataset (Gokaslan & Cohen, 2019), we assess Longhorn s performance on language modeling... We train a 1.3B language model on the Slim Pajama dataset (Soboleva et al., 2023)... we conduct experiments on the Image Net (Deng et al., 2009) classification task.
Dataset Splits	No	The paper mentions using a "disjoint validation set from Slim Pajama dataset" and varying context lengths for evaluation (T P t2048, 4096, 8192, 16384, 32768u), but it does not specify explicit train/validation/test split percentages, sample counts for each split, or reference a standard predefined split with a citation for all experiments. While it details training tokens (100B tokens for Slim Pajama), it lacks the specific partitioning information required for full reproducibility.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. It mentions a "CUDA kernel" in Algorithm 1, which implies GPU usage, but no specific hardware model is identified.
Software Dependencies	No	The paper mentions the Adam W optimizer and PyTorch in a footnote (referencing adapted code from nano GPT repository). However, it does not provide specific version numbers for these or any other software libraries or frameworks, which is required for a reproducible description of software dependencies.
Experiment Setup	Yes	For the large-scale language modeling task... training a 1.3B parameter model on the Slim Pajama (Soboleva et al., 2023) dataset with 100B tokens and a batch size of 2M. We used the Adam W optimizer (Loshchilov & Hutter, 2017) with a weight decay of 0.01, cosine learning rate decay (peak: 3e 4, final: 3e 5), and gradient clipping of 1.0. Each experiment individually searches for the best learning rate from t10 4, 4.6 ˆ 10 4, 2.2 ˆ 10 3, 10 2u. Table 6 in Appendix D provides training details on Open Web Text, including parameters, number of layers, model dimension, number of heads, training steps, learning rate, batch size, and tokens for 125M and 350M models.