reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Role of Sparsity for Length Generalization in LLMs

Authors: Noah Golowich, Samy Jelassi, David Brandfonbrener, Sham M. Kakade, Eran Malach

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is a sparse dependency structure of each token on the previous ones. Inspired by our theory, we introduce Predictive Position Coupling, which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which position coupling can successfully be applied to achieve length generalization.
Researcher Affiliation	Academia	1MIT EECS 2Harvard University 3Kempner Institute at Harvard University. Correspondence to: Noah Golowich <EMAIL>, Samy Jelassi<EMAIL>, David Brandfonbrener<EMAIL>, Sham M. Kakade<EMAIL>, Eran Malach <EMAIL>.
Pseudocode	No	The paper contains theoretical models and definitions (e.g., Definition 3.3 for Sparse functional attention class) and describes methods conceptually, but it does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper mentions using the OLMo codebase as a third-party tool but does not provide any statement or link for the release of their own implementation code. For example: "We trained our models using the OLMo codebase (Groeneveld et al., 2024) on the C4 dataset (Raffel et al., 2019)."
Open Datasets	Yes	We trained our models using the OLMo codebase (Groeneveld et al., 2024) on the C4 dataset (Raffel et al., 2019).
Dataset Splits	Yes	For each value of Ktrain {4,6,8,10,12}, we train a transformer ˆh Ktrain to predict the last token of samples drawn from Dsp ℓ,k, where ℓis sampled uniformly subject to the length of the sample satisfying 2ℓ [20, 50] and k Unif([Ktrain]). We then evaluate the performance of each of the trained transformers ˆh Ktrain on samples drawn from Dℓ,ktest of length 2ℓ [20,500] and with sparsities ktest {4,6,8,10,12,14,16}.
Hardware Specification	No	The paper describes the model architecture and training parameters but does not specify any hardware details like GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper mentions software components such as "GPT-Neo X (decoder-only) transformer", "rotary positional embeddings (Ro PE)", "Po SE", "Adam W optimizer", and "OLMo codebase" but does not provide specific version numbers for any of these to ensure reproducibility.
Experiment Setup	Yes	Our model is based off of the GPT-Neo X (decoder-only) transformer (Andonian et al., 2023), and uses rotary positional embeddings (Ro PE). To ensure nontrivial length generalization performance, we combined Ro PE with Po SE (Appendix B.3). Full training and evaluation details may be found in Appendix E. We used the Adam W optimizer with learning rate 5 10 5 and weight decay parameter equal to 0.1; moreover, the experiments use a linear learning rate scheduler with 300 warmup steps. Table 1, 2, 3, and 4 detail specific hyperparameters like training/testing lengths, batch sizes, number of steps, and model dimensions.