reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Authors: Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 the training length. To fairly compare GCA with other attention mechanisms, we pre-train all models from scratch and evaluate them on tasks such as long-range language modeling, summarization, and the needle-in-a-haystack (NIAH) tests. The results demonstrate that DRT significantly outperforms all baselines with comparable pre-training costs and much lower inference costs.
Researcher Affiliation	Collaboration	1Ant Group 2Shanghai Tech University 3Fudan University. Correspondence to: Wei Wu <EMAIL>, Kewei Tu <EMAIL>.
Pseudocode	Yes	It is worth noting that GCA is easy to integrate with Flash Attention-2, as detailed in the pseudo-code in Appendix B. Algorithm 1 FLASHGCA forward pass, Algorithm 2 FLASHGCA Backward Pass
Open Source Code	Yes	1The code is released at https://github.com/ ant-research/long-context-modeling
Open Datasets	Yes	PG19. PG19 (Rae et al., 2020) is a language modeling benchmark widely used to evaluate long-range text understanding capabilities of models. ArXiv-math. ArXiv-math is a corpus consisting of mathematical papers from arXiv... We use the preprocessed corpus from (Azerbayev et al., 2023).
Dataset Splits	No	The paper mentions evaluating models on 'valid test' splits for perplexity (Table 1) and fine-tuning on 'synthetic data' (Appendix D), but does not specify exact percentages, sample counts, or detailed methodologies for creating these splits for reproduction beyond general context lengths.
Hardware Specification	Yes	We used mixed-precision training with bfloat16 over at 8 Nvidia A100 GPUs.
Software Dependencies	Yes	Our base LM is based on the implementation of Tiny Llama (Zhang et al., 2024) combined with Flash Attention2 (Dao, 2024) enabling ALi Bi (Press et al., 2022) and sliding window attention (Child et al., 2019). We implement hardware-aware GCA based on Triton (Tillet et al., 2019). Training utilizes the Adam W optimizer (Loshchilov & Hutter, 2019). We used GPT-2 s (Radford et al., 2019) tokenizer.
Experiment Setup	Yes	Training utilizes the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95, and a weight decay factor of 0.001. We used base learning rate 2 10 3 for all our experiments with a warmup stage that was 2% of the whole training and applied a cosine scheduler with final learning rate being 4 10 4. We train all models with an effective batch size of 219 tokens for 60K steps resulting in a total training budget of 32.2 billion tokens.