Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling

Authors: Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 the training length. To fairly compare GCA with other attention mechanisms, we pre-train all models from scratch and evaluate them on tasks such as long-range language modeling, summarization, and the needle-in-a-haystack (NIAH) tests. The results demonstrate that DRT significantly outperforms all baselines with comparable pre-training costs and much lower inference costs.
Researcher Affiliation Collaboration 1Ant Group 2Shanghai Tech University 3Fudan University. Correspondence to: Wei Wu <EMAIL>, Kewei Tu <EMAIL>.
Pseudocode Yes It is worth noting that GCA is easy to integrate with Flash Attention-2, as detailed in the pseudo-code in Appendix B. Algorithm 1 FLASHGCA forward pass, Algorithm 2 FLASHGCA Backward Pass
Open Source Code Yes 1The code is released at https://github.com/ ant-research/long-context-modeling
Open Datasets Yes PG19. PG19 (Rae et al., 2020) is a language modeling benchmark widely used to evaluate long-range text understanding capabilities of models. ArXiv-math. ArXiv-math is a corpus consisting of mathematical papers from arXiv... We use the preprocessed corpus from (Azerbayev et al., 2023).
Dataset Splits No The paper mentions evaluating models on 'valid test' splits for perplexity (Table 1) and fine-tuning on 'synthetic data' (Appendix D), but does not specify exact percentages, sample counts, or detailed methodologies for creating these splits for reproduction beyond general context lengths.
Hardware Specification Yes We used mixed-precision training with bfloat16 over at 8 Nvidia A100 GPUs.
Software Dependencies Yes Our base LM is based on the implementation of Tiny Llama (Zhang et al., 2024) combined with Flash Attention2 (Dao, 2024) enabling ALi Bi (Press et al., 2022) and sliding window attention (Child et al., 2019). We implement hardware-aware GCA based on Triton (Tillet et al., 2019). Training utilizes the Adam W optimizer (Loshchilov & Hutter, 2019). We used GPT-2 s (Radford et al., 2019) tokenizer.
Experiment Setup Yes Training utilizes the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95, and a weight decay factor of 0.001. We used base learning rate 2 10 3 for all our experiments with a warmup stage that was 2% of the whole training and applied a cosine scheduler with final learning rate being 4 10 4. We train all models with an effective batch size of 219 tokens for 60K steps resulting in a total training budget of 32.2 billion tokens.