Efficient Length-Generalizable Attention via Causal Retrieval for Long-Context Language Modeling
Authors: Xiang Hu, Zhihao Teng, Jun Zhao, Wei Wu, Kewei Tu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that GCA-based models achieve near-perfect accuracy in passkey retrieval for 16M context lengths, which is 1000 the training length. To fairly compare GCA with other attention mechanisms, we pre-train all models from scratch and evaluate them on tasks such as long-range language modeling, summarization, and the needle-in-a-haystack (NIAH) tests. The results demonstrate that DRT significantly outperforms all baselines with comparable pre-training costs and much lower inference costs. |
| Researcher Affiliation | Collaboration | 1Ant Group 2Shanghai Tech University 3Fudan University. Correspondence to: Wei Wu <EMAIL>, Kewei Tu <EMAIL>. |
| Pseudocode | Yes | It is worth noting that GCA is easy to integrate with Flash Attention-2, as detailed in the pseudo-code in Appendix B. Algorithm 1 FLASHGCA forward pass, Algorithm 2 FLASHGCA Backward Pass |
| Open Source Code | Yes | 1The code is released at https://github.com/ ant-research/long-context-modeling |
| Open Datasets | Yes | PG19. PG19 (Rae et al., 2020) is a language modeling benchmark widely used to evaluate long-range text understanding capabilities of models. ArXiv-math. ArXiv-math is a corpus consisting of mathematical papers from arXiv... We use the preprocessed corpus from (Azerbayev et al., 2023). |
| Dataset Splits | No | The paper mentions evaluating models on 'valid test' splits for perplexity (Table 1) and fine-tuning on 'synthetic data' (Appendix D), but does not specify exact percentages, sample counts, or detailed methodologies for creating these splits for reproduction beyond general context lengths. |
| Hardware Specification | Yes | We used mixed-precision training with bfloat16 over at 8 Nvidia A100 GPUs. |
| Software Dependencies | Yes | Our base LM is based on the implementation of Tiny Llama (Zhang et al., 2024) combined with Flash Attention2 (Dao, 2024) enabling ALi Bi (Press et al., 2022) and sliding window attention (Child et al., 2019). We implement hardware-aware GCA based on Triton (Tillet et al., 2019). Training utilizes the Adam W optimizer (Loshchilov & Hutter, 2019). We used GPT-2 s (Radford et al., 2019) tokenizer. |
| Experiment Setup | Yes | Training utilizes the Adam W optimizer (Loshchilov & Hutter, 2019) with β1 = 0.9 and β2 = 0.95, and a weight decay factor of 0.001. We used base learning rate 2 10 3 for all our experiments with a warmup stage that was 2% of the whole training and applied a cosine scheduler with final learning rate being 4 10 4. We train all models with an effective batch size of 219 tokens for 60K steps resulting in a total training budget of 32.2 billion tokens. |