When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Authors: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on three types of LLMs demonstrate that Anchor Attention significantly improves long-context performance and reduces training time by over 50% compared to standard full attention mechanisms, while preserving the original LLM s capabilities on general tasks. |
| Researcher Affiliation | Collaboration | Haonan Wang EMAIL National University of Singapore; Qian Liu EMAIL Sea AI Lab, Singapore |
| Pseudocode | No | The paper describes methods through textual descriptions and illustrative figures (e.g., Figure 2 illustrating attention paradigms), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Anchor Context: The implementation of Anchor Attention supports several popular models, using the Flash Attention2 and Flex Attention, and is available at https://github.com/haonan3/Anchor Context. |
| Open Datasets | Yes | We use the Slim Pajama dataset (Soboleva et al., 2023) for long-context training, an open-source replication of the LLa MA pretraining data mixture (Touvron et al., 2023). |
| Dataset Splits | No | The paper describes how it samples tokens for training and uses established benchmarks (RULER, Long Bench, MMLU, Hella Swag), but does not specify explicit train/validation/test splits with percentages or sample counts for its own generated datasets. |
| Hardware Specification | Yes | All models are trained on 8 NVIDIA A100 GPUs. |
| Software Dependencies | Yes | The flexibility of our Anchor Context approach allows for effortless adoption, enabling researchers to incorporate it to much substantial modifications to their codebase... it provides two computational engine options: Flex Attention (which will be natively supported in Py Torch 2.5.0) and Flash Attention. |
| Experiment Setup | Yes | Our training hyperparameters are primarily based on (Zhang, 2023). All models are trained on 8 NVIDIA A100 GPUs. We set the learning rate to 2e-5 and use the Adam W optimizer with weight decay of 0.1, β1 = 0.9, and β2 = 0.95. Each model is trained for 2000 steps, which corresponds to approximately 1 epoch over the 2 billion token dataset. The batch size is set to 8, equating to 0.5 million tokens per batch for 64K context and 1 million tokens for 128K context lengths. |