When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Authors: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on three types of LLMs demonstrate that Anchor Attention significantly improves long-context performance and reduces training time by over 50% compared to standard full attention mechanisms, while preserving the original LLM s capabilities on general tasks.
Researcher Affiliation Collaboration Haonan Wang EMAIL National University of Singapore; Qian Liu EMAIL Sea AI Lab, Singapore
Pseudocode No The paper describes methods through textual descriptions and illustrative figures (e.g., Figure 2 illustrating attention paradigms), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Anchor Context: The implementation of Anchor Attention supports several popular models, using the Flash Attention2 and Flex Attention, and is available at https://github.com/haonan3/Anchor Context.
Open Datasets Yes We use the Slim Pajama dataset (Soboleva et al., 2023) for long-context training, an open-source replication of the LLa MA pretraining data mixture (Touvron et al., 2023).
Dataset Splits No The paper describes how it samples tokens for training and uses established benchmarks (RULER, Long Bench, MMLU, Hella Swag), but does not specify explicit train/validation/test splits with percentages or sample counts for its own generated datasets.
Hardware Specification Yes All models are trained on 8 NVIDIA A100 GPUs.
Software Dependencies Yes The flexibility of our Anchor Context approach allows for effortless adoption, enabling researchers to incorporate it to much substantial modifications to their codebase... it provides two computational engine options: Flex Attention (which will be natively supported in Py Torch 2.5.0) and Flash Attention.
Experiment Setup Yes Our training hyperparameters are primarily based on (Zhang, 2023). All models are trained on 8 NVIDIA A100 GPUs. We set the learning rate to 2e-5 and use the Adam W optimizer with weight decay of 0.1, β1 = 0.9, and β2 = 0.95. Each model is trained for 2000 steps, which corresponds to approximately 1 epoch over the 2 billion token dataset. The batch size is set to 8, equating to 0.5 million tokens per batch for 64K context and 1 million tokens for 128K context lengths.