reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training

Authors: Haonan Wang, Qian Liu, Chao Du, Tongyao Zhu, Cunxiao Du, Kenji Kawaguchi, Tianyu Pang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on three types of LLMs demonstrate that Anchor Attention significantly improves long-context performance and reduces training time by over 50% compared to standard full attention mechanisms, while preserving the original LLM s capabilities on general tasks.
Researcher Affiliation	Collaboration	Haonan Wang EMAIL National University of Singapore; Qian Liu EMAIL Sea AI Lab, Singapore
Pseudocode	No	The paper describes methods through textual descriptions and illustrative figures (e.g., Figure 2 illustrating attention paradigms), but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Anchor Context: The implementation of Anchor Attention supports several popular models, using the Flash Attention2 and Flex Attention, and is available at https://github.com/haonan3/Anchor Context.
Open Datasets	Yes	We use the Slim Pajama dataset (Soboleva et al., 2023) for long-context training, an open-source replication of the LLa MA pretraining data mixture (Touvron et al., 2023).
Dataset Splits	No	The paper describes how it samples tokens for training and uses established benchmarks (RULER, Long Bench, MMLU, Hella Swag), but does not specify explicit train/validation/test splits with percentages or sample counts for its own generated datasets.
Hardware Specification	Yes	All models are trained on 8 NVIDIA A100 GPUs.
Software Dependencies	Yes	The flexibility of our Anchor Context approach allows for effortless adoption, enabling researchers to incorporate it to much substantial modifications to their codebase... it provides two computational engine options: Flex Attention (which will be natively supported in Py Torch 2.5.0) and Flash Attention.
Experiment Setup	Yes	Our training hyperparameters are primarily based on (Zhang, 2023). All models are trained on 8 NVIDIA A100 GPUs. We set the learning rate to 2e-5 and use the Adam W optimizer with weight decay of 0.1, β1 = 0.9, and β2 = 0.95. Each model is trained for 2000 steps, which corresponds to approximately 1 epoch over the 2 billion token dataset. The batch size is set to 8, equating to 0.5 million tokens per batch for 64K context and 1 million tokens for 128K context lengths.