LongRoPE2: Near-Lossless LLM Context Window Scaling

Authors: Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on LLa MA3-8B and Phi3mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of Long Ro PE2.
Researcher Affiliation Collaboration 1Microsoft 2Shanghai Jiao Tong University 3Zhejiang University; Siyuan Wang and Gaokai Zhang did this work during the internship at MSRA. Correspondence to: Li Lyna Zhang <EMAIL>.
Pseudocode Yes Figure 10. The pseudocode for mixed context window training and inference.
Open Source Code No The paper mentions using 'Flash Attention-2 (Dao, 2023)' and 'nn Scaler (Lin et al., 2024)' but does not provide specific access to source code for the methodology described in this paper.
Open Datasets Yes We randomly sample 10 books from the PG19 validation set. ... we sample 4.5B, 2.5B, and 2B tokens from Red Pajamav1 (Computer, 2023), Red Pajama-v2 (Weber et al., 2024), and Star Coder (Li et al., 2023), covering 8k 200k sequence lengths. For short context windows, we sample 1B tokens from Fineweb-Edu (Lozhkov et al., 2024).
Dataset Splits No The paper specifies token amounts used for training from various datasets (Red Pajamav1, Red Pajama-v2, Star Coder, Fineweb-Edu) and mentions using the 'PG19 validation set'. However, it does not provide specific details on train/test/validation splits (e.g., percentages or exact counts) for its experiments on these or other evaluation benchmarks like RULER, LOFT, etc.
Hardware Specification Yes we extend the two models to 128k context window and midtrain on 64 A100 GPUs using a 10B-token dataset. ... we measured the KV cache recomputation time on a 4 80GB A100 GPU (using v LLM 0.7.3)
Software Dependencies Yes To accelerate training and inference, we use Flash Attention-2 (Dao, 2023)... we utilize nn Scaler (Lin et al., 2024), an efficient distributed training system for long-context LLMs... we measured the KV cache recomputation time on a 4 80GB A100 GPU (using v LLM 0.7.3)
Experiment Setup Yes We train for 1 epoch with a global batch size of 64. The initial learning rate of 2e-5 with a cosine learning rate scheduler. For the rescaling factor search, we set a population size of P = 64, evolution iterations of 40, and a mutation probability p = 0.3.