LongRoPE2: Near-Lossless LLM Context Window Scaling
Authors: Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on LLa MA3-8B and Phi3mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of Long Ro PE2. |
| Researcher Affiliation | Collaboration | 1Microsoft 2Shanghai Jiao Tong University 3Zhejiang University; Siyuan Wang and Gaokai Zhang did this work during the internship at MSRA. Correspondence to: Li Lyna Zhang <EMAIL>. |
| Pseudocode | Yes | Figure 10. The pseudocode for mixed context window training and inference. |
| Open Source Code | No | The paper mentions using 'Flash Attention-2 (Dao, 2023)' and 'nn Scaler (Lin et al., 2024)' but does not provide specific access to source code for the methodology described in this paper. |
| Open Datasets | Yes | We randomly sample 10 books from the PG19 validation set. ... we sample 4.5B, 2.5B, and 2B tokens from Red Pajamav1 (Computer, 2023), Red Pajama-v2 (Weber et al., 2024), and Star Coder (Li et al., 2023), covering 8k 200k sequence lengths. For short context windows, we sample 1B tokens from Fineweb-Edu (Lozhkov et al., 2024). |
| Dataset Splits | No | The paper specifies token amounts used for training from various datasets (Red Pajamav1, Red Pajama-v2, Star Coder, Fineweb-Edu) and mentions using the 'PG19 validation set'. However, it does not provide specific details on train/test/validation splits (e.g., percentages or exact counts) for its experiments on these or other evaluation benchmarks like RULER, LOFT, etc. |
| Hardware Specification | Yes | we extend the two models to 128k context window and midtrain on 64 A100 GPUs using a 10B-token dataset. ... we measured the KV cache recomputation time on a 4 80GB A100 GPU (using v LLM 0.7.3) |
| Software Dependencies | Yes | To accelerate training and inference, we use Flash Attention-2 (Dao, 2023)... we utilize nn Scaler (Lin et al., 2024), an efficient distributed training system for long-context LLMs... we measured the KV cache recomputation time on a 4 80GB A100 GPU (using v LLM 0.7.3) |
| Experiment Setup | Yes | We train for 1 epoch with a global batch size of 64. The initial learning rate of 2e-5 with a cosine learning rate scheduler. For the rescaling factor search, we set a population size of P = 64, evolution iterations of 40, and a mutation probability p = 0.3. |