reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LongRoPE2: Near-Lossless LLM Context Window Scaling

Authors: Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on LLa MA3-8B and Phi3mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of Long Ro PE2.
Researcher Affiliation	Collaboration	1Microsoft 2Shanghai Jiao Tong University 3Zhejiang University; Siyuan Wang and Gaokai Zhang did this work during the internship at MSRA. Correspondence to: Li Lyna Zhang <EMAIL>.
Pseudocode	Yes	Figure 10. The pseudocode for mixed context window training and inference.
Open Source Code	No	The paper mentions using 'Flash Attention-2 (Dao, 2023)' and 'nn Scaler (Lin et al., 2024)' but does not provide specific access to source code for the methodology described in this paper.
Open Datasets	Yes	We randomly sample 10 books from the PG19 validation set. ... we sample 4.5B, 2.5B, and 2B tokens from Red Pajamav1 (Computer, 2023), Red Pajama-v2 (Weber et al., 2024), and Star Coder (Li et al., 2023), covering 8k 200k sequence lengths. For short context windows, we sample 1B tokens from Fineweb-Edu (Lozhkov et al., 2024).
Dataset Splits	No	The paper specifies token amounts used for training from various datasets (Red Pajamav1, Red Pajama-v2, Star Coder, Fineweb-Edu) and mentions using the 'PG19 validation set'. However, it does not provide specific details on train/test/validation splits (e.g., percentages or exact counts) for its experiments on these or other evaluation benchmarks like RULER, LOFT, etc.
Hardware Specification	Yes	we extend the two models to 128k context window and midtrain on 64 A100 GPUs using a 10B-token dataset. ... we measured the KV cache recomputation time on a 4 80GB A100 GPU (using v LLM 0.7.3)
Software Dependencies	Yes	To accelerate training and inference, we use Flash Attention-2 (Dao, 2023)... we utilize nn Scaler (Lin et al., 2024), an efficient distributed training system for long-context LLMs... we measured the KV cache recomputation time on a 4 80GB A100 GPU (using v LLM 0.7.3)
Experiment Setup	Yes	We train for 1 epoch with a global batch size of 64. The initial learning rate of 2e-5 with a cosine learning rate scheduler. For the rescaling factor search, we set a population size of P = 64, evolution iterations of 40, and a mutation probability p = 0.3.