reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

NExtLong: Toward Effective Long-Context Training without Long Documents

Authors: Chaochen Gao, Xing W, Zijia Lin, Debing Zhang, Songlin Hu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that NExt Long achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents.
Researcher Affiliation	Collaboration	1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Tsinghua University 4Xiaohongshu Inc.
Pseudocode	Yes	An overview of the approach is shown in Figure 2, and the corresponding pseudocode is presented in Appendix D.2. ... Algorithm 1 Negative Document Extension (NExt Long)
Open Source Code	Yes	Our code is available in https://github.com/caskcsg/ longcontext/tree/main/NExt Long.
Open Datasets	Yes	Datasets We select two commonly used pretraining datasets composed entirely of short documents (Refer to Appendix B.2 for document length distribution): Cosmopedia v2 (Ben Allal et al., 2024) and Fine Web-Edu (Lozhkov et al., 2024).
Dataset Splits	No	The paper uses Cosmopedia v2 (Ben Allal et al., 2024) and Fine Web-Edu (Lozhkov et al., 2024) as pretraining datasets. For evaluation, it uses the HELMET (Yen et al., 2024b) and RULER (Hsieh et al., 2024) benchmarks. While these benchmarks have their own splits, the paper does not specify the train/validation/test splits for the synthesized long-context data used for training the models.
Hardware Specification	Yes	Table 8. 128K model training configuration. ... GPU-type H100 GPU-numbers 64. Table 9. 512K model training configuration. ... GPU-type H100 GPU-numbers 128.
Software Dependencies	No	We fine-tune the Meta-Llama-3-8B-base (Meta, 2024) model using a batch size of 4M tokens for 1000 steps with the open-source framework GPT-Neo X2. No specific version number for GPT-Neo X2 or other software libraries is provided.
Experiment Setup	Yes	Table 8. 128K model training configuration. Initial Model Meta-Llama-3-8B (base model) rotary-emb-base 200,000,000 β1 0.9 β2 0.95 lr 4e 5 precision bfloat16 gradient-clipping 1.0 weight-decay 0.1 lr-decay-style cosine train-iters 1000 warmup-iters 200 seq-length 131072. Table 9. 512K model training configuration. Initial Model Llama-3-8B-Pro Long-64k-Base rotary-emb-base 128,000,000 β1 0.9 β2 0.95 lr 1e 5 precision bfloat16 gradient-clipping 1.0 weight-decay 0.1 lr-decay-style cosine train-iters 500 warmup-iters 50 seq-length 524288.