NExtLong: Toward Effective Long-Context Training without Long Documents

Authors: Chaochen Gao, Xing W, Zijia Lin, Debing Zhang, Songlin Hu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrate that NExt Long achieves significant performance improvements on the HELMET and RULER benchmarks compared to existing long-context synthesis approaches and leading models, which are trained on non-synthetic long documents.
Researcher Affiliation Collaboration 1Institute of Information Engineering, Chinese Academy of Sciences 2School of Cyber Security, University of Chinese Academy of Sciences 3Tsinghua University 4Xiaohongshu Inc.
Pseudocode Yes An overview of the approach is shown in Figure 2, and the corresponding pseudocode is presented in Appendix D.2. ... Algorithm 1 Negative Document Extension (NExt Long)
Open Source Code Yes Our code is available in https://github.com/caskcsg/ longcontext/tree/main/NExt Long.
Open Datasets Yes Datasets We select two commonly used pretraining datasets composed entirely of short documents (Refer to Appendix B.2 for document length distribution): Cosmopedia v2 (Ben Allal et al., 2024) and Fine Web-Edu (Lozhkov et al., 2024).
Dataset Splits No The paper uses Cosmopedia v2 (Ben Allal et al., 2024) and Fine Web-Edu (Lozhkov et al., 2024) as pretraining datasets. For evaluation, it uses the HELMET (Yen et al., 2024b) and RULER (Hsieh et al., 2024) benchmarks. While these benchmarks have their own splits, the paper does not specify the train/validation/test splits for the synthesized long-context data used for training the models.
Hardware Specification Yes Table 8. 128K model training configuration. ... GPU-type H100 GPU-numbers 64. Table 9. 512K model training configuration. ... GPU-type H100 GPU-numbers 128.
Software Dependencies No We fine-tune the Meta-Llama-3-8B-base (Meta, 2024) model using a batch size of 4M tokens for 1000 steps with the open-source framework GPT-Neo X2. No specific version number for GPT-Neo X2 or other software libraries is provided.
Experiment Setup Yes Table 8. 128K model training configuration. Initial Model Meta-Llama-3-8B (base model) rotary-emb-base 200,000,000 β1 0.9 β2 0.95 lr 4e 5 precision bfloat16 gradient-clipping 1.0 weight-decay 0.1 lr-decay-style cosine train-iters 1000 warmup-iters 200 seq-length 131072. Table 9. 512K model training configuration. Initial Model Llama-3-8B-Pro Long-64k-Base rotary-emb-base 128,000,000 β1 0.9 β2 0.95 lr 1e 5 precision bfloat16 gradient-clipping 1.0 weight-decay 0.1 lr-decay-style cosine train-iters 500 warmup-iters 50 seq-length 524288.