reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Long Context Fine-tuning with Chunk Flow

Authors: Xiulong Yuan, Hongtao Xu, Wenting Shen, Ang Wang, Xiafei Qiu, Jie Zhang, Yuqiong Liu, Bowen Yu, Junyang Lin, Mingzhen Li, Weile Jia, Yong Li, Wei Lin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that, compared with Megatron-LM, Chunk Flow can be up to 4.53x faster in the long context fine-tuning of LLMs.
Researcher Affiliation	Collaboration	1Alibaba Group 2School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences 3State Key Lab of Processors, Institute of Computing Technology, CAS.
Pseudocode	Yes	Algorithm 1 Chunk Construction Algorithm Algorithm 2 Chunk Scheduling Algorithm
Open Source Code	No	The paper does not provide a direct link to a source-code repository, an explicit statement of code release, or mention of code in supplementary materials.
Open Datasets	Yes	Table 1 shows sequences distribution statistics in LMSys Chat1M dataset. It can be observed that over 99% of the sequences in the dataset are shorter than 4K tokens, while the longest sequence extends to approximately 300K tokens. This pronounced disparity results in an extremely long-tail distribution of sequence lengths, posing significant challenges for efficient training and resource utilization. Notably, this distribution pattern has also been observed by Meta(Meta, 2024) as well as in our in-house proprietary training dataset, which is specifically collected for fine-tuning LLMs with context length over 256K.
Dataset Splits	No	The paper mentions using 'LMSys Chat1M' and an 'evaluation dataset' but does not specify how these datasets are split into training, validation, or test sets with percentages or sample counts.
Hardware Specification	Yes	All experiments are conducted using Alibaba Cloud ml.gu7ef.8xlarge-gu100 instances (Alibaba, 2025a), with a global batch size of 256 and a micro-batch size of 1.
Software Dependencies	No	The paper mentions using Megatron-LM as a baseline and states that Chunk Flow is built on top of it, but does not provide specific version numbers for Megatron-LM or any other software dependencies.
Experiment Setup	Yes	All experiments are conducted using Alibaba Cloud ml.gu7ef.8xlarge-gu100 instances (Alibaba, 2025a), with a global batch size of 256 and a micro-batch size of 1. The configurations used for training different-sized models in Megatron-LM with various context lengths are shown in Table 3. These configurations achieve the best performance in Megatron-LM while ensuring no OOM errors occur.