Efficient Long Context Fine-tuning with Chunk Flow
Authors: Xiulong Yuan, Hongtao Xu, Wenting Shen, Ang Wang, Xiafei Qiu, Jie Zhang, Yuqiong Liu, Bowen Yu, Junyang Lin, Mingzhen Li, Weile Jia, Yong Li, Wei Lin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that, compared with Megatron-LM, Chunk Flow can be up to 4.53x faster in the long context fine-tuning of LLMs. |
| Researcher Affiliation | Collaboration | 1Alibaba Group 2School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences 3State Key Lab of Processors, Institute of Computing Technology, CAS. |
| Pseudocode | Yes | Algorithm 1 Chunk Construction Algorithm Algorithm 2 Chunk Scheduling Algorithm |
| Open Source Code | No | The paper does not provide a direct link to a source-code repository, an explicit statement of code release, or mention of code in supplementary materials. |
| Open Datasets | Yes | Table 1 shows sequences distribution statistics in LMSys Chat1M dataset. It can be observed that over 99% of the sequences in the dataset are shorter than 4K tokens, while the longest sequence extends to approximately 300K tokens. This pronounced disparity results in an extremely long-tail distribution of sequence lengths, posing significant challenges for efficient training and resource utilization. Notably, this distribution pattern has also been observed by Meta(Meta, 2024) as well as in our in-house proprietary training dataset, which is specifically collected for fine-tuning LLMs with context length over 256K. |
| Dataset Splits | No | The paper mentions using 'LMSys Chat1M' and an 'evaluation dataset' but does not specify how these datasets are split into training, validation, or test sets with percentages or sample counts. |
| Hardware Specification | Yes | All experiments are conducted using Alibaba Cloud ml.gu7ef.8xlarge-gu100 instances (Alibaba, 2025a), with a global batch size of 256 and a micro-batch size of 1. |
| Software Dependencies | No | The paper mentions using Megatron-LM as a baseline and states that Chunk Flow is built on top of it, but does not provide specific version numbers for Megatron-LM or any other software dependencies. |
| Experiment Setup | Yes | All experiments are conducted using Alibaba Cloud ml.gu7ef.8xlarge-gu100 instances (Alibaba, 2025a), with a global batch size of 256 and a micro-batch size of 1. The configurations used for training different-sized models in Megatron-LM with various context lengths are shown in Table 3. These configurations achieve the best performance in Megatron-LM while ensuring no OOM errors occur. |