Scaling Long Context Training Data by Long-Distance Referrals

Authors: Yonghao Zhuang, Lanxiang Hu, Longfei Yun, Souvik Kundu, Zhengzhong Liu, Eric P Xing, Hao Zhang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, Long Pack, to scale such data. Our experiments demonstrate that Long Pack is highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents.
Researcher Affiliation Collaboration Yonghao Zhuang1 , Lanxiang Hu2 , Longfei Yun2, Souvik Kundu3, Zhengzhong Liu4, Eric P. Xing1,4, Hao Zhang2,5 1Carnegie Mellon University, 2University of California San Diego, 3 Intel, 4 MBZUAI, 5 Snowflake
Pseudocode Yes D PSEUDO-CODE OF OUR DATA PIPELINE Algorithm 1 Long Pack data process pipeline
Open Source Code No The paper mentions using "Huggingface Transformers for model implementation, and Huggingface Accelerate for training" in Section 5.1, but does not provide specific code or a repository link for their own methodology.
Open Datasets Yes On the other hand, directly sampling long documents from the pretraining dataset faces the data shortage issues. For example, in Figure 5a, we analyzed Refined Web (Penedo et al., 2023), a popular pretraining dataset with 960 million documents and 600 billion tokens. ... We use the Refined Web (Penedo et al., 2023) dataset in all experiments. ... For example, Red Pajama (Weber et al., 2024), a dataset reproducing Llama training dataset, has 87% from code, and 4.8% from Git Hub, which is also available on the Internet These pages are regularly collected by automated crawlers like Common Crawl and Wayback Machine (Crawl, 2024; Archive, 2024). ... We use GLM-4-9B (Zeng et al., 2024) as the base pretrained model. ... The ruler benchmark has 13 tasks in 4 categories. We further divide the retrieve category into two subcategories, single-pos retrieval and multi-pos retrieval. Below we briefly explain each category, as well as the tasks in the category: ... squad Rajpurkar et al. (2016) (qa1) and hotpotqa Yang et al. (2018) (qa2).
Dataset Splits Yes Documents are categorized into several groups based on their length, and we conduct referral density analysis across the groups. We set 5 groups of document length intervals: (0-4K), (4K-8K), (8K16K), (16K-32K), (32K-64K). For each interval, we sample a group of documents within the length interval. Each group has roughly 1 billion tokens. ... For each task, it samples 500 times and uses the pass rate as the score.
Hardware Specification Yes System All experiments are running on a single node with 8 H100 80GB GPUs.
Software Dependencies No The paper mentions "Huggingface Transformers" and "Huggingface Accelerate" for model implementation and training, and "spaCy" and "coreferee" for phrase extraction and coreference resolution, but no specific version numbers are provided for any of these software dependencies.
Experiment Setup Yes All experiments share the same hyper-parameters, which are listed in Table 3. Table 3: Training hyper-parameter in all our experiments. parameter value name value learning rate 2e-5 batch size 64 mixed precision bf16 optimizer AdamW weight decay 0