reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling Long Context Training Data by Long-Distance Referrals

Authors: Yonghao Zhuang, Lanxiang Hu, Longfei Yun, Souvik Kundu, Zhengzhong Liu, Eric P Xing, Hao Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, Long Pack, to scale such data. Our experiments demonstrate that Long Pack is highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents.
Researcher Affiliation	Collaboration	Yonghao Zhuang1 , Lanxiang Hu2 , Longfei Yun2, Souvik Kundu3, Zhengzhong Liu4, Eric P. Xing1,4, Hao Zhang2,5 1Carnegie Mellon University, 2University of California San Diego, 3 Intel, 4 MBZUAI, 5 Snowflake
Pseudocode	Yes	D PSEUDO-CODE OF OUR DATA PIPELINE Algorithm 1 Long Pack data process pipeline
Open Source Code	No	The paper mentions using "Huggingface Transformers for model implementation, and Huggingface Accelerate for training" in Section 5.1, but does not provide specific code or a repository link for their own methodology.
Open Datasets	Yes	On the other hand, directly sampling long documents from the pretraining dataset faces the data shortage issues. For example, in Figure 5a, we analyzed Refined Web (Penedo et al., 2023), a popular pretraining dataset with 960 million documents and 600 billion tokens. ... We use the Refined Web (Penedo et al., 2023) dataset in all experiments. ... For example, Red Pajama (Weber et al., 2024), a dataset reproducing Llama training dataset, has 87% from code, and 4.8% from Git Hub, which is also available on the Internet These pages are regularly collected by automated crawlers like Common Crawl and Wayback Machine (Crawl, 2024; Archive, 2024). ... We use GLM-4-9B (Zeng et al., 2024) as the base pretrained model. ... The ruler benchmark has 13 tasks in 4 categories. We further divide the retrieve category into two subcategories, single-pos retrieval and multi-pos retrieval. Below we briefly explain each category, as well as the tasks in the category: ... squad Rajpurkar et al. (2016) (qa1) and hotpotqa Yang et al. (2018) (qa2).
Dataset Splits	Yes	Documents are categorized into several groups based on their length, and we conduct referral density analysis across the groups. We set 5 groups of document length intervals: (0-4K), (4K-8K), (8K16K), (16K-32K), (32K-64K). For each interval, we sample a group of documents within the length interval. Each group has roughly 1 billion tokens. ... For each task, it samples 500 times and uses the pass rate as the score.
Hardware Specification	Yes	System All experiments are running on a single node with 8 H100 80GB GPUs.
Software Dependencies	No	The paper mentions "Huggingface Transformers" and "Huggingface Accelerate" for model implementation and training, and "spaCy" and "coreferee" for phrase extraction and coreference resolution, but no specific version numbers are provided for any of these software dependencies.
Experiment Setup	Yes	All experiments share the same hyper-parameters, which are listed in Table 3. Table 3: Training hyper-parameter in all our experiments. parameter value name value learning rate 2e-5 batch size 64 mixed precision bf16 optimizer AdamW weight decay 0