Combatting Dimensional Collapse in LLM Pre-Training Data via Submodular File Selection

Authors: Ziqing Fan, Siyuan Du, Shengchao Hu, Pingjie Wang, Li Shen, Ya Zhang, Dacheng Tao, Yanfeng Wang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we establish a benchmark and conduct extensive experiments on the Tiny Llama architecture with models from 120M to 1.1B parameters. Evaluating across nine tasks from the Harness framework, Di SF demonstrates a significant improvement on overall performance. Specifically, Di SF saves 98.5% of 590M training files in Slim Pajama, outperforming the full-data pre-training1 within a 50B training budget, and achieving about 1.5x training efficiency and 5x data efficiency.
Researcher Affiliation Academia 1Shanghai Jiao Tong University, China; 2Shanghai AI Laboratory, China; 3Fudan University, China 4Shenzhen Campus of Sun Yat-sen University, China; 5Nanyang Technological University, Singapore zqfan EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, ya EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 Selection procedure of Di SF Input:(D, b, S, M) V . Divide D into batches of bi with scale b. for i = 1, . . . , |D| b do randomly select x bi and Ui {x }. while |Ui| b|S| |D| do bi bi \ {x }. x = arg maxx bi FDi SF M (Ui {x}). Ui Ui {x }. end while V = V Ui. end for Output:V
Open Source Code Yes Source code is available at: https://github.com/Media Brain-SJTU/Di SF.
Open Datasets Yes Following many prior works (Touvron et al., 2023a; Zhang et al., 2024a; Wettig et al., 2024; Xie et al., 2023a), we employ Slim Pajama (Touvron et al., 2023a; Computer, 2023) as the text corpus, which is specifically curated for pre-training LLMs. The original Red Pajama corpus, an open-source research project, was designed to replicate the pretraining data of Llama (Touvron et al., 2023a) and contains over 1.2 trillion tokens.
Dataset Splits No The paper mentions using "590M training files of Slim Pajama" for pre-training, but does not provide explicit training, validation, and test splits for the Slim Pajama dataset itself. Evaluation is performed on separate benchmarks (Harness).
Hardware Specification Yes Thanks to Flash Attention (Dao et al., 2022) and Lit GPT (Lightning AI, 2023), all experiments can be conducted on NVIDIA Ge Force RTX 4090 GPUs with 24GB memory, which is feasible for general academic research. All experiments and selection are implemented by Py Torch (Paszke et al., 2019) on platforms with 8 GPUs and 64 CPUs.
Software Dependencies No The paper mentions using Py Torch (Paszke et al., 2019), Flash Attention (Dao et al., 2022), Lit GPT (Lightning AI, 2023), and Adam W (Loshchilov & Hutter, 2019), but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We follow all settings in Tiny Llama (Zhang et al., 2024a). The optimizer is Adam W (Loshchilov & Hutter, 2019), setting parameters β1 at 0.9 and β2 at 0.95. We adopt the cosine learning rate schedule with a maximum learning rate of 4e-4 and the minimum of 4e-5, the batch size of 2M tokens, the weight decay of 0.1, and the gradient clipping threshold of 1. The training budgets are 10B and 50B tokens, with 1.5% selection budget of Slim Pajama s training files.