Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Authors: Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that models trained on PROX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, Red Pajama-V2, Fine Web, Fine Web Edu, and DCLM).
Researcher Affiliation Collaboration 1Shanghai Jiao Tong University 2Generative AI Research Lab (GAIR) 3Sea AI Lab 4Shanghai Artificial Intelligence Laboratory. Correspondence to: Pengfei Liu <EMAIL>.
Pseudocode Yes Algorithm 1 Document Chunk Splitting Algorithm
Open Source Code No The paper mentions using third-party open-source codebases like Lit GPT, Tiny Llama, llama-factory, and vllm. However, it does not provide any explicit statement or link for the authors' own implementation code for the PROX methodology described in this paper.
Open Datasets Yes For the general domain, we begin with Red Pajama-V2 (Together, 2023), a preprocessed large-scale dataset... We further apply PROX on the C4 corpus (Raffel et al., 2020)... and the recent high quality datasets including Fine Web (as well as Fine Web-Edu) (Penedo et al., 2024a) and DCLM (Li et al., 2024). For specific domain experiments, we use Open Web Math (Paster et al., 2024)...
Dataset Splits Yes Finally, we use LLAMA-3-70B-INSTRUCT to annotate 51K data, splitting 5K for validation.
Hardware Specification Yes Such 2-stage synthesis requires approximately 192 A100 GPU hours for processing 60B tokens of data.
Software Dependencies No The paper mentions using Lit GPT (AI, 2023), Tiny Llama (Zhang et al., 2024b), Flash Attention (Dao, 2024), llama-factory (Zheng et al., 2024) and vllm (Kwon et al., 2023) but does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes We apply full parameter supervised fine-tuning on our base models: we train on the whole seed dataset for 3 to 5 epochs, with batch size as 64, and cosine learning rate schedular (lr from 1e-5 – 1e-6)... Table 10: Training hyper-parameters of all base models.