Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale
Authors: Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that models trained on PROX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, Red Pajama-V2, Fine Web, Fine Web Edu, and DCLM). |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Generative AI Research Lab (GAIR) 3Sea AI Lab 4Shanghai Artificial Intelligence Laboratory. Correspondence to: Pengfei Liu <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Document Chunk Splitting Algorithm |
| Open Source Code | No | The paper mentions using third-party open-source codebases like Lit GPT, Tiny Llama, llama-factory, and vllm. However, it does not provide any explicit statement or link for the authors' own implementation code for the PROX methodology described in this paper. |
| Open Datasets | Yes | For the general domain, we begin with Red Pajama-V2 (Together, 2023), a preprocessed large-scale dataset... We further apply PROX on the C4 corpus (Raffel et al., 2020)... and the recent high quality datasets including Fine Web (as well as Fine Web-Edu) (Penedo et al., 2024a) and DCLM (Li et al., 2024). For specific domain experiments, we use Open Web Math (Paster et al., 2024)... |
| Dataset Splits | Yes | Finally, we use LLAMA-3-70B-INSTRUCT to annotate 51K data, splitting 5K for validation. |
| Hardware Specification | Yes | Such 2-stage synthesis requires approximately 192 A100 GPU hours for processing 60B tokens of data. |
| Software Dependencies | No | The paper mentions using Lit GPT (AI, 2023), Tiny Llama (Zhang et al., 2024b), Flash Attention (Dao, 2024), llama-factory (Zheng et al., 2024) and vllm (Kwon et al., 2023) but does not provide specific version numbers for any of these software dependencies. |
| Experiment Setup | Yes | We apply full parameter supervised fine-tuning on our base models: we train on the whole seed dataset for 3 to 5 epochs, with batch size as 64, and cosine learning rate schedular (lr from 1e-5 – 1e-6)... Table 10: Training hyper-parameters of all base models. |