reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Authors: Fan Zhou, Zengzhi Wang, Qian Liu, Junlong Li, Pengfei Liu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that models trained on PROX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, Red Pajama-V2, Fine Web, Fine Web Edu, and DCLM).
Researcher Affiliation	Collaboration	1Shanghai Jiao Tong University 2Generative AI Research Lab (GAIR) 3Sea AI Lab 4Shanghai Artificial Intelligence Laboratory. Correspondence to: Pengfei Liu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Document Chunk Splitting Algorithm
Open Source Code	No	The paper mentions using third-party open-source codebases like Lit GPT, Tiny Llama, llama-factory, and vllm. However, it does not provide any explicit statement or link for the authors' own implementation code for the PROX methodology described in this paper.
Open Datasets	Yes	For the general domain, we begin with Red Pajama-V2 (Together, 2023), a preprocessed large-scale dataset... We further apply PROX on the C4 corpus (Raffel et al., 2020)... and the recent high quality datasets including Fine Web (as well as Fine Web-Edu) (Penedo et al., 2024a) and DCLM (Li et al., 2024). For specific domain experiments, we use Open Web Math (Paster et al., 2024)...
Dataset Splits	Yes	Finally, we use LLAMA-3-70B-INSTRUCT to annotate 51K data, splitting 5K for validation.
Hardware Specification	Yes	Such 2-stage synthesis requires approximately 192 A100 GPU hours for processing 60B tokens of data.
Software Dependencies	No	The paper mentions using Lit GPT (AI, 2023), Tiny Llama (Zhang et al., 2024b), Flash Attention (Dao, 2024), llama-factory (Zheng et al., 2024) and vllm (Kwon et al., 2023) but does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	We apply full parameter supervised fine-tuning on our base models: we train on the whole seed dataset for 3 to 5 epochs, with batch size as 64, and cosine learning rate schedular (lr from 1e-5 – 1e-6)... Table 10: Training hyper-parameters of all base models.