Synthetic continued pretraining

Authors: Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, Tatsunori Hashimoto

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our main experiments ( 5), we use Enti Graph to generate 455M synthetic tokens from 1.3M real tokens using GPT-4 (Open AI et al., 2024). Then, we continually pretrain Llama 3 8B (Dubey et al., 2024) on the synthetic tokens and evaluate its QA accuracy on the Qu ALITY questions. We observe log-linear scaling in the accuracy as synthetic token count increases, up to 455M ( 4.2).
Researcher Affiliation Academia Zitong Yang Department of Statistics Stanford University Neil Band Department of Computer Science Stanford University Shuangping Li Department of Statistics Stanford University Emmanuel Cand es Department of Statistics Stanford University Tatsunori Hashimoto Department of Computer Science Stanford University
Pseudocode No The paper describes the Enti Graph method in Section 2.2 and illustrates it in Figure 1, but it does not present a formal pseudocode block or algorithm.
Open Source Code Yes 1Code https://github.com/Zitong Yang/Synthetic_Continued_Pretraining.git.
Open Datasets Yes Our corpus and test queries are based on the Qu ALITY (Pang et al., 2022) long-document comprehension benchmark. We release the 455M Enti Graph corpus below: https://huggingface.co/datasets/zitongyang/entigraph-quality-corpus
Dataset Splits No The paper mentions a "Qu ALITY QA validation split" used for hyperparameter tuning in Appendix F.3, and defines the Qtest as "10-20 multiple choice questions accompanying each article in Qu ALITY" for testing. However, it does not provide explicit percentages, sample counts, or citations for standard training/validation/test splits for the primary corpora (Qu ALITY Dsource or the generated synthetic corpus) in a way that allows reproduction of data partitioning for model training.
Hardware Specification Yes All the continued pretraining experiments are performed with one 8 H100 node.
Software Dependencies No The paper mentions "Py Torch FSDP" and references a paper on it, but does not provide a specific version number for PyTorch. It mentions "FAISS" without a version number. It does provide a version for "Cohere rerank-english-v3.0" and a model name for "Open AI text-embedding-3-large", but for core software like PyTorch and FAISS, specific version numbers are missing.
Experiment Setup Yes In all experiments, we continue pretraining the Llama 3 8B Base model with a context length of 2048 and batch size of 16. We apply a linear learning rate warmup for 5% of total steps, followed by a cosine decay with peak learning rate 5e-6. Raw continued pretraining details... The selected hyperparameter configuration uses 4 epochs and a 0.1 replay rate. Instruction tuning details... We apply a linear learning rate warmup followed by a cosine decay to 0 with peak learning rate 5e-6, and train the model for 1 epoch with a batch size of 512 and context window of 2048.