reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Synthetic continued pretraining

Authors: Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candes, Tatsunori Hashimoto

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our main experiments ( 5), we use Enti Graph to generate 455M synthetic tokens from 1.3M real tokens using GPT-4 (Open AI et al., 2024). Then, we continually pretrain Llama 3 8B (Dubey et al., 2024) on the synthetic tokens and evaluate its QA accuracy on the Qu ALITY questions. We observe log-linear scaling in the accuracy as synthetic token count increases, up to 455M ( 4.2).
Researcher Affiliation	Academia	Zitong Yang Department of Statistics Stanford University Neil Band Department of Computer Science Stanford University Shuangping Li Department of Statistics Stanford University Emmanuel Cand es Department of Statistics Stanford University Tatsunori Hashimoto Department of Computer Science Stanford University
Pseudocode	No	The paper describes the Enti Graph method in Section 2.2 and illustrates it in Figure 1, but it does not present a formal pseudocode block or algorithm.
Open Source Code	Yes	1Code https://github.com/Zitong Yang/Synthetic_Continued_Pretraining.git.
Open Datasets	Yes	Our corpus and test queries are based on the Qu ALITY (Pang et al., 2022) long-document comprehension benchmark. We release the 455M Enti Graph corpus below: https://huggingface.co/datasets/zitongyang/entigraph-quality-corpus
Dataset Splits	No	The paper mentions a "Qu ALITY QA validation split" used for hyperparameter tuning in Appendix F.3, and defines the Qtest as "10-20 multiple choice questions accompanying each article in Qu ALITY" for testing. However, it does not provide explicit percentages, sample counts, or citations for standard training/validation/test splits for the primary corpora (Qu ALITY Dsource or the generated synthetic corpus) in a way that allows reproduction of data partitioning for model training.
Hardware Specification	Yes	All the continued pretraining experiments are performed with one 8 H100 node.
Software Dependencies	No	The paper mentions "Py Torch FSDP" and references a paper on it, but does not provide a specific version number for PyTorch. It mentions "FAISS" without a version number. It does provide a version for "Cohere rerank-english-v3.0" and a model name for "Open AI text-embedding-3-large", but for core software like PyTorch and FAISS, specific version numbers are missing.
Experiment Setup	Yes	In all experiments, we continue pretraining the Llama 3 8B Base model with a context length of 2048 and batch size of 16. We apply a linear learning rate warmup for 5% of total steps, followed by a cosine decay with peak learning rate 5e-6. Raw continued pretraining details... The selected hyperparameter configuration uses 4 epochs and a 0.1 replay rate. Instruction tuning details... We apply a linear learning rate warmup followed by a cosine decay to 0 with peak learning rate 5e-6, and train the model for 1 epoch with a batch size of 512 and context window of 2048.