reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Dynamics in Continual Pre-Training for Large Language Models

Authors: Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our scaling law holds across various CPT datasets and hyper-parameters. Our main experiments employ LLa MA-like models (Dubey et al., 2024) with 106M to 1.7B non-embedding parameters.
Researcher Affiliation	Collaboration	1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 3Ritzz AI. EMAIL. Correspondence to: Howe Tissue (project lead) <EMAIL>.
Pseudocode	No	The paper describes its methodology using mathematical equations and textual explanations, but does not include any clearly labeled pseudocode blocks or algorithms.
Open Source Code	No	The paper does not provide an explicit statement about releasing its own source code, nor does it include any links to a code repository for the methodology described.
Open Datasets	Yes	We use Fine Web (Penedo et al., 2024) as Dpt and Knowledge-Pile (Fei et al., 2024) as Dcpt. Additionally mentioned: Pile of Law (Henderson* et al., 2022).
Dataset Splits	No	The paper mentions using "validation losses on corresponding domains" for Dpt and Dcpt datasets, but does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or explicit predefined splits).
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It mentions 'LLa MA-like models' with parameter counts, but this refers to the model architecture, not the computational hardware.
Software Dependencies	No	The paper mentions using Adam W Optimizer, LLa MA-3 tokenizer, and the scipy library, but it does not specify version numbers for any of these software components.
Experiment Setup	Yes	Table 1. Experimental settings adopted in this work. Model Size: 106M, 594M, 1720M. Peak LR: 2e-4. PT Batch Size (Tokens): 4M. CPT Batch Size (Tokens): 4M. PT Sequence Length: 4096. CPT Sequence Length: 4096. β1, β2 in Adam W: 0.9, 0.95. Weight Decay: 0.1. Gradient Clip: 1.0.