Learning Dynamics in Continual Pre-Training for Large Language Models
Authors: Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our scaling law holds across various CPT datasets and hyper-parameters. Our main experiments employ LLa MA-like models (Dubey et al., 2024) with 106M to 1.7B non-embedding parameters. |
| Researcher Affiliation | Collaboration | 1School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 2State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 3Ritzz AI. EMAIL. Correspondence to: Howe Tissue (project lead) <EMAIL>. |
| Pseudocode | No | The paper describes its methodology using mathematical equations and textual explanations, but does not include any clearly labeled pseudocode blocks or algorithms. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing its own source code, nor does it include any links to a code repository for the methodology described. |
| Open Datasets | Yes | We use Fine Web (Penedo et al., 2024) as Dpt and Knowledge-Pile (Fei et al., 2024) as Dcpt. Additionally mentioned: Pile of Law (Henderson* et al., 2022). |
| Dataset Splits | No | The paper mentions using "validation losses on corresponding domains" for Dpt and Dcpt datasets, but does not provide specific details on how these datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or explicit predefined splits). |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It mentions 'LLa MA-like models' with parameter counts, but this refers to the model architecture, not the computational hardware. |
| Software Dependencies | No | The paper mentions using Adam W Optimizer, LLa MA-3 tokenizer, and the scipy library, but it does not specify version numbers for any of these software components. |
| Experiment Setup | Yes | Table 1. Experimental settings adopted in this work. Model Size: 106M, 594M, 1720M. Peak LR: 2e-4. PT Batch Size (Tokens): 4M. CPT Batch Size (Tokens): 4M. PT Sequence Length: 4096. CPT Sequence Length: 4096. β1, β2 in Adam W: 0.9, 0.95. Weight Decay: 0.1. Gradient Clip: 1.0. |