Investigating Continual Pretraining in Large Language Models: Insights and Implications

Authors: Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct an extensive benchmark by pretraining LLMs across a wide range of domains and evaluating their performance throughout the learning process. Unlike prior works that focus on a narrow set of domains, our study leverages the Massively Multi-Domain Dataset (M2D2) (Reid et al., 2022), which spans 236 hierarchically organized domains from Wikipedia and Semantic Scholar, enabling a detailed investigation of forgetting and knowledge transfer in a large-scale setting.
Researcher Affiliation Collaboration Ça gatay Yıldız EMAIL University of Tübingen Nishaanth Kanna Ravichandran EMAIL Cohere for AI Community Nitin Sharma EMAIL University of Tübingen Matthias Bethge EMAIL Tübingen AI Center, University of Tübingen Beyza Ermis EMAIL Cohere for AI
Pseudocode No The paper describes the training process and evaluation pipeline in narrative text. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that source code is provided or offer any links to a code repository.
Open Datasets Yes Our experiments are conducted on the M2D2 dataset (Reid et al., 2022), which is an extensive and finely categorized corpus specifically designed for exploring domain adaptation in language models. It comprises 8.5 billion tokens and covers 236 domains, sourced from Wikipedia and the Semantic Scholar (S2ORC) database Lo et al. (2019). ... To show the cross-domain similarity, we first computed the task embedding by using Sentence-BERT (Reimers & Gurevych, 2019) with 10K samples from each domain and 50K samples from Open Web Text (Gokaslan & Cohen, 2019), an open-source reproduction of GPT2 training dataset (Radford et al., 2019).
Dataset Splits Yes Each domain in the M2D2 dataset is split into train, validation, and test sets with no data leakage, as outlined in Reid et al. (2022). Each validation and test set includes over 1 million tokens, allowing accurate evaluations within specific domains.
Hardware Specification Yes We trained the models with Adam optimizer (Kingma & Ba, 2015) with a batch size of 16 sequences on NVIDIA A100 GPUs.
Software Dependencies No The paper mentions several tools and frameworks such as Adam optimizer, Deep Speed, and Sentence-BERT, but does not specify their version numbers.
Experiment Setup Yes We trained the models with Adam optimizer (Kingma & Ba, 2015) with a batch size of 16 sequences on NVIDIA A100 GPUs. We used Deep Speed (Rasley et al., 2020) with auto configuration, which assigns a dropout rate of 0.2 and automatic learning-rate selection. ... In our experiments, we used a fixed learning rate of 5e-5.