Investigating Continual Pretraining in Large Language Models: Insights and Implications
Authors: Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct an extensive benchmark by pretraining LLMs across a wide range of domains and evaluating their performance throughout the learning process. Unlike prior works that focus on a narrow set of domains, our study leverages the Massively Multi-Domain Dataset (M2D2) (Reid et al., 2022), which spans 236 hierarchically organized domains from Wikipedia and Semantic Scholar, enabling a detailed investigation of forgetting and knowledge transfer in a large-scale setting. |
| Researcher Affiliation | Collaboration | Ça gatay Yıldız EMAIL University of Tübingen Nishaanth Kanna Ravichandran EMAIL Cohere for AI Community Nitin Sharma EMAIL University of Tübingen Matthias Bethge EMAIL Tübingen AI Center, University of Tübingen Beyza Ermis EMAIL Cohere for AI |
| Pseudocode | No | The paper describes the training process and evaluation pipeline in narrative text. It does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that source code is provided or offer any links to a code repository. |
| Open Datasets | Yes | Our experiments are conducted on the M2D2 dataset (Reid et al., 2022), which is an extensive and finely categorized corpus specifically designed for exploring domain adaptation in language models. It comprises 8.5 billion tokens and covers 236 domains, sourced from Wikipedia and the Semantic Scholar (S2ORC) database Lo et al. (2019). ... To show the cross-domain similarity, we first computed the task embedding by using Sentence-BERT (Reimers & Gurevych, 2019) with 10K samples from each domain and 50K samples from Open Web Text (Gokaslan & Cohen, 2019), an open-source reproduction of GPT2 training dataset (Radford et al., 2019). |
| Dataset Splits | Yes | Each domain in the M2D2 dataset is split into train, validation, and test sets with no data leakage, as outlined in Reid et al. (2022). Each validation and test set includes over 1 million tokens, allowing accurate evaluations within specific domains. |
| Hardware Specification | Yes | We trained the models with Adam optimizer (Kingma & Ba, 2015) with a batch size of 16 sequences on NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions several tools and frameworks such as Adam optimizer, Deep Speed, and Sentence-BERT, but does not specify their version numbers. |
| Experiment Setup | Yes | We trained the models with Adam optimizer (Kingma & Ba, 2015) with a batch size of 16 sequences on NVIDIA A100 GPUs. We used Deep Speed (Rasley et al., 2020) with auto configuration, which assigns a dropout rate of 0.2 and automatic learning-rate selection. ... In our experiments, we used a fixed learning rate of 5e-5. |