reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Investigating Continual Pretraining in Large Language Models: Insights and Implications

Authors: Çağatay Yıldız, Nishaanth Kanna Ravichandran, Nitin Sharma, Matthias Bethge, Beyza Ermis

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct an extensive benchmark by pretraining LLMs across a wide range of domains and evaluating their performance throughout the learning process. Unlike prior works that focus on a narrow set of domains, our study leverages the Massively Multi-Domain Dataset (M2D2) (Reid et al., 2022), which spans 236 hierarchically organized domains from Wikipedia and Semantic Scholar, enabling a detailed investigation of forgetting and knowledge transfer in a large-scale setting.
Researcher Affiliation	Collaboration	Ça gatay Yıldız EMAIL University of Tübingen Nishaanth Kanna Ravichandran EMAIL Cohere for AI Community Nitin Sharma EMAIL University of Tübingen Matthias Bethge EMAIL Tübingen AI Center, University of Tübingen Beyza Ermis EMAIL Cohere for AI
Pseudocode	No	The paper describes the training process and evaluation pipeline in narrative text. It does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that source code is provided or offer any links to a code repository.
Open Datasets	Yes	Our experiments are conducted on the M2D2 dataset (Reid et al., 2022), which is an extensive and finely categorized corpus specifically designed for exploring domain adaptation in language models. It comprises 8.5 billion tokens and covers 236 domains, sourced from Wikipedia and the Semantic Scholar (S2ORC) database Lo et al. (2019). ... To show the cross-domain similarity, we first computed the task embedding by using Sentence-BERT (Reimers & Gurevych, 2019) with 10K samples from each domain and 50K samples from Open Web Text (Gokaslan & Cohen, 2019), an open-source reproduction of GPT2 training dataset (Radford et al., 2019).
Dataset Splits	Yes	Each domain in the M2D2 dataset is split into train, validation, and test sets with no data leakage, as outlined in Reid et al. (2022). Each validation and test set includes over 1 million tokens, allowing accurate evaluations within specific domains.
Hardware Specification	Yes	We trained the models with Adam optimizer (Kingma & Ba, 2015) with a batch size of 16 sequences on NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions several tools and frameworks such as Adam optimizer, Deep Speed, and Sentence-BERT, but does not specify their version numbers.
Experiment Setup	Yes	We trained the models with Adam optimizer (Kingma & Ba, 2015) with a batch size of 16 sequences on NVIDIA A100 GPUs. We used Deep Speed (Rasley et al., 2020) with auto configuration, which assigns a dropout rate of 0.2 and automatic learning-rate selection. ... In our experiments, we used a fixed learning rate of 5e-5.