reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Overtrained Language Models Are Harder to Fine-Tune

Authors: Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through both theory and experiments, we uncover a phenomenon we term catastrophic overtraining, where longer pre-training harms final model performance after instruction tuning or other forms of post-training (Figure 1). As shown in Section 2, extensive empirical evaluations demonstrate the prevalence of this phenomenon in existing models. For instance, we show that the OLMo-1B model (Groeneveld et al., 2024a), pre-trained on 3T tokens and post-trained on Anthropic-HH (Bai et al., 2022), performs 3% worse on Alpaca Eval (Li et al., 2023b) and 2% worse on ARC (Clark et al., 2018) compared to an intermediate checkpoint trained on just 2.3T tokens (Figure 2). To understand why catastrophic overtraining occurs, we turn to carefully controlled experiments (Section 3).
Researcher Affiliation	Academia	1Carnegie Mellon University 2Stanford University 3Harvard University 4Princeton University. Correspondence to: Jacob Mitchell Springer <EMAIL>.
Pseudocode	No	The paper includes a theoretical analysis section (Section 4) with mathematical derivations and theorems, but it does not contain any structured pseudocode blocks or algorithms.
Open Source Code	No	The paper mentions using the OLMo codebase for pre-training controlled experiments (Appendix D.1: "For our controlled experiments, we pre-train models using the OLMo codebase (Groeneveld et al., 2024b)."). However, it does not provide an explicit statement about releasing the code for the methodology presented in this paper, nor does it provide a direct link to a code repository for their specific contributions.
Open Datasets	Yes	For our pre-trained models, we use checkpoints from three base models: OLMo-1B (Groeneveld et al., 2024b), OLMo-2-7B (OLMo et al., 2024), and LLM360-Amber (Liu et al., 2023b). For instruction tuning, we use the following datasets. Anthropic-HH (Bai et al., 2022). TULU (Wang et al., 2023). We use the LLaVA visual instruction tuning framework to train multimodel models. The LLaVA framework involves two stages: first, fine-tuning an adapter between a vision model and a pre-trained language model, and then fine-tuning the entire model to follow instructions in the presence of images. Alpaca Eval (Li et al., 2023b). Generalist evaluations. These tasks cover reasoning (ARC Challenge and ARC Easy (Clark et al., 2018)), commonsense (PIQA (Bisk et al., 2020), Winogrande (Sakaguchi et al., 2021)), natural language inference (Bool Q (Clark et al., 2019), COPA, SCIQ) and sentence completion (Hella Swag). For our controlled experiments, we fine-tune the pre-trained models on a series of downstream tasks of two types: classification and language modeling. These ten datasets are: classification SUBJ (Pang & Lee, 2004), Bool Q (Clark et al., 2019), MR (Conneau & Kiela, 2018), CR (Conneau & Kiela, 2018), RTE (Dagan et al., 2005), TREC (Voorhees & Tice, 2000), English Tweet sentiment (Maggie et al., 2020), SIQA (Sap et al., 2019), and language modeling GSM8k (Cobbe et al., 2021), Starcoder-Python (Li et al., 2023a).
Dataset Splits	Yes	For tuning, we use a heldout validation set from each dataset, but report scores on a separate heldout test set. In order to compute the perplexity for classification tasks, we compute a score for each class by measuring the length-normalized likelihood of the class, and then report the perplexity over the classes. For generative tasks, we use the standard language modeling loss.
Hardware Specification	Yes	We train with 8x A100 GPUs.
Software Dependencies	No	The paper mentions using the OLMo codebase (Groeneveld et al., 2024b), muP parameterization (Yang et al., 2022), and Adam W optimizer. However, it does not specify version numbers for any of these software components or libraries.
Experiment Setup	Yes	We fine-tune with two different common post-training paradigms: instruction tuning and multimodal tuning. For instruction tuning, we use the following datasets. Anthropic-HH (Bai et al., 2022). TULU (Wang et al., 2023). We use the LLaVA visual instruction tuning framework to train multimodel models. When fine-tuning for instruction tuning, we use the standard SFT training algorithm with the following hyperparameters, as shown in Table 2. In this table, we also present the hyperparameters we use with the LLaVA framework, using the defaults for all non-specified hyperparameters. (Table 2 includes Batch size, Learning rates, Learning rate schedule, Warmup steps, Optimizer, Weight decay for various datasets). For our controlled experiments, we pre-train models using the OLMo codebase (Groeneveld et al., 2024b). We use muP parameterization for all of our experiments (Yang et al., 2022). We train three different model classes: OLMo-15M, OLMo-30M, and OLMo-90M with 15M, 30M and 90M non-embedding parameters, respectively. We use the following hyperparameters for pre-training, as shown in Table 3. (Table 3 includes Layers, Heads, Number of unique tokens, Hidden dimensions, Inner MLP dimensions, Max context length, Activation type, Attention dropout, Residual dropout, Embedding dropout, Optimizer, Learning rate, Beta1, Beta2, Learning rate scheduler, Warmup steps, Weight decay, Batch size). For each model, we train for tokens in the range 4B, 8B, 16B, 32B, 64B, 128B using the pre-tokenized C4 web data. We use the following hyperparameters for fine-tuning, as shown in Table 4. (Table 4 includes Learning rate, Batch size, Learning rate scheduler, Optimizer, Weight decay, Warmup steps, Epochs).