Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Authors: Louis Béthune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, Pierre Ablin

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. ... The main takeaway of our work is that for each target domain considered in this paper, the pretraining loss after finetuning can be predicted accurately from (i) the model scale, (ii) the amount of target data available and (iii) the fraction of pretraining data injected in the finetuning data mixture. ... Section 3. Experiments
Researcher Affiliation Industry 1Apple. Correspondence to: Louis Bethune <l EMAIL>, Pierre Ablin <p EMAIL>.
Pseudocode No The paper describes methods and equations for scaling laws and training procedures but does not present any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not explicitly state that source code for the methodology is released or provide a direct link to a code repository for this work. It mentions a codebase that "Awni Hannun kickstarted" in acknowledgments, but not as the released code for the paper.
Open Datasets Yes Datasets. We use Redpajama V2 (Weber et al., 2024) as the pretraining set. We use several domains from The Pile (Gao et al., 2020) as finetuning sets, covering all 5 categories (academic, internet, prose, dialogue and misc). ... We rely on the Open Hermes dataset (Teknium, 2023), with special [INST] and [/INST] tokens to delimit the instruction part from the prediction part.
Dataset Splits No The paper mentions using a 'validation version of the target dataset' and observes 'U-curves on the validation loss' during finetuning. However, it does not specify exact percentages, sample counts, or explicit methodology for how these training, validation, or test splits were created for any of the datasets used.
Hardware Specification Yes The biggest model fits on a single A100 80GB GPU without sharding, with parameter replication across GPUs to handle large batch sizes. 1 GPU is used for Tiny and Small, 4 for Medium, 8 for Large and XL.
Software Dependencies No The paper mentions using GPT2 style transformers, Sentence Piece tokenizer (Kudo, 2018), Adam, Adam W, and notes issues with Pytorch and Optax's weight decay implementation. However, it does not provide specific version numbers for any of these software components or libraries, which are necessary for full reproducibility.
Experiment Setup Yes Pretraining. We pretrain each model on Redpajama V2 using standard hyperparameters and a total count of 100 tokens per parameter. The learning rate follows a linear warmup for 0.5% of the total iterations and then follows a cosine scheduling until the end of the training, with a terminal value that is one-hundredth of the maximum value. We use Adam W with a weight decay of 0.1, which yielded better results than without weight decay (Figure 17). A gradient clipping of 5 was found sufficient to stabilize training across all model scales. Finetuning. We then finetune the model on several domains from the Pile, using a varying amount of target data and different fractions of injected pretraining data. We perform finetuning for 12K steps, which is sufficient to observe a U-curve on the validation loss in every configuration tested. The learning rate is equal to 1/30 times the peak pretraining LR, which was reached at about 90% of the pretraining stage. Empirically, we observe in the ablations of Figure 5 that this rule of thumb is sufficient to ensure both overfitting well within 12K steps and stable training at all model scales and all mixtures p.