Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining

Authors: Daouda Sow, Herbert Woisetschläger, Saikiran Bulusu, Shiqiang Wang, Hans Arno Jacobsen, Yingbin Liang

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically validate our approach across a spectrum of tasks, from pretraining 7B and 1.4B parameter LLMs to smaller-scale language models and linear regression problems, demonstrating that our loss-based reweighting approach can lead to faster convergence and significantly improved performance. We conduct extensive experiments that corroborate our claims and support our theoretical findings. Notably, we demonstrate that the advantages of downweighting low-loss samples are observed across a spectrum of problem scales and complexities: (i) in complex, largescale problems such as LLM pretraining with number of parameters ranging from 124M to 7B, our approach leads to notably improved performance and faster convergence; (ii) in simple, small-scale problems like linear regression, the results highlight the fundamental nature of our findings.
Researcher Affiliation Collaboration Daouda A. Sow The Ohio State University EMAIL Herbert Woisetschläger Technical University of Munich EMAIL Saikiran Bulusu The Ohio State University EMAIL Shiqiang Wang IBM Research EMAIL Hans-Arno Jacobsen University of Toronto EMAIL Yingbin Liang The Ohio State University EMAIL
Pseudocode Yes Algorithm 1 Fully Online Instance Reweighting
Open Source Code Yes Our codebase for the GPT2 experiments is publicly available.1 https://github.com/sowmaster/Sample-Level-Loss-Reweighting-ICLR-2025.
Open Datasets Yes We train on the Slim Pajama (Soboleva et al., 2023) corpus that includes seven diverse domains: Common Crawl (CC), C4, Git Hub, Stack Exchange, Book, Arxiv, and Wikipedia. In the first step, we pretrain the three GPT2 models on all seven domains where we compare our sample-level reweighting methods (Lin Upper, Quadratic, Extremes) against the uniform averaging baseline in which each sample contributes equally. [...] We use the Fine Web 15T dataset to train the larger Llama-1.4B and Llama-7B models.
Dataset Splits No The paper mentions evaluating
Hardware Specification No The paper mentions training models and using
Software Dependencies No The paper mentions using
Experiment Setup Yes Table 7: Training Hyperparameters for our benchmark evaluations