Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining
Authors: Daouda Sow, Herbert Woisetschläger, Saikiran Bulusu, Shiqiang Wang, Hans Arno Jacobsen, Yingbin Liang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically validate our approach across a spectrum of tasks, from pretraining 7B and 1.4B parameter LLMs to smaller-scale language models and linear regression problems, demonstrating that our loss-based reweighting approach can lead to faster convergence and significantly improved performance. We conduct extensive experiments that corroborate our claims and support our theoretical findings. Notably, we demonstrate that the advantages of downweighting low-loss samples are observed across a spectrum of problem scales and complexities: (i) in complex, largescale problems such as LLM pretraining with number of parameters ranging from 124M to 7B, our approach leads to notably improved performance and faster convergence; (ii) in simple, small-scale problems like linear regression, the results highlight the fundamental nature of our findings. |
| Researcher Affiliation | Collaboration | Daouda A. Sow The Ohio State University EMAIL Herbert Woisetschläger Technical University of Munich EMAIL Saikiran Bulusu The Ohio State University EMAIL Shiqiang Wang IBM Research EMAIL Hans-Arno Jacobsen University of Toronto EMAIL Yingbin Liang The Ohio State University EMAIL |
| Pseudocode | Yes | Algorithm 1 Fully Online Instance Reweighting |
| Open Source Code | Yes | Our codebase for the GPT2 experiments is publicly available.1 https://github.com/sowmaster/Sample-Level-Loss-Reweighting-ICLR-2025. |
| Open Datasets | Yes | We train on the Slim Pajama (Soboleva et al., 2023) corpus that includes seven diverse domains: Common Crawl (CC), C4, Git Hub, Stack Exchange, Book, Arxiv, and Wikipedia. In the first step, we pretrain the three GPT2 models on all seven domains where we compare our sample-level reweighting methods (Lin Upper, Quadratic, Extremes) against the uniform averaging baseline in which each sample contributes equally. [...] We use the Fine Web 15T dataset to train the larger Llama-1.4B and Llama-7B models. |
| Dataset Splits | No | The paper mentions evaluating |
| Hardware Specification | No | The paper mentions training models and using |
| Software Dependencies | No | The paper mentions using |
| Experiment Setup | Yes | Table 7: Training Hyperparameters for our benchmark evaluations |