reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining

Authors: Daouda Sow, Herbert Woisetschläger, Saikiran Bulusu, Shiqiang Wang, Hans Arno Jacobsen, Yingbin Liang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically validate our approach across a spectrum of tasks, from pretraining 7B and 1.4B parameter LLMs to smaller-scale language models and linear regression problems, demonstrating that our loss-based reweighting approach can lead to faster convergence and significantly improved performance. We conduct extensive experiments that corroborate our claims and support our theoretical findings. Notably, we demonstrate that the advantages of downweighting low-loss samples are observed across a spectrum of problem scales and complexities: (i) in complex, largescale problems such as LLM pretraining with number of parameters ranging from 124M to 7B, our approach leads to notably improved performance and faster convergence; (ii) in simple, small-scale problems like linear regression, the results highlight the fundamental nature of our findings.
Researcher Affiliation	Collaboration	Daouda A. Sow The Ohio State University EMAIL Herbert Woisetschläger Technical University of Munich EMAIL Saikiran Bulusu The Ohio State University EMAIL Shiqiang Wang IBM Research EMAIL Hans-Arno Jacobsen University of Toronto EMAIL Yingbin Liang The Ohio State University EMAIL
Pseudocode	Yes	Algorithm 1 Fully Online Instance Reweighting
Open Source Code	Yes	Our codebase for the GPT2 experiments is publicly available.1 https://github.com/sowmaster/Sample-Level-Loss-Reweighting-ICLR-2025.
Open Datasets	Yes	We train on the Slim Pajama (Soboleva et al., 2023) corpus that includes seven diverse domains: Common Crawl (CC), C4, Git Hub, Stack Exchange, Book, Arxiv, and Wikipedia. In the first step, we pretrain the three GPT2 models on all seven domains where we compare our sample-level reweighting methods (Lin Upper, Quadratic, Extremes) against the uniform averaging baseline in which each sample contributes equally. [...] We use the Fine Web 15T dataset to train the larger Llama-1.4B and Llama-7B models.
Dataset Splits	No	The paper mentions evaluating
Hardware Specification	No	The paper mentions training models and using
Software Dependencies	No	The paper mentions using
Experiment Setup	Yes	Table 7: Training Hyperparameters for our benchmark evaluations