Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

Authors: Yuan Li, Zhengzhong Liu, Eric Xing

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66% higher than the best domain loss from grid search on average.
Researcher Affiliation Academia 1Carnegie Mellon University 2Mohamed bin Zayed University of Artificial Intelligence.
Pseudocode Yes In summary, we integrate the aforementioned components and present the algorithm for determining the optimal data weights in Algorithm 1.
Open Source Code No The paper does not explicitly provide a link to source code, a statement that code is released, or mention of code in supplementary materials for the methodology described.
Open Datasets Yes We consider a scenario with three distinct domains: general instruction following (IF) (sampled from Infinite-Instruct (BAAI, 2024)), math (sampled from Open Math Instruct-2 (Toshniwal et al., 2024)), and code (sampled from Open Coder (Huang et al., 2024))... We further examine our method by re-weighting two popular SFT collections: Tulu3 (Lambert et al., 2024) and Orca (Mukherjee et al., 2023).
Dataset Splits No The paper mentions using a 'held-out validation data' for evaluation and perturbing data sizes for parameter estimation, but it does not provide specific percentages or absolute counts for the general training/validation/test splits of the datasets used for the main experiments. For example, for Orca, it states 'We used 300M tokens of the original dataset' without specifying splits.
Hardware Specification Yes All experiments are conducted on NVIDIA H100 GPUs.
Software Dependencies No The paper mentions using 'LM-Evluation harness' but does not specify a version number. No other specific software dependencies with version numbers (e.g., Python, PyTorch) are provided.
Experiment Setup Yes We use a cosine learning rate scheduler, set the batch size to 256 for training all models, and use a sequence length of 4096 tokens. For perturbation experiments, we train the model for 3 epochs and select the model with the lowest validation loss. The maximum training steps are determined by the data budget as follows: 200 steps for a 5M budget, 400 steps for 20M, and 2,500 steps for 200M. For the Tulu3 and Orca experiments, we set the maximum training steps to 6,000. The learning rate for each model is show in Table 6.