Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models
Authors: Yuan Li, Zhengzhong Liu, Eric Xing
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66% higher than the best domain loss from grid search on average. |
| Researcher Affiliation | Academia | 1Carnegie Mellon University 2Mohamed bin Zayed University of Artificial Intelligence. |
| Pseudocode | Yes | In summary, we integrate the aforementioned components and present the algorithm for determining the optimal data weights in Algorithm 1. |
| Open Source Code | No | The paper does not explicitly provide a link to source code, a statement that code is released, or mention of code in supplementary materials for the methodology described. |
| Open Datasets | Yes | We consider a scenario with three distinct domains: general instruction following (IF) (sampled from Infinite-Instruct (BAAI, 2024)), math (sampled from Open Math Instruct-2 (Toshniwal et al., 2024)), and code (sampled from Open Coder (Huang et al., 2024))... We further examine our method by re-weighting two popular SFT collections: Tulu3 (Lambert et al., 2024) and Orca (Mukherjee et al., 2023). |
| Dataset Splits | No | The paper mentions using a 'held-out validation data' for evaluation and perturbing data sizes for parameter estimation, but it does not provide specific percentages or absolute counts for the general training/validation/test splits of the datasets used for the main experiments. For example, for Orca, it states 'We used 300M tokens of the original dataset' without specifying splits. |
| Hardware Specification | Yes | All experiments are conducted on NVIDIA H100 GPUs. |
| Software Dependencies | No | The paper mentions using 'LM-Evluation harness' but does not specify a version number. No other specific software dependencies with version numbers (e.g., Python, PyTorch) are provided. |
| Experiment Setup | Yes | We use a cosine learning rate scheduler, set the batch size to 256 for training all models, and use a sequence length of 4096 tokens. For perturbation experiments, we train the model for 3 epochs and select the model with the lowest validation loss. The maximum training steps are determined by the data budget as follows: 200 steps for a 5M budget, 400 steps for 20M, and 2,500 steps for 200M. For the Tulu3 and Orca experiments, we set the maximum training steps to 6,000. The learning rate for each model is show in Table 6. |