reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

Authors: Yuan Li, Zhengzhong Liu, Eric Xing

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66% higher than the best domain loss from grid search on average.
Researcher Affiliation	Academia	1Carnegie Mellon University 2Mohamed bin Zayed University of Artificial Intelligence.
Pseudocode	Yes	In summary, we integrate the aforementioned components and present the algorithm for determining the optimal data weights in Algorithm 1.
Open Source Code	No	The paper does not explicitly provide a link to source code, a statement that code is released, or mention of code in supplementary materials for the methodology described.
Open Datasets	Yes	We consider a scenario with three distinct domains: general instruction following (IF) (sampled from Infinite-Instruct (BAAI, 2024)), math (sampled from Open Math Instruct-2 (Toshniwal et al., 2024)), and code (sampled from Open Coder (Huang et al., 2024))... We further examine our method by re-weighting two popular SFT collections: Tulu3 (Lambert et al., 2024) and Orca (Mukherjee et al., 2023).
Dataset Splits	No	The paper mentions using a 'held-out validation data' for evaluation and perturbing data sizes for parameter estimation, but it does not provide specific percentages or absolute counts for the general training/validation/test splits of the datasets used for the main experiments. For example, for Orca, it states 'We used 300M tokens of the original dataset' without specifying splits.
Hardware Specification	Yes	All experiments are conducted on NVIDIA H100 GPUs.
Software Dependencies	No	The paper mentions using 'LM-Evluation harness' but does not specify a version number. No other specific software dependencies with version numbers (e.g., Python, PyTorch) are provided.
Experiment Setup	Yes	We use a cosine learning rate scheduler, set the batch size to 256 for training all models, and use a sequence length of 4096 tokens. For perturbation experiments, we train the model for 3 epochs and select the model with the lowest validation loss. The maximum training steps are determined by the data budget as follows: 200 steps for a 5M budget, 400 steps for 20M, and 2,500 steps for 200M. For the Tulu3 and Orca experiments, we set the maximum training steps to 6,000. The learning rate for each model is show in Table 6.