Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Authors: Jiasheng Ye, Peiju Liu, Tianxiang Sun, Jun Zhan, Yunhua Zhou, Xipeng Qiu

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results verify that our method effectively optimizes the training mixture of a 1B model trained for 100B tokens in Red Pajama, reaching a performance comparable to the one trained for 48% more steps on the default mixture. We experiment to verify the reliability of our data mixing laws and prediction pipeline, showing its effectiveness in optimizing model performance, balancing model capabilities, and the prospects of guiding the design of the data schedule.
Researcher Affiliation Collaboration Jiasheng Ye1 , Peiju Liu1 , Tianxiang Sun1, Jun Zhan1, Yunhua Zhou2 , Xipeng Qiu1 1Fudan University, 2Shanghai AI Labortory EMAIL EMAIL EMAIL
Pseudocode Yes Algorithm 1 A pipeline to predict losses of different mixture proportions on large models trained on massive data through small-scale training Algorithm 2 Sampling mixture proportions for fitting mixture laws.
Open Source Code Yes Codes and data are available at: https://github.com/yegcjs/mixinglaws.
Open Datasets Yes We train 70M and 160M language models on the mixture of Github and Pile-CC subset from the Pile dataset (Gao et al., 2020)... We train our models on the mixture of Red Pajama and validate the validation set of the Pile to mimic the scenario where validation data are collected separately from the training data.
Dataset Splits No The paper mentions using subsets of datasets (e.g., "Pile-CC subset from the Pile dataset") and validating on specific validation sets (e.g., "validation set of Git Hub and Pile-CC", "validation set of the Pile"). It also discusses proportions of data mixtures for training (e.g., "five different mixture proportions, which are {0.25, 0.375, 0.5, 0.625, 0.75} for Github"). However, it does not provide explicit percentages or counts for how the raw datasets themselves (like The Pile or Red Pajama) are split into training, validation, or test sets for reproduction, nor does it cite specific predefined splits with details.
Hardware Specification Yes For the costs of our experiments, it takes around 3.5/8/16/21 hours to train a 70M/160M/305M/410M model for 30B tokens on 8 A100 GPUs on our infrastructure.
Software Dependencies No The paper mentions using "Pythia Suit (Biderman et al., 2023) as our model architectures" and "LBFGS" for fitting, but does not provide specific version numbers for these or any other software components used in the experiments.
Experiment Setup Yes In all our experiments, we train the model with a batch size of 1M tokens and a maximum learning rate of 1e-4. We warm up the learning rates for 2000 steps and decay it to 0.1 of the maximum at the last training step with a cosine decay schedule. For continual pretraining, we initialize the models with the 20k-step checkpoint of the Pythia 70M model and do not apply a learning rate warmup.