RegMix: Data Mixture as Regression for Language Model Pre-training

Authors: Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To empirically validate REGMIX, we train 512 models with 1M parameters for 1B tokens to fit the regression model and predict the best data mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000 larger and 25 longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Furthermore, REGMIX consistently outperforms human selection in experiments involving models up to 7B models trained on 100B tokens, while matching or exceeding Do Re Mi using just 10% of the computational resources.
Researcher Affiliation Collaboration Qian Liu1 Xiaosen Zheng2 Niklas Muennighoff 3,4 Guangtao Zeng5 Longxu Dou1 Tianyu Pang1 Jing Jiang2 Min Lin1 1Sea AI Lab 2SMU 3Contextual AI 4Stanford University 5SUTD
Pseudocode Yes H PESUDOCODE OF REGMIX Algorithm 1 REGMIX: Data Mixture as Regression
Open Source Code Yes Our code is available at https://github.com/sail-sg/regmix.
Open Datasets Yes We conduct our experiments using the domains of the Pile dataset (Gao et al., 2021) depicted in Table 1. Due to copyright concerns, we utilize the 17 subsets available on Hugging Face 4 that do not violate copyright issues. We consider both linear and Light GBM regression models, where the target y is set to be the validation loss of the Pile-CC domain.
Dataset Splits No The regression model is fitted using the training artifacts of 512 1M models with 1B tokens, and evaluated on 256 unseen data mixtures for 1M, 60M models (each trained with 1B tokens) and 64 unseen data mixtures for 1B models (each trained with 25B tokens).
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions Light GBM (Ke et al., 2017) and lm-eval-harness (Gao et al., 2023; Biderman et al., 2024), and GPTNeo X tokenizer (Black et al., 2022) but does not provide specific version numbers for these software components or other key libraries/frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup Yes For models with 1M and 60M parameters, we set the training iterations as 1000 and the batch size as 1M tokens, which means the training budget is 1B tokens. Similarly, we train the larger model with 1B parameters with 25000 training iterations and the same batch size thus consuming 25B tokens in total. We set the learning rate as 4e-4 and use the cosine learning rate scheduler. For linear regression, we employ 5-fold cross-validation with ridge regression to determine the optimal ℓ2 regularization weight from the set [1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3]. For Light GBM, we manually set the number of iterations to 1000 and the learning rate to 1e-2. leaving all other hyperparameters at their default values.