reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RegMix: Data Mixture as Regression for Language Model Pre-training

Authors: Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang, Min Lin

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To empirically validate REGMIX, we train 512 models with 1M parameters for 1B tokens to fit the regression model and predict the best data mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000 larger and 25 longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Furthermore, REGMIX consistently outperforms human selection in experiments involving models up to 7B models trained on 100B tokens, while matching or exceeding Do Re Mi using just 10% of the computational resources.
Researcher Affiliation	Collaboration	Qian Liu1 Xiaosen Zheng2 Niklas Muennighoff 3,4 Guangtao Zeng5 Longxu Dou1 Tianyu Pang1 Jing Jiang2 Min Lin1 1Sea AI Lab 2SMU 3Contextual AI 4Stanford University 5SUTD
Pseudocode	Yes	H PESUDOCODE OF REGMIX Algorithm 1 REGMIX: Data Mixture as Regression
Open Source Code	Yes	Our code is available at https://github.com/sail-sg/regmix.
Open Datasets	Yes	We conduct our experiments using the domains of the Pile dataset (Gao et al., 2021) depicted in Table 1. Due to copyright concerns, we utilize the 17 subsets available on Hugging Face 4 that do not violate copyright issues. We consider both linear and Light GBM regression models, where the target y is set to be the validation loss of the Pile-CC domain.
Dataset Splits	No	The regression model is fitted using the training artifacts of 512 1M models with 1B tokens, and evaluated on 256 unseen data mixtures for 1M, 60M models (each trained with 1B tokens) and 64 unseen data mixtures for 1B models (each trained with 25B tokens).
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions Light GBM (Ke et al., 2017) and lm-eval-harness (Gao et al., 2023; Biderman et al., 2024), and GPTNeo X tokenizer (Black et al., 2022) but does not provide specific version numbers for these software components or other key libraries/frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	For models with 1M and 60M parameters, we set the training iterations as 1000 and the batch size as 1M tokens, which means the training budget is 1B tokens. Similarly, we train the larger model with 1B parameters with 25000 training iterations and the same batch size thus consuming 25B tokens in total. We set the learning rate as 4e-4 and use the cosine learning rate scheduler. For linear regression, we employ 5-fold cross-validation with ridge regression to determine the optimal ℓ2 regularization weight from the set [1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3]. For Light GBM, we manually set the number of iterations to 1000 and the learning rate to 1e-2. leaving all other hyperparameters at their default values.