ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Authors: Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, Jonathan Richard Schwarz

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of baselines, resulting in speed-ups of over 500% in determining the best data mixture on our largest experiments.
Researcher Affiliation Collaboration Shengzhuang Chen EMAIL Thomson Reuters Foundational Research & Imperial College London Xu Ouyang EMAIL University of Virginia Michael Arthur Leopold Pearce EMAIL Graphcore Thomas Hartvigsen EMAIL University of Virginia & Thomson Reuters Foundational Research Jonathan Richard Schwarz EMAIL Thomson Reuters Foundational Research & Imperial College London
Pseudocode No The paper describes the steps of the ADMIRE-Bayes Opt method in text and provides a flow chart in Figure 1. It also describes optimization steps in Appendix B. However, there are no clearly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code No The paper does not explicitly state that the source code for the methodology described is publicly available nor does it provide a direct link to a code repository. The mention of 'ADMIRE-Bayes Opt Code' in proximity to the abstract lacks an accompanying link or explicit release statement. The paper states, 'We implement our method using the popular open-source Bayesian optimization library Bo Torch (Balandat et al., 2020)', which refers to a third-party library they used, not their own code.
Open Datasets Yes We therefore construct and release an open dataset ADMIRE IFT Runs containing full fine-tuning and evaluation runs for 460 state-of-the-art LLMs... The resultant ADMIRE IFT Runs dataset represents a significant contribution to the research community, providing public access to 460 trained checkpoints across 256 diverse data mixtures... Apart from the ADMIRE IFT Runs explained in section 5, we conduct experiments on open-sourced benchmark datasets from Reg Mix, which comprises 256 pre-training and evaluation results across three different model scales (1M, 60M and 1B parameters) on the Pile dataset (Gao et al., 2020)... using the Qwen 2.5 (Yang et al., 2024b) family of pre-trained models. To facilitate further research, we further contribute the ADMIRE-Bayes Opt collection, which includes all training artifacts for over 460 IFT runs for 0.5b, 3b, and 7b Qwen 2.5 models, each being trained on 200k examples from the Tülu 3 dataset.
Dataset Splits No The paper mentions limiting each data mixture to 200,000 training samples and evaluating on a 'Tülu 3 development set' for in-distribution evaluation, and also on unseen (out-of-distribution) datasets. However, it does not explicitly provide the specific percentages, sample counts, or methodology for splitting the core datasets (like The Pile or Tülu 3) into training, validation, and test sets that would be required for reproduction. It defers to the 'established evaluation protocol from the original Tülu 3 work' for some details.
Hardware Specification Yes Overall, ADMIRE IFT Runs was constructed for a total of 13,119 GPU hours on nvidia-a100-80gb GPUs.
Software Dependencies No The paper mentions using 'Bo Torch (Balandat et al., 2020)' and adhering to the 'open-instruct training pipeline' but does not specify version numbers for these or any other key software dependencies required to replicate the experiments.
Experiment Setup Yes To maintain practical relevance and avoid overfitting to specific SFT data mixtures, we limit each data mixture to 200,000 training samples... All post-training experiments strictly adhere to the open-instruct training pipeline and hyperparameters established in the original Tülu 3 project... Specifically, we use the Single Task Multi Fidelity GP model and the q Log Expected Improvement acquisition function. For experiments involving a single model size, the training data contains a single fidelity level. The acquisition function is optimized using optimize_acqf_discrete over the training set, from which the point with the highest acquisition value is selected. We set the number of restarts to 10 and the number of raw samples to 1024... For the continuous variable π constrained to the probability simplex, we employ projected gradient ascent... Gradient ascent step: Compute an unconstrained update π(k+1) = π(k) + η ∂αt(π(k), m) ∂π where η is the learning rate.