reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ADMIRE-BayesOpt: Accelerated Data MIxture RE-weighting for Language Models with Bayesian Optimization

Authors: Xu Ouyang, Shengzhuang Chen, Michael Arthur Leopold Pearce, Thomas Hartvigsen, Jonathan Richard Schwarz

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present results for pre-training and instruction finetuning across models ranging from 1 million to 7 billion parameters, varying from simple architectures to state-of-the-art models and benchmarks spanning dozens of datasets. We demonstrate consistently strong results relative to a wide range of baselines, resulting in speed-ups of over 500% in determining the best data mixture on our largest experiments.
Researcher Affiliation	Collaboration	Shengzhuang Chen EMAIL Thomson Reuters Foundational Research & Imperial College London Xu Ouyang EMAIL University of Virginia Michael Arthur Leopold Pearce EMAIL Graphcore Thomas Hartvigsen EMAIL University of Virginia & Thomson Reuters Foundational Research Jonathan Richard Schwarz EMAIL Thomson Reuters Foundational Research & Imperial College London
Pseudocode	No	The paper describes the steps of the ADMIRE-Bayes Opt method in text and provides a flow chart in Figure 1. It also describes optimization steps in Appendix B. However, there are no clearly labeled pseudocode or algorithm blocks with structured, code-like formatting.
Open Source Code	No	The paper does not explicitly state that the source code for the methodology described is publicly available nor does it provide a direct link to a code repository. The mention of 'ADMIRE-Bayes Opt Code' in proximity to the abstract lacks an accompanying link or explicit release statement. The paper states, 'We implement our method using the popular open-source Bayesian optimization library Bo Torch (Balandat et al., 2020)', which refers to a third-party library they used, not their own code.
Open Datasets	Yes	We therefore construct and release an open dataset ADMIRE IFT Runs containing full fine-tuning and evaluation runs for 460 state-of-the-art LLMs... The resultant ADMIRE IFT Runs dataset represents a significant contribution to the research community, providing public access to 460 trained checkpoints across 256 diverse data mixtures... Apart from the ADMIRE IFT Runs explained in section 5, we conduct experiments on open-sourced benchmark datasets from Reg Mix, which comprises 256 pre-training and evaluation results across three different model scales (1M, 60M and 1B parameters) on the Pile dataset (Gao et al., 2020)... using the Qwen 2.5 (Yang et al., 2024b) family of pre-trained models. To facilitate further research, we further contribute the ADMIRE-Bayes Opt collection, which includes all training artifacts for over 460 IFT runs for 0.5b, 3b, and 7b Qwen 2.5 models, each being trained on 200k examples from the Tülu 3 dataset.
Dataset Splits	No	The paper mentions limiting each data mixture to 200,000 training samples and evaluating on a 'Tülu 3 development set' for in-distribution evaluation, and also on unseen (out-of-distribution) datasets. However, it does not explicitly provide the specific percentages, sample counts, or methodology for splitting the core datasets (like The Pile or Tülu 3) into training, validation, and test sets that would be required for reproduction. It defers to the 'established evaluation protocol from the original Tülu 3 work' for some details.
Hardware Specification	Yes	Overall, ADMIRE IFT Runs was constructed for a total of 13,119 GPU hours on nvidia-a100-80gb GPUs.
Software Dependencies	No	The paper mentions using 'Bo Torch (Balandat et al., 2020)' and adhering to the 'open-instruct training pipeline' but does not specify version numbers for these or any other key software dependencies required to replicate the experiments.
Experiment Setup	Yes	To maintain practical relevance and avoid overfitting to specific SFT data mixtures, we limit each data mixture to 200,000 training samples... All post-training experiments strictly adhere to the open-instruct training pipeline and hyperparameters established in the original Tülu 3 project... Specifically, we use the Single Task Multi Fidelity GP model and the q Log Expected Improvement acquisition function. For experiments involving a single model size, the training data contains a single fidelity level. The acquisition function is optimized using optimize_acqf_discrete over the training set, from which the point with the highest acquisition value is selected. We set the number of restarts to 10 and the number of raw samples to 1024... For the continuous variable π constrained to the probability simplex, we employ projected gradient ascent... Gradient ascent step: Compute an unconstrained update π(k+1) = π(k) + η ∂αt(π(k), m) ∂π where η is the learning rate.