reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Maximizing Intermediate Checkpoint Value in LLM Pretraining with Bayesian Optimization

Authors: Deyuan Liu, Zecheng Wang, Bingning Wang, Weipeng Chen, Chunshan Li, Zhiying Tu, Dianhui Chu, Dianbo Sui

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through various experiments, we demonstrate that: (1) Our proposed methodology exhibits the capacity to augment pretraining... (2) Our proposed methodology, despite requiring a given held-out dataset, still demonstrates robust generalization capabilities across diverse domains, a pivotal aspect in pretraining. We conducted comprehensive pilot experiments to address two fundamental research questions... We evaluated the merged checkpoints using C-Eval (Huang et al., 2023), a rigorous benchmark... Table 1 presents the comprehensive results of our merging experiments... Our analysis reveals that merges involving adjacent checkpoints in the pretraining trajectory consistently led to substantial performance improvements... Section 5. Ablation Study
Researcher Affiliation	Collaboration	1Harbin Institute of Technology 2Tencent Wechat 3Independent Researcher. Correspondence to: Dianbo Sui <EMAIL>. email: EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Checkpoint Merging via Bayesian Optimization 1: Input: Initial checkpoints Θt 1, Θt, validation dataset D, search bounds [α, 1], number of iterations N 2: Evaluate initial merging weights λ(i) t (e.g., λ(1) t = α, λ(2) t = 1) and collect observations O = {(λ(i) t , f(λ(i) t ))}k0 i=1 3: for k = k0 + 1 to N do 4: Fit a Gaussian Process (GP) to the current observations O 5: Select the next merging weight λ(k) t = arg maxλt [α,1] A(λt) using the acquisition function A 6: Merge checkpoints: eΘ(k) t = λ(k) t Θt + (1 λ(k) t )Θt 1 7: Evaluate the performance f(λ(k) t ) of eΘ(k) t on the validation dataset D 8: Update the observations: O = O {(λ(k) t , f(λ(k) t ))} 9: end for 10: Output: Optimal merging weight λ t = arg maxλt{f(λt) \| (λt, f(λt)) O}
Open Source Code	No	The paper does not provide any explicit statements about code release, nor does it include links to a code repository. There is no mention of code in supplementary materials.
Open Datasets	Yes	For models, we follow previous Baichuan2 (Yang et al., 2023a) 7B, Deep Seek (Deep Seek-AI et al., 2024) 7B and Pythia (Biderman et al., 2023), ranging from 70M to 6.9B parameters models. For benchmarks, we evaluate on C-Eval (Huang et al., 2023), CMMLU (Li et al., 2023a), MMLU (Hendrycks et al., 2020), and GSM8K (Cobbe et al., 2021), PIQA (Bisk et al., 2020), Wino Grand (Sakaguchi et al., 2021), Sci Q (Welbl et al., 2017), and ARC-Easy (Clark et al., 2018).
Dataset Splits	No	The paper mentions few-shot evaluation settings (e.g., C-Eval(5-shot), GSM8K(4-shot)) and fractions of a validation set (e.g., '1/4, 1/2, 3/4, full' of C-Eval validation data in Section 5.1). However, it does not provide explicit training/test/validation splits for the main datasets used to train the LLMs themselves, nor for the overall experimental data beyond these specific contexts.
Hardware Specification	No	The paper mentions "training the LLaMA2 70B model with 2T tokens necessitates 1,720,320 GPU hours (Touvron et al., 2023)" as a background reference to resource demands, but it does not specify the hardware (e.g., specific GPU models or CPU types) used for its own experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the implementation of the proposed methodology.
Experiment Setup	Yes	To systematically investigate these research questions, we selected eleven representative checkpoints from the Baichuan2 model (Yang et al., 2023a), spanning a comprehensive range from 200B to 2640B tokens during pretraining. We evaluated the merged checkpoints using C-Eval (Huang et al., 2023), a rigorous benchmark encompassing 52 subjects across four difficulty levels... All merging experiments employed the greedy soup strategy (Wortsman et al., 2022), where checkpoints are combined sequentially, with each checkpoint added only if it demonstrates measurable improvement in accuracy on a held-out development set. Formally, the pairwise merging process is defined as: eΘt = λtΘt + (1 λt)Θt 1 (2) where λt [α, 1] and α (0, 1) serves as a lower bound to constrain the search space. We consider three acquisition functions: Expected Improvement (EI), Probability of Improvement (PI), and Upper Confidence Bound (UCB). These are formally defined as: ... To dynamically select the most promising acquisition function, we employ the GP-Hedge strategy... The overall procedure is summarized in Algorithm 1.