A Bayesian Model Selection Criterion for Selecting Pretraining Checkpoints

Authors: Michael Munn, Susan Wei

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Furthermore, we provide empirical evidence that the criterion reliably correlates with improved fine-tuning performance, offering a principled approach to predicting model adaptability. We experimentally confirm (Section 6), using varied datasets and architectures, that lower pretraining free energy not only enhances downstream adaptability (Figure 2 and Figure 3) but also exhibits a stronger correlation with adaptability compared to other pretraining metrics (Table 1). The goal of our experiments is to evaluate how well the pretraining WBIC, which estimates the pretraining free energy as described in Section 5.2, correlates with downstream performance.
Researcher Affiliation Collaboration 1Google Research, New York, USA 2Dept. of Econometrics and Business Statistics, Monash University, Melbourne, Australia. Correspondence to: Michael Munn <EMAIL>.
Pseudocode No The paper describes methods and procedures in narrative text and mathematical formulations but does not contain a distinct section or figure explicitly labeled "Pseudocode" or "Algorithm", nor structured step-by-step procedures formatted like code.
Open Source Code No The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository in the main text or appendices.
Open Datasets Yes We use the CIFAR-FS dataset (Bertinetto et al., 2019), derived from CIFAR-100 where the 100 classes are divided into 64 classes for meta-training, 16 classes for metavalidation, and 20 classes for meta-testing. We pretrain a VGG-16 (Simonyan, 2014) on the mini-Imagenet meta-training dataset (Dhillon et al., 2019) using SGD with cross-entropy loss.
Dataset Splits Yes We use the CIFAR-FS dataset (Bertinetto et al., 2019), derived from CIFAR-100 where the 100 classes are divided into 64 classes for meta-training, 16 classes for metavalidation, and 20 classes for meta-testing. When fine-tuning on the full CIFAR-FS meta-test dataset, we use all 20 meta-test classes and all 600 examples in each class. We then create an 80/20 train/test split. A single few-shot task is created by randomly sampling 5 classes and 5 examples per class from the meta-test dataset, creating a dataset with 25 total training examples.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. It only describes models and datasets used.
Software Dependencies No The paper mentions models like ResNet-18 and VGG-16 and optimizers like SGD, but it does not specify any software dependencies with version numbers, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We explore ranges of hyperparameter values for the learning rate, batch size and momentum. Learning rate. For experiments that vary the learning rate in Figure 2 (top row), for each learning rate value in {0.01, 0.05, 0.1, 0.2} we run SGD without momentum with a fixed batch size of 512 for 50,000 iterations. The hyperparameter settings for pretraining WBIC computation are provided in Appendix B.1. That is, we use step size ϵ = 2 10 7, chain length of 3,000 iterations, batch size of 2,048, γ = 1.0, and β = 1 log n where n is the size of the pretraining dataset. We use SGD with L2 regularization rate of 0.01 and with a fixed learning rate of 0.0001 for the model backbone and a fixed learning rate of 0.01 for the model head. We fine-tune for 100 steps using a batch size of 128.