Bayesian Neural Scaling Law Extrapolation with Prior-Data Fitted Networks

Authors: Dongwoo Lee, Dong Bok Lee, Steven Adriaensen, Juho Lee, Sung Ju Hwang, Frank Hutter, Seon Joo Kim, Hae Beom Lee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of our approach on real-world neural scaling laws, comparing it against both the existing point estimation methods and Bayesian approaches. Our method demonstrates superior performance, particularly in data-limited scenarios such as Bayesian active learning, underscoring its potential for reliable, uncertainty-aware extrapolation in practical applications. ... We empirically validate the efficacy and efficiency of our approach ... on an extensive set of datasets ... We also show that our NSL-PFN reliably predicts chaotic behaviors of real-world neural scaling laws even with a few observations at a small scale.
Researcher Affiliation Collaboration 1Yonsei University 2KAIST 3University of Freiburg 4Deep Auto 5Korea University.
Pseudocode No The paper describes methods and processes using mathematical notation and descriptive text, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present any structured, code-like steps for a procedure.
Open Source Code Yes The code and models are available at https://github. com/Dong Woo Lee-Eli/nslpfn.
Open Datasets Yes Datasets. Following the previous work (Alabdulmohsin et al., 2022; Caballero et al., 2022), we first validate our NSL-PFN on the popular benchmark datasets (Alabdulmohsin et al., 2022) consisting of scaling curves evaluated on various tasks in both image and natural language domain. The image classification (IC) dataset includes 72 scaling curves, each of which evaluates few-shot prediction performances of various neural network architectures w.r.t. the number of training datapoints. Specifically, many popular architectures such as Bi T (Kolesnikov et al., 2020), Mi X (Tolstikhin et al., 2021), and Vi T (Alexey, 2020) are evaluated on the various image classification datasets such as Image Net (Russakovsky et al., 2015), CIFAR100 (Krizhevsky et al., 2009), Birds (Welinder et al., 2010), and Caltech101 (Fei-Fei et al., 2004). The natural language processing (NLP) dataset contains 20 scaling curves, each of which evaluates the performance of various Transformer architectures (Bansal et al., 2022; Thoppilan et al., 2022) w.r.t. the number of training datapoints, on neural machine translation (NMT), language modeling (LM), and Big-Bench (BB; bench authors, 2023). We further consider the nano GPT-Bench dataset (Nano; Kadra et al., 2023) consisting of 24 scaling curves obtained by training nano GPT models with varying model sizes on the Open Web Text dataset (Gokaslan et al., 2019). We also use Col Pret, a recently released huge dataset containing more than 1000 curves (Choshen et al., 2024)... Lastly, we consider double descent (DD) dataset (Nakkiran et al., 2021) consisting of 16 curves exhibiting double descent behavior
Dataset Splits Yes After sampling a complete scaling curve D, we need to decide a cutoff position M to split D into the context C = {(xi, yi)}M i=1 and the target T = {(xi, yi)}M+N i=M+1 for training. ... In Fig. 3, we further test the robustness in prediction against varying context set size, i.e., the number of observations in each curve, or the cutoff. ... In this experiment, starting from four observations, we iteratively select the next unseen point with a specific criterion.
Hardware Specification Yes The training time is roughly 2.6 hours on NVIDIA A100-SXM4-80GB.
Software Dependencies No The paper mentions several software components like 'Transformer (Vaswani, 2017)', 'Adam optimizer (Kingma, 2014)', 'EMCEE (Foreman-Mackey et al., 2013)', 'Bayesian Ridge', and 'GPy Torch'. However, it does not provide specific version numbers for any of these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes The hyperparameters of architecture are set as follows: nlayers=12, nheads=4, and nhidden=512. We train our model on 1.6M synthetic examples sampled from our prior for 100K iterations, i.e., mini-batch size is set to 16, with Adam optimizer (Kingma, 2014). The learning rate is set to 0.00002 with cosine annealing, and the warm-up phase spans the first 25K iterations. To determine the size of D (i.e., M + N), we uniformly sample from the log space within the range [50, 500].