Position: The Future of Bayesian Prediction Is Prior-Fitted

Authors: Samuel Müller, Arik Reuter, Noah Hollmann, David Rügamer, Frank Hutter

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this position paper, we explore their potential and directions to address their current limitations. ... For technical details of the training procedure for the experiments in this position paper, we refer to Appendix A. ... Appendix A. Experimental Setup ... We show the average standard deviation of 100 datasets sampled from our prior, each normalized by its final standard deviation. ... We show some example trajectories in the Appendix in Figure 4.
Researcher Affiliation Collaboration 1University of Freiburg, Freiburg, Germany 2Meta, New York (work done at University of Freiburg) 3LMU Munich, Munich, Germany 4Prior Labs 5Munich Center for Machine Learning (MCML), Munich, Germany 6ELLIS Institute Tübingen, Tübingen, Germany.
Pseudocode Yes Algorithm 1 Latent Prior and Sampling Functions
Open Source Code No The paper does not provide an explicit statement about releasing source code for the methodology described, nor does it include a link to a code repository.
Open Datasets No In this work, we argue that training on vast quantities of synthetic data is ideally suited to utilize the rapidly increasing amount of available computational resources for neural network pre-training in these areas. ... For the analysis on the Martingale property in Section 5, we utilized the standard GP proposed by Müller et al. (2022) with an RBF-Kernel... Inputs were uniformly sampled between 0 and 1. ... To illustrate the PFN architecture s limitation in counting duplicated samples, we trained a PFN on random coin flips in Section 6.5.
Dataset Splits No Training set sizes were uniformly sampled from 1 to 100. This describes the size of the generated datasets, not explicit train/test/validation splits for a fixed dataset.
Hardware Specification No We are grateful for the computational resources that were available for this research. Specifically, we acknowledge support by the state of Baden-Württemberg through bw HPC and the German Research Foundation (DFG) through grant no INST 39/963-1 FUGG (bw For Cluster NEMO)
Software Dependencies No A promising first step could be incorporating zero attention (add zero attn in Py Torch; Paszke et al., 2019). ... a highly standardized procedure that could even be compiled to ONNX (developers, 2021). The paper mentions software names but does not specify exact version numbers for key libraries or frameworks used in their experiments.
Experiment Setup Yes A grid search identified the model with the best final training loss. We searched across 4 and 8 layers, batch sizes of 32 and 64, Adam learning rates of 0.0001, 0.0003, and 0.001, embedding sizes of 128, 256, and 512, and step counts of 100 000, 200 000, and 400 000.