Are you using test log-likelihood correctly?

Authors: Sameer Deshpande, Soumya Ghosh, Tin D. Nguyen, Tamara Broderick

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error. Our examples demonstrate that test log-likelihood is not always a good proxy for posterior approximation error. They further demonstrate that forecast evaluations based on test log-likelihood may not agree with forecast evaluations based on root mean squared error.
Researcher Affiliation Collaboration Sameer K. Deshpande EMAIL University of Wisconsin Madison Soumya Ghosh EMAIL MIT-IBM Watson AI Lab IBM Research Tin D. Nguyen EMAIL MIT-IBM Watson AI Lab Massachusetts Institute of Technology Tamara Broderick EMAIL MIT-IBM Watson AI Lab Massachusetts Institute of Technology
Pseudocode No The paper describes theoretical concepts and analyzes experimental results without providing any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions 'Periodic Matern32 in https://github.com/SheffieldML/GPy' as a tool but this is a third-party library used by the authors, not code released for their own methodology. No explicit statement about releasing their own code or links to a code repository for the described work are provided.
Open Datasets No The paper describes generating synthetic datasets for its examples based on specified mathematical models (e.g., 'D100 = {(xn, yn)}100 n=1 drawn from the following heteroscedastic model: xn N(0, 1), yn | xn N(xn, 1 + log(1 + exp(xn)))'). It does not provide access information for publicly available or open datasets.
Dataset Splits Yes We assume we have access to training and testing data such that all data points are independently and identically distributed (i.i.d.) from an unknown probability distribution P. Let D = {yn}N n=1 denote the training data. ... Practitioners commonly assess how well their model predicts out-of-sample using a held-out set of testing data D = {y n}N n=1... test log-likelihood computed on 104 test set observations... Using a test set of size N = 395,000, we observed TLL(D ; Π) = 1.420 < 1.389 = TLL(D ; Π).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions 'GPy' as a tool ('Periodic Matern32 in https://github.com/SheffieldML/GPy') but does not specify a version number for it or any other software dependencies used in the experiments.
Experiment Setup Yes We vary this constant value over the set {10^-3, 10^-2, 10^-1, 1, 10}. We use this heuristic and run SWAG for a thousand epochs, annealing the learning rate down to a different constant value after 750 epochs. ... First consider the case where we employ a periodic kernel, constrain the noise nugget σ2 to 1.6, and fit all other hyper-parameters by maximizing the marginal likelihood.