Are you using test log-likelihood correctly?
Authors: Sameer Deshpande, Soumya Ghosh, Tin D. Nguyen, Tamara Broderick
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error. Our examples demonstrate that test log-likelihood is not always a good proxy for posterior approximation error. They further demonstrate that forecast evaluations based on test log-likelihood may not agree with forecast evaluations based on root mean squared error. |
| Researcher Affiliation | Collaboration | Sameer K. Deshpande EMAIL University of Wisconsin Madison Soumya Ghosh EMAIL MIT-IBM Watson AI Lab IBM Research Tin D. Nguyen EMAIL MIT-IBM Watson AI Lab Massachusetts Institute of Technology Tamara Broderick EMAIL MIT-IBM Watson AI Lab Massachusetts Institute of Technology |
| Pseudocode | No | The paper describes theoretical concepts and analyzes experimental results without providing any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions 'Periodic Matern32 in https://github.com/SheffieldML/GPy' as a tool but this is a third-party library used by the authors, not code released for their own methodology. No explicit statement about releasing their own code or links to a code repository for the described work are provided. |
| Open Datasets | No | The paper describes generating synthetic datasets for its examples based on specified mathematical models (e.g., 'D100 = {(xn, yn)}100 n=1 drawn from the following heteroscedastic model: xn N(0, 1), yn | xn N(xn, 1 + log(1 + exp(xn)))'). It does not provide access information for publicly available or open datasets. |
| Dataset Splits | Yes | We assume we have access to training and testing data such that all data points are independently and identically distributed (i.i.d.) from an unknown probability distribution P. Let D = {yn}N n=1 denote the training data. ... Practitioners commonly assess how well their model predicts out-of-sample using a held-out set of testing data D = {y n}N n=1... test log-likelihood computed on 104 test set observations... Using a test set of size N = 395,000, we observed TLL(D ; Π) = 1.420 < 1.389 = TLL(D ; Π). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'GPy' as a tool ('Periodic Matern32 in https://github.com/SheffieldML/GPy') but does not specify a version number for it or any other software dependencies used in the experiments. |
| Experiment Setup | Yes | We vary this constant value over the set {10^-3, 10^-2, 10^-1, 1, 10}. We use this heuristic and run SWAG for a thousand epochs, annealing the learning rate down to a different constant value after 750 epochs. ... First consider the case where we employ a periodic kernel, constrain the noise nugget σ2 to 1.6, and fit all other hyper-parameters by maximizing the marginal likelihood. |