reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Are you using test log-likelihood correctly?

Authors: Sameer Deshpande, Soumya Ghosh, Tin D. Nguyen, Tamara Broderick

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present simple examples demonstrating how comparisons based on test log-likelihood can contradict comparisons according to other objectives. Specifically, our examples show that (i) approximate Bayesian inference algorithms that attain higher test log-likelihoods need not also yield more accurate posterior approximations and (ii) conclusions about forecast accuracy based on test log-likelihood comparisons may not agree with conclusions based on root mean squared error. Our examples demonstrate that test log-likelihood is not always a good proxy for posterior approximation error. They further demonstrate that forecast evaluations based on test log-likelihood may not agree with forecast evaluations based on root mean squared error.
Researcher Affiliation	Collaboration	Sameer K. Deshpande EMAIL University of Wisconsin Madison Soumya Ghosh EMAIL MIT-IBM Watson AI Lab IBM Research Tin D. Nguyen EMAIL MIT-IBM Watson AI Lab Massachusetts Institute of Technology Tamara Broderick EMAIL MIT-IBM Watson AI Lab Massachusetts Institute of Technology
Pseudocode	No	The paper describes theoretical concepts and analyzes experimental results without providing any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions 'Periodic Matern32 in https://github.com/SheffieldML/GPy' as a tool but this is a third-party library used by the authors, not code released for their own methodology. No explicit statement about releasing their own code or links to a code repository for the described work are provided.
Open Datasets	No	The paper describes generating synthetic datasets for its examples based on specified mathematical models (e.g., 'D100 = {(xn, yn)}100 n=1 drawn from the following heteroscedastic model: xn N(0, 1), yn \| xn N(xn, 1 + log(1 + exp(xn)))'). It does not provide access information for publicly available or open datasets.
Dataset Splits	Yes	We assume we have access to training and testing data such that all data points are independently and identically distributed (i.i.d.) from an unknown probability distribution P. Let D = {yn}N n=1 denote the training data. ... Practitioners commonly assess how well their model predicts out-of-sample using a held-out set of testing data D = {y n}N n=1... test log-likelihood computed on 104 test set observations... Using a test set of size N = 395,000, we observed TLL(D ; Π) = 1.420 < 1.389 = TLL(D ; Π).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions 'GPy' as a tool ('Periodic Matern32 in https://github.com/SheffieldML/GPy') but does not specify a version number for it or any other software dependencies used in the experiments.
Experiment Setup	Yes	We vary this constant value over the set {10^-3, 10^-2, 10^-1, 1, 10}. We use this heuristic and run SWAG for a thousand epochs, annealing the learning rate down to a different constant value after 750 epochs. ... First consider the case where we employ a periodic kernel, constrain the noise nugget σ2 to 1.6, and fit all other hyper-parameters by maximizing the marginal likelihood.