reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Understanding the difficulties of posterior predictive estimation

Authors: Abhinav Agrawal, Justin Domke

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main contribution is a theoretical analysis demonstrating that even with exact inference, SNR can decay rapidly with an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or (c) the size of test data relative to training data. Through several examples, we empirically verify these claims and show that these factors indeed lead to poor SNR and unreliable PPD estimates (sometimes, estimates are off by hundreds of nats even with a million samples).
Researcher Affiliation	Academia	1Manning College of Information and Computer Sciences, University of Massachusetts, Amherst, MA, USA. Correspondence to: Abhinav Agrawal <EMAIL>.
Pseudocode	Yes	Figure 7 provides the pseudocode. Learned IS(D , K) w Optimize(IW-ELBO) zk rw k {1, . . . , K} K PK k=1 p(D \|zk)q D(zk)
Open Source Code	No	The paper states 'All our code is implemented in JAX (Bradbury et al., 2018)' and 'While we implement our own inference schemes for this paper, we expect the results to be similar if we use the aforementioned libraries.' However, it does not provide an explicit statement of code release or a link to a code repository for the methodology described in this paper.
Open Datasets	Yes	Figure 1 shows log PPDq estimates for a user-preference model on the Movie Lens-25M dataset (Harper & Konstan, 2015), with approximate posterior q D produced from variational inference (VI) with either a Gaussian or flow-based family (see section 5.4 for setup).
Dataset Splits	Yes	We used a train-test split such that, for each user, one-tenth of the ratings are in the test set. This gives us 18M ratings for training (and 2M ratings for testing).
Hardware Specification	Yes	All our code is implemented in JAX (Bradbury et al., 2018) and run on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions using 'JAX (Bradbury et al., 2018)', 'ADAM (Kingma & Ba, 2015)', and 'DRe G gradient (Tucker et al., 2019)' but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	To learn the variational parameters, we optimize standard ELBO using ADAM (Kingma & Ba, 2015) with a learning rate of 0.001 for 10,000 iterations. For each iteration, we use a batch of 16 samples for estimating the DRe G gradient (Tucker et al., 2019). Also, for LIS, 'optimize IW-ELBOM using ADAM (Kingma & Ba, 2015) with a learning rate of 0.001 for 1000 iterations. For each iteration, we use a single sample of the DRe G estimator. We set M = 16 for all our experiments.'