reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The Polynomial Stein Discrepancy for Assessing Moment Convergence

Authors: Narayan Srinivasan, Matthew Sutton, Christopher Drovandi, Leah F South

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section demonstrates the performance of PSD on the current benchmark examples from Liu et al. (2016), Chwialkowski et al. (2016), Jitkrittum et al. (2017) and Huggins & Mackey (2018). The proposed PSD is compared to existing methods on the basis of runtime, power in goodnessof-fit testing and performance as a sample quality measure. The simulations are run using the settings and implementations provided by the respective authors, with the exception that we sample from Q for all asymptotic methods since sampling from P is rarely feasible in practice. Goodnessof-fit testing results for PSD in the main paper are with the bootstrap test, which we recommend in general. Results for the PSD asymptotic test with samples from Q are shown in Appendix C.1. Following Jitkrittum et al. (2017) and Huggins & Mackey (2018), our bootstrap implementations for KSD and PSD use V-statistics with Rademacher resampling. The performance is similar to the bootstrap described in Liu et al. (2016) and in Section 3.2. Code to reproduce these results is available at https://github.com/Nars98/PSD. This code builds on existing code (Huggins, 2018; Jitkrittum, 2019) by adding PSD as a new method. All experiments were run on a high performance computing cluster, using a single core for each individual hypothesis test. Further empirical investigations are available in Appendix C.
Researcher Affiliation	Academia	1 School of Mathematical Sciences, Queensland University of Technology (QUT), Brisbane, Australia 2 QUT Centre for Data Science, Brisbane, Australia 3 School of Mathematics and Physics, University of Queensland, Brisbane, Australia. Correspondence to: Narayan Srinivasan <EMAIL>, Leah South <EMAIL>.
Pseudocode	No	The paper describes the Polynomial Stein Discrepancy (PSD) and its associated goodness-of-fit tests using mathematical formulations and prose. It does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps formatted like code.
Open Source Code	Yes	Code to reproduce these results is available at https://github.com/Nars98/PSD. This code builds on existing code (Huggins, 2018; Jitkrittum, 2019) by adding PSD as a new method.
Open Datasets	Yes	We begin by considering P = N(0d, Id) and assessing the performance for a variety of Q and d using statistical tests with significance level α = 0.05. We investigate four cases: (a) type I error rate: Q = N(0d, Id) (b) statistical power for misspecified variance: Q = N(0d, Σ), where Σij = 0 for i = j, Σ11 = 1.7 and Σii = 1 for i = 2, . . . , d, (c) statistical power for misspecified kurtosis: Q = T (0, 5), a standard multivariate student-t distribution with 5 degrees of freedom, and (d) statistical power for misspecified kurtosis: q(x) = Qd t=1 Lap(xt \| 0, 1 2), the product of d independent Laplace distributions. The target P is the non-normalized density of a restricted Boltzmann machine (RBM); the samples Qn are obtained from the same RBM perturbed by independent Gaussian noise. We simulate data with 100 observations and d = 20 variables, of which three are non-zero. This example is a logistic regression with 104 observations and d = 5 variables using Gaussian priors with a standard deviation of 10. We conduct an empirical study with the target P, set to the two-dimensional Rosenbrock function.
Dataset Splits	No	The paper primarily uses simulated data from described distributions or models for its experiments (e.g., Gaussian, Student-t, Laplace, Restricted Boltzmann Machine, Logistic Regression data). It specifies sample sizes (e.g., 'n = 1000', 'n = 2000', 'n = 10000') and the number of bootstrap samples ('m = 500') for its statistical tests, but it does not define or refer to conventional training, validation, or test dataset splits typically used in machine learning.
Hardware Specification	No	All experiments were run on a high performance computing cluster, using a single core for each individual hypothesis test.
Software Dependencies	No	The paper states that the code builds on existing code (Huggins, 2018; Jitkrittum, 2019) and is available on GitHub. However, it does not specify any particular software versions, such as programming languages, libraries, or frameworks (e.g., Python, PyTorch, TensorFlow, CUDA versions) that are essential for reproducibility.
Experiment Setup	Yes	We use m = 500 bootstrap samples to estimate the rejection threshold for the PSD and KSD tests. We set P = N(0d, Id) and assess the performance for a variety of Q and d using statistical tests with significance level α = 0.05. We use n = 1000 except the multivariate t, which uses n = 2000. For IMQ KSD, we use the recommended IMQ kernel with c = 1 and β = 0.5. For FSSD, we set the number of test locations to 10. For RFSD, we use the L1 IMQ base kernel and fix the number of features to 10. We compare the step size selection made by PSD to that of RFSD and IMQ KSD when n = 10000 samples are obtained using SGLD. We simulate data with 100 observations and d = 20 variables, of which three are non-zero. This example is a logistic regression with 104 observations and d = 5 variables using Gaussian priors with a standard deviation of 10.