reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Nearest Neighbor Dirichlet Mixtures

Authors: Shounak Chattopadhyay, Antik Chakraborty, David B. Dunson

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Desirable asymptotic properties are shown, and the methods are evaluated in simulation studies and applied to a motivating data set in the context of classiﬁcation. Section 4 contains simulation experiments comparing NN-DM with a rich variety of competitors in univariate and multivariate examples, including an assessment of UQ performance. Section 5 contains a real data application, and Section 6 a discussion.
Researcher Affiliation	Academia	Shounak Chattopadhyay EMAIL Department of Statistical Science Duke University Durham, NC 27708-0251, USA Antik Chakraborty EMAIL Department of Statistics Purdue University West Lafayette, IN 47907, USA David B. Dunson EMAIL Department of Statistical Science Duke University Durham, NC 27708-0251, USA
Pseudocode	Yes	Algorithm 1: Nearest neighbor-Dirichlet mixture algorithm to obtain Monte Carlo samples from the pseudo-posterior of f(x) with Gaussian kernel and normal-inverse Wishart prior. Algorithm 2: Leave-one-out cross-validation for choosing the hyperparameter δ2 0 in nearest neighbor-Dirichlet mixture method. Algorithm 3: Nearest neighbor-Dirichlet mixture algorithm to obtain Monte Carlo samples from the pseudo-posterior of f(x) with Gaussian kernel and normal-inverse gamma prior.
Open Source Code	Yes	R package NNDM available at https://github.com/shounakchattopadhyay/NN-DM was used for the numerical experiments.
Open Datasets	Yes	We consider 10 choices of f0 from the R package benchden (Mildenberger and Weinert, 2012);... The high time resolution universe survey data (Keith et al., 2010) contain information on sampled pulsar stars. ...The data are publicly available from the University of California at Irvine machine learning repository.
Dataset Splits	Yes	In our experiments, we set nt = 500 and R = 20. We create a test data set of 200 stars, among which 23 are pulsar stars. The training size is then varied from 300 to 1800 in increments of 300, each time adding 300 training points by randomly sampling from the entire data leaving out the initial test set.
Hardware Specification	Yes	With all the simulations carried out on an M1 Mac Book Pro with 16 GB of RAM.
Software Dependencies	Yes	All simulations were carried out using the R programming language (R Core Team, 2018). For Dirichlet process mixture models, we collect 2, 000 Markov chain Monte Carlo (MCMC) samples after discarding a burn-in of 3, 000 samples using the dirichletprocess package (J. Ross and Markwick, 2019)... R package version 0.3.1.
Experiment Setup	Yes	In our experiments, we set nt = 500 and R = 20. We set n = 200, 500 with kn = n1/3 + 1. ... The prior hyperparameter choices for the proposed method are µ0 = 0, ν0 = 0.001, γ0 = 1; δ2 0 is chosen via the cross-validation method of Section 2.3. For the multivariate cases, we consider n = 200 and 1000. The number of neighbors is set to k = 10 and the dimension p is chosen from {2, 3, 4, 6}. The hyperparameters for the nearest neighbor-Dirichlet mixture are chosen as µ0 = 0p, ν0 = 0.001, γ0 = p, and Ψ0 = {(γ0 p + 1)δ2 0}Ip = δ2 0 Ip, where the optimal δ2 0 is chosen via cross-validation as described in Section 2.3. We implement the DP-MC with base measure NIWp(0p, 0.01, p, Ip) and a Gamma(2, 4) prior on the concentration parameter as in West (1992). For the NN-DM, we take k = 8 in the univariate case and k = 5 in the bivariate case, α = 0.001, and other hyperparameters chosen as before.