reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Distance Clustering

Authors: Leo L. Duan, David B. Dunson

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A simulation study is included to assess performance relative to competitors, and we apply the approach to clustering of brain genome expression data. Keywords: Distance-based clustering, Mixture model, Model-based clustering, Model misspeciﬁcation, Pairwise distance matrix, Partial likelihood
Researcher Affiliation	Academia	Leo L Duan EMAIL Department of Statistics University of Florida Gainesville, FL 32611, USA David B Dunson EMAIL Department of Statistical Science Duke University Durham, NC 27708, USA
Pseudocode	Yes	Algorithm 1: The pseudocode of the No-U-Turn Hamiltonian Monte Carlo sampler for the Bayesian distance clustering.
Open Source Code	No	No explicit statement or link for the open-sourcing of the code described in this paper is provided. The paper mentions the use of 'hamiltorch package (Cobb and Jalaian, 2020)' which is a third-party tool, not the authors' own code for their methodology.
Open Datasets	Yes	To assess the performance, we use the MNIST data of hand-written digits of 0–9, with each image having p = 28 × 28 pixels.
Dataset Splits	No	For the MNIST data, 'In each experiment, we take n = 10,000 random samples to ﬁt the clustering models, among which each digit has approximately 1000 samples, and we repeat experiments 10 times.' No specific train/test/validation splits or random seeds are provided for reproducibility of the exact sample partitioning. For the brain data, the paper states 'We take the mid-coronal section of 41 × 58 voxels. Excluding the empty ones outside the brain, they have a sample size of n = 1781.' which describes the data selection but not experimental splits.
Hardware Specification	Yes	To provide some running time, using a quad-core i7 CPU, at n = 1000, the HMC algorithm takes about 20 minutes for running 10,000 iterations.
Software Dependencies	No	The paper mentions 'BFGS optimization algorithm (implemented in the PyTorch package)' and 'No-U-Turn Sampler (NUTS-HMC) algorithm (Hoffman and Gelman, 2014) implemented in the hamiltorch package (Cobb and Jalaian, 2020)'. However, specific version numbers for PyTorch or hamiltorch are not provided.
Experiment Setup	Yes	To favor small values for the mode while accommodating a moderate degree of uncertainty, we use a Gamma prior αh Gamma(1.5, 1.0). For conjugacy, we choose an inverse-gamma prior for σh with E(σh) = βh, σh Inverse-Gamma(2, βσ), βσ = 1. In this article, we use t = 0.1 as a balance between the approximation accuracy and the numeric stability of the algorithm. To run the HMC sampler, we use the No-U-Turn Sampler (NUTS-HMC) algorithm (...) implemented in the hamiltorch package (Cobb and Jalaian, 2020), which also automatically tunes the other two working parameters ϵ and L. For clustering, we use an over-ﬁtted mixture with k = 20 and small Dirichlet concentration parameter α = 1/20.