reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Sparse Gaussian Mixture Model for Clustering in High Dimensions

Authors: Dapeng Yao, Fangzheng Xie, Yanxun Xu

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing data set. Keywords: Clustering, High dimensions, Minimax estimation, Posterior contraction, Single-cell sequencing. 4. Simulation Studies 5. Single-cell Sequencing Data Analysis
Researcher Affiliation	Academia	Dapeng Yao EMAIL Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, U.S.A. Fangzheng Xie EMAIL Department of Statistics, Indiana University, Bloomington, Indiana, U.S.A. Yanxun Xu EMAIL Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, U.S.A.
Pseudocode	Yes	Algorithm 1 The Gibbs sampler Require: Initialization of C,{µc : c C}, ξ, {φc : c C} 1: for b 1 to B do 2: for i 1 to n do 4: if zi = zl for all l = i then 5: Remove µzj 6: end if 7: Sample φt+1 pφ(φt+1) 8: Sample µt+1 pµ\|ξ,φ(µt+1 \| ξ, φt+1) 9: for k 1 to t do 10: mk (n k + α)p(Yi \| µc) where n k is the size of cluster k in C j 11: end for 12: Vn(t) t! n! Γ(αt) nαt 1 p K(t) 13: Vn(t + 1) (t+1)! n! Γ(α(t+1)) nα(t+1) 1 p K(t + 1) 14: mt+1 α Vn(t+1) Vn(t) p(Yi \| µt+1) 15: Sample zi Categorical m1 Pt+1 k=1 mk , , mt+1 Pt+1 k=1 mk 16: end for 17: for c 1 to \|C\| do 18: for j 1 to p do 19: Sample (µc)j N nc + λ2 ξj (φc)j 1 , nc + λ2 ξj (φc)j 20: end for 21: end for 22: for c 1 to \|C\| do 23: for j 1 to p do 24: Sample (φc)j Gi G(0.5, (µc)2 jλ2 ξj, 1) 25: end for 26: end for 27: for j 1 to p do Q c C λ1 exp 2 λ2 1 (µc)2 j (φc)j Q c C λ1 exp (µc)2 j (φc)j θ+Q c C λ0 exp (µc)2 j (φc)j 29: Sample ξj Bernoulli(θ ) 30: end for 31: Sample θ Beta 1 + Pp j=1 ξj, βθ + p Pp j=1 ξj 32: end for
Open Source Code	Yes	The R code can be found at https://github.com/Yanxun Xu/High Dim Clustering.
Open Datasets	Yes	We evaluate the proposed Bayesian sparse Gaussian mixture model using a benchmark sc RNA-Seq data set (Darmanis et al., 2015), which is available at the data repository Gene Expression Omnibus (GSE67835, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835).
Dataset Splits	No	The paper describes data preprocessing for the single-cell RNA sequencing data ("After excluding hybrid cells and filtering out lowly expressed genes") and simulation scenarios, but does not provide specific details on how datasets were split into training, validation, or test sets for experimental evaluation.
Hardware Specification	No	The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running the experiments or simulations.
Software Dependencies	Yes	The proposed Bayesian method, Mclust, PCA-KM, and SKM are performed under R with version 4.2.1 and CHIME is performed under Matlab with version 9.11 (R2021b).
Experiment Setup	Yes	The proposed Bayesian method can successfully recover the simulated true number of clusters. Speciﬁcally, when K = 3, the proposed method identiﬁes 3 clusters in 85 replicates out of 100 replicates for s = 6 and in 98 replicates for s = 12; when K = 5, the proposed method identiﬁes 5 clusters in 83 replicates out of 100 replicates for s = 6 and in 98 replicates for s = 12. In contrast, all the four competitors underestimate the number of clusters. In particular, when K = 3, the estimated number of clusters using the four competitors all equal to 2 in 100 simulation replicates. When K = 5, PCA-KM, SKM, MClust, and CHIME only correctly estimate the number of clusters in 6, 0, 4, and 3 out of 100 replicates for s = 6, and in 8, 0, 9, and 3 out of 100 replicates for s = 12. Figure 1 and Appendix Figure A1 plot the simulated true cluster memberships and the estimated clustering results under the proposed Bayesian method and the four competitors for one randomly selected simulation replicate when K = 3, s = 6, and K = 5, s = 6, respectively. We can see that the four competitors cannot well distinguish clusters with a certain degree of overlapping, e.g., the green and blue clusters in the upper left panel of Figure 1, while the proposed Bayesian method can successfully separate them. For PCA-KM and SKM, we choose the number of clusters via Silhouette method (Rousseeuw, 1987), with the range of K being from 2 to 10. For MClust and CHIME, the number of clusters is estimated via Bayesian information criterion (BIC). In each simulation, we compute posterior inference using the developed MCMC sampler with 1000 burn-in iterations and another 4000 iterations for post-burn-in samples. The upper bound of the number of clusters is set to be Kmax = 20. We set the hyperparameters κ, λ0, and λ1 in the spike-and-slab LASSO prior to be 0.1, 100, and 1 respectively, and λ in the truncated Poisson prior for K to be 2. The estimated number of clusters and cluster assignments under the proposed Bayesian method are reported based on the posterior mode of zi s from post-burn-in MCMC samples.