Bayesian Sparse Gaussian Mixture Model for Clustering in High Dimensions
Authors: Dapeng Yao, Fangzheng Xie, Yanxun Xu
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The validity and usefulness of the proposed method is demonstrated through simulation studies and the analysis of a real-world single-cell RNA sequencing data set. Keywords: Clustering, High dimensions, Minimax estimation, Posterior contraction, Single-cell sequencing. 4. Simulation Studies 5. Single-cell Sequencing Data Analysis |
| Researcher Affiliation | Academia | Dapeng Yao EMAIL Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, U.S.A. Fangzheng Xie EMAIL Department of Statistics, Indiana University, Bloomington, Indiana, U.S.A. Yanxun Xu EMAIL Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, Maryland, U.S.A. |
| Pseudocode | Yes | Algorithm 1 The Gibbs sampler Require: Initialization of C,{µc : c C}, ξ, {φc : c C} 1: for b 1 to B do 2: for i 1 to n do 4: if zi = zl for all l = i then 5: Remove µzj 6: end if 7: Sample φt+1 pφ(φt+1) 8: Sample µt+1 pµ|ξ,φ(µt+1 | ξ, φt+1) 9: for k 1 to t do 10: mk (n k + α)p(Yi | µc) where n k is the size of cluster k in C j 11: end for 12: Vn(t) t! n! Γ(αt) nαt 1 p K(t) 13: Vn(t + 1) (t+1)! n! Γ(α(t+1)) nα(t+1) 1 p K(t + 1) 14: mt+1 α Vn(t+1) Vn(t) p(Yi | µt+1) 15: Sample zi Categorical m1 Pt+1 k=1 mk , , mt+1 Pt+1 k=1 mk 16: end for 17: for c 1 to |C| do 18: for j 1 to p do 19: Sample (µc)j N nc + λ2 ξj (φc)j 1 , nc + λ2 ξj (φc)j 20: end for 21: end for 22: for c 1 to |C| do 23: for j 1 to p do 24: Sample (φc)j Gi G(0.5, (µc)2 jλ2 ξj, 1) 25: end for 26: end for 27: for j 1 to p do Q c C λ1 exp 2 λ2 1 (µc)2 j (φc)j Q c C λ1 exp (µc)2 j (φc)j θ+Q c C λ0 exp (µc)2 j (φc)j 29: Sample ξj Bernoulli(θ ) 30: end for 31: Sample θ Beta 1 + Pp j=1 ξj, βθ + p Pp j=1 ξj 32: end for |
| Open Source Code | Yes | The R code can be found at https://github.com/Yanxun Xu/High Dim Clustering. |
| Open Datasets | Yes | We evaluate the proposed Bayesian sparse Gaussian mixture model using a benchmark sc RNA-Seq data set (Darmanis et al., 2015), which is available at the data repository Gene Expression Omnibus (GSE67835, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE67835). |
| Dataset Splits | No | The paper describes data preprocessing for the single-cell RNA sequencing data ("After excluding hybrid cells and filtering out lowly expressed genes") and simulation scenarios, but does not provide specific details on how datasets were split into training, validation, or test sets for experimental evaluation. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., CPU, GPU models, memory) used for running the experiments or simulations. |
| Software Dependencies | Yes | The proposed Bayesian method, Mclust, PCA-KM, and SKM are performed under R with version 4.2.1 and CHIME is performed under Matlab with version 9.11 (R2021b). |
| Experiment Setup | Yes | The proposed Bayesian method can successfully recover the simulated true number of clusters. Specifically, when K = 3, the proposed method identifies 3 clusters in 85 replicates out of 100 replicates for s = 6 and in 98 replicates for s = 12; when K = 5, the proposed method identifies 5 clusters in 83 replicates out of 100 replicates for s = 6 and in 98 replicates for s = 12. In contrast, all the four competitors underestimate the number of clusters. In particular, when K = 3, the estimated number of clusters using the four competitors all equal to 2 in 100 simulation replicates. When K = 5, PCA-KM, SKM, MClust, and CHIME only correctly estimate the number of clusters in 6, 0, 4, and 3 out of 100 replicates for s = 6, and in 8, 0, 9, and 3 out of 100 replicates for s = 12. Figure 1 and Appendix Figure A1 plot the simulated true cluster memberships and the estimated clustering results under the proposed Bayesian method and the four competitors for one randomly selected simulation replicate when K = 3, s = 6, and K = 5, s = 6, respectively. We can see that the four competitors cannot well distinguish clusters with a certain degree of overlapping, e.g., the green and blue clusters in the upper left panel of Figure 1, while the proposed Bayesian method can successfully separate them. For PCA-KM and SKM, we choose the number of clusters via Silhouette method (Rousseeuw, 1987), with the range of K being from 2 to 10. For MClust and CHIME, the number of clusters is estimated via Bayesian information criterion (BIC). In each simulation, we compute posterior inference using the developed MCMC sampler with 1000 burn-in iterations and another 4000 iterations for post-burn-in samples. The upper bound of the number of clusters is set to be Kmax = 20. We set the hyperparameters κ, λ0, and λ1 in the spike-and-slab LASSO prior to be 0.1, 100, and 1 respectively, and λ in the truncated Poisson prior for K to be 2. The estimated number of clusters and cluster assignments under the proposed Bayesian method are reported based on the posterior mode of zi s from post-burn-in MCMC samples. |