Graphical Dirichlet Process for Clustering Non-Exchangeable Grouped Data
Authors: Arhit Chakrabarti, Yang Ni, Ellen Ruth A. Morris, Michael L. Salinas, Robert S. Chapkin, Bani K. Mallick
JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop an efficient posterior inference algorithm and illustrate our model with simulations and a real grouped single-cell data set. |
| Researcher Affiliation | Academia | Arhit Chakrabarti EMAIL Department of Statistics Texas A&M University College Station, TX 77843-3143, USA Yang Ni EMAIL Department of Statistics CPRIT Single Cell Data Science Core Texas A&M University College Station, TX 77843-3143, USA Ellen Ruth A. Morris EMAIL Department of Nutrition Program in Integrative Nutrition & Complex Diseases Current address: Texas A&M Veterinary Medical Diagnostic Laboratory Texas A&M University College Station, TX 77843-4471, USA Michael L. Salinas EMAIL Department of Nutrition Program in Integrative Nutrition & Complex Diseases CPRIT Single Cell Data Science Core Texas A&M University College Station, TX 77843-2253, USA Robert S. Chapkin EMAIL Department of Nutrition Program in Integrative Nutrition & Complex Diseases CPRIT Single Cell Data Science Core Texas A&M University College Station, TX 77843-2253, USA Bani K. Mallick EMAIL Department of Statistics Texas A&M University College Station, TX 77843-3143, USA |
| Pseudocode | No | The paper describes the |
| Open Source Code | Yes | The source codes used for the analysis, including those for simulations and real data, can be found in the repository https://github.com/Arhit-Chakrabarti/GDPSamp. |
| Open Datasets | No | Our motivating application is a single-cell RNA-sequencing (sc RNA-seq) study that aimed to investigate intestinal stem cell differentiation processes in mice with colorectal cancer. ... For illustration, we randomly sampled 100 cells from each of the eight groups. The sc RNA-seq data were pre-processed following standard procedure as outlined by Hao et al. (2021) using the R package Seurat. |
| Dataset Splits | No | The paper describes sample sizes for simulation and random sampling for the real data, but does not provide specific train/test/validation splits. For simulations, Table 1 specifies 'Sample sizes Groups' for 'small', 'moderate', 'large', and 'unbalanced' scenarios. For real data, it states: 'For illustration, we randomly sampled 100 cells from each of the eight groups.' However, this is data preparation and not a split for model evaluation. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions 'R package Seurat' and 'SALTSampler (Director et al., 2017) for which the implementation is publicly available as an R package' but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | In our Gibbs sampler, the truncation level of the finite mixture model was set to L = 10, and the base measure for GDP, G0, was specified as the normal-inverse-Wishart distribution, NIW(0, 0.01, I2, 2). Upon the completion of the Gibbs sampler, the clusters were estimated by using the least squares criterion (Dahl, 2006), and they were compared with the true cluster labels for evaluation. We considered various sample sizes in each group, which are summarized in Table 1. In all cases, we ran 15,000 iterations of our Gibbs sampler and after discarding the first 5,000 samples as burn-in, we retained every 10th iteration of posterior samples. For the real data, "We considered the truncation level, L = 30, and the same base probability measure, G0, as in the simulations... We ran four parallel chains of the Gibbs sampler for 50,000 iterations. To monitor the convergence of the sampler, we drew the traceplots of the log-likelihood for each of the four chains, after discarding the initial 35,000 samples and thinning the samples by a factor of 15..." |