reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Graphical Dirichlet Process for Clustering Non-Exchangeable Grouped Data

Authors: Arhit Chakrabarti, Yang Ni, Ellen Ruth A. Morris, Michael L. Salinas, Robert S. Chapkin, Bani K. Mallick

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop an eﬃcient posterior inference algorithm and illustrate our model with simulations and a real grouped single-cell data set.
Researcher Affiliation	Academia	Arhit Chakrabarti EMAIL Department of Statistics Texas A&M University College Station, TX 77843-3143, USA Yang Ni EMAIL Department of Statistics CPRIT Single Cell Data Science Core Texas A&M University College Station, TX 77843-3143, USA Ellen Ruth A. Morris EMAIL Department of Nutrition Program in Integrative Nutrition & Complex Diseases Current address: Texas A&M Veterinary Medical Diagnostic Laboratory Texas A&M University College Station, TX 77843-4471, USA Michael L. Salinas EMAIL Department of Nutrition Program in Integrative Nutrition & Complex Diseases CPRIT Single Cell Data Science Core Texas A&M University College Station, TX 77843-2253, USA Robert S. Chapkin EMAIL Department of Nutrition Program in Integrative Nutrition & Complex Diseases CPRIT Single Cell Data Science Core Texas A&M University College Station, TX 77843-2253, USA Bani K. Mallick EMAIL Department of Statistics Texas A&M University College Station, TX 77843-3143, USA
Pseudocode	No	The paper describes the
Open Source Code	Yes	The source codes used for the analysis, including those for simulations and real data, can be found in the repository https://github.com/Arhit-Chakrabarti/GDPSamp.
Open Datasets	No	Our motivating application is a single-cell RNA-sequencing (sc RNA-seq) study that aimed to investigate intestinal stem cell diﬀerentiation processes in mice with colorectal cancer. ... For illustration, we randomly sampled 100 cells from each of the eight groups. The sc RNA-seq data were pre-processed following standard procedure as outlined by Hao et al. (2021) using the R package Seurat.
Dataset Splits	No	The paper describes sample sizes for simulation and random sampling for the real data, but does not provide specific train/test/validation splits. For simulations, Table 1 specifies 'Sample sizes Groups' for 'small', 'moderate', 'large', and 'unbalanced' scenarios. For real data, it states: 'For illustration, we randomly sampled 100 cells from each of the eight groups.' However, this is data preparation and not a split for model evaluation.
Hardware Specification	No	The paper does not explicitly mention any specific hardware used for running its experiments, such as GPU models, CPU models, or cloud computing instance types.
Software Dependencies	No	The paper mentions 'R package Seurat' and 'SALTSampler (Director et al., 2017) for which the implementation is publicly available as an R package' but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	In our Gibbs sampler, the truncation level of the ﬁnite mixture model was set to L = 10, and the base measure for GDP, G0, was speciﬁed as the normal-inverse-Wishart distribution, NIW(0, 0.01, I2, 2). Upon the completion of the Gibbs sampler, the clusters were estimated by using the least squares criterion (Dahl, 2006), and they were compared with the true cluster labels for evaluation. We considered various sample sizes in each group, which are summarized in Table 1. In all cases, we ran 15,000 iterations of our Gibbs sampler and after discarding the ﬁrst 5,000 samples as burn-in, we retained every 10th iteration of posterior samples. For the real data, "We considered the truncation level, L = 30, and the same base probability measure, G0, as in the simulations... We ran four parallel chains of the Gibbs sampler for 50,000 iterations. To monitor the convergence of the sampler, we drew the traceplots of the log-likelihood for each of the four chains, after discarding the initial 35,000 samples and thinning the samples by a factor of 15..."