reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Bayesian Data Selection

Authors: Eli N. Weinstein, Jeffrey W. Miller

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation. Keywords: Bayesian nonparametrics, Bayesian theory, consistency, misspeciﬁcation, Stein discrepancy. We provide ﬁrst-of-a-kind empirical data selection analyses with two models that are frequently used in single-cell RNA sequencing analysis.
Researcher Affiliation	Academia	Eli N. Weinstein EMAIL Data Science Institute Columbia University New York, NY 10027, USA. Jeﬀrey W. Miller EMAIL Department of Biostatistics Harvard T.H. Chan School of Public Health Boston, MA 02115, USA
Pseudocode	No	The paper describes methods and theoretical derivations but does not include any explicitly labeled pseudocode or algorithm blocks. Procedural steps are described in paragraph form or mathematical equations.
Open Source Code	Yes	Code is available at https://github.com/EWeinstein/data-selection.
Open Datasets	Yes	We downloaded two publicly available data sets. The ﬁrst data set was from human peripheral blood mononuclear cells (PBMCs), available at: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k. ... The second was taken from a dissociated extranodal marginal zone B-cell tumor, speciﬁcally a mucosa-associated lymphoid tissue (MALT) tumor: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3. ... In addition to the two data sets in D.5, we also explored a data set of E18 mouse neurons: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/neuron_10k_v3.
Dataset Splits	Yes	We performed leave-one-out data selection, comparing the foreground space XF0 = X to foreground spaces XFj for j {1, . . . , d}, which exclude the jth dimension of the data. ... We subsampled each data set to 200 genes (selected randomly from among the 2000 most highly expressed) and 2000 cells (selected randomly) for computational tractability...
Hardware Specification	No	The paper mentions "GPU-accelerated stochastic variational inference" in Section 8 but does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for the experiments.
Software Dependencies	No	The paper mentions using "pymanopt (Townsend et al., 2016)" and "Pyro by deﬁning a new distribution with log probability given by the negative NKSD (Bingham et al., 2019)", as well as the "Adam optimizer (Kingma and Ba, 2015)". However, specific version numbers for these software packages are not provided in the text.
Experiment Setup	Yes	We set α = 0.1 in the following experiments, and we use pymanopt (Townsend et al., 2016) to optimize U over the Stiefel manifold (Section D). ... We set T = 0.05 in the SVC, based on the calibration procedure in Section A.1 (Section D.3). We use the Pitman-Yor mixture model expression for the background model dimension (Equation 3), with α = 0.5, ν = 1, and D = 0.2. ... The number of latent components k was set to 3, based on the procedure of Minka (2000). ... We place a standard normal prior on each entry of Hj and a Laplace prior on each entry of Jjj with scale 0.1... We use the factored IMQ kernel for the NKSD, with β = 0.5 and c = 1. ... At each optimization step, the expectation ... is estimated using a minibatch of 200 randomly selected datapoints... We interleave updates to the variational approximation and to φ, using the Adam optimizer with step size 0.01 for each.