Bayesian Data Selection
Authors: Eli N. Weinstein, Jeffrey W. Miller
JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply the SVC to the analysis of single-cell RNA sequencing data sets using probabilistic principal components analysis and a spin glass model of gene regulation. Keywords: Bayesian nonparametrics, Bayesian theory, consistency, misspecification, Stein discrepancy. We provide first-of-a-kind empirical data selection analyses with two models that are frequently used in single-cell RNA sequencing analysis. |
| Researcher Affiliation | Academia | Eli N. Weinstein EMAIL Data Science Institute Columbia University New York, NY 10027, USA. Jeffrey W. Miller EMAIL Department of Biostatistics Harvard T.H. Chan School of Public Health Boston, MA 02115, USA |
| Pseudocode | No | The paper describes methods and theoretical derivations but does not include any explicitly labeled pseudocode or algorithm blocks. Procedural steps are described in paragraph form or mathematical equations. |
| Open Source Code | Yes | Code is available at https://github.com/EWeinstein/data-selection. |
| Open Datasets | Yes | We downloaded two publicly available data sets. The first data set was from human peripheral blood mononuclear cells (PBMCs), available at: https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k. ... The second was taken from a dissociated extranodal marginal zone B-cell tumor, specifically a mucosa-associated lymphoid tissue (MALT) tumor: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/malt_10k_protein_v3. ... In addition to the two data sets in D.5, we also explored a data set of E18 mouse neurons: https://support.10xgenomics.com/single-cell-gene-expression/datasets/3.0.0/neuron_10k_v3. |
| Dataset Splits | Yes | We performed leave-one-out data selection, comparing the foreground space XF0 = X to foreground spaces XFj for j {1, . . . , d}, which exclude the jth dimension of the data. ... We subsampled each data set to 200 genes (selected randomly from among the 2000 most highly expressed) and 2000 cells (selected randomly) for computational tractability... |
| Hardware Specification | No | The paper mentions "GPU-accelerated stochastic variational inference" in Section 8 but does not provide specific hardware details such as GPU models, CPU types, or memory configurations used for the experiments. |
| Software Dependencies | No | The paper mentions using "pymanopt (Townsend et al., 2016)" and "Pyro by defining a new distribution with log probability given by the negative NKSD (Bingham et al., 2019)", as well as the "Adam optimizer (Kingma and Ba, 2015)". However, specific version numbers for these software packages are not provided in the text. |
| Experiment Setup | Yes | We set α = 0.1 in the following experiments, and we use pymanopt (Townsend et al., 2016) to optimize U over the Stiefel manifold (Section D). ... We set T = 0.05 in the SVC, based on the calibration procedure in Section A.1 (Section D.3). We use the Pitman-Yor mixture model expression for the background model dimension (Equation 3), with α = 0.5, ν = 1, and D = 0.2. ... The number of latent components k was set to 3, based on the procedure of Minka (2000). ... We place a standard normal prior on each entry of Hj and a Laplace prior on each entry of Jjj with scale 0.1... We use the factored IMQ kernel for the NKSD, with β = 0.5 and c = 1. ... At each optimization step, the expectation ... is estimated using a minibatch of 200 randomly selected datapoints... We interleave updates to the variational approximation and to φ, using the Adam optimizer with step size 0.01 for each. |