Shared Subspace Models for Multi-Group Covariance Estimation

Authors: Alexander M. Franks, Peter Hoff

JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 4 we investigate the behavior of this class of models in simulation and demonstrate how the shared subspace assumption is widely applicable, even when there is little similarity in the covariance matrices across groups. In Section 5 we use an asymptotic approximation to describe how shared subspace inference reduces bias when both p and n are large. Finally, In Section 6 we demonstrate the utility of a shared subspace model in an analysis of gene expression data from juvenile leukemia patients .
Researcher Affiliation Academia Alexander M. Franks EMAIL Department of Probability and Applied Statistics Statistics University of California, Santa Barbara Santa Barbara, CA 93106, USA Peter Hoff EMAIL Department of Statistical Science Duke University Durham, NC 27708, USA
Pseudocode Yes Algorithm 1: Shared Subspace EM Algorithm; Algorithm 2: Gibbs Sampler for Projected Data Covariance Matrices
Open Source Code Yes A repository for the replication code is available on Git Hub (Franks, 2016).
Open Datasets Yes We demonstrate the utility of the shared subspace covariance estimator for exploring differences in the covariability of gene expression levels in young adults with different subtypes of pediatric acute lymphoblastic leukemia (ALL) (Yeoh et al., 2002).
Dataset Splits No The paper mentions sample sizes for different groups: n = (15, 27, 64, 20, 43, 79, 79) but does not provide specific train/test/validation dataset splits or methodology needed for reproduction in a machine learning context.
Hardware Specification Yes Together, the run time for the full empirical Bayes procedure (both algorithms) took less than 10 minutes on a 2017 Macbook Pro.
Software Dependencies No The paper mentions "Amelia, a software package for missing value imputation" and refers to "Git Hub" for replication code, but does not specify version numbers for any software components used in the experimental setup.
Experiment Setup Yes In this simulation, we generate K = 5 groups of data from the shared subspace spiked covariance model with p = 20000 features, a shared subspace dimension of s = r = 2, σ2 k = 1, and nk = 100. We fix the first eigenvalue of Ψk from each group to λ1 = 1000 and vary λ2. ... We apply the rank selection criteria discussed in Section 4.1 and proposed by Gavish and Donoho (2014) to the pooled expression data ... This procedure yields s = 45 dimensions1. We run Algorithm 1 to estimate the shared subspace, and then use Bayesian inference (Algorithm 2) to identify differences between groups on the inferred subspace.