reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Shared Subspace Models for Multi-Group Covariance Estimation

Authors: Alexander M. Franks, Peter Hoff

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In Section 4 we investigate the behavior of this class of models in simulation and demonstrate how the shared subspace assumption is widely applicable, even when there is little similarity in the covariance matrices across groups. In Section 5 we use an asymptotic approximation to describe how shared subspace inference reduces bias when both p and n are large. Finally, In Section 6 we demonstrate the utility of a shared subspace model in an analysis of gene expression data from juvenile leukemia patients .
Researcher Affiliation	Academia	Alexander M. Franks EMAIL Department of Probability and Applied Statistics Statistics University of California, Santa Barbara Santa Barbara, CA 93106, USA Peter Hoﬀ EMAIL Department of Statistical Science Duke University Durham, NC 27708, USA
Pseudocode	Yes	Algorithm 1: Shared Subspace EM Algorithm; Algorithm 2: Gibbs Sampler for Projected Data Covariance Matrices
Open Source Code	Yes	A repository for the replication code is available on Git Hub (Franks, 2016).
Open Datasets	Yes	We demonstrate the utility of the shared subspace covariance estimator for exploring diﬀerences in the covariability of gene expression levels in young adults with diﬀerent subtypes of pediatric acute lymphoblastic leukemia (ALL) (Yeoh et al., 2002).
Dataset Splits	No	The paper mentions sample sizes for different groups: n = (15, 27, 64, 20, 43, 79, 79) but does not provide specific train/test/validation dataset splits or methodology needed for reproduction in a machine learning context.
Hardware Specification	Yes	Together, the run time for the full empirical Bayes procedure (both algorithms) took less than 10 minutes on a 2017 Macbook Pro.
Software Dependencies	No	The paper mentions "Amelia, a software package for missing value imputation" and refers to "Git Hub" for replication code, but does not specify version numbers for any software components used in the experimental setup.
Experiment Setup	Yes	In this simulation, we generate K = 5 groups of data from the shared subspace spiked covariance model with p = 20000 features, a shared subspace dimension of s = r = 2, σ2 k = 1, and nk = 100. We ﬁx the ﬁrst eigenvalue of Ψk from each group to λ1 = 1000 and vary λ2. ... We apply the rank selection criteria discussed in Section 4.1 and proposed by Gavish and Donoho (2014) to the pooled expression data ... This procedure yields s = 45 dimensions1. We run Algorithm 1 to estimate the shared subspace, and then use Bayesian inference (Algorithm 2) to identify diﬀerences between groups on the inferred subspace.