reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Empirical Bayes Matrix Factorization

Authors: Wei Wang, Matthew Stephens

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the beneﬁts of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identiﬁes interpretable structure that agrees with known relationships among human tissues.
Researcher Affiliation	Academia	Wei Wang EMAIL Department of Statistics University of Chicago Chicago, IL, USA Matthew Stephens EMAIL Department of Statistics and Department of Human Genetics University of Chicago Chicago, IL, USA
Pseudocode	Yes	Algorithm 1 Alternating Optimization for EBMF (rank 1) Algorithm 2 Streamlined Alternating Optimization for EBMF (rank 1) Algorithm 3 Single-factor update for EBMF (rank K) Algorithm 4 Greedy Algorithm for EBMF Algorithm 5 Backﬁtting algorithm for EBMF (rank K)
Open Source Code	Yes	Software implementing our approach is available at https://github.com/stephenslab/flashr.
Open Datasets	Yes	Movie Lens 100K data, an (incomplete) 943 1682 matrix of user-movie ratings (integers from 1 to 5) (Harper and Konstan, 2016). GTEx e QTL summary data, a 16 069 44 matrix of Z scores computed testing association of genetic variants (rows) with gene expression in diﬀerent human tissues (columns). These data come from the Genotype Tissue Expression (GTEx) project (Consortium et al., 2015). Brain Tumor data, a 43 356 matrix of gene expression measurements on 4 diﬀerent types of brain tumor (included in the denoise R package, Josse et al., 2018). Presidential address data, a 13 836 matrix of word counts from the inaugural addresses of 13 US presidents (1940 2009) (also included in the denoise R package, Josse et al., 2018). Breast cancer data, a 251 226 matrix of gene expression measurements from Carvalho et al. (2008), which were used as an example in the paper introducing NBSFA (Knowles and Ghahramani, 2011).
Dataset Splits	Yes	We applied each method to all 5 data sets, using 10-fold OCV (Appendix B) to mask data points for imputation, repeated 20 times (with diﬀerent random number seeds) for each data set. Generic k-fold CV involves randomly dividing the data matrix into k parts and then, for each part, training methods on the other k 1 parts before assessing error on that part, as in Algorithm 6. The novel part of OCV is in how to choose the hold-out pattern. We randomly divide the columns and rows into k sets. and put these sets into k orthogonal parts, and then take all Yij with the chosen column and row indices as hold-out Y(i).
Hardware Specification	Yes	running our current implementation of the greedy algorithm on the GTEx data (a 16,000 by 44 matrix) takes about 140s (wall time) for G = PN and 650s for G = SN (on a 2015 Mac Book Air with a 2.2 GHz Intel Core i7 processor and 8Gb RAM).
Software Dependencies	No	We have implemented Algorithms 2, 4 and 5 in an R package, ﬂash ( factors and loadings via adaptive shrinkage ). One source of functions for solving the EBNM problem is the adaptive shrinkage (ashr) package... We have also implemented functions to solve the EBNM problem for additional choices of G in the package ebnm (https://github.com/stephenslab/ebnm).
Experiment Setup	Yes	Here, we use the soft Impute function from the package soft Impute (Mazumder et al., 2010), with penalty parameter λ = 0, which essentially performs SVD when Y is completely observed, but can also deal with missing values in Y . We simulated using three diﬀerent levels of sparsity on the loadings, using π0 = 0.9, 0.3, 0. (We set the noise precision τ = 1, 1/16, 1/25 in these three cases to make each problem not too easy and not too hard.) All of the Bayesian methods (ﬂash, SFA, SFAmix and NBSFA) are self-tuning, at least to some extent, and we applied them here with default values. The soft Impute method has a single tuning parameter (λ, which controls the nuclear norm penalty), and we chose this penalty by orthogonal cross-validation (OCV; Appendix B). The PMD method can use two tuning parameters (one for l and one for f)... We used OCV to tune parameters in both cases, referring to the methods as PMD.cv2 (2 tuning parameters) and PMD.cv1 (1 tuning parameter).