Empirical Bayes Matrix Factorization
Authors: Wei Wang, Matthew Stephens
JMLR 2021 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the benefits of EBMF with sparse priors through both numerical comparisons with competing methods and through analysis of data from the GTEx (Genotype Tissue Expression) project on genetic associations across 44 human tissues. In numerical comparisons EBMF often provides more accurate inferences than other methods. In the GTEx data, EBMF identifies interpretable structure that agrees with known relationships among human tissues. |
| Researcher Affiliation | Academia | Wei Wang EMAIL Department of Statistics University of Chicago Chicago, IL, USA Matthew Stephens EMAIL Department of Statistics and Department of Human Genetics University of Chicago Chicago, IL, USA |
| Pseudocode | Yes | Algorithm 1 Alternating Optimization for EBMF (rank 1) Algorithm 2 Streamlined Alternating Optimization for EBMF (rank 1) Algorithm 3 Single-factor update for EBMF (rank K) Algorithm 4 Greedy Algorithm for EBMF Algorithm 5 Backfitting algorithm for EBMF (rank K) |
| Open Source Code | Yes | Software implementing our approach is available at https://github.com/stephenslab/flashr. |
| Open Datasets | Yes | Movie Lens 100K data, an (incomplete) 943 1682 matrix of user-movie ratings (integers from 1 to 5) (Harper and Konstan, 2016). GTEx e QTL summary data, a 16 069 44 matrix of Z scores computed testing association of genetic variants (rows) with gene expression in different human tissues (columns). These data come from the Genotype Tissue Expression (GTEx) project (Consortium et al., 2015). Brain Tumor data, a 43 356 matrix of gene expression measurements on 4 different types of brain tumor (included in the denoise R package, Josse et al., 2018). Presidential address data, a 13 836 matrix of word counts from the inaugural addresses of 13 US presidents (1940 2009) (also included in the denoise R package, Josse et al., 2018). Breast cancer data, a 251 226 matrix of gene expression measurements from Carvalho et al. (2008), which were used as an example in the paper introducing NBSFA (Knowles and Ghahramani, 2011). |
| Dataset Splits | Yes | We applied each method to all 5 data sets, using 10-fold OCV (Appendix B) to mask data points for imputation, repeated 20 times (with different random number seeds) for each data set. Generic k-fold CV involves randomly dividing the data matrix into k parts and then, for each part, training methods on the other k 1 parts before assessing error on that part, as in Algorithm 6. The novel part of OCV is in how to choose the hold-out pattern. We randomly divide the columns and rows into k sets. and put these sets into k orthogonal parts, and then take all Yij with the chosen column and row indices as hold-out Y(i). |
| Hardware Specification | Yes | running our current implementation of the greedy algorithm on the GTEx data (a 16,000 by 44 matrix) takes about 140s (wall time) for G = PN and 650s for G = SN (on a 2015 Mac Book Air with a 2.2 GHz Intel Core i7 processor and 8Gb RAM). |
| Software Dependencies | No | We have implemented Algorithms 2, 4 and 5 in an R package, flash ( factors and loadings via adaptive shrinkage ). One source of functions for solving the EBNM problem is the adaptive shrinkage (ashr) package... We have also implemented functions to solve the EBNM problem for additional choices of G in the package ebnm (https://github.com/stephenslab/ebnm). |
| Experiment Setup | Yes | Here, we use the soft Impute function from the package soft Impute (Mazumder et al., 2010), with penalty parameter λ = 0, which essentially performs SVD when Y is completely observed, but can also deal with missing values in Y . We simulated using three different levels of sparsity on the loadings, using π0 = 0.9, 0.3, 0. (We set the noise precision τ = 1, 1/16, 1/25 in these three cases to make each problem not too easy and not too hard.) All of the Bayesian methods (flash, SFA, SFAmix and NBSFA) are self-tuning, at least to some extent, and we applied them here with default values. The soft Impute method has a single tuning parameter (λ, which controls the nuclear norm penalty), and we chose this penalty by orthogonal cross-validation (OCV; Appendix B). The PMD method can use two tuning parameters (one for l and one for f)... We used OCV to tune parameters in both cases, referring to the methods as PMD.cv2 (2 tuning parameters) and PMD.cv1 (1 tuning parameter). |