Adaptive Randomized Dimension Reduction on Massive Data

Authors: Gregory Darnell, Stoyan Georgiev, Sayan Mukherjee, Barbara E Engelhardt

JMLR 2017 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental These statistical and computational advantages are highlighted in our experiments on simulated data and large-scale genomic studies. We demonstrate on simulated and real data examples that the randomized estimators provide a computationally efficient solution, and, furthermore, often improve statistical accuracy of the predictions.
Researcher Affiliation Collaboration Gregory Darnell EMAIL Lewis-Sigler Institute Princeton University Princeton, NJ 08544, USA; Stoyan Georgiev EMAIL Google Palo Alto, CA 94043, USA; Sayan Mukherjee EMAIL Departments of Statistical Science Mathematics, and Computer Science Duke University Durham, NC 27708, USA; Barbara E Engelhardt EMAIL Department of Computer Science Center for Statistics and Machine Learning Princeton University Princeton, NJ 08540, USA.
Pseudocode Yes 2.4.1 ALGORITHM FOR ARSVD: Given an upper bound on the target rank dmax and the number of necessary power iterations tmax (...), the algorithm proceeds in two stages: (1) estimate a basis for the range of Xd, (2) project the data onto this basis and apply SVD: Algorithm: Adaptive Randomized SVD(X, tmax, dmax, ) (1) Find orthonormal basis for the range of X; (i) Set the number working directions: ℓ= dmax + ; (ii) Generate random matrix: Ω Rn ℓwith Ωij iid N(0, 1); (iii) Construct blocks: F (t) = XXT F (t 1) with F (0) = Ωfor t {1, . . . , tmax}; (iv) Select the optimal block t {1, . . . , tmax} and rank estimate d {1, . . . , dmax}, using the stability criterion and Bi-Cross-Validation stated in Section 2.4.3; (v) Compute a basis for the selected block: F (t ) = QR Rn ℓ, QT Q = I; (2) Project data onto the range basis and compute the SVD; (i) Project onto the basis: B = XT Q Rp ℓ; (ii) Factorize: B svd = UΣW T , where Σ = diag(σ1, . . . , σℓ); (iii) Compute the rank d approximation: b Xd = Ud Σd V T d Ud = (U1| . . . |Ud ) Rn d Σd = diag(σ1, . . . , σd ) Rd d Vd = Q (W1| . . . |Wd ) Rp d ;
Open Source Code Yes The software for ARSVD is available at https://github.com/gdarnell/arsvd.
Open Datasets Yes We applied our LMM with ARSVD to a large genomic data set to illustrate that we can achieve considerable computational efficiency without loss in accuracy. In particular, we applied our method to the Wellcome Trust Case Control Consortium (WTCCC) data (Consortium, 2007).
Dataset Splits No The data we consider consist of a case-control study of 4, 684 individuals. The cases are individuals with Crohn s disease, and the number of features are 478,765 genetic variants across the 22 autosomal chromosomes in the genome. ... In order to most accurately control for the test statistic computed in a LMM and to achieve maximal statistical power, it is suggested that a covariance matrix is constructed once per (22) chromosomes, performed by holding out the test chromosome and concatenating the remaining chromosomes (Yang et al., 2014). This describes a data processing strategy for specific analyses on chromosomes, not general train/test/validation splits. For simulated data, the generation process is described, but no explicit splits are mentioned.
Hardware Specification No No specific hardware details (like GPU/CPU models, processors, or memory) are provided in the paper. It only mentions computational speed and runtimes, but not the hardware they were achieved on.
Software Dependencies No The software we implemented for the results in this paper as well as our methodology is based on EMMAX (Kang et al., 2010). Our LMM with ARSVD procedure uses EMMAX (pylmm) to solve the LMM. We compare the runtime ratio of our RSVD with a state-of-the-art low-rank approximation algorithm, the blocked Lanczos method implemented in the CRAN package irlba (Baglama and Reichel, 2006). The two methods we compared to are EMMAX with the addition of ARSVD and GEMMA (Zhou et al., 2013) (GEMMA is executed with the -lmm option to most closely approximate the analysis that pylmm performs). While these software packages are mentioned, specific version numbers are not provided for any of them.
Experiment Setup Yes In all our simulations, we set the oversampling parameter in ARSVD, = 10. Given an upper bound on the target rank dmax and the number of necessary power iterations tmax (tmax {5, . . . 10} is sufficient in most cases). We set the initial rank upper bound estimate to be 2 d and used the stability based method (Section 2.4.3) to estimate both optimal t and the corresponding d . For the RSVD procedure we will set the rank parameter to 100. we use = 10 as a suggested default. and a small number (B = 5) of independent Gaussian random projections Ω(b) Rn dmax.