reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Randomized Dimension Reduction on Massive Data

Authors: Gregory Darnell, Stoyan Georgiev, Sayan Mukherjee, Barbara E Engelhardt

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	These statistical and computational advantages are highlighted in our experiments on simulated data and large-scale genomic studies. We demonstrate on simulated and real data examples that the randomized estimators provide a computationally efﬁcient solution, and, furthermore, often improve statistical accuracy of the predictions.
Researcher Affiliation	Collaboration	Gregory Darnell EMAIL Lewis-Sigler Institute Princeton University Princeton, NJ 08544, USA; Stoyan Georgiev EMAIL Google Palo Alto, CA 94043, USA; Sayan Mukherjee EMAIL Departments of Statistical Science Mathematics, and Computer Science Duke University Durham, NC 27708, USA; Barbara E Engelhardt EMAIL Department of Computer Science Center for Statistics and Machine Learning Princeton University Princeton, NJ 08540, USA.
Pseudocode	Yes	2.4.1 ALGORITHM FOR ARSVD: Given an upper bound on the target rank dmax and the number of necessary power iterations tmax (...), the algorithm proceeds in two stages: (1) estimate a basis for the range of Xd, (2) project the data onto this basis and apply SVD: Algorithm: Adaptive Randomized SVD(X, tmax, dmax, ) (1) Find orthonormal basis for the range of X; (i) Set the number working directions: ℓ= dmax + ; (ii) Generate random matrix: Ω Rn ℓwith Ωij iid N(0, 1); (iii) Construct blocks: F (t) = XXT F (t 1) with F (0) = Ωfor t {1, . . . , tmax}; (iv) Select the optimal block t {1, . . . , tmax} and rank estimate d {1, . . . , dmax}, using the stability criterion and Bi-Cross-Validation stated in Section 2.4.3; (v) Compute a basis for the selected block: F (t ) = QR Rn ℓ, QT Q = I; (2) Project data onto the range basis and compute the SVD; (i) Project onto the basis: B = XT Q Rp ℓ; (ii) Factorize: B svd = UΣW T , where Σ = diag(σ1, . . . , σℓ); (iii) Compute the rank d approximation: b Xd = Ud Σd V T d Ud = (U1\| . . . \|Ud ) Rn d Σd = diag(σ1, . . . , σd ) Rd d Vd = Q (W1\| . . . \|Wd ) Rp d ;
Open Source Code	Yes	The software for ARSVD is available at https://github.com/gdarnell/arsvd.
Open Datasets	Yes	We applied our LMM with ARSVD to a large genomic data set to illustrate that we can achieve considerable computational efﬁciency without loss in accuracy. In particular, we applied our method to the Wellcome Trust Case Control Consortium (WTCCC) data (Consortium, 2007).
Dataset Splits	No	The data we consider consist of a case-control study of 4, 684 individuals. The cases are individuals with Crohn s disease, and the number of features are 478,765 genetic variants across the 22 autosomal chromosomes in the genome. ... In order to most accurately control for the test statistic computed in a LMM and to achieve maximal statistical power, it is suggested that a covariance matrix is constructed once per (22) chromosomes, performed by holding out the test chromosome and concatenating the remaining chromosomes (Yang et al., 2014). This describes a data processing strategy for specific analyses on chromosomes, not general train/test/validation splits. For simulated data, the generation process is described, but no explicit splits are mentioned.
Hardware Specification	No	No specific hardware details (like GPU/CPU models, processors, or memory) are provided in the paper. It only mentions computational speed and runtimes, but not the hardware they were achieved on.
Software Dependencies	No	The software we implemented for the results in this paper as well as our methodology is based on EMMAX (Kang et al., 2010). Our LMM with ARSVD procedure uses EMMAX (pylmm) to solve the LMM. We compare the runtime ratio of our RSVD with a state-of-the-art low-rank approximation algorithm, the blocked Lanczos method implemented in the CRAN package irlba (Baglama and Reichel, 2006). The two methods we compared to are EMMAX with the addition of ARSVD and GEMMA (Zhou et al., 2013) (GEMMA is executed with the -lmm option to most closely approximate the analysis that pylmm performs). While these software packages are mentioned, specific version numbers are not provided for any of them.
Experiment Setup	Yes	In all our simulations, we set the oversampling parameter in ARSVD, = 10. Given an upper bound on the target rank dmax and the number of necessary power iterations tmax (tmax {5, . . . 10} is sufﬁcient in most cases). We set the initial rank upper bound estimate to be 2 d and used the stability based method (Section 2.4.3) to estimate both optimal t and the corresponding d . For the RSVD procedure we will set the rank parameter to 100. we use = 10 as a suggested default. and a small number (B = 5) of independent Gaussian random projections Ω(b) Rn dmax.