reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Recovering PCA and Sparse PCA via Hybrid-(l1,l2) Sparse Sampling of Data Elements

Authors: Abhisek Kundu, Petros Drineas, Malik Magdon-Ismail

JMLR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results on synthetic, image, text, biological, and ﬁnancial data show that not only are we able to recover PCA and sparse PCA from incomplete data, but we can speed up such computations signiﬁcantly using our sparse sketch .
Researcher Affiliation	Collaboration	Abhisek Kundu EMAIL Intel Parallel Computing Labs Intel Tech (I) Pvt Ltd, Devarabeesanhalli, Outer Ring Road Bangalore, 560103, India Petros Drineas EMAIL Computer Science Purdue University West Lafayette, IN 47907, USA Malik Magdon-Ismail EMAIL Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA
Pseudocode	Yes	Algorithm 1 Element-wise Matrix Sparsiﬁcation Algorithm 2 Approximation of PCA from Data Samples Algorithm 3 One-pass hybrid sampling Algorithm 4 Estimating α from Samples Appendix F. SELECT-s Algorithm
Open Source Code	No	The paper does not explicitly provide a link to open-source code or state that code is made available. The license is for the paper itself, not necessarily the implementation.
Open Datasets	Yes	Tech TC Datasets: (Gabrilovich and Markovitch 2004) ... Digit Data: (Hull 1994) ... Gene Expression Data: We use GSE10072 gene expression data for lung cancer from NCBI Gene Expression Omnibus database.
Dataset Splits	No	The paper mentions using various datasets but does not provide specific training/test/validation splits, percentages, or sample counts for reproduction.
Hardware Specification	No	The paper mentions computational time and performance comparisons using MATLAB functions but does not specify the hardware (CPU, GPU, memory, etc.) on which the experiments were run.
Software Dependencies	No	The paper mentions using "MATLAB function svds(A,k)" and "Spasm toolbox of Sjstrand et al. (2012)" but does not provide specific version numbers for MATLAB or the Spasm toolbox, which would be necessary for reproducibility.
Experiment Setup	Yes	Table 1 summarizes α for various data sets. Achlioptas et al. (2013a) argued that, for rs0 > rs1, ℓ1 sampling is better than ℓ2 (even with truncation). Our results on α in Table 1 reproduce this condition (α = 1 implies ℓ1). Moreover, our method can derive the right blend of ℓ1 and ℓ2 sampling even when the above condition fails. In this sense, we generalize the results of Achlioptas et al. (2013a).