reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improved Distributed Principal Component Analysis

Authors: Yingyu Liang, Maria-Florina F Balcan, Vandana Kanchanapally, David Woodruff

NeurIPS 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality.
Researcher Affiliation	Collaboration	Maria-Florina Balcan School of Computer Science Carnegie Mellon University EMAIL Vandana Kanchanapally School of Computer Science Georgia Institute of Technology EMAIL Yingyu Liang Department of Computer Science Princeton University EMAIL David Woodruff Almaden Research Center IBM Research EMAIL
Pseudocode	Yes	Algorithm 1 Distributed k-means clustering; Algorithm 2 Fast Distributed PCA for l2-Error Fitting
Open Source Code	No	The paper does not provide any statement or link indicating the release of open-source code for the described methodology.
Open Datasets	Yes	We choose the following real world datasets from UCI repository [1] for our experiments. For low rank approximation and k-means clustering, we choose two medium size datasets News Groups (18774 61188) and MNIST (70000 784), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) (300000 102660) and Pub Med abstracts (BOWpubmed) (8200000 141043). We use r = 10 for rank-r approximation and k = 10 for k-means clustering. For PCR, we use MNIST and further choose Year Prediction MSD (515345 90), CTslices (53500 386), and a large dataset MNIST8m (800000 784).
Dataset Splits	No	The paper mentions datasets used but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper discusses the number of nodes (servers) in the distributed setting (e.g., 's = 25 for medium-size datasets, and s = 100 for the larger ones') but does not provide specific details about the hardware used for the experiments (e.g., CPU/GPU models, memory).
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers needed to replicate the experiments.
Experiment Setup	Yes	The number of nodes is s = 25 for medium-size datasets, and s = 100 for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node s weight chosen from the power law with parameter α = 2. For each projection dimension, we ﬁrst construct the projected data using distributed PCA... For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported.