Improved Distributed Principal Component Analysis
Authors: Yingyu Liang, Maria-Florina F Balcan, Vandana Kanchanapally, David Woodruff
NeurIPS 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality. |
| Researcher Affiliation | Collaboration | Maria-Florina Balcan School of Computer Science Carnegie Mellon University EMAIL Vandana Kanchanapally School of Computer Science Georgia Institute of Technology EMAIL Yingyu Liang Department of Computer Science Princeton University EMAIL David Woodruff Almaden Research Center IBM Research EMAIL |
| Pseudocode | Yes | Algorithm 1 Distributed k-means clustering; Algorithm 2 Fast Distributed PCA for l2-Error Fitting |
| Open Source Code | No | The paper does not provide any statement or link indicating the release of open-source code for the described methodology. |
| Open Datasets | Yes | We choose the following real world datasets from UCI repository [1] for our experiments. For low rank approximation and k-means clustering, we choose two medium size datasets News Groups (18774 61188) and MNIST (70000 784), and two large-scale Bag-of-Words datasets: NYTimes news articles (BOWnytimes) (300000 102660) and Pub Med abstracts (BOWpubmed) (8200000 141043). We use r = 10 for rank-r approximation and k = 10 for k-means clustering. For PCR, we use MNIST and further choose Year Prediction MSD (515345 90), CTslices (53500 386), and a large dataset MNIST8m (800000 784). |
| Dataset Splits | No | The paper mentions datasets used but does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper discusses the number of nodes (servers) in the distributed setting (e.g., 's = 25 for medium-size datasets, and s = 100 for the larger ones') but does not provide specific details about the hardware used for the experiments (e.g., CPU/GPU models, memory). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers needed to replicate the experiments. |
| Experiment Setup | Yes | The number of nodes is s = 25 for medium-size datasets, and s = 100 for the larger ones. We distribute the data over the nodes using a weighted partition, where each point is distributed to the nodes with probability proportional to the node s weight chosen from the power law with parameter α = 2. For each projection dimension, we first construct the projected data using distributed PCA... For each projection dimension and each algorithm with randomness, the average ratio over 5 runs is reported. |