Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Dimensionality Reduction of Massive Sparse Datasets Using Coresets

Authors: Dan Feldman, Mikhail Volkov, Daniela Rus

NeurIPS 2016 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implemented our algorithm to obtain a low-rank approximation for the term-document matrix of Wikipedia with provable error bounds. Since our streaming algorithm is also embarrassingly parallel we run it on Amazon Cloud, and receive a significantly better running time and accuracy compared to existing heuristics (e.g. Hadoop/Map Reduce) that yield non-sparse solutions.
Researcher Affiliation Academia Dan Feldman University of Haifa Haifa, Israel EMAIL Mikhail Volkov CSAIL, MIT Cambridge, MA, USA EMAIL Daniela Rus CSAIL, MIT Cambridge, MA, USA EMAIL
Pseudocode Yes Fig. 1a contains the pseudocode for Algorithm 1. [...] Fig. 5 contains the pseudocode for Algorithm 2.
Open Source Code Yes Our project codebase is open-sourced and can be found here: http://people.csail.mit.edu/mikhail/NIPS2016.
Open Datasets Yes For our large-scale grand challenge experiment, we apply our algorithm for computing Latent Semantic Analysis (LSA) on the entire English Wikipedia.
Dataset Splits No The paper does not specify explicit train/validation/test dataset splits, percentages, or detailed splitting methodology.
Hardware Specification No The paper mentions running on 'Amazon Cloud' and that existing SVD implementations 'crash on a regular laptop' but does not provide specific hardware details (e.g., CPU/GPU models, memory sizes).
Software Dependencies No The coreset construction algorithm described in Section 5 was implemented in MATLAB. We make use of the redsvd package [12] to improve performance, but it is not required to run the system. (No version numbers provided for MATLAB or redsvd).
Experiment Setup Yes We specify a nominal error of ε=0.5, which is a theoretical upper bound for N = 2k/ε iterations, and show that the coreset error remains bounded. [...] For our synthetic data experiments, we used a moderate size sparse input of (5000 1000) to evaluate the relationship between the error ε and the number of iterations of the algorithm N.