reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dimensionality Reduction of Massive Sparse Datasets Using Coresets

Authors: Dan Feldman, Mikhail Volkov, Daniela Rus

NeurIPS 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implemented our algorithm to obtain a low-rank approximation for the term-document matrix of Wikipedia with provable error bounds. Since our streaming algorithm is also embarrassingly parallel we run it on Amazon Cloud, and receive a signiﬁcantly better running time and accuracy compared to existing heuristics (e.g. Hadoop/Map Reduce) that yield non-sparse solutions.
Researcher Affiliation	Academia	Dan Feldman University of Haifa Haifa, Israel EMAIL Mikhail Volkov CSAIL, MIT Cambridge, MA, USA EMAIL Daniela Rus CSAIL, MIT Cambridge, MA, USA EMAIL
Pseudocode	Yes	Fig. 1a contains the pseudocode for Algorithm 1. [...] Fig. 5 contains the pseudocode for Algorithm 2.
Open Source Code	Yes	Our project codebase is open-sourced and can be found here: http://people.csail.mit.edu/mikhail/NIPS2016.
Open Datasets	Yes	For our large-scale grand challenge experiment, we apply our algorithm for computing Latent Semantic Analysis (LSA) on the entire English Wikipedia.
Dataset Splits	No	The paper does not specify explicit train/validation/test dataset splits, percentages, or detailed splitting methodology.
Hardware Specification	No	The paper mentions running on 'Amazon Cloud' and that existing SVD implementations 'crash on a regular laptop' but does not provide specific hardware details (e.g., CPU/GPU models, memory sizes).
Software Dependencies	No	The coreset construction algorithm described in Section 5 was implemented in MATLAB. We make use of the redsvd package [12] to improve performance, but it is not required to run the system. (No version numbers provided for MATLAB or redsvd).
Experiment Setup	Yes	We specify a nominal error of ε=0.5, which is a theoretical upper bound for N = 2k/ε iterations, and show that the coreset error remains bounded. [...] For our synthetic data experiments, we used a moderate size sparse input of (5000 1000) to evaluate the relationship between the error ε and the number of iterations of the algorithm N.