reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds

Authors: Shusen Wang, Alex Gittens, Michael W. Mahoney

JMLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluations on the 8.1 million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel k-means clustering with Nystr om approximation. ... The superior performance of the latter approach is empirically veriﬁed.
Researcher Affiliation	Academia	Shusen Wang EMAIL Department of Computer Science Stevens Institute of Technology Hoboken, NJ 07030, USA Alex Gittens EMAIL Computer Science Department Rensselaer Polytechnic Institute Troy, NY 12180, USA Michael W. Mahoney EMAIL International Computer Science Institute and Department of Statistics University of California at Berkeley Berkeley, CA 94720, USA
Pseudocode	Yes	Algorithm 1 Approximate Kernel k-Means using the Power Method. Algorithm 2 Approximate Kernel k-Means Clustering using Nystr om Approximation. Algorithm 3 Distributed Approximate Kernel k-Means Clustering using Nystr om Approximation.
Open Source Code	Yes	This implementation is available at https://github.com/wangshusen/Spark Kernel KMeans.git.
Open Datasets	Yes	Empirical evaluations on the 8.1 million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel k-means clustering with Nystr om approximation. ... We used three classiﬁcation data sets, described in Table 3. The data sets used are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/. ... MNIST (Le Cun et al., 1998) 60,000 780 10 Mushrooms (Frank and Asuncion, 2010) 8,124 112 2 Pen Digits (Frank and Asuncion, 2010) 7,494 16 10
Dataset Splits	No	The paper uses datasets like MNIST8M, Mushrooms, and Pen Digits, stating their sizes and number of features/clusters. However, for these clustering tasks, it does not specify explicit training/validation/test splits as would be typical for supervised learning, nor does it refer to predefined splits for reproduction purposes. The experiments appear to use the full datasets for clustering and evaluation.
Hardware Specification	Yes	We ran our experiments on Cori Phase I, a NERSC supercomputer, located at Lawrence Berkeley National Laboratory. Cori Phase I is a Cray XC40 system with 1632 compute nodes, each of which has two 2.3GHz 16-core Haswell processors and 128GB of DRAM.
Software Dependencies	No	We implemented Algorithm 3 in the Apache Spark framework (Zaharia et al., 2010, 2012), using the Scala API. We computed the Nystr om approximation using the matrix operations provided by Spark, and invoked the MLlib library for machine learning in Spark (Meng et al., 2016) to perform the dimensionality reduction and linear k-means clustering steps. While Spark and MLlib are mentioned, specific version numbers for these software components are not provided.
Experiment Setup	Yes	For the linear k-means clustering, we set the maximum number of iterations to 100. We chose the RBF kernel width parameter according to (5) with β = 1.0.