Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds
Authors: Shusen Wang, Alex Gittens, Michael W. Mahoney
JMLR 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluations on the 8.1 million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel k-means clustering with Nystr om approximation. ... The superior performance of the latter approach is empirically verified. |
| Researcher Affiliation | Academia | Shusen Wang EMAIL Department of Computer Science Stevens Institute of Technology Hoboken, NJ 07030, USA Alex Gittens EMAIL Computer Science Department Rensselaer Polytechnic Institute Troy, NY 12180, USA Michael W. Mahoney EMAIL International Computer Science Institute and Department of Statistics University of California at Berkeley Berkeley, CA 94720, USA |
| Pseudocode | Yes | Algorithm 1 Approximate Kernel k-Means using the Power Method. Algorithm 2 Approximate Kernel k-Means Clustering using Nystr om Approximation. Algorithm 3 Distributed Approximate Kernel k-Means Clustering using Nystr om Approximation. |
| Open Source Code | Yes | This implementation is available at https://github.com/wangshusen/Spark Kernel KMeans.git. |
| Open Datasets | Yes | Empirical evaluations on the 8.1 million instance MNIST8M dataset demonstrate the scalability and usefulness of kernel k-means clustering with Nystr om approximation. ... We used three classification data sets, described in Table 3. The data sets used are available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/ datasets/. ... MNIST (Le Cun et al., 1998) 60,000 780 10 Mushrooms (Frank and Asuncion, 2010) 8,124 112 2 Pen Digits (Frank and Asuncion, 2010) 7,494 16 10 |
| Dataset Splits | No | The paper uses datasets like MNIST8M, Mushrooms, and Pen Digits, stating their sizes and number of features/clusters. However, for these clustering tasks, it does not specify explicit training/validation/test splits as would be typical for supervised learning, nor does it refer to predefined splits for reproduction purposes. The experiments appear to use the full datasets for clustering and evaluation. |
| Hardware Specification | Yes | We ran our experiments on Cori Phase I, a NERSC supercomputer, located at Lawrence Berkeley National Laboratory. Cori Phase I is a Cray XC40 system with 1632 compute nodes, each of which has two 2.3GHz 16-core Haswell processors and 128GB of DRAM. |
| Software Dependencies | No | We implemented Algorithm 3 in the Apache Spark framework (Zaharia et al., 2010, 2012), using the Scala API. We computed the Nystr om approximation using the matrix operations provided by Spark, and invoked the MLlib library for machine learning in Spark (Meng et al., 2016) to perform the dimensionality reduction and linear k-means clustering steps. While Spark and MLlib are mentioned, specific version numbers for these software components are not provided. |
| Experiment Setup | Yes | For the linear k-means clustering, we set the maximum number of iterations to 100. We chose the RBF kernel width parameter according to (5) with β = 1.0. |