Coreset Spectral Clustering

Authors: Ben Jourdan, Gregory Schwartzman, Peter Macgregor, He Sun

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform three experiments to compare our coreset algorithm against the naive method, and secondly, to compare our Coreset Spectral Clustering algorithm to coreset kernel k-means and the sklearn implementation of spectral clustering. Experiments were performed on a dual Intel Xeon E52690 system with 384GB of RAM. Our implementation was written in Rust using Faer (El Kazdadi)3.
Researcher Affiliation Academia 1University of Edinburgh, UK 2Japan Advanced Institute of Science and Technology (JAIST), Japan 3University of St Andrews, UK EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode Yes Algorithm 1 D2-Sampling(X) (Jiang et al., 2024) Algorithm 2 FASTD2-SAMPLING Algorithm 3 CORESET SPECTRAL CLUSTERING Algorithm 4 Constructing an ε-coreset for kernel k-means on dataset X with kernel K (Jiang et al., 2024) Algorithm 5 Importance-Sampling(X, ε) (Jiang et al., 2024) Algorithm 6 CONSTRUCTT Algorithm 7 REPAIR(x,T) Algorithm 8 SAMPLEPOINT
Open Source Code Yes A user friendly python wrapper is available here: github.com/Ben Jourdan/coreset-sc.
Open Datasets Yes We test these algorithms on the following three large graph datasets from the Suite Sparse Matrix Collection (Davis & Hu, 2011). Friendster: A snapshot of the social media network with 65M nodes, 1.8B edges, and 5000 labeled overlapping communities. We measure each algorithm s running time and Adjusted Rand Index (ARI) (Rand, 1971) on nearest neighbour graphs of the following datasets: MNIST: A labelled collection of 70,000 28x28 pixel images of handwritten digits (Lecun et al., 1998). Pen Digits: 10,992 labeled instances of 2D pen movements covering the digits 0-9, each represented with 16 numerical features. (Anguita et al., 2013). HAR: A collection of 10,299 labeled smartphone sensor readings capturing six human movements, each represented as 561 numerical features (Anguita et al., 2013). Letter: A labelled collection of 20,000 instances of handwritten alphabetic english characters, represented as 16 numerical features (Frey & Slate, 1991).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. It describes the characteristics of the datasets used for clustering (e.g., number of nearest neighbors for graph construction, stochastic block model parameters) but not specific data partitioning strategies for evaluation.
Hardware Specification Yes Experiments were performed on a dual Intel Xeon E52690 system with 384GB of RAM.
Software Dependencies No Our implementation was written in Rust using Faer (El Kazdadi)3. We compare against the sklearn (Pedregosa et al., 2011) implementation of Spectral Clustering (SC). The paper mentions software names but does not specify their version numbers.
Experiment Setup Yes We measure the time it takes each method to construct a 100K point coreset for the weighted kernel k-means problem on K and W while varying the number of clusters passed to the D2-sampling routines. Following Jiang et al. (2024), we only do one iteration of importance sampling in Algorithm 4. Figure 3 shows the results for the HAR dataset, and demonstrates that the coreset methods run much faster than SC and both variants of CSC perform as well as SC using a coreset less than 5% the size of the original dataset, in a fraction of a second. We sample graphs where the number of nodes in each cluster is 1000, the probability of an edge between two nodes in the same cluster is 0.5 and the probability of an edge between two nodes from different clusters is 0.001/k. The coreset size is set to be 1% the size of the input graph, implying that each cluster has an average of 10 nodes in the coreset graph.