Coreset Spectral Clustering
Authors: Ben Jourdan, Gregory Schwartzman, Peter Macgregor, He Sun
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform three experiments to compare our coreset algorithm against the naive method, and secondly, to compare our Coreset Spectral Clustering algorithm to coreset kernel k-means and the sklearn implementation of spectral clustering. Experiments were performed on a dual Intel Xeon E52690 system with 384GB of RAM. Our implementation was written in Rust using Faer (El Kazdadi)3. |
| Researcher Affiliation | Academia | 1University of Edinburgh, UK 2Japan Advanced Institute of Science and Technology (JAIST), Japan 3University of St Andrews, UK EMAIL, EMAIL, EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 D2-Sampling(X) (Jiang et al., 2024) Algorithm 2 FASTD2-SAMPLING Algorithm 3 CORESET SPECTRAL CLUSTERING Algorithm 4 Constructing an ε-coreset for kernel k-means on dataset X with kernel K (Jiang et al., 2024) Algorithm 5 Importance-Sampling(X, ε) (Jiang et al., 2024) Algorithm 6 CONSTRUCTT Algorithm 7 REPAIR(x,T) Algorithm 8 SAMPLEPOINT |
| Open Source Code | Yes | A user friendly python wrapper is available here: github.com/Ben Jourdan/coreset-sc. |
| Open Datasets | Yes | We test these algorithms on the following three large graph datasets from the Suite Sparse Matrix Collection (Davis & Hu, 2011). Friendster: A snapshot of the social media network with 65M nodes, 1.8B edges, and 5000 labeled overlapping communities. We measure each algorithm s running time and Adjusted Rand Index (ARI) (Rand, 1971) on nearest neighbour graphs of the following datasets: MNIST: A labelled collection of 70,000 28x28 pixel images of handwritten digits (Lecun et al., 1998). Pen Digits: 10,992 labeled instances of 2D pen movements covering the digits 0-9, each represented with 16 numerical features. (Anguita et al., 2013). HAR: A collection of 10,299 labeled smartphone sensor readings capturing six human movements, each represented as 561 numerical features (Anguita et al., 2013). Letter: A labelled collection of 20,000 instances of handwritten alphabetic english characters, represented as 16 numerical features (Frey & Slate, 1991). |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits. It describes the characteristics of the datasets used for clustering (e.g., number of nearest neighbors for graph construction, stochastic block model parameters) but not specific data partitioning strategies for evaluation. |
| Hardware Specification | Yes | Experiments were performed on a dual Intel Xeon E52690 system with 384GB of RAM. |
| Software Dependencies | No | Our implementation was written in Rust using Faer (El Kazdadi)3. We compare against the sklearn (Pedregosa et al., 2011) implementation of Spectral Clustering (SC). The paper mentions software names but does not specify their version numbers. |
| Experiment Setup | Yes | We measure the time it takes each method to construct a 100K point coreset for the weighted kernel k-means problem on K and W while varying the number of clusters passed to the D2-sampling routines. Following Jiang et al. (2024), we only do one iteration of importance sampling in Algorithm 4. Figure 3 shows the results for the HAR dataset, and demonstrates that the coreset methods run much faster than SC and both variants of CSC perform as well as SC using a coreset less than 5% the size of the original dataset, in a fraction of a second. We sample graphs where the number of nodes in each cluster is 1000, the probability of an edge between two nodes in the same cluster is 0.5 and the probability of an edge between two nodes from different clusters is 0.001/k. The coreset size is set to be 1% the size of the input graph, implying that each cluster has an average of 10 nodes in the coreset graph. |