reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Coreset Spectral Clustering

Authors: Ben Jourdan, Gregory Schwartzman, Peter Macgregor, He Sun

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform three experiments to compare our coreset algorithm against the naive method, and secondly, to compare our Coreset Spectral Clustering algorithm to coreset kernel k-means and the sklearn implementation of spectral clustering. Experiments were performed on a dual Intel Xeon E52690 system with 384GB of RAM. Our implementation was written in Rust using Faer (El Kazdadi)3.
Researcher Affiliation	Academia	1University of Edinburgh, UK 2Japan Advanced Institute of Science and Technology (JAIST), Japan 3University of St Andrews, UK EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 D2-Sampling(X) (Jiang et al., 2024) Algorithm 2 FASTD2-SAMPLING Algorithm 3 CORESET SPECTRAL CLUSTERING Algorithm 4 Constructing an ε-coreset for kernel k-means on dataset X with kernel K (Jiang et al., 2024) Algorithm 5 Importance-Sampling(X, ε) (Jiang et al., 2024) Algorithm 6 CONSTRUCTT Algorithm 7 REPAIR(x,T) Algorithm 8 SAMPLEPOINT
Open Source Code	Yes	A user friendly python wrapper is available here: github.com/Ben Jourdan/coreset-sc.
Open Datasets	Yes	We test these algorithms on the following three large graph datasets from the Suite Sparse Matrix Collection (Davis & Hu, 2011). Friendster: A snapshot of the social media network with 65M nodes, 1.8B edges, and 5000 labeled overlapping communities. We measure each algorithm s running time and Adjusted Rand Index (ARI) (Rand, 1971) on nearest neighbour graphs of the following datasets: MNIST: A labelled collection of 70,000 28x28 pixel images of handwritten digits (Lecun et al., 1998). Pen Digits: 10,992 labeled instances of 2D pen movements covering the digits 0-9, each represented with 16 numerical features. (Anguita et al., 2013). HAR: A collection of 10,299 labeled smartphone sensor readings capturing six human movements, each represented as 561 numerical features (Anguita et al., 2013). Letter: A labelled collection of 20,000 instances of handwritten alphabetic english characters, represented as 16 numerical features (Frey & Slate, 1991).
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. It describes the characteristics of the datasets used for clustering (e.g., number of nearest neighbors for graph construction, stochastic block model parameters) but not specific data partitioning strategies for evaluation.
Hardware Specification	Yes	Experiments were performed on a dual Intel Xeon E52690 system with 384GB of RAM.
Software Dependencies	No	Our implementation was written in Rust using Faer (El Kazdadi)3. We compare against the sklearn (Pedregosa et al., 2011) implementation of Spectral Clustering (SC). The paper mentions software names but does not specify their version numbers.
Experiment Setup	Yes	We measure the time it takes each method to construct a 100K point coreset for the weighted kernel k-means problem on K and W while varying the number of clusters passed to the D2-sampling routines. Following Jiang et al. (2024), we only do one iteration of importance sampling in Algorithm 4. Figure 3 shows the results for the HAR dataset, and demonstrates that the coreset methods run much faster than SC and both variants of CSC perform as well as SC using a coreset less than 5% the size of the original dataset, in a fraction of a second. We sample graphs where the number of nodes in each cluster is 1000, the probability of an edge between two nodes in the same cluster is 0.5 and the probability of an edge between two nodes from different clusters is 0.001/k. The coreset size is set to be 1% the size of the input graph, implying that each cluster has an average of 10 nodes in the coreset graph.