Dynamic Similarity Graph Construction with Kernel Density Estimation

Authors: Steinar Laenen, Peter Macgregor, He Sun

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our algorithms are experimentally compared against several baseline algorithms on 8 datasets, and these experiments confirm the sub-linear update time proven in theory. These experiments further demonstrate that our dynamic KDE algorithm scales better to large datasets than several baselines, including the fast static KDE algorithm in (Charikar et al., 2020), and our dynamic similarity graph construction algorithm runs faster than the fully-connected and k-nearest neighbour similarity graph baselines, and produces comparable clustering results when applying spectral clustering.
Researcher Affiliation Academia 1University of Edinburgh, United Kingdom 2University of St Andrews, United Kingdom. Correspondence to: He Sun <EMAIL>.
Pseudocode Yes Algorithm 1 DYNAMICKDE(X, Q, ε) ... Algorithm 10 Tree Update Procedures for Constructing an Approximate Similarity Graph
Open Source Code Yes Our code can be downloaded from https://github.com/Steinar Laenen/Dynamic-Similarity-Graph-Construction-with-Kernel-Density-Estimation.
Open Datasets Yes We evaluate the algorithms on a variety of real-world and synthetic data, and we summarise their properties in Table 3 in Section D. The datasets cover a variety of domains, including synthetic data (blobs (Pedregosa et al., 2011)), images (mnist (Lecun et al., 1998), aloi (Geusebroek et al., 2005)), image embeddings (cifar10 (He et al., 2016; Krizhevsky, 2009)), word embeddings (glove (Pennington et al., 2014)), mixed numerical datasets (msd (Bertin Mahieux et al., 2011), covtype (Blackard & Dean, 1999), and census (Meek et al., 1990)).
Dataset Splits Yes We split the data points into chunks of size 1,000 for aloi, msd, and covtype, and size 10,000 for glove and census. Then, we add one chunk at a time to the set of data points X and the set of query points Q. ... We split the datasets into chunks of 1, 000 and add each chunk to the dynamically constructed similarity graph, adding one complete ground-truth cluster at a time.
Hardware Specification Yes All experiments are performed on a computer server with 64 AMD EPYC 7302 16-Core Processors and 500 Gb of RAM.
Software Dependencies No The paper does not explicitly state specific software dependencies with version numbers. It mentions libraries implicitly through dataset citations (e.g., scikit-learn for blobs) but no versions.
Experiment Setup Yes For each dataset, we set the parameter σ of the Gaussian kernel such that the average kernel density µq n 1 0.01 (Karppa et al., 2022). ... We split the data points into chunks of size 1,000 for aloi, msd, and covtype, and size 10,000 for glove and census. ... For the dynamic similarity graph algorithm, we compare against the two baseline algorithms: 1. FULLYCONNECTED: the fully-connected similarity graph with the Gaussian kernel; 2. KNN: the k-nearest neighbour graph, for k = 20.