reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Dynamic Similarity Graph Construction with Kernel Density Estimation

Authors: Steinar Laenen, Peter Macgregor, He Sun

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our algorithms are experimentally compared against several baseline algorithms on 8 datasets, and these experiments confirm the sub-linear update time proven in theory. These experiments further demonstrate that our dynamic KDE algorithm scales better to large datasets than several baselines, including the fast static KDE algorithm in (Charikar et al., 2020), and our dynamic similarity graph construction algorithm runs faster than the fully-connected and k-nearest neighbour similarity graph baselines, and produces comparable clustering results when applying spectral clustering.
Researcher Affiliation	Academia	1University of Edinburgh, United Kingdom 2University of St Andrews, United Kingdom. Correspondence to: He Sun <EMAIL>.
Pseudocode	Yes	Algorithm 1 DYNAMICKDE(X, Q, ε) ... Algorithm 10 Tree Update Procedures for Constructing an Approximate Similarity Graph
Open Source Code	Yes	Our code can be downloaded from https://github.com/Steinar Laenen/Dynamic-Similarity-Graph-Construction-with-Kernel-Density-Estimation.
Open Datasets	Yes	We evaluate the algorithms on a variety of real-world and synthetic data, and we summarise their properties in Table 3 in Section D. The datasets cover a variety of domains, including synthetic data (blobs (Pedregosa et al., 2011)), images (mnist (Lecun et al., 1998), aloi (Geusebroek et al., 2005)), image embeddings (cifar10 (He et al., 2016; Krizhevsky, 2009)), word embeddings (glove (Pennington et al., 2014)), mixed numerical datasets (msd (Bertin Mahieux et al., 2011), covtype (Blackard & Dean, 1999), and census (Meek et al., 1990)).
Dataset Splits	Yes	We split the data points into chunks of size 1,000 for aloi, msd, and covtype, and size 10,000 for glove and census. Then, we add one chunk at a time to the set of data points X and the set of query points Q. ... We split the datasets into chunks of 1, 000 and add each chunk to the dynamically constructed similarity graph, adding one complete ground-truth cluster at a time.
Hardware Specification	Yes	All experiments are performed on a computer server with 64 AMD EPYC 7302 16-Core Processors and 500 Gb of RAM.
Software Dependencies	No	The paper does not explicitly state specific software dependencies with version numbers. It mentions libraries implicitly through dataset citations (e.g., scikit-learn for blobs) but no versions.
Experiment Setup	Yes	For each dataset, we set the parameter σ of the Gaussian kernel such that the average kernel density µq n 1 0.01 (Karppa et al., 2022). ... We split the data points into chunks of size 1,000 for aloi, msd, and covtype, and size 10,000 for glove and census. ... For the dynamic similarity graph algorithm, we compare against the two baseline algorithms: 1. FULLYCONNECTED: the fully-connected similarity graph with the Gaussian kernel; 2. KNN: the k-nearest neighbour graph, for k = 20.