Adaptive Clustering Using Kernel Density Estimators

Authors: Ingo Steinwart, Bharath K. Sriperumbudur, Philipp Thomann

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we illustrate the behavior of our generic KDE-based clustering algorithm on a few artificial data sets for which the ground truth clustering can be computed. In addition, we compare their performance to k-means and hierarchical clustering.
Researcher Affiliation Collaboration Ingo Steinwart EMAIL University of Stuttgart Department of Mathematics D-70569 Stuttgart, GermanyBharath K. Sriperumbudur EMAIL Pennsylvania State University Department of Statistics University Park, PA 16802, USAPhilipp Thomann EMAIL D ONE Solutions AG Sihlfeldstrasse 58 8003 Zürich, Switzerland
Pseudocode Yes Algorithm 1 Clustering with the help of a generic level set estimator Algorithm 2 Estimating the split-tree with the help of a generic level set estimator
Open Source Code No The paper does not provide concrete access to source code. It describes implementation details in the 'Algorithms' subsection of Section 7 but does not provide a link or an explicit release statement.
Open Datasets Yes The first distribution, see Figure 9, is a mixture of 15 Gaussian distributions and a uniform background distribution. [...] Since this distribution was inspired by the S2-data set of Fränti and Virmajoki (2006) we will call it S2 in the following.
Dataset Splits No For each of the six cluster problems described above and the following sample sizes n {2500, 3000, 3500, 4200, 5000, 6000, 7000, 8200, 10000, 14000, 20000}. we generated 100 data sets. In addition, we also computed the true densities of the 6 distributions on a 1000 1000 grid of [0, 1]2 to find a high-resolution approximation of the ground truth clustering.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No Besides our methods we also considered k-means and hierarchical clustering. To this end, we used the functions kmeans, kmeans++, and hclust of R.
Experiment Setup Yes We considered 500 geometrically spaced candidate values of δ between c(ln(n)/n)1/d and c(ln n) 1/d, where in the experiments, the factor c was determined by an estimate of the median mutual distance between the samples of the considered data set. [...] Moreover, we considered both a plain moving window kernel and the Epanechnikov kernel, where in both cases the underlying norm was the Euclidean distance. Since both kernels have bounded support, we simply chose σ := δ, see (24), and ε := 3 p h D,δ n 1δ d for each candidate value δ. [...] Finally, we decided to focus on thickness guarantees with the most natural choice γ := 1, [...] we choose τ := (2 + ϵ) δ with ϵ = 0.00001, where we note that our theoretical findings actually hold true for each value τ > 2δ [...]. In addition, kmeans was repeated with 100 random initializations using the parameter nstart = 100.