reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Adaptive Clustering Using Kernel Density Estimators

Authors: Ingo Steinwart, Bharath K. Sriperumbudur, Philipp Thomann

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we illustrate the behavior of our generic KDE-based clustering algorithm on a few artificial data sets for which the ground truth clustering can be computed. In addition, we compare their performance to k-means and hierarchical clustering.
Researcher Affiliation	Collaboration	Ingo Steinwart EMAIL University of Stuttgart Department of Mathematics D-70569 Stuttgart, GermanyBharath K. Sriperumbudur EMAIL Pennsylvania State University Department of Statistics University Park, PA 16802, USAPhilipp Thomann EMAIL D ONE Solutions AG Sihlfeldstrasse 58 8003 Zürich, Switzerland
Pseudocode	Yes	Algorithm 1 Clustering with the help of a generic level set estimator Algorithm 2 Estimating the split-tree with the help of a generic level set estimator
Open Source Code	No	The paper does not provide concrete access to source code. It describes implementation details in the 'Algorithms' subsection of Section 7 but does not provide a link or an explicit release statement.
Open Datasets	Yes	The first distribution, see Figure 9, is a mixture of 15 Gaussian distributions and a uniform background distribution. [...] Since this distribution was inspired by the S2-data set of Fränti and Virmajoki (2006) we will call it S2 in the following.
Dataset Splits	No	For each of the six cluster problems described above and the following sample sizes n {2500, 3000, 3500, 4200, 5000, 6000, 7000, 8200, 10000, 14000, 20000}. we generated 100 data sets. In addition, we also computed the true densities of the 6 distributions on a 1000 1000 grid of [0, 1]2 to find a high-resolution approximation of the ground truth clustering.
Hardware Specification	No	The paper does not provide specific hardware details used for running its experiments.
Software Dependencies	No	Besides our methods we also considered k-means and hierarchical clustering. To this end, we used the functions kmeans, kmeans++, and hclust of R.
Experiment Setup	Yes	We considered 500 geometrically spaced candidate values of δ between c(ln(n)/n)1/d and c(ln n) 1/d, where in the experiments, the factor c was determined by an estimate of the median mutual distance between the samples of the considered data set. [...] Moreover, we considered both a plain moving window kernel and the Epanechnikov kernel, where in both cases the underlying norm was the Euclidean distance. Since both kernels have bounded support, we simply chose σ := δ, see (24), and ε := 3 p h D,δ n 1δ d for each candidate value δ. [...] Finally, we decided to focus on thickness guarantees with the most natural choice γ := 1, [...] we choose τ := (2 + ϵ) δ with ϵ = 0.00001, where we note that our theoretical findings actually hold true for each value τ > 2δ [...]. In addition, kmeans was repeated with 100 random initializations using the parameter nstart = 100.