Stable and Consistent Density-Based Clustering via Multiparameter Persistence

Authors: Alexander Rolle, Luis Scoccola

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate Persistable on benchmark data sets, showing that it identifies multi-scale cluster structure in data. Keywords: density-based clustering, topological data analysis, hierarchical clustering, multiparameter persistent homology, interleaving distance, vineyard
Researcher Affiliation Academia Alexander Rolle EMAIL Department of Mathematics Technical University of Munich Boltzmannstraße 3, 85748 Garching, Germany Luis Scoccola EMAIL Mathematical Institute University of Oxford Woodstock Road, Oxford OX2 6GG, United Kingdom
Pseudocode Yes Algorithm 1 Compute the barcode of the HC induced by a finite filtered graph... Algorithm 2 Exhaustive persistence-based flattening of the HC induced by a finite filtered graph
Open Source Code Yes In another publication (Scoccola and Rolle, 2023), we described the implementation of Persistable. ... See the Persistable software repository (link available in Scoccola and Rolle 2023) for code that replicates all the examples in this section, as well as for further evaluations of Persistable on benchmark data sets.
Open Datasets Yes We consider a data set consisting of approximately 560 000 rideshare pickup locations in the New York City area from April, 2014. The data set is the result of a Freedom of Information request by the website Five Thirty Eight (2015). ... We consider a data set concerning the fatty acid composition of 572 samples of olive oil. ... The data set is due to Forina et al. (1983), and we obtained it from the supplementary materials of Stuetzle and Nugent (2010).
Dataset Splits No The paper discusses applying clustering algorithms to datasets and evaluating the results against ground truth labels (e.g., adjusted Rand index), but it does not specify explicit training, validation, or test splits for any model or experiment. It mentions using a subsample for approximation, but this is not a dataset split for reproducibility purposes.
Hardware Specification Yes Using a subsample of 30 000 data points, we are able to compute clusterings of the complete Rideshare data set in a matter of seconds, using approximately 200 MB of RAM, using a laptop with an Intel(R) Core(TM) i5 CPU (4 cores, 1.6GHz) and 8 GB RAM, running GNU/Linux.
Software Dependencies Yes The RIVET Developers. RIVET. 1.1.0, 2020. URL https://github.com/rivet TDA/ rivet/. ... hdbscan clustering library (Mc Innes et al., 2017).
Experiment Setup Yes To apply the persistence-based flattening, one chooses the number of clusters in the output, guided by the barcode of the HC. ... PF(H, n) = leaves(H τ), where τ = (Pr(H)(n 1) + Pr(H)(n))/2. ... The memory usage of HDBSCAN scales with n k, where n is the number of data points and k is the density threshold parameter min samples.