reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Stable and Consistent Density-Based Clustering via Multiparameter Persistence

Authors: Alexander Rolle, Luis Scoccola

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate Persistable on benchmark data sets, showing that it identiﬁes multi-scale cluster structure in data. Keywords: density-based clustering, topological data analysis, hierarchical clustering, multiparameter persistent homology, interleaving distance, vineyard
Researcher Affiliation	Academia	Alexander Rolle EMAIL Department of Mathematics Technical University of Munich Boltzmannstraße 3, 85748 Garching, Germany Luis Scoccola EMAIL Mathematical Institute University of Oxford Woodstock Road, Oxford OX2 6GG, United Kingdom
Pseudocode	Yes	Algorithm 1 Compute the barcode of the HC induced by a ﬁnite ﬁltered graph... Algorithm 2 Exhaustive persistence-based ﬂattening of the HC induced by a ﬁnite ﬁltered graph
Open Source Code	Yes	In another publication (Scoccola and Rolle, 2023), we described the implementation of Persistable. ... See the Persistable software repository (link available in Scoccola and Rolle 2023) for code that replicates all the examples in this section, as well as for further evaluations of Persistable on benchmark data sets.
Open Datasets	Yes	We consider a data set consisting of approximately 560 000 rideshare pickup locations in the New York City area from April, 2014. The data set is the result of a Freedom of Information request by the website Five Thirty Eight (2015). ... We consider a data set concerning the fatty acid composition of 572 samples of olive oil. ... The data set is due to Forina et al. (1983), and we obtained it from the supplementary materials of Stuetzle and Nugent (2010).
Dataset Splits	No	The paper discusses applying clustering algorithms to datasets and evaluating the results against ground truth labels (e.g., adjusted Rand index), but it does not specify explicit training, validation, or test splits for any model or experiment. It mentions using a subsample for approximation, but this is not a dataset split for reproducibility purposes.
Hardware Specification	Yes	Using a subsample of 30 000 data points, we are able to compute clusterings of the complete Rideshare data set in a matter of seconds, using approximately 200 MB of RAM, using a laptop with an Intel(R) Core(TM) i5 CPU (4 cores, 1.6GHz) and 8 GB RAM, running GNU/Linux.
Software Dependencies	Yes	The RIVET Developers. RIVET. 1.1.0, 2020. URL https://github.com/rivet TDA/ rivet/. ... hdbscan clustering library (Mc Innes et al., 2017).
Experiment Setup	Yes	To apply the persistence-based ﬂattening, one chooses the number of clusters in the output, guided by the barcode of the HC. ... PF(H, n) = leaves(H τ), where τ = (Pr(H)(n 1) + Pr(H)(n))/2. ... The memory usage of HDBSCAN scales with n k, where n is the number of data points and k is the density threshold parameter min samples.