A Scalable Approach for Mapper via Efficient Spatial Search
Authors: Luca Simi
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this study, we introduce a novel, more scalable method for constructing open covers for Mapper, leveraging techniques from computational geometry. Our approach significantly enhances efficiency, improving Mapper s performance for large high-dimensional data. We will present theoretical insights into our method and demonstrate its effectiveness through experimental evaluations on well-known datasets, showcasing substantial improvements in visualization quality and computational performance. We implemented our method in a new Python library called tda-mapper, which is freely available at https://github.com/lucasimi/tda-mapper-python, providing a powerful tool for TDA practitioners and researchers. [...] In this section, we report the results obtained from comparing giotto-tda (v0.6.2) (Tauzin et al., 2021), Kepler Mapper (v2.1.0) (van Veen et al., 2019) and tda-mapper (v0.9.0) (Simi, 2024), focusing on both running time performance and the complexity of the generated graphs. |
| Researcher Affiliation | Academia | Luca Simi EMAIL. This research was conducted independently and did not receive any external funding or institutional support. The findings presented here are not associated with the author s professional duties or affiliations. Although no explicit institutional affiliation (university or company) is provided, and the work is stated as independent, the nature of the research and its publication venue (Transactions on Machine Learning Research) align with academic contributions. Therefore, it is classified as Academia (0) as the closest fit among the given categories. |
| Pseudocode | Yes | Algorithm 1 Greedy construction of an ϵ-net. Algorithm 2 Naive construction of the standard cubical cover. Algorithm 3 Algorithm for build_vptree(Y, d). Algorithm 4 Algorithm for range_query(T, q, ϵ). Algorithm 5 Construction of ball cover via vp-trees. Algorithm 6 Construction of the standard cubical cover via vp-trees. Algorithm 7 Construction of proximity-net. Algorithm 8 Construction of ϵ-net via proximity-net and vp-trees. Algorithm 9 Construction of cubical cover via proximity-net and vp-trees. |
| Open Source Code | Yes | We implemented our method in a new Python library called tda-mapper, which is freely available at https://github.com/lucasimi/tda-mapper-python, providing a powerful tool for TDA practitioners and researchers. Additionally, we introduce our open-source library, tda-mapper (Simi, 2024), available at https://github.com/lucasimi/tda-mapper-python. |
| Open Datasets | Yes | To ensure the reliability of our benchmarks, we used well-known datasets publicly available at the UCI Machine Learning Repository (Dua & Graff, 2023): (a) the Digits dataset (Alpaydin & Kaynak, 1998), with 1797 instances and 64 features; (b) the MNIST dataset (Le Cun et al., 1998), with 70000 instances and 784 features; (c) the Cifar-10 dataset (Krizhevsky et al., 2009), with 60000 instances and 1024 features; and (d) the Fashion-MNIST dataset (Xiao et al., 2017) with 70000 instances and 784 features. |
| Dataset Splits | No | The paper describes experiments for comparing Mapper graph construction and performance across different implementations using full datasets. It does not provide specific training/test/validation splits for these datasets, as the experiments do not involve traditional model training and evaluation using such splits. For instance, it mentions 'For each dataset we ran Mapper using overlap fraction p ranging in the set {0.125, 0.25, 0.5} and using n = 10 as the number of intervals on each feature.' and 'Specifically, we opted for a clustering algorithm that assigns all data points to a single cluster.' No explicit dataset splitting for reproduction of model training is given. |
| Hardware Specification | Yes | Our experiments were conducted on Debian 12 using Python 3.11, using a PC equipped with a Ryzen 7 5700G CPU with 2x16GB DDR34 2133Mhz, in dual channel configuration. |
| Software Dependencies | No | The paper states 'Our experiments were conducted on Debian 12 using Python 3.11'. It also lists 'networkx (Hagberg et al., 2008), numpy (Harris et al., 2020), matplotlib (Hunter, 2007), and plotly (Plotly Technologies Inc., 2015)' as dependencies and 'sklearn (Pedregosa et al., 2011)' for testing, but it does not provide specific version numbers for these libraries. Only Python's version is specified. |
| Experiment Setup | Yes | For each dataset we ran Mapper using overlap fraction p ranging in the set {0.125, 0.25, 0.5} and using n = 10 as the number of intervals on each feature. This choice is arbitrary, but was enough to get informative Mapper graphs, especially with low-dimensional lenses, with every dataset we used. [...] Specifically, we opted for a clustering algorithm that assigns all data points to a single cluster. [...] The Mapper algorithm was configured using PCA with four principal components as the lens, a cubical cover with five intervals and 50% overlap, and KMeans clustering with two clusters was applied to the pullback of each open set. |