Euler Characteristic Tools for Topological Data Analysis

Authors: Olympio Hacquard, Vadim Lebovici

JMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this article, we show that Euler characteristic profiles and their hybrid transforms are informative and highly efficient topological descriptors. Throughout the paper, we use classical methods based on persistence diagrams as a baseline for our descriptors. After introducing the necessary notions in Section 2, we provide heuristics on how to choose the kernel of hybrid transforms and give many examples of the type of topological and geometric behaviour Euler curves and their integral transforms can capture from data in Section 3. Most importantly, our main contributions are the following: We demonstrate that Euler profiles achieve state-of-the-art accuracy in supervised classification and regression tasks when coupled with a random forest or a gradient boosting (Sections 4.1, 4.2 and 4.4) at a very low computational cost (Section 4.5). We demonstrate that hybrid transforms act as highly efficient information compressors and typically require a much smaller resolution than Euler profiles to reach a similar performance. They can also outperform Euler profiles in unsupervised classification tasks and in supervised tasks when plugging a linear classifier (Figure 7 and Sections 4.1 to 4.3). In Section 4.3, we illustrate their ability to capture fine-grained information on a real-world data set. Finally, Section 7 is devoted to the proofs of the results stated in Sections 5 and 6.
Researcher Affiliation Academia Olympio Hacquard EMAIL Laboratoire de Math ematiques d Orsay Universit e Paris-Saclay, CNRS, Inria 91400 Orsay France Vadim Lebovici EMAIL Mathematical Institute University of Oxford, Oxford, United Kingdom
Pseudocode No The paper discusses algorithms and their complexity in Section 3.1 titled "Algorithm," but it does not present any structured pseudocode or algorithm blocks. It describes the general procedure for computation but lacks a formalized algorithm presentation.
Open Source Code Yes A Python implementation of our algorithms is freely available online on our Git Hub repository: https://github.com/vadimlebovici/eulearning.
Open Datasets Yes The ORBIT5K data set is often used as a standard benchmark for classification methods in topological data analysis (Adams et al., 2017; Carri ere et al., 2020; Reinauer et al., 2021). The Sydney urban objects recognition data set consists of 3D point clouds of everyday urban road objects scanned with a LIDAR (De Deuge et al., 2013). We have applied our method to the supervised classification of graph data. [...] The first four methods are stateof-the-art classification methods on graphs that use kernels or neural networks. We report the scores from the original papers, Tran et al. (2019); Zhang et al. (2018); Verma and Zhang (2017); Xu et al. (2019). Perslay (Carri ere et al., 2020), and Atol (Royer et al., 2021) are topological methods that transform the graphs into persistence diagrams using HKS functions.
Dataset Splits Yes We observe 101 samples from the unit disk of the space with curvature [ 2, 1.96, . . . , 1.96, 2] and validate our model on a testing set of 100 point clouds sampled from the disk of the space with random curvature drawn uniformly in [ 2, 2]. We generate 700 training and 300 testing orbits for each class. Note that the methods SV, FGSD, and GIN do not average ten times and rather consider a single 10-fold sample which can slightly boost their accuracies.
Hardware Specification Yes Our timing experiments have been run on a workstation with an Intel(R) Core(TM) i7-4770 CPU (8 cores, 3.4GHz) and 8 GB RAM, running GNU/Linux (Ubuntu 20.04.1).
Software Dependencies No The paper mentions several software components like the Gudhi library (Rouvreau, 2015), MMA package (Loiseaux et al., 2022b), and XGBoost classifier (Chen and Guestrin, 2016), but it does not specify any version numbers for these software components. It only mentions the operating system as "GNU/Linux (Ubuntu 20.04.1)".
Experiment Setup Yes All descriptors have a resolution of 900 (hence of 30 30 for two-parameter ones) and were classified using the XGBoost classifier (Chen and Guestrin, 2016). We select the hyperparameters of our descriptors by cross-validation: For the ECC, the quantiles (see Implementation in Section 3.1) are selected in {(0.1, 0.9), (0.2, 0.8), (0.3, 0.7)}. For the ECS, the quantiles are selected in the same set as for the ECC for both parameters. For the HT1, the range is selected in {[0, 50], [0, 100], [0, 500], [0, 1000]} and the primitive kernel κ in {s 7 exp( s4), s 7 s4 exp( s4), s 7 s8 exp( s8)}. For the HT2, the primitive kernel and the range for the first parameter are the same as for the HT1, and the range for the second parameter is selected in {[0, 50], [0, 80], [0, 100], [0, 500]}.