Nearest-Neighbour-Induced Isolation Similarity and Its Impact on Density-Based Clustering
Authors: Xiaoyu Qin, Kai Ming Ting, Ye Zhu, Vincent CS Lee4755-4762
AAAI 2019 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The impact of Isolation Similarity on density-based clustering is studied here. We show for the first time that the clustering performance of the classic density-based clustering algorithm DBSCAN can be significantly uplifted to surpass that of the recent density-peak clustering algorithm DP. This is achieved by simply replacing the distance measure with the proposed nearest-neighbour-induced Isolation Similarity in DBSCAN, leaving the rest of the procedure unchanged. A new type of clusters called mass-connected clusters is formally defined. We show that DBSCAN, which detects density-connected clusters, becomes one which detects mass-connected clusters, when the distance measure is replaced with the proposed similarity. We also provide the condition under which mass-connected clusters can be detected, while density-connected clusters cannot. |
| Researcher Affiliation | Academia | Xiaoyu Qin Monash University Victoria, Australia 3800 EMAIL Kai Ming Ting Federation University Victoria, Australia 3842 EMAIL Ye Zhu Deakin University Victoria, Australia 3125 EMAIL Vincent CS Lee Monash University Victoria, Australia 3800 EMAIL |
| Pseudocode | No | N/A |
| Open Source Code | Yes | All algorithms used in our experiments are implemented in Matlab (the source code with demo can be obtained from https://github.com/cswords/anne-dbscan-demo). |
| Open Datasets | Yes | The artificial datasets are from http://cs.uef.fi/sipu/datasets/ (Gionis, Mannila, and Tsaparas 2007; Zahn 1971; Chang and Yeung 2008; Jain and Law 2005) except that the hard distribution dataset is from https://sourceforge.net/p/density-ratio/ (Zhu, Ting, and Carman 2016), 5 high-dimensional data are from http: //featureselection.asu.edu/datasets.php (Li et al. 2016), and the rest of the datasets are from http://archive.ics.uci.edu/ml (Dheeru and Karra Taniskidou 2017). |
| Dataset Splits | No | We compared all clustering results in terms of the best F1 score (Rijsbergen 1979) that is obtained from a search of the algorithm’s parameter. We search each parameter within a reasonable range. |
| Hardware Specification | Yes | The experiments ran on a machine having CPU: i5-8600k 4.30GHz processor, 8GB RAM; and GPU: GTX Titan X with 3072 1075MHz CUDA (Owens et al. 2008) cores & 12GB graphic memory. |
| Software Dependencies | No | All algorithms used in our experiments are implemented in Matlab (the source code with demo can be obtained from https://github.com/cswords/anne-dbscan-demo). We produced the GPU accelerated versions of all implementations. |
| Experiment Setup | Yes | The ranges used for all algorithms/dissimilarities are provided in Table 2. |