Generalization Performance of Ensemble Clustering: From Theory to Algorithm

Authors: Xu Zhang, Haoye Qiu, Weixuan Liang, Hui Liu, Junhui Hou, Yuheng Jia

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By extensive experimental validation, we confirm the validity of our theoretical assertions and demonstrate that the proposed algorithm surpasses other state-of-the-art methods significantly in terms of performance.
Researcher Affiliation Academia 1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China 2College of Computer Science and Technology, National University of Defense Technology, Changsha, China 3School of Computing Information Sciences, Saint Francis University, Hong Kong, China 4Department of Computer Science, City University of Hong Kong, Hong Kong, China 5Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China. Correspondence to: Yuheng Jia <EMAIL>.
Pseudocode Yes The pseudo code for this algorithm is provided in Appendix C. (Algorithm 1)
Open Source Code Yes The code is available at https://github.com/xuz2019/GPEC.
Open Datasets Yes We evaluated our method on 10 datasets with method CEAM (Zhou et al., 2024), CEs2L, CEs2Q (Li et al., 2019), LWEA (Huang et al., 2018), NWCA (Zhang et al., 2024), ECCMS (Jia et al., 2024), MKKM (Bang et al., 2018), SMKKM (Liu, 2023), SEC (Liu et al., 2017). Due to the space limitations, detailed descriptions of the datasets and comparison methods are provided in Appendix E.1 and E.2. E.1. Details of Datasets In the comparative experiments in Section 6.1, we used 10 benchmark datasets including images, DNA, sensor information, etc. We have summarized the feature information of the datasets in Table 3, and the detailed information is as follows: 1. Phishing Websites1: The dataset consists of a collection of legitimate and phishing website instances. ... http://archive.ics.uci.edu/dataset/327/phishing+websites 2. Rice2: A total of 3810 images of rice grains were captured from two species: Cammeo and Osmancik rices. ... http://archive.ics.uci.edu/dataset/545/rice+cammeo+and+osmancik
Dataset Splits No The paper does not provide explicit training/test/validation dataset splits. It only states: "For each dataset, we repeat the experiments 20 times and compute the average performance. The true number of clustering class is chosen as k for each dataset."
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes For each dataset, we repeat the experiments 20 times and compute the average performance. The true number of clustering class is chosen as k for each dataset. E.4. Hyper-parameter Analysis In this paper, we have only one hyper-parameter, α, which serves as the threshold for extracting high-confidence elements. Fig. 4 shows the performance of our model under different α settings. It can be seen that our method is quite robust across most datasets, and the optimal hyper-parameter is generally between 0.1 and 0.3.