Hierarchical Overlapping Clustering on Graphs: Cost Function, Algorithm and Scalability

Authors: Yicheng Pan, Renjie Chen, Pengyu Long, Bingchen Fan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that this speed-up algorithm outperforms all baseline methods significantly in both effectiveness (across synthetic and real datasets) and scalability. ... In this section, we verify by experiments the effectiveness and scalability of the speed-up version of Algorithm 2, which demonstrates the validity of our cost function as well.
Researcher Affiliation Academia 1State Key Laboratory of Complex & Critical Software Environment, Beihang University. Correspondence to: Yicheng Pan <EMAIL>.
Pseudocode Yes Algorithm 1 Algorithm for 2-OC Algorithm 2 Algorithm for k-HOC Algorithm 3 Speed-up algorithm for 2-OC
Open Source Code Yes For the source codes and datasets, please refer to https://github.com/Hardict/HOC.
Open Datasets Yes For synthetic datasets, we employ the overlapping stochastic block model (OSBM)... For real datasets, we adopt Amazon, Youtube, and DBLP (Yang & Leskovec, 2012b) that are provided by http://snap.stanford.edu/data and are also used in (Orecchia et al., 2022). To show intuitively that our algorithm is able to find out the blurred overlapping area of datasets, we run our 2-OC algorithm on the MNIST dataset (Le Cun et al., 1998)
Dataset Splits No The paper uses synthetic datasets based on OSBM with specific generation parameters (p1, p2, p3) and real-world datasets (Amazon, Youtube, DBLP, MNIST). It describes how the OSBM is modified and parameters for cluster generation, and for MNIST it specifies selecting pairs of labels and constructing a k-nearest neighbor graph. However, it does not provide explicit train/test/validation splits for any of the datasets used in its experiments.
Hardware Specification Yes Our experiments were performed on a personal computer equipped with a 2.3GHz quad-core Intel i5 processor with 8GB memory.
Software Dependencies No The paper does not specify version numbers for any software dependencies (e.g., programming languages, libraries, frameworks) used to implement their proposed methods. It only mentions the HIPR implementation for a baseline method without specifying its version for their own work.
Experiment Setup Yes In our experiment, we set ℓ= 32. ... In the fist row, p1 = 10 -3, p2 = 5 10 -3, p3 = 0.5. In the second row, p1 = 10 -4, p2 = 2 10 -4, p3 = 5 10 -3. Regard to the last two columns of NMI results, level 1 is the first level that contains the two high-level clusters, and level 2 is the second one that contains the four low-level clusters. Each point is calculated on average over 5 trials, and error bar indicates standard deviation.