Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

A Unified Framework for Spectral Clustering in Sparse Graphs

Authors: Lorenzo Dall'Amico, Romain Couillet, Nicolas Tremblay

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed algorithm is extensively tested on synthetic DC-SBM graphs (for which the performance is measured in terms of overlap between the estimated and groundtruth labels) as well as on real networks (for which the performance is measured in terms of modularity and DC-SBM log-likelihood). In both cases our algorithm outperforms or is on par with the standard competing spectral techniques for sparse graphs.
Researcher Affiliation Academia Lorenzo Dall Amico EMAIL Romain Couillet EMAIL Nicolas Tremblay EMAIL GIPSA-lab Université Grenoble Alpes, CNRS, Grenoble INP 11 Rue des Mathématiques, 38400 Saint-Martin-d Hères, France Laboratoire d Informatique de Grenoble (LIG), Université Grenoble Alpes 700 Av. Centrale, 38401 Saint-Martin-d Hères, France
Pseudocode Yes Algorithm 0 A typical spectral clustering algorithm on a graph G with k classes... Algorithm 1 Community Detection with the Bethe-Hessian... Algorithm 2 Community Detection in sparse and heterogeneous graphs... Subroutine 1 estimate_number_of_classes... Subroutine 2 compute_ζ
Open Source Code Yes A Python implementation of the codes needed to reproduce the results of this article can be found at lorenzodallamico.github.io/codes/: all data, experiments and running times refer to this implementation. On top of the Python codes, we developed a more efficient Julia implementation of the proposed Algorithm 2, available in the Co De Bet He.jl package which gains a factor x5 x10 in terms of computational time over the Python implementation.
Open Datasets Yes Table 1 compares the results of different clustering algorithms of 15 real-world networks of increasing size. ... on some large size Lancichinetti-Fortunato-Radicchi networks (Lancichinetti et al., 2008) that notoriously constitute a benchmark to evaluate clustering performance on synthetic graphs with ground truth labels that resemble real-world networks. ... real-world networks (Leskovec and Krevl, 2014; Zachary, 1977; Lusseau et al., 2003; Girvan and Newman, 2002; Adamic and Glance, 2005).
Dataset Splits No The parameters of the LFR are n = 50.000, dmax = 15 log(n), ˆd = 5, τ1 = 3, τ2 = 1, min_class_size = n/8, max_class_size = n/8. Variances are taken over 8 runs of k-means. ... For all networks, the number of communities, when not available, is estimated through ˆk L (see Subroutine 1) and then the same value is used for all competing techniques ... community detection is performed only on the largest connected component of the graph and n, c, Φ, k refer to the characteristics of this dominant connected component. Given the embedding X, 5 iterations of k-means are run and the best partition ˆℓ(in terms of k-means is kept) and the scores M, L are computed. To keep stochastic of k-means into account, this step is repeated for 8 times and Tables 1 reports the results of M, L in terms of mean standard deviation.
Hardware Specification Yes The laptop s RAM is 7.7Gb with Intel Core i7-6600U CPU @ 2.6GHz x 4
Software Dependencies No A Python implementation of the codes needed to reproduce the results of this article can be found at lorenzodallamico.github.io/codes/: all data, experiments and running times refer to this implementation. On top of the Python codes, we developed a more efficient Julia implementation of the proposed Algorithm 2, available in the Co De Bet He.jl package which gains a factor x5 x10 in terms of computational time over the Python implementation. ... We indeed observed that when the number of communities estimated by both algorithms is similar, they both produce node partitions with a similar modularity. This is however no longer the case when the estimated k are more distinct (e.g., on GNutella P2P). While we recommend our estimation method of k due to its interpretability (that we discussed all along the article), we acknowledge that the Louvain algorithm often provides very competitive outcomes, at a smaller computational cost but at the expense of any theoretical guarantee. Similar considerations can be made in relation to the more recently proposed Leiden method (Traag et al., 2019).
Experiment Setup Yes For this simulation, n = 50.000, k = 2, θ = 1n if not otherwise specified. (Left) spectrum of the matrix Lsym in the: A) dense regime (cin/n = 0.06, cout/n = 0.02); B) sparse regime (cin = 6, cout = 2). (Right) first eigenvectors x1 vs second eigenvector x2 of Lsym in the dense regime for: C) θ = 1n; D) θi [U(3, 10)]3. ... The parameters of the LFR are n = 50.000, dmax = 15 log(n), ˆd = 5, τ1 = 3, τ2 = 1, min_class_size = n/8, max_class_size = n/8. Variances are taken over 8 runs of k-means. ... Given the embedding X, 5 iterations of k-means are run and the best partition ˆℓ(in terms of k-means is kept) and the scores M, L are computed. To keep stochastic of k-means into account, this step is repeated for 8 times.