Randomized Spectral Co-Clustering for Large-Scale Directed Networks

Authors: Xiao Guo, Yixuan Qiu, Hai Zhang, Xiangyu Chang

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerically, we design and conduct simulations to support our theoretical results and test the efficiency of the algorithms on real networks with up to millions of nodes. A publicly available R package Rand Clust is developed for better usability and reproducibility of the proposed methods.
Researcher Affiliation Academia Xiao Guo EMAIL Center for Modern Statistics School of Mathematics Northwest University, Xi an, China Yixuan Qiu EMAIL School of Statistics and Management Shanghai University of Finance and Economics, Shanghai, China Hai Zhang EMAIL Center for Modern Statistics School of Mathematics Northwest University, Xi an, China Xiangyu Chang EMAIL Center for Intelligent Decision-Making and Machine Learning School of Management Xi an Jiaotong University, Xi an, China
Pseudocode Yes Algorithm 1 Spectral co-clustering with k-means Input: Adjacency matrix A Rn n of a directed network, number of row clusters Ky, and number of column clusters Kz (Ky Kz). 1: Compute the partial SVD of A, with left and right singular vectors U Rn Ky and V Rn Ky. 2: Run k-means on U with Ky target clusters and on V with Kz clusters. 3: Output the co-clustering results. Algorithm 2 Spectral co-clustering with spherical k-means Input: Adjacency matrix A Rn n of a directed network, number of row clusters Ky, and number of column clusters Kz (Ky Kz). 1: Compute the partial SVD of A, with left and right singular vectors U Rn Ky and V Rn Ky. 2: Construct U and V , whose rows are normalized rows of U and V , respectively. The zero rows are remained the same. 3: Run k-means on U with Ky target clusters and on V with Kz clusters. 4: Output the co-clustering results.
Open Source Code Yes A publicly available R package Rand Clust is developed for better usability and reproducibility of the proposed methods. ... 1. https://github.com/Xiao Guo-stat/Rand Clust
Open Datasets Yes We consider two directed networks, one is the statisticians citation network (Ji and Jin, 2016) and the other is the European email network (Yin et al., 2017). ... Table 3: A summary of the five real large-scale networks. Data No. of nodes No. of edges Target rank Epinions social network (Richardson et al., 2003) 75,877 508,836 3 Slashdot social network (Leskovec et al., 2009) 77,360 905,468 5 Berkeley-Stanford web network (Leskovec et al., 2009) 654,782 7,499,425 4 Wikipedia top categories network(Yin et al., 2017) 1,791,489 28,511,807 5 Wikipedia talk network (Leskovec et al., 2010) 2,388,953 5,018,445 3
Dataset Splits No The paper describes simulation model setups and evaluates algorithms on existing real-world networks without mentioning specific training, testing, or validation splits. It focuses on misclustering rates and approximation errors on the full networks or simulated data.
Hardware Specification Yes A machine with Intel Core i9-9900K CPU 3.60GHz, 32GB memory, and 64-bit WS operating-system, and R version 4.2.2 is used for all computations.
Software Dependencies Yes A machine with Intel Core i9-9900K CPU 3.60GHz, 32GB memory, and 64-bit WS operating-system, and R version 4.2.2 is used for all computations. ... We use the R package irlba to compute the singular vector iteratively after the sampling step. ... svds in R package RSpectra (Qiu and Mei, 2019) and the augmented implicitly restarted Lanczos bidiagonalization algorithm (Baglama and Reichel, 2005) (irlba in R package irlba (Baglama et al., 2019)).
Experiment Setup Yes In the random projection scheme, the oversampling parameter is 10, the power parameter is 2, and the test matrices are generated with i.i.d. standard Gaussian entries. In the random sampling scheme, the sampling rate is 0.7. ... For RP-SCC (RP-Ss CC), the oversampling parameter is 10, the power parameter is 2... For RS-SCC (RS-Ss CC), the sampling rate is 0.8... For svds and irlba, the tolerance parameter is set to be 10 5. For RB-Krylov, the power parameter is also 2... For RP, the power parameter is 1 and the oversampling parameter is 5... For RS, the sampling parameter is 0.7... For RB-Krylov, the power parameter is 1. For rsvd, the power and oversampling parameters are both the same with those in RP.