Adjacency Search Embeddings
Authors: Meher Chaitanya, Kshitijaa Jaglan, Ulrik Brandes
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approaches, when used in conjunction with the skip-gram model, exhibit superior effectiveness in comparison to other shallow embedding techniques in tasks such as link prediction and node classification. By incorporating our mechanisms as a preprocessing technique, we show substantial improvements in node classification performance across GNNs like GCN, Graph Sage, and Gatv2 on both attributed and non-attributed networks. Empirical Performance: Our proposed algorithms significantly enhance link prediction accuracy when integrated with the Skip Gram model, outperforming the best results of other shallow embedding techniques by an average of 6.97%. Furthermore, our approach serves as an effective local neighborhood sampling technique for message aggregation in GNNs, leading to improved performance across various architectures, including GCN, Graph Sage, and GATv2, for both attributed and non-attributed graphs. |
| Researcher Affiliation | Academia | Meher Chaitanya EMAIL ETH Zürich; Kshitijaa Jaglan EMAIL University of Zürich; Ulrik Brandes EMAIL ETH Zürich |
| Pseudocode | Yes | Algorithm 1 MAS-Embeddings; Algorithm 2 TAS-Embeddings; Algorithm 3 Delayed-BFS; Algorithm 4 TAS-Sampling |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It only mentions a link to the open review forum and a PyTorch documentation link, which are not direct releases of the authors' implementation. |
| Open Datasets | Yes | We utilize six datasets to evaluate the performance of our approaches against other shallow embedding methods. The statistics of these six graph datasets employed for evaluation are outlined in the Appendix B, Table 8. 1. Facebook Leskovec & Mcauley (2012): (...) 2. Arxiv Leskovec et al. (2007): (...) 3. Email Yin et al. (2017): (...) 4. Cora Sen et al. (2008): (...) 5. Pubmed Namata et al. (2012): (...) 6. Citeseer Sen et al. (2008): (...) 7. Blog Zafarani & Liu (2009): (...). We also used the standard train-test splits on ogbn-products Hu et al. (2020)datasets. |
| Dataset Splits | Yes | For training both Maximum Adjacency Search (MAS) and Threshold-based Adjacency Search (TAS) embeddings, we initially obtain a sub-graph with 80% randomly selected edges from each dataset and generate node embeddings by training MAS and TAS on these sub-graphs. A subset of the training edges (10%) are used for validation to select the best operator. The best logistic regression classifier obtained in this process is further used for testing on the remaining 20% data. We use an equal-sized sample of negative edges as positive edges for training. The accuracy for link prediction tasks is evaluated using the AUC (area under the ROC curve) score with a 5-fold cross-validation Baeza-Yates (1999). For node classification, The process involves training logistic classifiers with 80% of the labeled nodes for training, including a subset for validation, while the remaining 20% serves as a test set. This procedure is iterated with 5 random seed initializations. We used the Graph SAGE, GCN, and GATv2 models, trained and tested them using the same data splits as in previous studies Kipf & Welling (2016a); Brody et al. (2021); Cui et al. (2022), namely 20 randomly selected samples for each class during training with a validation set of 500 samples. |
| Hardware Specification | Yes | Training times were obtained using 16 core processor, running TAS on 12 threads, and all algorithms were implemented using gensim. |
| Software Dependencies | No | The paper mentions that "all algorithms were implemented using gensim" and refers to the "Skip Gram-Word2Vec model," but it does not provide specific version numbers for these software components or any other libraries. |
| Experiment Setup | Yes | Parameters for MAS and TAS: Number of Neighbors (k): In our experiments, we vary k from 1 to 10. For graphs having higher average degrees (such as Facebook and email), we vary the range of number of neighbors (k) parameter from 1 to 20. Number of Permutations (p): In our experiments, p is a small constant, and we vary it from 1 to 8. Walk Length (l): In our experiments, we vary this parameter from 10 to 40 in steps of 5. Window Size (w): This is chosen from the set of values {5, 10, 15, 20}. Dimensions (d): For experimental purposes, we set d = 128. Threshold (θ): For our empirical analysis, the threshold parameter, θ Z+, is varied in the range of θ [1, 8]. For node classification, The process involves training logistic classifiers with 80% of the labeled nodes for training, including a subset for validation, while the remaining 20% serves as a test set. This procedure is iterated with 5 random seed initializations. |