reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Calibrating and Improving Graph Contrastive Learning

Authors: MA KAILI, Garry YANG, Han Yang, Yongqiang Chen, James Cheng

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide both theoretical and empirical results to demonstrate the effectiveness of Contrast-Reg in enhancing the generalizability of the Graph Neural Network (GNN) model and improving the performance of graph contrastive algorithms with different similarity definitions and encoder backbones across various downstream tasks. Furthermore, we design experiments to examine the empirical performance of Contrast-Reg... We begin by introducing the experimental settings in Section 6.1. Section 6.2 presents the main results across various downstream tasks.
Researcher Affiliation	Academia	Kaili Ma* EMAIL Department of Computer Science and Engineering, The Chinese University of Hong Kong Garry Yang* EMAIL Department of Computer Science and Engineering, The Chinese University of Hong Kong Han Yang EMAIL Department of Computer Science and Engineering, The Chinese University of Hong Kong Yongqiang Chen EMAIL Department of Computer Science and Engineering, The Chinese University of Hong Kong James Cheng EMAIL Department of Computer Science and Engineering, The Chinese University of Hong Kong
Pseudocode	Yes	Algorithm 1: Graph Contrastive Learning Framework Algorithm 2: ML Parameter: Parameters of an (additional) GNN layer g. Algorithm 3: LC Hyperparameter: R: curriculum update epochs; k: the number of candidate positive samples for seed node; Algorithm 4: GCA Hyperparameter: two stochastic augmentation functions set T and T
Open Source Code	No	Our codes and datasets will be made available.
Open Datasets	Yes	The datasets we employ encompass citation networks, web graphs, co-purchase networks, and social networks. Comprehensive statistics for these datasets can be found in Appendix D. For Cora, Citeseer, Pubmed, ogbn-arxiv, ogbn-products, and Reddit, we adhere to the standard dataset splits and conduct 10 different runs with fixed random seeds ranging from 0 to 9. For Computers, Photo, and Wiki, we randomly divide the train/validation/test sets, allocating 20/30/all remaining nodes per class, in accordance with the recommendations in the previous literature (Shchur et al., 2018). ...Dataset statistics The datasets we employed encompass citation networks, web graphs, co-purchase networks, and social networks. The detailed dataset statistics are shown in Table 9. Dataset Node # Edge # Feature # Class # Cora (Yang et al., 2016) 2,708 5,429 1,433 7 Citeseer (Yang et al., 2016) 3,327 4,732 3,703 6 Pubmed (Yang et al., 2016) 19,717 44,338 500 3 ogbn-arxiv (Hu et al., 2020a) 169,343 1,166,243 128 40 Wiki (Yang et al., 2015) 2,405 17,981 4,973 3 Computers (Shchur et al., 2018) 13,381 245,778 767 10 Photo (Shchur et al., 2018) 7,487 119,043 745 8 ogbn-products (Hu et al., 2020a) 2,449,029 61,859,140 100 47 Reddit (Hamilton et al., 2017) 232,965 114,615,892 602 41
Dataset Splits	Yes	For Cora, Citeseer, Pubmed, ogbn-arxiv, ogbn-products, and Reddit, we adhere to the standard dataset splits and conduct 10 different runs with fixed random seeds ranging from 0 to 9. For Computers, Photo, and Wiki, we randomly divide the train/validation/test sets, allocating 20/30/all remaining nodes per class, in accordance with the recommendations in the previous literature (Shchur et al., 2018). ...In order to circumvent the data linkage issue in link prediction, we employ an inductive setting for graph representation learning. We randomly extract induced subgraphs (comprising 85% of the edges) from each original graph for training both the representation learning model and the link predictor, while reserving the remaining edges for validation and testing (10% for the test edge set and 5% for the validation edge set). ...For the Reddit dataset, we naturally partition the data by time, pretraining the models using the first 20 days. We generate an induced subgraph based on the pretraining nodes and divide the remaining data into three parts: the first part produce a new subgraph for fine-tuning the pre-trained model and training the classifier, while the second and third parts are designated for validation and testing. For the ogbn-products dataset, we split the data according to node ID, pretraining the models using a subgraph generated by the initial 70% of the nodes. The data splitting scheme for the remaining data mirrors that of the Reddit dataset.
Hardware Specification	Yes	The experiments are conducted on Linux servers installed with an Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 256GB RAM and 8 NVIDIA 2080Ti GPUs.
Software Dependencies	Yes	Our models, as well as the DGI, GMI and GCN baselines, were implemented in Py Torch Geometric (Fey & Lenssen, 2019) version 1.4.3, DGL (Wang et al., 2019) version 0.5.1 with CUDA version 10.2, scikit-learn version 0.23.1 and Python 3.6.
Experiment Setup	Yes	For full batch training, we used 1-layer GCN as the encoder with prelu activation, for mini-batch training, we used a 3-layer GCN with prelu activation. We conducted grid search of different learning rate (from 1e-2, 5e-3, 3e-3, 1e-3, 5e-4, 3e-4, 1e-4) and curriculum settings (including learning rate decay and curriculum rounds) on the fullbatch version. We used 1e-3 or 5e-4 as the learning rate; 10,10,15 or 10,10,25 as the fanouts and 1024 or 512 as the batch size for mini-batch training. The hyperparameter configurations can be found in Appendix D.