reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Online GNN Evaluation Under Test-time Graph Distribution Shifts

Authors: Xin Zheng, Dongjin Song, Qingsong Wen, Bo Du, Shirui Pan

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on real-world test graphs under diverse graph distribution shifts could verify the effectiveness of the proposed method, revealing its strong correlation with ground-truth test errors on various well-trained GNN models.
Researcher Affiliation	Collaboration	Xin Zheng Monash University Melbourne, Australia EMAIL Dongjin Song University of Connecticut Storrs, USA EMAIL Qingsong Wen Squirrel AI Bellevue, USA EMAIL Bo Du Wuhan University Wuhan, China EMAIL Shirui Pan Griffith University Queensland, Australia EMAIL
Pseudocode	Yes	Algorithm 1 Learning Behavior Discrepancy (LEBED) Score Computation.
Open Source Code	Yes	1Code is available at https://github.com/Amanda-Zheng/LEBED
Open Datasets	Yes	We perform experiments on six real-world graph datasets with diverse graph data distribution shifts containing: node feature shifts (Wu et al., 2022; Jin et al., 2023b)), domain shifts (Wu et al., 2020), temporal shifts (Wu et al., 2022). Detailed statistics of all these datasets are listed in Table A1 in Appendix B.
Dataset Splits	Yes	For all training graphs and validation graphs, we follow the process procedures and splits in works (Wu et al., 2022) and (Wu et al., 2020).
Hardware Specification	Yes	The running time comparison on Citationv2 in seconds is shown in Fig. 3 with a single Ge Force RTX 3080 GPU and 200 iterations for w/ Dstru..
Software Dependencies	No	In our experiments, we use Pytorch geometric library (Fey & Lenssen, 2019) and four Ge Force RTX 3080 GPUs for all implementations. However, specific version numbers for the software dependencies are not provided.
Experiment Setup	Yes	More details of these well-trained GNN models, including architectures, training hyper-parameters, and groundtruth test error distributions, are provided in Appendix D. We report the correlation between the proposed LEBED and the ground-truth test errors under unseen and unlabeled test graphs with distribution shifts, using R2 and rank correlation Spearman s ρ, where R2 ranges [0, 1], representing the degree of linear fit between two variables. The closer it is to 1, the higher the linear correlation. Spearman s ρ ranges [ 1, 1], representing the monotonic correlation between two variables with 1 indicating the positive correlation and 1 indicating the negative correlation.