reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Subgraph Aggregation for Out-of-Distribution Generalization on Graphs

Authors: Bowen Liu, Haoyang Li, Shuning Wang, Shuo Nie, Shanghang Zhang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on both synthetic and real-world datasets demonstrate that Su GAr outperforms state-of-the-art methods, achieving up to a 24% improvement in OOD generalization on graphs. We conducted experiments on 15 datasets to rigorously evaluate the effectiveness of Su GAr, encompassing both synthetic and real-world datasets that exhibit various distribution shifts. The results not only demonstrate the efficacy of Su GAr in learning multiple subgraphs but also highlight its superiority in single-subgraph learning when only one causal subgraph is present. Remarkably, Su GAr demonstrates improvements across multiple graph datasets. Main Results(RQ1). To address RQ1, we conducted a comparative analysis of Su GAr against a range of baseline methods. The performance of Su GAr, relative to current SOTA methods, is presented in tables 1 and 2. Ablation Studies(RQ3). To assess the importance of diversity injection, we designed the following variants and conducted experiments on three challenging datasets: SPMotif-0.9, EC50-Scaffold, and EC50-Size: (1) D: Removed the diversity regularizer, retaining only sampler to learn diverse subgraphs. (2) S: Removed sampler, retaining only the diversity regularizer to learn diverse subgraphs. (3) A: Retained both sampler and the diversity regularizer. (4) N: Removed all components of the proposed method.
Researcher Affiliation	Academia	1State Key Laboratory of Multimedia Information Processing, School of Computer Science, Peking University 2Harbin Institute of Technology 3Weill Cornell Medicine, Cornell University EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an unambiguous statement of code release, a link to a code repository, or mention of code in supplementary materials.
Open Datasets	Yes	Datasets. We utilized the SPMotif datasets from DIR (Wu et al. 2022c), which involve artificial structural shifts and graph size shifts. Additionally, we developed SUMotif based on SPMotif to verify whether Su GAr generalizes effectively when multiple critical subgraphs are present. To evaluate Su GAr in real-world scenarios with more complex distribution shifts, we employed Drug OOD (Ji et al. 2022) from AI-aided Drug Discovery with Assay, Scaffold, and Size splits, the Colored MNIST dataset injected with attribute shifts, and Graph-SST (Socher et al. 2013) with degree biases, following the methodology of CIGA (Chen et al. 2022). More dataset details are shown in the Appendix.
Dataset Splits	No	The paper mentions evaluating models based on "validation performance" and referring to a "protocol" for Drug OOD from (Ji et al. 2022), but it does not explicitly state the specific train/test/validation split percentages, sample counts, or detailed splitting methodology in the main text. It states "More dataset details are shown in the Appendix" but these are not in the main paper.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory amounts) used for conducting the experiments.
Software Dependencies	No	The paper does not specify any software or library names with version numbers that would be necessary to replicate the experiments.
Experiment Setup	Yes	All ensemble and WA methods, as well as Su GAr, employed 10 base models. We apply the Greedy selection strategy for all WA methods, as it consistently outperformed the Uniform strategy across all datasets. Both methods followed a shared initialization and mild hyperparameter search setup as described in (Rame et al. 2022).