reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

AutoGFM: Automated Graph Foundation Model with Adaptive Architecture Customization

Authors: Haibo Chen, Xin Wang, Zeyang Zhang, Haoyang Li, Ling Feng, Wenwu Zhu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that Auto GFM outperforms baselines, achieving state-of-the-art performance. The contributions of this paper are summarized as follows: We conduct extensive experiments on eight datasets to demonstrate the superiority of our method over state-of-the-art baselines.
Researcher Affiliation	Academia	1Department of Computer Science and Technology, BNRIST, Tsinghua University, Beijing, China. Correspondence to: Xin Wang <EMAIL>, Wenwu Zhu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training pipeline for Auto GFM
Open Source Code	No	The paper mentions using and reproducing results from other methods' publicly available code, but does not provide a statement or link for the code of the Auto GFM methodology itself.
Open Datasets	Yes	Datasets We employ datasets with diverse domains and tasks. For node-level tasks, we utilize citation networks (Cora, Pubmed, and Arxiv) and the web link network (Wiki CS). For edge-level tasks, we utilize Knowledge Graphs (WN18RR, FB15K237). For graph-level tasks, we utilize molecular datasets (HIV, PCBA, and Ch EMBL). Following (Liu et al., 2023a), we use the textual encoder to unify the node features from different domains. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.
Dataset Splits	Yes	Dataset Splitting. We adopt the same splitting strategy as (Liu et al., 2023a; Wang et al., 2024b). For Cora and Pub Med select 20 labeled nodes per class for training. We utilize a predefined set of 10 splits with different random seeds to compute the average performance. For Wiki CS, we report the average accuracy over 20 distinct training splits, each generated with 20 different random seeds. In each split, 5% of the nodes from each class are used for training. For Arxiv, HIV, and PCBA, we employ the official dataset splits and conduct experiments 10 times using different random seeds to determine the average accuracy. The FB15K237 dataset consists of 272,115 edges in the training set, 17,535 edges in the validation set, and 20,466 edges in the test set. Meanwhile, for WN18RR, the corresponding numbers are 86,835, 3,034, and 3,134, respectively. Each experiment is repeated 10 times with different random seeds, and the final results are reported as the average accuracy.
Hardware Specification	Yes	GPU: NVIDIA A100-SXM4-40GB and NVIDIA A100-SXM4-80GB. CPU: Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz.
Software Dependencies	Yes	Software: Python 3.9, CUDA 12.2, Py Torch (Paszke et al., 2019) 1.13.1.
Experiment Setup	Yes	We evaluate different GNN architectures and GNAS methods based on GFT (Wang et al., 2024b), following the default hyperparameters of GFT to maintain consistency. To ensure a fair comparison, we set the dimensionality of all methods to 768, use the same search space and operations (GCN, GIN, GAT, Graph SAGE, Graph Conv), and fix the number of layers to 2. For our method, we explore hyperparameter λ, β {1e 1, 1e 2, 1e 3, 1e 4} and empirically select λ and β. The learning rate of the disentangled contrastive graph encoder is set to 5e 3, and the learning rate of the architecture predictor is set to 3e 2. The dimensionality of both the graph encoder and the supernet is 768. Each experiment is conducted 10 times, and we report the average performance along with standard deviations.