reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Griffin: Towards a Graph-Centric Relational Database Foundation Model

Authors: Yanbo Wang, Xiyuan Wang, Quan Gan, Minjie Wang, Qibin Yang, David Wipf, Muhan Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated on large-scale, heterogeneous, and temporal graphs extracted from RDBs across various domains (spanning over 150 million nodes), Griffin demonstrates superior or comparable performance to individually trained models, excels in low-data scenarios, and shows strong transferability with similarity and diversity in pretraining across new datasets and tasks, highlighting its potential as a universally applicable foundation model for RDBs.
Researcher Affiliation	Collaboration	1Institute for Artificial Intelligence, Peking University. 2Amazon Web Services. Correspondence to: Muhan Zhang <EMAIL>.
Pseudocode	No	The paper describes the model design and training pipeline in detail using textual descriptions, mathematical formulas, and figures. However, there are no explicitly labeled pseudocode or algorithm blocks presenting structured steps.
Open Source Code	Yes	Code available at github.com/yanxwb/Griffin.
Open Datasets	Yes	We sourced large-scale temporal RDBs from two leading benchmarks, 4DBInfer (Wang et al., 2024) and Rel Bench (Robinson et al., 2024), covering a wide range of domains, scales, and tasks. A total of 24 tasks were selected for SFT and downstream evaluation. Single-Table Datasets: Over 200 datasets were curated from TPBerta (Yan et al., 2024) and CARTE (Kim et al., 2024) from Hugging Face.
Dataset Splits	Yes	To ensure robustness, each task was evaluated across five different random seeds for split selection. Limited-Sample SFT: Fine-tuning with a restricted subset of 4096 samples.
Hardware Specification	Yes	The experiments were conducted on an AWS g6.48x instance, ensuring sufficient computational resources for large-scale graph-based training.
Software Dependencies	No	The paper mentions using a pre-trained text encoder (Nussbaum et al., 2024) and states that the sentence embedding model was based on Nomic embeddings, but it does not provide specific version numbers for any software libraries or dependencies used in their implementation.
Experiment Setup	Yes	For optimization and training, we employed the Adam W optimizer with a learning rate of 3e-4 and an L2-norm regularization of 2e-4. A batch size of 256 was used for all training runs. Early stopping was applied with a patience of 10 epochs to prevent overfitting, ensuring stable convergence. No additional learning rate scheduler or gradient clipping was used. The model architecture was designed with a hidden dimension of 512, maintaining consistency between different components. The sentence embedding model was based on Nomic embeddings, truncated to 512 dimensions. The cross-attention module included 8 attention heads and a dropout rate of 0.1, allowing for effective feature extraction while preventing overfitting. Si LU was chosen as the activation function across all layers. For graph construction and sampling, we adopted a 4-layer message-passing neural network (MPNN) with 2-layer uniform sampling on temporal neighbors. The fanout was set to 20 per layer to ensure a balanced trade-off between computational efficiency and capturing structural information. Additionally, reversed edges were incorporated into the sampled subgraph to improve relational modeling.