reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RelGNN: Composite Message Passing for Relational Deep Learning

Authors: Tianlang Chen, Charilaos Kanatsoulis, Jure Leskovec

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	RELGNN is evaluated on 30 diverse real-world tasks from RELBENCH (Fey et al., 2024), and achieves state-of-the-art performance on the vast majority of tasks, with improvements of up to 25%. We assess the performance of the proposed RELGNN across all tasks in RELBENCH (Fey et al., 2024), a benchmark which spans seven diverse relational databases covering e-commerce, social networks, sports, and medical platforms. RELBENCH features 30 real-world predictive tasks cast as entity classification, entity regression, and recommendation.
Researcher Affiliation	Academia	1Computer Science Department, Stanford University. Correspondence to: Tianlang Chen <EMAIL>.
Pseudocode	No	The paper describes the message passing mechanisms using mathematical equations (e.g., Equations 1-7) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper does not contain an explicit statement about the release of source code for RELGNN, nor does it provide a link to a code repository.
Open Datasets	Yes	We evaluate RELGNN on RELBENCH (Robinson et al., 2024), a public benchmark designed for predictive tasks over relational databases using GNNs. RELBENCH offers a diverse collection of real-world relational databases and realistic predictive tasks. The benchmark covers 7 datasets, each carefully processed from real-world sources across diverse domains such as e-commerce, social networks, medical records, Q&A platforms, and sports. These datasets vary significantly in size, with differences in the number of rows, columns, and tables, serving as a challenging and comprehensive benchmark for RDL model evaluation. Appendix A.1 provides description and detailed statistics for each dataset.
Dataset Splits	Yes	The data is split temporally, with models trained on data from earlier time periods and tested on data from future time periods. To attach target labels, each task defines a training table that links entities of interest to their target labels and timestamps via foreign keys, enabling automatic supervision from historical data while ensuring temporal consistency during training. The tasks vary significantly in the number of entities in the train/validation/test split and the proportion of test entities encountered during training. Detailed description of each task can be found in Appendix A.2. Table 4 provides detailed statistics for each dataset and task, including '#Rows of training table', '#Unique Train Entities', 'Validation Entities', and 'Test Entities'.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or cloud computing specifications).
Software Dependencies	No	The paper mentions software components like PyTorch Frame, Light GBM, Graph SAGE, GAT, and GIN, but does not provide specific version numbers for these software dependencies as used in their experiments.
Experiment Setup	Yes	The prediction head for this task consists of a multi-layer perceptron (MLP) applied to the GNN-generated node embeddings. The model is trained using binary cross-entropy loss. All results are averaged over five different seeds. The model is trained using L1 loss. The two-tower GNN calculates pairwise scores through the inner product of source and target node embeddings and is trained using the Bayesian Personalized Ranking loss (Rendle et al., 2012). ID-GNN computes scores by applying an MLP to the embedding of the target entity within the subgraph sampled around each source entity and is trained using binary cross-entropy loss.