reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: Graph Learning Will Lose Relevance Due To Poor Benchmarks

Authors: Maya Bechler-Speicher, Ben Finkelshtein, Fabrizio Frasca, Luis Müller, Jan Tönshoff, Antoine Siraudin, Viktor Zaverkin, Michael M. Bronstein, Mathias Niepert, Bryan Perozzi, Mikhail Galkin, Christopher Morris

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we support our claims made in the previous four sections with empirical evidence. Recently, Deac et al. (2022); Wilson et al. (2024) proposed a message-passing scheme in which, during every odd layer, the original graph is disregarded in favor of propagating information through a fixed-structure expander graph specifically, a Cayley graph. Ablation studies presented in Wilson et al. (2024) on multiple TUDATASET benchmarks showed that using the Cayley graph exclusively, without incorporating the original graph at any layer, sometimes improved performance. This finding is striking, as the Cayley graph does not inherently encode task-relevant information. These results align with the observations of Bechler-Speicher et al. (2024), who showed that making graphs more regular consistently improved performance. To further substantiate these findings, we replicate these experiments using the OGB graph-level benchmarks, strengthening the evidence for these observations.
Researcher Affiliation	Collaboration	1Meta 2University of Oxford 3Technion Israel Institute of Technology 4RWTH Aachen University 5NEC Laboratories Europe 6AITHYRA 7University of Stuttgart 8Google Research. Correspondence to: Maya Bachler-Speicher <EMAIL>, Luis Müller <EMAIL>.
Pseudocode	No	The paper describes methodologies and architectures in natural language and mathematical equations (e.g., Section B.3.1 MODEL ARCHITECTURES) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured, code-like procedural steps.
Open Source Code	Yes	Our code is available at https://github.com/ benfinkelshtein/PP-Benchmarks.
Open Datasets	Yes	For instance, popular benchmarks frequently feature two-dimensional molecular graphs (Hu et al., 2020a; Morris et al., 2020), neglecting critical three-dimensional geometric structures. Additionally, many studies report state-of-the-art results on (synthetic) datasets like ZINC (Dwivedi et al., 2022b), which lack sufficient (real-world) justification for their graph-based approach, further complicating their utility.
Dataset Splits	Yes	While older publications typically used stratified 10-fold cross-validation as an evaluation protocol, newer results are often based on repeated random 80/10/10 splits, which are prone to be noisier. This difference explains the performance variance to some degree but does not account for the sharp drop in the reported accuracy of more recent publications. Instead, many recent works seem to run experiments with suboptimal hyperparameter choices, resulting in a significant loss in performance for the compared models. For example, Barbero et al. (2024) configure the training to only last 100 epochs, which is too short to allow for model convergence on a dataset as small as ENZYMES. ... Since the original validation split of PCQM4MV2 is used to compare models in the literature, we create a separate holdout set by sampling 10K graphs uniformly at random from the training data and use this set for model selection and hyperparameter tuning.
Hardware Specification	Yes	Each training run uses a single Nvidia H100 GPU and lasts approximately 8 hours. In total, hyperparameter tuning consumed less than 200 H100 hours of computing.
Software Dependencies	No	The paper describes the model architectures and experimental procedures in detail, but it does not specify any particular software libraries or tools with their version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be necessary for replication.
Experiment Setup	Yes	For each model among Graph Conv, GIN, and GAT, we tuned the learning rate in {10 3, 5 10 3}, number of layers in {3, 5}, dropout in {0, 0.3}, hidden dimensions in {32, 64}, batch size in {16, 32}, early stopping with patience of 50 steps on the validation loss, and sum-pooling. We used Re LU activation and Cross Entropy loss.