reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Where Did the Gap Go? Reassessing the Long-Range Graph Benchmark

Authors: Jan Tönshoff, Martin Ritzert, Eran Rosenbluth, Martin Grohe

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we carefully reevaluate multiple MPGNN baselines as well as the Graph Transformer GPS (Rampáek et al. 2022) on LRGB. Through a rigorous empirical analysis, we demonstrate that the reported performance gap is overestimated due to suboptimal hyperparameter choices. It is noteworthy that across multiple datasets the performance gap completely vanishes after basic hyperparameter optimization. In addition, we discuss the impact of lacking feature normalization for LRGB s vision datasets and highlight a spurious implementation of LRGB s link prediction metric. The principal aim of our paper is to establish a higher standard of empirical rigor within the graph machine learning community.
Researcher Affiliation	Academia	Jan Tönshoﬀ toenshoﬀ@informatik.rwth-aachen.de RWTH Aachen University Martin Ritzert EMAIL Georg-August-Universität Göttingen Eran Rosenbluth EMAIL RWTH Aachen University Martin Grohe EMAIL RWTH Aachen University
Pseudocode	No	The paper describes methods and experiments in detail but does not include any clearly labeled pseudocode or algorithm blocks. The procedural descriptions are all in paragraph form.
Open Source Code	Yes	Our contribution is three-fold1: First, we show that the three MPGNN baselines GCN, GINE, and Gated GCN all proﬁt massively from further hyperparameter tuning, reducing and even closing the gap to graph transformers on multiple datasets. ... 1Source code: https://github.com/toenshoff/LRGB
Open Datasets	Yes	The recent Long-Range Graph Benchmark (LRGB, Dwivedi et al. 2022) introduced a set of graph learning tasks strongly dependent on long-range interaction between vertices. ... The Long-Range Graph Benchmark (LRGB) has been introduced by Dwivedi et al. (2022) as a collection of ﬁve datasets: Peptides-func and Peptides-struct are graph-level classiﬁcation and regression tasks, respectively. ... Pascal VOC-SP and COCO-SP model semantic image segmentation as a node-classiﬁcation task on superpixel graphs. PCQM-Contact is a link prediction task on molecular graphs.
Dataset Splits	Yes	Table 1a provides the results obtained on the test splits of the Peptides-Func and Peptides-Struct. ... For the ﬁnal evaluations runs we average results across four diﬀerent random seeds as speciﬁed by the LRGB dataset.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. It focuses on the experimental setup and results but omits hardware specifications.
Software Dependencies	No	The paper mentions using an "Adam W optimizer" and "Ge LU Hendrycks & Gimpel (2016) as our default activation function" but does not specify version numbers for programming languages, machine learning frameworks (e.g., PyTorch, TensorFlow), or other key software libraries.
Experiment Setup	Yes	We tune the main hyperparameters (such as depth, dropout rate, . . . ) in pre-deﬁned ranges while strictly adhering to the oﬃcial 500k parameter budget. The exact hyperparameter ranges and all ﬁnal conﬁgurations are provided in Appendix A.1. In particular, we looked at networks with 6 to 10 layers, varied the number of layers in the prediction head from 1 to 3 (which turned out to be very relevant), and also considered the dropout and learning rate of the network. ... Overall, we tried to incorporate the most important hyperparameters which we selected to be dropout, model depth, prediction head depth, learning rate, and the used positional or structural encoding. For GPS we additionally evaluated the internal MPGNN (but only between GCN and Gated GCN) and whether to use Batch Norm or Layer Norm. Thus, our hyperparamters and ranges were as follows: Dropout [0, 0.1, 0.2], default 0.1; Depth [6,8,10], default 8. ... learning rate [0.001, 0.0005, 0.0001], default 0.001; Head depth [1,2,3], default 2; Encoding [none, Lap PE, RWSE] default none; Internal MPGNN [GCN, Gated GCN], default Gated GCN (only for GPS); Normalization [Batch Norm, Layer Norm] default Batch Norm (only for GPS).