reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

GETS: Ensemble Temperature Scaling for Calibration in Graph Neural Networks

Authors: Dingyi Zhuang, Chonghe Jiang, Yunhan Zheng, Shenhao Wang, Jinhua Zhao

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our method outperforms stateof-the-art calibration techniques, reducing expected calibration error (ECE) by 25% across 10 GNN benchmark datasets. Additionally, GETS is computationally efficient, scalable, and capable of selecting effective input combinations for improved calibration performance. The implementation is available at https://github.com/Zhuang Dingyi/GETS/. 5 EXPERIMENTS 5.1 EXPERIMENTAL SETUP 5.2 CONFIDENCE CALIBRATION EVALUATION 5.3 TIME COMPLEXITY 5.4 EXPERT SELECTION 5.5 ABLATION STUDIES
Researcher Affiliation	Academia	Dingyi Zhuang Massachusetts Institute of Technology EMAIL Chonghe Jiang The Chinese University of Hong Kong EMAIL Yunhan Zheng Singapore-MIT Alliance for Research and Technology (SMART) EMAIL Shenhao Wang University of Florida EMAIL Jinhua Zhao Massachusetts Institute of Technology EMAIL
Pseudocode	No	The paper describes methods and equations but does not include any explicitly labeled pseudocode blocks or algorithms in a structured format.
Open Source Code	Yes	The implementation is available at https://github.com/Zhuang Dingyi/GETS/.
Open Datasets	Yes	We include the 10 commonly used graph classification networks for a thorough evaluation, The data summary is given in Table 1, refer to Appendix A.2 for their sources. ... We evaluated our method on several widely used benchmark datasets, all accessible via the Deep Graph Library (DGL)1. These datasets encompass a variety of graph types and complexities, allowing us to assess the robustness and generalizability of our calibration approach. ... Citation Networks (Cora, Citeseer, Pubmed, Cora-Full): In these datasets (Sen et al., 2008; Mc Callum et al., 2000; Giles et al., 1998)
Dataset Splits	Yes	The train-val-test split is 20-10-70 (Hsu et al., 2022; Tang et al., 2024), note that uncertainty calibration models are trained on the validation set, which is also referred to as the calibration set. We randomly generate 10 different splits of training, validation, and testing inputs and run the models 10 times on different splits.
Hardware Specification	Yes	All our experiments are implemented on a machine with Ubuntu 22.04, with 2 AMD EPYC 9754 128-Core Processors, 1TB RAM, and 10 NVIDIA L40S 48GB GPUs.
Software Dependencies	No	The paper mentions using 'torch.nn.Embedding' and 'Deep Graph Library (DGL)', but does not specify version numbers for these or other software components like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For the base GNN classification model (i.e., the uncalibrated model), we follow the architecture and parameter setup outlined by Kipf & Welling (2016); Veliˇckovi c et al. (2017); Xu et al. (2018), with modifications to achieve optimal performance. Specifically, we use a two-layer GCN, GAT, or GIN model and tune the hidden dimension from the set {16, 32, 64}. We experiment with dropout rates ranging from 0.5 to 1, and we do not apply any additional normalization. During training, we use a learning rate of 1e-2. We tune the weight decay parameter to prevent overfitting and consider adding early stopping with patience of 50 epochs. The model is trained for a maximum of 200 epochs to ensure convergence. The specifics are summarized in Table 4. ... For all experiments, the pre-trained GNN classifiers are frozen, and the predicted logits z from the validation set are fed into our calibration model as inputs. ... Table 5: Summary of GETS Parameters Across Datasets (Hidden Dim, Dropout, Num Layers, Learning Rate, Weight Decay are specified for each dataset).