Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

A Recipe for Causal Graph Regression: Confounding Effects Revisited

Authors: Yujia Yin, Tianyi Qu, Zihao Wang, Yifan Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on graph OOD benchmarks validate the efficacy of our proposals for CGR. 5. Experiments In this section, we evaluate the prediction performance and OOD generalization ability of our method. We comprehensively compare our method with existing models to demonstrate the superior generalization ability of our method on regression tasks. We briefly introduce the dataset, baselines, and experimental settings here. Table 1. OOD generalization performance on GOOD-ZINC dataset, with boldface being the best and underline being the runner-up. Figure 3. Ablation study on confounder predictive power (left) and causal intervention methods (right) for OOD generalization on GOOD-Motif.
Researcher Affiliation Collaboration 1Hong Kong Baptist University 2SF Tech 3Zhejiang University 4Hong Kong University of Science and Technology. Correspondence to: Tianyi Qu <EMAIL>, Yifan Chen <EMAIL>.
Pseudocode No The paper describes methods using text and mathematical formulations (e.g., Section 4, Equations 1-17), but it does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The model implementation and the code are provided on https: //github.com/causal-graph/CGR.
Open Datasets Yes GOOD-ZINC. GOOD-ZINC is a regression task in the GOOD benchmark (Gui et al., 2022), which aims to test the out-of-distribution performance of real-world molecular property regression datasets from the ZINC database (G omez-Bombarelli et al., 2018). Reaction OOD-SOOD. In addition to the GOOD benchmark, we also used three S-OOD datasets in the Reaction OOD benchmark (Wang et al., 2023), namely Cycloaddition (Stuyver et al., 2023), E2&SN2 (von Rudorff et al., 2020), and RDB7 (Spiekermann et al., 2022).
Dataset Splits Yes Table 4 presents the number of graphs/nodes in different dataset splits for the GOOD-ZINC dataset. The dataset is analyzed under three types of distribution shifts: covariate, concept, and no shift. Each row represents the number of graphs/nodes in training, in-distribution (ID) validation, ID test, out-of-distribution (OOD) validation, and OOD test sets. The no-shift scenario serves as a baseline with no distributional difference between training and test sets.
Hardware Specification No No specific hardware details such as GPU/CPU models, processors, or memory amounts are mentioned in the paper in the main text or supplementary materials. The paper discusses model architecture and training parameters, but not the underlying hardware.
Software Dependencies No The paper mentions using 'a three-layer GIN as the backbone model' in Section A.4 'Experimental Settings', but it does not specify any version numbers for GIN or other software libraries, frameworks, or programming languages used.
Experiment Setup Yes We use a three-layer GIN as the backbone model, with 300 hidden dimensions, which is consistently applied in both OURS and baseline models. The model is trained for 300 epochs, with the learning rate adjusted using the cosine annealing strategy. The initial learning rate is set to 0.001, with a minimum value of 1e-8 . For the OURS model, all tunable hyperparameters in the loss function L are set to 0.5.