WILTing Trees: Interpreting the Distance Between MPNN Embeddings

Authors: Masahiro Negishi, Thomas Gärtner, Pascal Welke

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain. Section 6. Experiments: In this section, we confirm that our proposed d WILT can successfully approximate d MPNN. Then, we show that the distribution of learned edge weights of WILT is skewed towards 0, and a large part of them can be removed with L1 regularization. Finally, we investigate the WL colors that influence d MPNN most. Due to space limitations, we report results only for a selection of MPNNs and datasets. Code is available online, and experimental settings and additional results are in Appendix E.
Researcher Affiliation Academia 1TU Wien, Vienna, Austria 2Lancaster University Leipzig, Leipzig, Germany. Correspondence to: Masahiro Negishi <EMAIL>.
Pseudocode Yes Algorithm 1 Optimizing edge weights of WILT
Open Source Code Yes Code is available online, and experimental settings and additional results are in Appendix E. The code to run our experiments is available at https://github.com/masahiro-negishi/wilt.
Open Datasets Yes We conduct experiments on three different datasets: Mutagenicity and ENZYMES (Morris et al., 2020), and Lipophilicity (Wu et al., 2018). We chose these datasets to represent binary classification, multiclass classification, and regression tasks, respectively. Next, we offer additional experimental results on non-molecular datasets: IMDB-BINARY and COLLAB (obtained from Morris et al., 2020).
Dataset Splits Yes In each setting, we split the dataset into Dtrain, Deval, and Dtest (8:1:1). We train the model for 100 epochs and record the performance on Deval after each epoch.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types) for running its experiments. It only mentions running models, but without hardware specifications.
Software Dependencies No The paper mentions using the "Adam optimizer" and training "GCNs with mean or sum pooling", but it does not specify version numbers for these or other software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or programming languages, which are crucial for reproducibility.
Experiment Setup Yes For each model architecture, we vary the number of message passing layers (1, 2, 3, 4), the embedding dimensions (32, 64, 128), and the graph pooling methods (mean, sum). This results in a total of 2 × 4 × 3 × 2 = 48 different MPNNs for each dataset. In each setting, we split the dataset into Dtrain, Deval, and Dtest (8:1:1). We train the model for 100 epochs and record the performance on Deval after each epoch. We set the batch size to 32, and use the Adam optimizer with learning rate of 10−3. ALIk(d MPNN, dfunc) and the performance metric (accuracy for Mutagenicity and ENZYMES, RMSE for Lipophilicity) are calculated with the model at the epoch that performed best on Deval.