reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

WILTing Trees: Interpreting the Distance Between MPNN Embeddings

Authors: Masahiro Negishi, Thomas Gärtner, Pascal Welke

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain. Section 6. Experiments: In this section, we confirm that our proposed d WILT can successfully approximate d MPNN. Then, we show that the distribution of learned edge weights of WILT is skewed towards 0, and a large part of them can be removed with L1 regularization. Finally, we investigate the WL colors that influence d MPNN most. Due to space limitations, we report results only for a selection of MPNNs and datasets. Code is available online, and experimental settings and additional results are in Appendix E.
Researcher Affiliation	Academia	1TU Wien, Vienna, Austria 2Lancaster University Leipzig, Leipzig, Germany. Correspondence to: Masahiro Negishi <EMAIL>.
Pseudocode	Yes	Algorithm 1 Optimizing edge weights of WILT
Open Source Code	Yes	Code is available online, and experimental settings and additional results are in Appendix E. The code to run our experiments is available at https://github.com/masahiro-negishi/wilt.
Open Datasets	Yes	We conduct experiments on three different datasets: Mutagenicity and ENZYMES (Morris et al., 2020), and Lipophilicity (Wu et al., 2018). We chose these datasets to represent binary classification, multiclass classification, and regression tasks, respectively. Next, we offer additional experimental results on non-molecular datasets: IMDB-BINARY and COLLAB (obtained from Morris et al., 2020).
Dataset Splits	Yes	In each setting, we split the dataset into Dtrain, Deval, and Dtest (8:1:1). We train the model for 100 epochs and record the performance on Deval after each epoch.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types) for running its experiments. It only mentions running models, but without hardware specifications.
Software Dependencies	No	The paper mentions using the "Adam optimizer" and training "GCNs with mean or sum pooling", but it does not specify version numbers for these or other software libraries (e.g., PyTorch, TensorFlow, scikit-learn) or programming languages, which are crucial for reproducibility.
Experiment Setup	Yes	For each model architecture, we vary the number of message passing layers (1, 2, 3, 4), the embedding dimensions (32, 64, 128), and the graph pooling methods (mean, sum). This results in a total of 2 × 4 × 3 × 2 = 48 different MPNNs for each dataset. In each setting, we split the dataset into Dtrain, Deval, and Dtest (8:1:1). We train the model for 100 epochs and record the performance on Deval after each epoch. We set the batch size to 32, and use the Adam optimizer with learning rate of 10−3. ALIk(d MPNN, dfunc) and the performance metric (accuracy for Mutagenicity and ENZYMES, RMSE for Lipophilicity) are calculated with the model at the epoch that performed best on Deval.