reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Solving the Tree Containment Problem Using Graph Neural Networks

Authors: Arkadiy Dushatskiy, Esther Julien, Leen Stougie, Leo van Iersel

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose to solve it approximately using Graph Neural Networks. Our algorithm demonstrates an accuracy of over 95% in solving the tree containment problem on instances with up to 100 leaves. Experiments: Our datasets consist of network-tree pairs (N, T) along with corresponding labels indicating tree containment... We compare the performance of Combine-GNN with the aforementioned baselines. Furthermore, we study the scalability of Combine-GNN in terms of the training dataset size and increasing size of the test instances (in the inductive learning setup). Finally, we specifically study how the different design choices of the GNN used in our algorithm affect the performance. The results of statistical significance testing are provided in Appendix, Table 3.
Researcher Affiliation	Academia	Arkadiy Dushatskiy EMAIL Centrum Wiskunde & Informatica Amsterdam, the Netherlands Esther Julien EMAIL Delft University of Technology Delft, the Netherlands Leen Stougie EMAIL Centrum Wiskunde & Informatica Vrije Universiteit Amsterdam Amsterdam, the Netherlands Leo van Iersel L.J.J.van EMAIL Delft University of Technology Delft, the Netherlands
Pseudocode	No	The paper describes the architecture of the GNN using mathematical notation for message passing, concatenation, and readout operations, but it does not provide a clearly labeled 'Pseudocode' or 'Algorithm' block with structured, step-by-step instructions. For example, it provides: 'mk i, AGGREGATEk i, {(xk 1 j , xk 1 i ) : (j, i) E} mk i, AGGREGATEk i, {(xk 1 j , xk 1 i ) : (i, j) E} xk i COMBINEk(xk 1 i , mk i, , mk i, )' and 'xfinal i CONCATENATE([x0 i , x1 i . . . , x L i ]).'
Open Source Code	Yes	The code and used tree containment instances are available at https://github.com/Arkadiy D/Phylo GNN.
Open Datasets	Yes	The code and used tree containment instances are available at https://github.com/Arkadiy D/Phylo GNN. For the the real-world experiments, we use the dataset of gene trees from bacterial and archaeal genomes (proposed in Beiko (2011), binarized in van Iersel et al. (2022)).
Dataset Splits	Yes	In each setup (defined by the number of leaves in the training and validation/test instances), we have 10000 training samples (network-tree pairs), 1000 samples used as validation dataset, and 1000 for the test dataset.
Hardware Specification	Yes	We use a system with Intel(R) Xeon(R) Silver 4110 CPU and Nvidia Ge Force RTX 2080 Ti GPUs (for this experiment, only one GPU is utilized).
Software Dependencies	No	The paper mentions several algorithms and optimizers used, such as 'Adam W optimizer (Loshchilov & Hutter, 2017)', 'XGBoost model (Chen & Guestrin, 2016)', 'GCN (Kipf & Welling, 2017)', 'GAT (Veličković et al., 2018)', and 'GIN (Xu et al., 2019)'. However, it does not specify the version numbers of any software libraries or frameworks used for their implementation (e.g., PyTorch, TensorFlow, scikit-learn versions).
Experiment Setup	Yes	We train GNNs using the Adam W optimizer (Loshchilov & Hutter, 2017) with the cosine annealing learning rate schedule, number of training steps is 5000; batch size is 200 (similar parameters were used in Rossi et al. (2023)). We performed a grid search to find the best performing architecture and the corresponding training hyperparameters for Combine-GNN. The tuned hyperparameters and their ranges are the following (the best values are underlined): initial learning rate: {10 4, 10 3, 10 2}; weight decay: {0, 10 4, 10 3}; dropout: {0.0, 0.2}; number of GNN layers: {3, 5}, size of node embeddings: {32, 64, 128}; type of GNN operation: {GCN, GIN, GAT}; readout operation: {mean, max, sum}.