Learning Condensed Graph via Differentiable Atom Mapping for Reaction Yield Prediction

Authors: Ankit Ghosh, Gargee Kashyap, Sarthak Mittal, Nupur Jain, Raghavan B Sunoj, Abir De

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that YIELDNET can predict the yield more accurately than the baselines. Furthermore, the model is trained only under the distant supervision of yield values, without requiring fine-grained supervision of atom mapping. (...) Our experimental evaluation across multiple datasets show that YIELDNET is able to outperform several baselines by a significant margin.
Researcher Affiliation Academia 1Department of Chemistry, IIT Bombay 2Department of Computer Science and Engineering, IIT Bombay. Correspondence to: Ankit Ghosh <EMAIL>.
Pseudocode No The paper describes the model architecture and mathematical formulations for GNNθ (Section C.1) and Input Differentiable GNNψ (Section C.2) using equations (22)-(35) and (36)-(45) respectively, but these are descriptions of computational steps rather than clearly structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available in https://github.com/ankitthreo/YieldNet.git.
Open Datasets Yes We carry out our experiments using eight datasets. They include (1) GP dataset which is derived from Gasphase Isomerization reactions (Grambow et al., 2020b); five datasets derived from catalytic asymmetric N, S-acetal formation reaction (Zahrt et al., 2019), viz., (2) NS1, (3) NS2, (4) NS3, (5) NS4, (6) NS5; one dataset on (7) Suzuki coupling reaction (SC); and, another dataset on (8) Deoxyflurorination (Nielsen et al., 2018) (DF). (...) The Gas-Phase reaction datasets (Grambow et al., 2020b) are licensed under CC-BY 4.0. The USPTO dataset used here comes under the MIT License. The original USPTO dataset (Schwaller et al., 2021b) by Lowe (2017) comes under CC0 1.0 License. NS (Zahrt et al., 2019) and DF (Nielsen et al., 2018) datasets used here, are available in (Singh & Sunoj, 2022).
Dataset Splits Yes We partitioned the datasets into 70% training, 10% validation, and 20% test folds. We generated ten random splits using different random seeds.
Hardware Specification Yes All the models are trained on NVIDIA A100 80GB GPU. All the models are fully based on PyTorch (Paszke et al., 2019). We run in Ubuntu 20.04.6 LTS machine having 2TB RAM with 64 bit CPU and AMD EPYC 7742 64-Core Processor.
Software Dependencies No The paper mentions software like PyTorch (Paszke et al., 2019) and Network X (Hagberg et al., 2008), and operating system Ubuntu 20.04.6 LTS, but does not provide specific version numbers for the key software libraries and dependencies used to implement the methodology, other than the OS version.
Experiment Setup Yes We kept the batch size b same across all models. We set b = 50 for GP datasets and b = 8 for the rest. We train each model with 100 epochs and evaluate their performance on the test dataset, selecting the epoch with the lowest validation MAE. We use Adam optimizer for each model. We use Noam learning rate with 2 warmup epochs and an initial and final learning rate of 10^-4 and a maximum learning rate of 10^-3. Additionally, we keep the regularizer parameter ρ fixed at a value of 0.1 in our model. (...) 1. GNNθ. We set the dimension of node embeddings, d = 20 (30) and dimension of edge embeddings, D = 20 (31). 2. Align. We set temperature λ = 0.1 in Eq. (6), Gumbel noise factor 1.0, and Sinkhorn iterations T = 10. 3. Input Differentiable GNNψ. It shares parameters with GNNθ except ψ0 and ψ4 (Appendix C). We set dH = 20 n, where n is the number of steps. 4. Transformerϕ1. As dH = 20 n = (dH +n) = 20, which is the input and output dimension of the Transformerϕ1. We set the number of heads to 5, and the feedforward dimension to 2048.