SynFlowNet: Design of Diverse and Novel Molecules with Synthesis Constraints
Authors: Miruna Cretu, Charles Harris, Ilia Igashov, Arne Schneuing, Marwin Segler, Bruno Correia, Julien Roy, Emmanuel Bengio, Pietro Lio
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach using synthetic accessibility scores and an independent retrosynthesis tool to assess the synthesizability of our compounds, and motivate the choice of GFlow Nets through considerable improvement in sample diversity compared to baselines. Additionally, we identify challenges with reaction encodings that can complicate traversal of the MDP in the backward direction. To address this, we introduce various strategies for learning the GFlow Net backward policy and thus demonstrate how additional constraints can be integrated into the GFlow Net MDP framework. This approach enables our model to successfully identify synthesis pathways for previously unseen molecules. |
| Researcher Affiliation | Collaboration | 1University of Cambridge, 2EPFL, 3Microsoft Research, 4Valence Labs EMAIL, EMAIL |
| Pseudocode | Yes | In that setting, we (1) generate trajectories using PF , (2) update PF according to the trajectory balance objective in Equation 1 and (3) update PB using these same trajectories according to Equation 2 (see Algorithm 1 in Appendix A.4). ... (see Algorithm 2, Appendix A.4). |
| Open Source Code | Yes | Source code is available at https://github.com/mirunacrt/synflownet. |
| Open Datasets | Yes | We use commercially available building blocks (BBs) from Enamine, which are small fragments of molecules prepared in bulk to be readily synthesised into candidate molecules. Reaction templates are obtained from two publicly available template libraries (Button et al., 2019; Hartenfeller M, 2012). ... We use a subset of Enamine reactions available from Swanson et al. (2024) which produces 93.9% of the REAL space1. ... We also employ two oracle functions from the PMO Gao et al. (2022a) benchmark, which provide machine learning proxies trained fit to experimental data to predict the bioactivities against their corresponding disease targets. The two targets we use here are GSK3β Li et al. (2018) and dopamine receptor D2 (DRD2) (Olivecrona et al., 2017a). ... For this, we select 70 000 random Ch EMBL molecules from Zdrazil et al. (2023) and run Ai Zynth Finder retrosynthesis (Genheden et al., 2020) to decompose the molecules into building blocks. |
| Dataset Splits | No | The paper mentions training proxy models using external datasets, e.g., the sEH proxy model was trained on '300,000 randomly generated molecules', and GSK3β and DRD2 use 'oracle functions from the PMO Gao et al. (2022a) benchmark'. However, it does not explicitly describe the train/test/validation splits for these datasets within the paper itself for its own experiments. For its own methodology (GFlowNet), it operates in an online fashion generating trajectories, and mentions 'on-policy (train) and off-policy (test) molecules' for evaluating the backward policy, but this refers to generated samples rather than predefined splits of an external dataset with specific percentages or counts. |
| Hardware Specification | Yes | model.forward() corresponds to the average time required to perform the matrix operations (Py Torch back-end, single GPU NVIDIA H100) during the forward pass of the model |
| Software Dependencies | Yes | We accomplish this using the new GPU-accelerated Vina-GPU 2.1 docking algorithm (Tang et al., 2023; Alhossary et al., 2015). ... To compute clusters of building blocks for Figures 6E-F, we used Bit BIRCH (Jung et al., 2024), a recent adaptation of Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) algorithm (Zhang et al., 1996), recently proposed for efficient clustering of large molecular libraries. ... we performed a sequence-based search using MMSeqs2 (Steinegger & Söding, 2017) across the Protein Data Bank (PDB). |
| Experiment Setup | Yes | Table A.1: Hyperparameters used in our Syn Flow Net training pipelines. Batch size 64 Number of GNN layers 4 GNN node embedding size 128 Graph transformer heads 2 Learning rate (PF) 10 4 Learning rate (PB) 10 4 Learning rate (Z) 10 3 |