Multi-domain Distribution Learning for De Novo Drug Design
Authors: Arne Schneuing, Ilia Igashov, Adrian Dobbelstein, Thomas Castiglione, Michael Bronstein, Bruno Correia
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on our observations, we focus our evaluation primarily on the distribution learning capabilities of the proposed generative model, comparing it to established baselines. To this end, we assess molecular properties and structural features using distance functions between distributions derived from generated samples and training data points, and demonstrate that DRUGFLOW molecules closely match the data distribution across a broad range of metrics. Metrics We compute Jensen-Shannon divergences for the categorical distributions of atom types, bond types and ring systems (Walters, 2022; 2021). We use the Wasserstein-1 distance for the bond length distributions of the three most common bond types (C C, C N and C=C), the three most common bond angles (C C=C, C C C and C C O) as well as the number of rotatable bonds per molecule. We also apply the Wasserstein distance to computational scores relevant to applications in medicinal chemistry: Quantitative Estimate of Drug-likeness (QED) (Bickerton et al., 2012), Synthetic Accessibility (SA) (Ertl & Schuffenhauer, 2009) and lipophilicity (log P) (Wildman & Crippen, 1999). Dataset & Baselines We use the Cross Docked dataset (Francoeur et al., 2020) with 100 000 protein-ligand pairs for training and 100 proteins for testing, following previous works (Luo et al., 2021; Peng et al., 2022). The data split was done by 30% sequence identity using MMseqs2 (Steinegger & S oding, 2017). Ligands that do not pass all Pose Busters Buttenschoen et al. (2024) filters were removed from the training set. We compare DRUGFLOW with an autoregressive method, POCKET2MOL (Peng et al., 2022), and two diffusion-based methods, TARGETDIFF (Guan et al., 2023a) and DIFFSBDD (Schneuing et al., 2022). We generated 100 samples for each test set target with DRUGFLOW and selected only molecules that passed the RDKit validity filter. |
| Researcher Affiliation | Collaboration | Arne Schneuing1 , Ilia Igashov1 , Adrian W. Dobbelstein1, Thomas Castiglione2, Michael Bronstein3,4 & Bruno Correia1 1 Ecole Polytechnique F ed erale de Lausanne, 2Vant AI, Inc., 3University of Oxford, 4Aithyra |
| Pseudocode | No | The paper includes a 'Method overview' in Figure 1, which is a schematic diagram, and detailed mathematical descriptions of the generative framework in Appendix A.1. However, it does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | REPRODUCIBILITY STATEMENT All methodological details are described in Section 2 and Appendix A. Source code is available at https://github.com/LPDI-EPFL/Drug Flow. |
| Open Datasets | Yes | We use the Cross Docked dataset (Francoeur et al., 2020) with 100 000 protein-ligand pairs for training and 100 proteins for testing, following previous works (Luo et al., 2021; Peng et al., 2022). ... For each test set protein, we randomly selected and docked 100 molecules from the 2.4M compounds in the Ch EMBL database (release 34). |
| Dataset Splits | Yes | We use the Cross Docked dataset (Francoeur et al., 2020) with 100 000 protein-ligand pairs for training and 100 proteins for testing, following previous works (Luo et al., 2021; Peng et al., 2022). The data split was done by 30% sequence identity using MMseqs2 (Steinegger & S oding, 2017). |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or other computing infrastructure used for the experiments. |
| Software Dependencies | No | The paper mentions tools like 'RDKit validity filter', 'MMseqs2', and 'Rosetta repack protocol' but does not specify their version numbers. No other software or library dependencies with version numbers are provided. |
| Experiment Setup | Yes | Hyperparameters Important model hyperparameters are summarized in Table 4. Table 4: Parameter: Training epochs (600), Virtual nodes Nmax (10), Sampling steps (500), OOD λ (10), Scheduler k (3), Preference alignment β (100), λcoord (1), λatom (0.5), λbond (0.5), λw (1), λl (0.2). |