reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-domain Distribution Learning for De Novo Drug Design

Authors: Arne Schneuing, Ilia Igashov, Adrian Dobbelstein, Thomas Castiglione, Michael Bronstein, Bruno Correia

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Based on our observations, we focus our evaluation primarily on the distribution learning capabilities of the proposed generative model, comparing it to established baselines. To this end, we assess molecular properties and structural features using distance functions between distributions derived from generated samples and training data points, and demonstrate that DRUGFLOW molecules closely match the data distribution across a broad range of metrics. Metrics We compute Jensen-Shannon divergences for the categorical distributions of atom types, bond types and ring systems (Walters, 2022; 2021). We use the Wasserstein-1 distance for the bond length distributions of the three most common bond types (C C, C N and C=C), the three most common bond angles (C C=C, C C C and C C O) as well as the number of rotatable bonds per molecule. We also apply the Wasserstein distance to computational scores relevant to applications in medicinal chemistry: Quantitative Estimate of Drug-likeness (QED) (Bickerton et al., 2012), Synthetic Accessibility (SA) (Ertl & Schuffenhauer, 2009) and lipophilicity (log P) (Wildman & Crippen, 1999). Dataset & Baselines We use the Cross Docked dataset (Francoeur et al., 2020) with 100 000 protein-ligand pairs for training and 100 proteins for testing, following previous works (Luo et al., 2021; Peng et al., 2022). The data split was done by 30% sequence identity using MMseqs2 (Steinegger & S oding, 2017). Ligands that do not pass all Pose Busters Buttenschoen et al. (2024) filters were removed from the training set. We compare DRUGFLOW with an autoregressive method, POCKET2MOL (Peng et al., 2022), and two diffusion-based methods, TARGETDIFF (Guan et al., 2023a) and DIFFSBDD (Schneuing et al., 2022). We generated 100 samples for each test set target with DRUGFLOW and selected only molecules that passed the RDKit validity filter.
Researcher Affiliation	Collaboration	Arne Schneuing1 , Ilia Igashov1 , Adrian W. Dobbelstein1, Thomas Castiglione2, Michael Bronstein3,4 & Bruno Correia1 1 Ecole Polytechnique F ed erale de Lausanne, 2Vant AI, Inc., 3University of Oxford, 4Aithyra
Pseudocode	No	The paper includes a 'Method overview' in Figure 1, which is a schematic diagram, and detailed mathematical descriptions of the generative framework in Appendix A.1. However, it does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	REPRODUCIBILITY STATEMENT All methodological details are described in Section 2 and Appendix A. Source code is available at https://github.com/LPDI-EPFL/Drug Flow.
Open Datasets	Yes	We use the Cross Docked dataset (Francoeur et al., 2020) with 100 000 protein-ligand pairs for training and 100 proteins for testing, following previous works (Luo et al., 2021; Peng et al., 2022). ... For each test set protein, we randomly selected and docked 100 molecules from the 2.4M compounds in the Ch EMBL database (release 34).
Dataset Splits	Yes	We use the Cross Docked dataset (Francoeur et al., 2020) with 100 000 protein-ligand pairs for training and 100 proteins for testing, following previous works (Luo et al., 2021; Peng et al., 2022). The data split was done by 30% sequence identity using MMseqs2 (Steinegger & S oding, 2017).
Hardware Specification	No	The paper does not explicitly mention any specific hardware details such as GPU models, CPU types, or other computing infrastructure used for the experiments.
Software Dependencies	No	The paper mentions tools like 'RDKit validity filter', 'MMseqs2', and 'Rosetta repack protocol' but does not specify their version numbers. No other software or library dependencies with version numbers are provided.
Experiment Setup	Yes	Hyperparameters Important model hyperparameters are summarized in Table 4. Table 4: Parameter: Training epochs (600), Virtual nodes Nmax (10), Sampling steps (500), OOD λ (10), Scheduler k (3), Preference alignment β (100), λcoord (1), λatom (0.5), λbond (0.5), λw (1), λl (0.2).