reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PharmacoMatch: Efficient 3D Pharmacophore Screening via Neural Subgraph Matching

Authors: Daniel Rose, Oliver Wieder, Thomas Seidel, Thierry Langer

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive investigations of the learned representations and evaluate Pharmaco Match as pre-screening tool in a zero-shot setting. We demonstrate significantly shorter runtimes and comparable performance metrics to existing solutions, providing a promising speed-up for screening very large datasets.
Researcher Affiliation	Academia	1Department of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry, Faculty of Life Sciences, University of Vienna, 1090 Vienna, Austria 2 Christian Doppler Laboratory for Molecular Informatics in the Biosciences, Department of Pharmaceutical Sciences, University of Vienna, 1090 Vienna, Austria Email: EMAIL
Pseudocode	No	The paper describes mathematical formulations for message passing (Equations 5 and 6) and illustrates the workflow in Figure 3, but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block with structured steps for a method or procedure.
Open Source Code	Yes	The source code of this project can be found under the following link: https://github.com/ molinfo-vienna/Pharmaco Match.
Open Datasets	Yes	Unlabeled data for contrastive training To span the pharmaceutical compound space, we download a set of drug-like molecules sourced from the Ch EMBL database (Davies et al., 2015; Zdrazil et al., 2023) website in the form of Simplified Molecular Input Line Entry System (SMILES) strings (Weininger, 1988) and curate an unlabeled dataset using the open-source Chemical Data Processing Toolkit (CDPKit) (Seidel, 2024) (see Appendix A.1 for details). We perform experiments on the DUD-E benchmark dataset (Mysinger et al., 2012), which is commonly used to evaluate the performance of molecular docking and structurebased screening. For our pre-screening experiment, we use the DEKOIS2.0 (Bauer et al., 2013) dataset, which contains 80 targets, each with 40 actives and 1,200 decoys, as well as the LIT-PCBA (Tran-Nguyen et al., 2020) dataset, consisting of 15 target sets with 7,761 confirmed actives and 382,674 inactive compounds. Training and test data can be downloaded here: https://doi.org/ 10.6084/m9.figshare.27061081.
Dataset Splits	Yes	Unlabeled data was split into training and validation data with a 98:2 ratio.
Hardware Specification	Yes	Alignment is performed in parallel on an AMD EPYC 7713 64-Core Processor with 128 threads, while pharmacophore embedding and matching are run on an NVIDIA Ge Force RTX 3090, with both devices having comparable purchase prices and release dates. Training was performed on a single NVIDIA Ge Force 3090 RTX graphics unit with 24 GB GDDR6X.
Software Dependencies	Yes	The GNN was implemented in Python 3.10 with Py Torch (v2.0.1) and the Py Torch Geometric library (v2.3.1) (Fey & Lenssen, 2019). Both, model and dataset, were implemented within the Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019) framework (v2.1.0). Model training was monitored with Tensorboard (v2.13.0). CDPKit (v1.1.1) was employed for chemical data processing.
Experiment Setup	Yes	Our GNN encoder model is implemented with three convolutional layers with an output dimension of 64. The MLP has a depth of three dense layers with a hidden dimension of 1024 and an output dimension of 512. The final model was trained for 500 epochs using an Adam (Kingma, 2014) optimizer with a learning rate of 10 3. The margin of the best performing model was set to α = 100. The default tolerance radius r T in CDPKit s pharmacophore screening is set to 1.5 A, and we use the same value for the node displacement during model training to ensure consistency with the alignment algorithm in subsequent evaluations. We design a curriculum learning strategy for learning on pharmacophore graphs, detailed in Appendix A.5, along with details on model training and hyperparameter optimization. Table 3: Hyperparameters of the best performing encoder model Hyperparameter batch size 256 dropout convolution block 0.2 dropout projection block 0.2 max. epochs 500 hidden dimension convolution block 64 hidden dimension projection block 1024 output dimension convolution block 1024 output dimension projection block 512 learning rate optimizer 0.001 margin for negative pairs 100.0 number of convolution blocks 3 depth of the projector MLP 3 edge attributes dimension 5 sampling sphere radius positive pairs 1.5 sampling surface radius negative pairs 1.5