Redefining the task of Bioactivity Prediction

Authors: Yanwen Huang, Bowen Gao, Yinjun JIA, Hongbo Ma, Wei-Ying Ma, Ya-Qin Zhang, Yanyan Lan

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that this new task provides a more challenging and meaningful benchmark for training and evaluating bioactivity prediction models, ultimately offering a more robust assessment of model performance. Dataset and Code are available at: https://github.com/bowen-gao/SIU.
Researcher Affiliation Academia 1Institute for AI Industry Research (AIR), Tsinghua University 2Department of Pharmaceutical Science, Peking University 3Department of Computer Science and Technology, Tsinghua University 4Beijing Academy of Artificial Intelligence (BAAI)
Pseudocode No The paper describes the methodology in detail using prose and figures, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Dataset and Code are available at: https://github.com/bowen-gao/SIU.
Open Datasets Yes To address these issues, we redefine the bioactivity prediction task by introducing the SIU dataset-a million-scale Structural small molecule-protein Interaction dataset for Unbiased bioactivity prediction task, which is 50 times larger than the widely used PDBbind. The bioactivity labels in SIU are derived from wet experiments and organized by label types, ensuring greater accuracy and comparability. The complexes in SIU are constructed using a majority vote from three commonly used docking software programs, enhancing their reliability. Additionally, the structure of SIU allows for multiple small molecules to be associated with each protein pocket, enabling the redefinition of evaluation metrics like Pearson and Spearman correlations across different small molecules targeting the same protein pocket. Experimental results demonstrate that this new task provides a more challenging and meaningful benchmark for training and evaluating bioactivity prediction models, ultimately offering a more robust assessment of model performance. Dataset and Code are available at: https://github.com/bowen-gao/SIU.
Dataset Splits Yes To ensure robust evaluation and flexibility, the dataset includes multiple predefined splitting strategies. These include sequence identity filters at thresholds of 90%, 60%, and 30%, as well as a combined sequence identity and structural similarity filter. A manually curated test set focuses on biologically meaningful tasks by incorporating representative protein targets across diverse classes, offering insights into the generalizability of predictions for key biochemically relevant targets. Additionally, bioactivity prediction models can be assessed using a 10-fold cross-validation framework, providing a reliable and unbiased approach for diverse training and testing scenarios (details in Appendix A.3)... For both versions 0.9 and 0.6, we have 21,528 data pairs allocated for testing. Specifically, version 0.9 includes 1,250,807 data pairs for training and validation, while version 0.6 includes 386,330 data pairs for these purposes.
Hardware Specification Yes For GNN Model, we use the same model in Atom3D (Townshend et al., 2021). We train the model using one NVIDIA A100 GPU... For Uni-Mol model... We use four NVIDIA A100 GPU to train the model... For Pro FSA model... We use four NVIDIA A100 GPU to train the model.
Software Dependencies No The paper mentions using 'Glide Lig Prep' for initial 3D conformations and docking software programs like 'Vina', 'GOLD', but does not provide specific version numbers for these or any other software libraries/frameworks used for model training.
Experiment Setup Yes For GNN Model... The batch size is 256, the max number of epochs is 20, the optimizer is Adam, the learning rate is 1e-3. For 3D-CNN Model... The batch size is 256, the max number of epochs is 20, the optimizer is Adam, the learning rate is 1e-4. For Uni-Mol model... The batch size is 384, the max number of epochs is 50, the optimizer is Adam, the learning rate is 1e-4. For Pro FSA model... The batch size is 384, the max number of epochs is 50, the optimizer is Adam, the learning rate is 1e-4.