DrugHash: Hashing Based Contrastive Learning for Virtual Screening

Authors: Jin Han, Yun Hong, Wu-Jun Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Drug Hash can outperform existing methods to achieve state-of-the-art accuracy, with at least a 32 reduction in memory cost and a 4.6 improvement in speed.
Researcher Affiliation Academia Jin Han1*, Yun Hong2*, Wu-Jun Li1 1 National Key Laboratory for Novel Software Technology, School of Computer Science, Nanjing University 2 Kuang Yaming Honors School, Nanjing University EMAIL, EMAIL
Pseudocode No The paper describes the proposed method, training, and inference steps using mathematical equations and textual descriptions, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the methodology described.
Open Datasets Yes To train Drug Hash, we adopt the same training datasets as Drug CLIP, which is the PDBBind dataset (Wang et al. 2005) argumented by Homo Aug (Gao et al. 2023). To benchmark the VS performance of different methods, we adopt two evaluation datasets, which are DUD-E (Mysinger et al. 2012) and LIT-PCBA (Tran-Nguyen, Jacquemard, and Rognan 2020). To evaluate the memory and time cost of different VS methods, we adopt the ZINC database (Irwin et al. 2020) and the Enamine REAL database (Shivanyuk et al. 2007). The CASF-2016 dataset (Su et al. 2018) is used as the validation set.
Dataset Splits Yes To train Drug Hash, we adopt the same training datasets as Drug CLIP, which is the PDBBind dataset (Wang et al. 2005) argumented by Homo Aug (Gao et et al. 2023). To benchmark the VS performance of different methods, we adopt two evaluation datasets, which are DUD-E (Mysinger et al. 2012) and LIT-PCBA (Tran-Nguyen, Jacquemard, and Rognan 2020). The CASF-2016 dataset (Su et al. 2018) is used as the validation set to select the best number of epoch.
Hardware Specification Yes Our model is trained on NVIDIA RTX A6000 GPUs, and each model is trained up to 200 epochs. The time test is running on the Intel Xeon Gold 6240R CPUs.
Software Dependencies No The paper mentions Faiss (Douze et al. 2024) and general deep learning concepts but does not specify version numbers for key software libraries or frameworks (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes In our implementation, we set the hyperparameter λ to 0.2. The temperature coefficient τ is set to 0.07. Each time, we sample 48 protein-molecule pairs for contrastive learning. The code length of the output binary hash codes is 128. Our model is trained on NVIDIA RTX A6000 GPUs, and each model is trained up to 200 epochs. The model is trained for five random seeds and we report the average results. We utilize gradient accumulation, performing gradient backpropagation every four steps on a single GPU card, which is equivalent to using four cards for distributed training.