Retrieval Augmented Zero-Shot Enzyme Generation for Specified Substrate

Authors: Jiahe Du, Kaixiong Zhou, Xinyu Hong, Zhaozhuo Xu, Jinbo Xu, Xiao Huang

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model on enzyme design tasks with diverse real-world substrates and show that it outperforms existing protein generation methods in catalytic capability, foldability, and docking accuracy. Additionally, we define the zero-shot substrate-specified enzyme generation task and introduce a dataset with evaluation benchmarks.
Researcher Affiliation Collaboration 1Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Hong Kong SAR, China 2Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, NC, USA 3Beijing Molecule Mind Co.,Ltd., Beijing, China 4Department of Computer Science, Stevens Institute of Technology, Hoboken, NJ, USA.
Pseudocode No The paper describes methods in prose and equations, such as in Section 3 'Substrate-Specified Enzyme Generator' and its subsections, but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement about releasing its source code, nor does it provide a link to a code repository.
Open Datasets Yes We provide a substrate-enzyme relationship dataset extracted from RHEA1 database to better evaluate model performance on the substrate-specified enzyme generation task. Statistics of the dataset are shown in Table. 1. The two rules in Sec. 2 are strictly followed to avoid data overlap. 1https://www.rhea-db.org
Dataset Splits Yes All substrate-enzyme pairs (m, x) are split into D for training, Dvalid for validation and Dtest for testing. To avoid of data leakage, two rules are designed for any two (m1, x1) and (m2, x2) in different subsets: 1. Molecules from different subsets should not be the same, i.e. m1 = m2; 2. Any two protein sequences from different subsets, i.e., x1 and x2, should not have an overlap of more than 30% (with an identity exceeding 30%). The split forms a zero-shot setting. Take the target molecule TCP as an example. TCP is in Dtest and the model G is generating enzyme for TCP. G has never trained with TCP because TCP is not in D. G has never seen proteins similar to TCP s ground truth enzymes because all of them are only in Dtest, and all proteins in D have at least 70% different from them.
Hardware Specification Yes All the experiments are conducted within 200 GB memory, 2 Intel Xeon Gold 6426Y CPUs, and 4 NVIDIA 4090D GPUs with 24 GB memory each.
Software Dependencies No The paper mentions various tools and models used for evaluation and baselines (e.g., Clustal W, Uni KP, ESMFold, RDKit, Neural PLexer, Auto Dock-Vina), some with implied versions from citations, but does not provide specific version numbers for the key software components (like Python, PyTorch, or other libraries) used to implement the SENZ model itself.
Experiment Setup No The paper describes the model architecture, training methodology, and loss functions (Lr, Lg), but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rate, batch size, number of epochs), optimizer settings, or training schedules for the SENZ model.