Retrieval Augmented Diffusion Model for Structure-informed Antibody Design and Optimization
Authors: Zichen Wang, Yaokun Ji, Jianing Tian, Shuangjia Zheng
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments demonstrate that our method achieves state-of-the-art performance in multiple antibody inverse folding and optimization tasks, offering a new perspective on biomolecular generative models. To evaluate the performance of our model s generation, we utilize two tasks: antibody CDR sequence inverse folding (Section 5.1) and antibody optimization based on sequence design (Section 5.2), to compare with the baselines. Additionally, we conducted ablation experiments and further analysis to demonstrate the effectiveness of the retrieval-augmented method (Section 5.3). |
| Researcher Affiliation | Academia | 1 Global Institute of Future technology, Shanghai Jiao Tong University; 2 School of Software & Microelectronics, Peking University |
| Pseudocode | Yes | Algorithm 1 Structural Retrieval Algorithm Overview Algorithm 2 Training Procedure of RADAb Algorithm 3 Sampling Procedure of RADAb |
| Open Source Code | Yes | REPRODUCIBILITY STATEMENT The code is avalibale at https://github.com/GENTEL-lab/RADAb |
| Open Datasets | Yes | To fully exploit the protein structure space, we first compiled a database of CDR-like fragments from the non-redundant Protein Data Bank (PDB) (Berman et al., 2000). The dataset for training the model is obtained from the SAb Dab and our established CDR-like fragments dataset. Following the previous work (Luo et al., 2022), we first eliminated structures with a resolution lower than 4 A and removed antibodies that target non-protein antigens. Chothia (Chothia & Lesk, 1987) in ANARCI (Dunbar & Deane, 2016) is used for renumbering antibody residues. |
| Dataset Splits | Yes | We clustered the SADab datasets based on 50% sequence similarity in the CDR-H3 region, and chose 50 PDB files comprising 63 antibody-antigen complex structures as the test set. To ensure distinct training and test sets, we removed structures from the training set that were part of the same clusters as those in the test set. |
| Hardware Specification | Yes | All experiments are run on a single RTX4090 GPU, with a memory storage of 24GB. |
| Software Dependencies | No | Our model was developed and executed within the Py Torch framework. |
| Experiment Setup | Yes | For training, We chose the Adam optimizer with a learning rate of 0.0001, weight decay of 0.0, and momentum parameters beta1 and beta2 set to 0.9 and 0.999, respectively. To dynamically adjust the learning rate, we employed plateau as learning rate scheduler. When the validation loss plateaued, the learning rate was reduced by a factor of 0.8, with a minimum learning rate set to 5e-6. The scheduler s patience was set to 10 epochs. The batch size is 8 during training. We design 8 samples for each CDR in the test set. Due to the high variability and specificity of the CDRH3 region, and it is considered the most critical part in determining antigen-antibody binding. We conducted separate training for the sequence design of this region, adding and removing noise only for the CDRH3 region in each training iteration, with a total of 100,000 iterations. The other five regions, being more conserved, were trained together for a total of 250,000 iterations (approximately equivalent to 50,000 iterations per region). The reverse generation process time step t is set to 100. |