Decoupled Sequence and Structure Generation for Realistic Antibody Design

Authors: Nayoung Kim, Minsu Kim, Sungsoo Ahn, Jinkyoo Park

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that our sequence-structure decoupling approach improves performance in various antibody design experiments, while our training algorithm effectively prevents excessive token repetitions. Notably, our approach establishes a Pareto frontier over other non-autoregressive antibody design models, indicating optimal trade-offs between high sequence modeling capacity and low token repetition. Additionally, we demonstrate that our training algorithm can be generalized to protein design.
Researcher Affiliation Academia Nayoung Kim EMAIL Korea Advanced Institute of Science and Technology (KAIST) Minsu Kim EMAIL Korea Advanced Institute of Science and Technology (KAIST) Sungsoo Ahn EMAIL Pohang University of Science and Technology (POSTECH) Jinkyoo Park EMAIL Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode Yes Algorithm 1 Training sequence design model Require: Antibody sequence dataset D = {(s, c)}, antibody sequence generator pθ(s|c). ... Algorithm 2 Training structure prediction model Require: A trained antibody sequence generator pθ(s|c), a sequence-to-structure model pϕ(x|s, c), and antibody sequence-structure dataset D = {(s, x, c)}. ... Algorithm 3 ITA algorithm for antibody affinity optimization Input: SKEMPI V2.0 antibody-antigen complex dataset D, pre-trained pθ(s, x|c), top-k candidates to maintain
Open Source Code No Our implementation is built upon https://github.com/facebookresearch/esm, https://github. com/Byted Protein/By Prot/tree/main, https://github.com/wengong-jin/Refine GNN, and https:// github.com/THUNLP-MT/MEAN/tree/main.
Open Datasets Yes SAb Dab benchmark... Structural Antibody Database (SAb Dab) (Dunbar et al., 2014)... RAb D benchmark... Rosetta Antibody Design (RAb D) dataset... SKEMPI V2.0 dataset (Jankauskait e et al., 2019)... CATH 4.2 and CATH 4.3 benchmarks.
Dataset Splits Yes We then split the complexes into train, validation, and test sets according to the CDR clusterings. Specifically, we use MMseqs2 (Steinegger & Söding, 2017) to assign antibodies with CDR sequence identity above 40% to the same cluster, where the sequence identity is computed with the BLOSUM62 substitution matrix (Henikoff & Henikoff, 1992). Then we conduct a 10-fold cross-validation by splitting the clusters into a ratio of 8:1:1 for train/valid/test sets, respectively. Detailed statistics of the 10-fold dataset splits are provided in Appendix E.
Hardware Specification Yes All models were trained on a machine with 48 CPU cores and 8 NVIDIA Geforce RTX 3090. We have used 1-3 GPUs for all experiments.
Software Dependencies No The paper mentions 'ESM2-650M (Lin et al., 2023)' as a protein language model and optimizers 'Adam' and 'Adam W', but it does not specify version numbers for any software libraries or dependencies, such as Python, PyTorch, or TensorFlow.
Experiment Setup Yes We set hyperparameter α = 0.2 in our sequence objective and use rank 2 for weights Wq, Wk, Wv, and Wo in the multi-head attention module for Lo RA fine-tuning... we limit the batch size with a maximum token length of 6000 and train for 30 epochs. We set hyperparameter α = 0.2 in our sequence objective... Optimizer Adam Learning rate 0.001 (Table 11) and Optimizer Adam W Learning rate 0.001 (Table 12).