reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Decoupled Sequence and Structure Generation for Realistic Antibody Design

Authors: Nayoung Kim, Minsu Kim, Sungsoo Ahn, Jinkyoo Park

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that our sequence-structure decoupling approach improves performance in various antibody design experiments, while our training algorithm effectively prevents excessive token repetitions. Notably, our approach establishes a Pareto frontier over other non-autoregressive antibody design models, indicating optimal trade-offs between high sequence modeling capacity and low token repetition. Additionally, we demonstrate that our training algorithm can be generalized to protein design.
Researcher Affiliation	Academia	Nayoung Kim EMAIL Korea Advanced Institute of Science and Technology (KAIST) Minsu Kim EMAIL Korea Advanced Institute of Science and Technology (KAIST) Sungsoo Ahn EMAIL Pohang University of Science and Technology (POSTECH) Jinkyoo Park EMAIL Korea Advanced Institute of Science and Technology (KAIST)
Pseudocode	Yes	Algorithm 1 Training sequence design model Require: Antibody sequence dataset D = {(s, c)}, antibody sequence generator pθ(s\|c). ... Algorithm 2 Training structure prediction model Require: A trained antibody sequence generator pθ(s\|c), a sequence-to-structure model pϕ(x\|s, c), and antibody sequence-structure dataset D = {(s, x, c)}. ... Algorithm 3 ITA algorithm for antibody affinity optimization Input: SKEMPI V2.0 antibody-antigen complex dataset D, pre-trained pθ(s, x\|c), top-k candidates to maintain
Open Source Code	No	Our implementation is built upon https://github.com/facebookresearch/esm, https://github. com/Byted Protein/By Prot/tree/main, https://github.com/wengong-jin/Refine GNN, and https:// github.com/THUNLP-MT/MEAN/tree/main.
Open Datasets	Yes	SAb Dab benchmark... Structural Antibody Database (SAb Dab) (Dunbar et al., 2014)... RAb D benchmark... Rosetta Antibody Design (RAb D) dataset... SKEMPI V2.0 dataset (Jankauskait e et al., 2019)... CATH 4.2 and CATH 4.3 benchmarks.
Dataset Splits	Yes	We then split the complexes into train, validation, and test sets according to the CDR clusterings. Specifically, we use MMseqs2 (Steinegger & Söding, 2017) to assign antibodies with CDR sequence identity above 40% to the same cluster, where the sequence identity is computed with the BLOSUM62 substitution matrix (Henikoff & Henikoff, 1992). Then we conduct a 10-fold cross-validation by splitting the clusters into a ratio of 8:1:1 for train/valid/test sets, respectively. Detailed statistics of the 10-fold dataset splits are provided in Appendix E.
Hardware Specification	Yes	All models were trained on a machine with 48 CPU cores and 8 NVIDIA Geforce RTX 3090. We have used 1-3 GPUs for all experiments.
Software Dependencies	No	The paper mentions 'ESM2-650M (Lin et al., 2023)' as a protein language model and optimizers 'Adam' and 'Adam W', but it does not specify version numbers for any software libraries or dependencies, such as Python, PyTorch, or TensorFlow.
Experiment Setup	Yes	We set hyperparameter α = 0.2 in our sequence objective and use rank 2 for weights Wq, Wk, Wv, and Wo in the multi-head attention module for Lo RA fine-tuning... we limit the batch size with a maximum token length of 6000 and train for 30 epochs. We set hyperparameter α = 0.2 in our sequence objective... Optimizer Adam Learning rate 0.001 (Table 11) and Optimizer Adam W Learning rate 0.001 (Table 12).