Controllable Protein Sequence Generation with LLM Preference Optimization

Authors: Xiangyu Liu, Yi Liu, Silei Chen, Wei Hu

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that Ctrl Prot can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both singleattribute and multi-attribute protein sequence generation.
Researcher Affiliation Academia 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Medical School, Nanjing University, China 3 National Institute of Healthcare Data Science, Nanjing University, China
Pseudocode No The paper describes the methodology using textual explanations and mathematical equations, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Datasets and source code are available at https://github.com/nju-websoft/Ctrl Prot.
Open Datasets Yes Dataset construction. We extract protein sequences with Gene Ontology (GO) terms from the Uni Prot KB database2 and corresponding structures from the Alpha Fold protein structure database3. ... 1https://current.geneontology.org/annotations/ 2https://www.uniprot.org/ 3https://alphafold.ebi.ac.uk/
Dataset Splits Yes Each attribute contains 10k protein sequences for training. For each attribute, we extract 100k sequences from Uni Prot KB as the evaluation set, excluding the training set to ensure no data leakage.
Hardware Specification Yes All training and generation are conducted on a single A800 GPU.
Software Dependencies No We use Protein MPNN as the structural encoder and ESMFold (Lin et al. 2023b) for structure prediction, both with default parameters. The Rosetta score is calculated using the weight configuration of ref2015 (Park et al. 2016). While specific versions are mentioned for Rosetta's weight configuration (ref2015), other key software like Protein MPNN, ESMFold, Prot GPT2, and ESM-2 are mentioned without specific version numbers.
Experiment Setup Yes For prefix-tuning, we finetune Prot GPT2 with following settings: batch size (16), learning rate (1e-4), prefix token number (100). For preference optimization, we use 5k pairs on each attribute and set the learning rate (5e-5), β (0.1), and α (0.05). The maximum generation length is 400.