Controllable Protein Sequence Generation with LLM Preference Optimization
Authors: Xiangyu Liu, Yi Liu, Silei Chen, Wei Hu
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that Ctrl Prot can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both singleattribute and multi-attribute protein sequence generation. |
| Researcher Affiliation | Academia | 1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Medical School, Nanjing University, China 3 National Institute of Healthcare Data Science, Nanjing University, China |
| Pseudocode | No | The paper describes the methodology using textual explanations and mathematical equations, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Datasets and source code are available at https://github.com/nju-websoft/Ctrl Prot. |
| Open Datasets | Yes | Dataset construction. We extract protein sequences with Gene Ontology (GO) terms from the Uni Prot KB database2 and corresponding structures from the Alpha Fold protein structure database3. ... 1https://current.geneontology.org/annotations/ 2https://www.uniprot.org/ 3https://alphafold.ebi.ac.uk/ |
| Dataset Splits | Yes | Each attribute contains 10k protein sequences for training. For each attribute, we extract 100k sequences from Uni Prot KB as the evaluation set, excluding the training set to ensure no data leakage. |
| Hardware Specification | Yes | All training and generation are conducted on a single A800 GPU. |
| Software Dependencies | No | We use Protein MPNN as the structural encoder and ESMFold (Lin et al. 2023b) for structure prediction, both with default parameters. The Rosetta score is calculated using the weight configuration of ref2015 (Park et al. 2016). While specific versions are mentioned for Rosetta's weight configuration (ref2015), other key software like Protein MPNN, ESMFold, Prot GPT2, and ESM-2 are mentioned without specific version numbers. |
| Experiment Setup | Yes | For prefix-tuning, we finetune Prot GPT2 with following settings: batch size (16), learning rate (1e-4), prefix token number (100). For preference optimization, we use 5k pairs on each attribute and set the learning rate (5e-5), β (0.1), and α (0.05). The maximum generation length is 400. |