CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models
Authors: Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We thoroughly evaluate CFP-GEN across various tasks, including functional sequence generation, functional protein inverse folding, and multi-objective protein design. CFPGEN achieves exceptional function performance, e.g., improving ESM3 by 30% in F1-score, as demonstrated by leading function predictors. Additionally, it improves Amino Acid Recovery (AAR) of DPLM by 9% in inverse folding. Notably, CFP-GEN demonstrates a remarkable success rate in designing multi-functional proteins (e.g., enzymes exhibiting multiple catalytic activities). |
| Researcher Affiliation | Academia | 1Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 239556900, Kingdom of Saudi Arabia 2Center of Excellence for Smart Health (KCSH), KAUST 3Center of Excellence for Generative AI, KAUST. Correspondence to: Xin Gao <EMAIL>. |
| Pseudocode | No | The paper describes the methodology of CFP-GEN in detail in Section 3, including its components like AGFM and RCFE, and the mathematical formulations for the diffusion process and optimization. However, it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data available at https://github.com/yinjunbo/cfpgen. |
| Open Datasets | Yes | To collect high-quality data for training CFPGEN, we employ expert-curated functional annotations from Swiss Prot (Uni Prot KB) (Consortium, 2019), Inter Pro (Hunter et al., 2009), and CARE (Yang et al., 2024) databases. Additionally, the PDB (Berman et al., 2000) and AFDB (Jumper et al., 2021) databases were exploited to provide backbone atomic coordinates, ensuring structural constraints are incorporated into the dataset. |
| Dataset Splits | Yes | For the general protein dataset... To evaluate GO function, we construct a validation set by selecting 30 sequences per GO/IPR label, resulting in a subset of 8,309 sequences. The training dataset is then formed by holding out these sequences to ensure an unbiased evaluation. For IPR function assignment, we further perform a 10-fold uniform downsampling of the GO validation set, yielding 831 sequences... To enable rigorous evaluation, we construct a validation set by sampling 30 sequences per EC label, resulting in a high-quality evaluation set of 16,187 sequences, while the remaining data is allocated to the training set. |
| Hardware Specification | Yes | The batch size is set to 1 million tokens, and training is conducted on 8 NVIDIA A100 GPUs for around 72 hours of each stage. |
| Software Dependencies | No | The paper mentions the use of the Adam W optimizer and refers to DPLM as a base model. However, it does not specify version numbers for any programming languages, libraries, or other software components used for implementation. |
| Experiment Setup | Yes | The batch size is set to 1 million tokens, and training is conducted on 8 NVIDIA A100 GPUs for around 72 hours of each stage. The Adam W optimizer is employed with a maximum learning rate of 0.00004. During inference, we allow the model to perform 100 sampling steps, following the DPLM conditional generation, with sequence length varying from 200 to 400. |