reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CFP-Gen: Combinatorial Functional Protein Generation via Diffusion Language Models

Authors: Junbo Yin, Chao Zha, Wenjia He, Chencheng Xu, Xin Gao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We thoroughly evaluate CFP-GEN across various tasks, including functional sequence generation, functional protein inverse folding, and multi-objective protein design. CFPGEN achieves exceptional function performance, e.g., improving ESM3 by 30% in F1-score, as demonstrated by leading function predictors. Additionally, it improves Amino Acid Recovery (AAR) of DPLM by 9% in inverse folding. Notably, CFP-GEN demonstrates a remarkable success rate in designing multi-functional proteins (e.g., enzymes exhibiting multiple catalytic activities).
Researcher Affiliation	Academia	1Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 239556900, Kingdom of Saudi Arabia 2Center of Excellence for Smart Health (KCSH), KAUST 3Center of Excellence for Generative AI, KAUST. Correspondence to: Xin Gao <EMAIL>.
Pseudocode	No	The paper describes the methodology of CFP-GEN in detail in Section 3, including its components like AGFM and RCFE, and the mathematical formulations for the diffusion process and optimization. However, it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and data available at https://github.com/yinjunbo/cfpgen.
Open Datasets	Yes	To collect high-quality data for training CFPGEN, we employ expert-curated functional annotations from Swiss Prot (Uni Prot KB) (Consortium, 2019), Inter Pro (Hunter et al., 2009), and CARE (Yang et al., 2024) databases. Additionally, the PDB (Berman et al., 2000) and AFDB (Jumper et al., 2021) databases were exploited to provide backbone atomic coordinates, ensuring structural constraints are incorporated into the dataset.
Dataset Splits	Yes	For the general protein dataset... To evaluate GO function, we construct a validation set by selecting 30 sequences per GO/IPR label, resulting in a subset of 8,309 sequences. The training dataset is then formed by holding out these sequences to ensure an unbiased evaluation. For IPR function assignment, we further perform a 10-fold uniform downsampling of the GO validation set, yielding 831 sequences... To enable rigorous evaluation, we construct a validation set by sampling 30 sequences per EC label, resulting in a high-quality evaluation set of 16,187 sequences, while the remaining data is allocated to the training set.
Hardware Specification	Yes	The batch size is set to 1 million tokens, and training is conducted on 8 NVIDIA A100 GPUs for around 72 hours of each stage.
Software Dependencies	No	The paper mentions the use of the Adam W optimizer and refers to DPLM as a base model. However, it does not specify version numbers for any programming languages, libraries, or other software components used for implementation.
Experiment Setup	Yes	The batch size is set to 1 million tokens, and training is conducted on 8 NVIDIA A100 GPUs for around 72 hours of each stage. The Adam W optimizer is employed with a maximum learning rate of 0.00004. During inference, we allow the model to perform 100 sampling steps, following the DPLM conditional generation, with sequence length varying from 200 to 400.