reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controllable Protein Sequence Generation with LLM Preference Optimization

Authors: Xiangyu Liu, Yi Liu, Silei Chen, Wei Hu

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments demonstrate that Ctrl Prot can meet functionality and structural stability requirements effectively, achieving state-of-the-art performance in both singleattribute and multi-attribute protein sequence generation.
Researcher Affiliation	Academia	1 State Key Laboratory for Novel Software Technology, Nanjing University, China 2 Medical School, Nanjing University, China 3 National Institute of Healthcare Data Science, Nanjing University, China
Pseudocode	No	The paper describes the methodology using textual explanations and mathematical equations, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Datasets and source code are available at https://github.com/nju-websoft/Ctrl Prot.
Open Datasets	Yes	Dataset construction. We extract protein sequences with Gene Ontology (GO) terms from the Uni Prot KB database2 and corresponding structures from the Alpha Fold protein structure database3. ... 1https://current.geneontology.org/annotations/ 2https://www.uniprot.org/ 3https://alphafold.ebi.ac.uk/
Dataset Splits	Yes	Each attribute contains 10k protein sequences for training. For each attribute, we extract 100k sequences from Uni Prot KB as the evaluation set, excluding the training set to ensure no data leakage.
Hardware Specification	Yes	All training and generation are conducted on a single A800 GPU.
Software Dependencies	No	We use Protein MPNN as the structural encoder and ESMFold (Lin et al. 2023b) for structure prediction, both with default parameters. The Rosetta score is calculated using the weight configuration of ref2015 (Park et al. 2016). While specific versions are mentioned for Rosetta's weight configuration (ref2015), other key software like Protein MPNN, ESMFold, Prot GPT2, and ESM-2 are mentioned without specific version numbers.
Experiment Setup	Yes	For prefix-tuning, we finetune Prot GPT2 with following settings: batch size (16), learning rate (1e-4), prefix token number (100). For preference optimization, we use 5k pairs on each attribute and set the learning rate (5e-5), β (0.1), and α (0.05). The maximum generation length is 400.