reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Steering Protein Language Models

Authors: Long-Kai Huang, Rongyi Zhu, Bing He, Jianhua Yao

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experiments on lysozyme-like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into both auto-encoding and autoregressive PLMs without requiring additional training.
Researcher Affiliation	Industry	Long-Kai Huang 1 Rongyi Zhu 1 Bing He 1 Jianhua Yao 1 1Tencent AI Lab. Correspondence to: Long-Kai Huang <EMAIL>, Jianhua Yao <>.
Pseudocode	Yes	Algorithm 1 Activation Steering based Protein Optimization (ASPO) 1: Input: protein sequence x, positive protein sequence set P, negative set N, steering strength α, layer ℓfor relatedness score computation, number of mutation sites per round T, and number of rounds R 2: Compute steering vectors {vl} for all layers l = 1, 2, ..., L using Equation (3) 3: for r = 1 to R do 4: Compute token representations hk ℓfor all tokens k = 1, 2, ..., K at layer ℓ. 5: Compute the relatedness scores sk for all tokens using Equation (4). 6: Obtain the set of the token indices of the T lowest scores in {sk ℓ} as IT . 7: Mask tokens at positions in IT . 8: Predict new amino acids at positions in IT using activation steering (Equation (1)) with steering vectors {vl}. 9: end for
Open Source Code	Yes	Code is available at Github1. 1https://github.com/Long-Kai/Steering-PLMs
Open Datasets	Yes	Data: To construct the positive and negative sets for steering vector extraction, we ﬁrst predict thermostability or solubility for all lysozyme-like proteins in the Uni Ref50 dataset using property-speciﬁc predictors. For thermostability, we use data from the Meltome Atlas (Jarzab et al., 2020), which provides melting temperatures for 48,000 proteins across 13 species (archaea to humans), with values ranging from 30 C to 90 C. We use the preprocessed dataset in khurana2018deepsol, containing 28,972 soluble and 40,448 insoluble proteins. The data is split 90%/10% for training and validation. For benchmarking, we use an independent test set in (Chang et al., 2014), which includes 1,000 soluble and 1,001 insoluble proteins. For GFP brightness, we adopt the same data split as (Kirjner et al., 2023) and randomly select 100 sequences from easy difﬁculty as the positive set and 100 sequences from hard difﬁculty as the negative set.
Dataset Splits	Yes	The dataset is split into 90% for training and 10% for testing. To reduce redundancy, we ensure a maximum sequence identity of 90% within the training set. Furthermore, any training sequence with 30% identity to a test sequence is removed, preventing information leakage and ensuring a fair evaluation. The ﬁnal dataset contains 24,817 proteins for training and 3,134 for testing.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory amounts) are provided in the paper. The paper focuses on the models, methods, and experimental results without detailing the computational infrastructure used.
Software Dependencies	No	We estimate the dissimilarity in a set-wise manner using MMseqs2 (Steinegger & S oding, 2017). For AR-PLMs, we use Lo RA (Hu et al., 2022) on all layers with rank 4 and alpha 16. While MMseqs2 and LoRA are mentioned, specific version numbers for these or any other software dependencies are not provided.
Experiment Setup	Yes	Hyper-parameter settings: We ﬁx positive and negative set sizes for steering vector extraction at 100 and set α = 1.0 by default. For AE-PLMs, we ﬁne-tune only the last layer. For AR-PLMs, we use Lo RA (Hu et al., 2022) on all layers with rank 4 and alpha 16. For protein optimization speciﬁc hyperparameters, we set the number of optimization rounds R = 8 and the number of mutation sites per round T = 4 for thermostability experiments and set R = 4 and T = 2 for the solubility and GFP brightness experiments.