Steering Protein Language Models

Authors: Long-Kai Huang, Rongyi Zhu, Bing He, Jianhua Yao

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through comprehensive experiments on lysozyme-like sequence generation and optimization, we demonstrate that our methods can be seamlessly integrated into both auto-encoding and autoregressive PLMs without requiring additional training.
Researcher Affiliation Industry Long-Kai Huang 1 Rongyi Zhu 1 Bing He 1 Jianhua Yao 1 1Tencent AI Lab. Correspondence to: Long-Kai Huang <EMAIL>, Jianhua Yao <>.
Pseudocode Yes Algorithm 1 Activation Steering based Protein Optimization (ASPO) 1: Input: protein sequence x, positive protein sequence set P, negative set N, steering strength α, layer ℓfor relatedness score computation, number of mutation sites per round T, and number of rounds R 2: Compute steering vectors {vl} for all layers l = 1, 2, ..., L using Equation (3) 3: for r = 1 to R do 4: Compute token representations hk ℓfor all tokens k = 1, 2, ..., K at layer ℓ. 5: Compute the relatedness scores sk for all tokens using Equation (4). 6: Obtain the set of the token indices of the T lowest scores in {sk ℓ} as IT . 7: Mask tokens at positions in IT . 8: Predict new amino acids at positions in IT using activation steering (Equation (1)) with steering vectors {vl}. 9: end for
Open Source Code Yes Code is available at Github1. 1https://github.com/Long-Kai/Steering-PLMs
Open Datasets Yes Data: To construct the positive and negative sets for steering vector extraction, we first predict thermostability or solubility for all lysozyme-like proteins in the Uni Ref50 dataset using property-specific predictors. For thermostability, we use data from the Meltome Atlas (Jarzab et al., 2020), which provides melting temperatures for 48,000 proteins across 13 species (archaea to humans), with values ranging from 30 C to 90 C. We use the preprocessed dataset in khurana2018deepsol, containing 28,972 soluble and 40,448 insoluble proteins. The data is split 90%/10% for training and validation. For benchmarking, we use an independent test set in (Chang et al., 2014), which includes 1,000 soluble and 1,001 insoluble proteins. For GFP brightness, we adopt the same data split as (Kirjner et al., 2023) and randomly select 100 sequences from easy difficulty as the positive set and 100 sequences from hard difficulty as the negative set.
Dataset Splits Yes The dataset is split into 90% for training and 10% for testing. To reduce redundancy, we ensure a maximum sequence identity of 90% within the training set. Furthermore, any training sequence with 30% identity to a test sequence is removed, preventing information leakage and ensuring a fair evaluation. The final dataset contains 24,817 proteins for training and 3,134 for testing.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts) are provided in the paper. The paper focuses on the models, methods, and experimental results without detailing the computational infrastructure used.
Software Dependencies No We estimate the dissimilarity in a set-wise manner using MMseqs2 (Steinegger & S oding, 2017). For AR-PLMs, we use Lo RA (Hu et al., 2022) on all layers with rank 4 and alpha 16. While MMseqs2 and LoRA are mentioned, specific version numbers for these or any other software dependencies are not provided.
Experiment Setup Yes Hyper-parameter settings: We fix positive and negative set sizes for steering vector extraction at 100 and set α = 1.0 by default. For AE-PLMs, we fine-tune only the last layer. For AR-PLMs, we use Lo RA (Hu et al., 2022) on all layers with rank 4 and alpha 16. For protein optimization specific hyperparameters, we set the number of optimization rounds R = 8 and the number of mutation sites per round T = 4 for thermostability experiments and set R = 4 and T = 2 for the solubility and GFP brightness experiments.