Concept Bottleneck Language Models For Protein Design
Authors: Aya Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo, Samuel Stanton, Hector Corrada Bravo, Kyunghyun Cho, Nathan Frey
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. We compare CB-p LM to various conditional p LMs of the same size, all trained on the same dataset, across more than 80 single and multi-property control in silico experiments. Our model demonstrates 3 better control in terms of change in concept magnitude and a 16% improvement in control accuracy compared to other architectures. Additionally, we benchmark CB-p LM against state-of-the-art (SOTA) protein design models explicitly trained to optimize a single concept. Remarkably, our general-purpose CB-p LM, trained to learn over 700 concepts, delivers results comparable to SOTA models, while maintaining the naturalness of the protein. |
| Researcher Affiliation | Collaboration | 1Genentech 2Prescient Design 3University of California San Diego 4Guide Labs 5Department of Computer Science, New York University 6Center for Data Science, New York University |
| Pseudocode | Yes | Algorithm 1 Single Concept intervention procedure |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code for their methodology, nor does it include a direct link to a code repository. |
| Open Datasets | Yes | We combined sequences from Uni Ref50 (Suzek et al., 2015) and SWISS-PROT (Bairoch & Apweiler, 2000), removing duplicates. ... To assess performance, we used 10,000 randomly sampled antibodies from the publicly available Mason dataset Mason et al. (2021). |
| Dataset Splits | Yes | Using a validation dataset, we selected 10,000 sequences with the lowest and highest concept values for positive and negative interventions. ... To test generation control, we mask 5% of the input sequence (up to 25 amino acids) and intervene on the concept value. ... During training we mask percentage is 25% of the sequence. |
| Hardware Specification | No | The paper mentions aspects of training efficiency and distributed backends but does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments. |
| Software Dependencies | No | Biopython (Cock et al., 2009) was used to extract biophysical and bioinformatics sequence-level concepts. ... We employed the Adam W, mixed precision training, and gradient clipping for stability. ... To evaluate the naturalness of designs generated by various models, we folded the designs using ABody Builder2 (Abanades et al., 2023) and analyzed different protein surface properties with the Therapeutic Antibody Profiler (TAP) Raybould & Deane (2022). |
| Experiment Setup | Yes | MODEL PARAMETERS AND CONFIGURATIONS AT DIFFERENT SCALE: Number of layers 10 27 33 26; Embedding dim 408 768 1280 2560; Attention heads 12 12 20 40; Concept embedding dim 2 2 2 2; Learning rate 0.001 0.0001 0.0001 0.0001; Clip norm 0.5 0.5 0.5 0.5; precision 16 16 16 bf16; Warmup steps 3000 10000 30000 30000; Effective batch size 512 1024 1024 1024; Distributed backend ddp ddp ddp deepspeed stage 1. Training: During training we mask percentage is 25% of the sequence. We then truncate all sequences to a maximum length of 512. ... We use Rotary Position Embedding. |