reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Concept Bottleneck Language Models For Protein Design

Authors: Aya Ismail, Tuomas Oikarinen, Amy Wang, Julius Adebayo, Samuel Stanton, Hector Corrada Bravo, Kyunghyun Cho, Nathan Frey

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our models achieve pre-training perplexity and downstream task performance comparable to traditional masked protein language models, demonstrating that interpretability does not compromise performance. We compare CB-p LM to various conditional p LMs of the same size, all trained on the same dataset, across more than 80 single and multi-property control in silico experiments. Our model demonstrates 3 better control in terms of change in concept magnitude and a 16% improvement in control accuracy compared to other architectures. Additionally, we benchmark CB-p LM against state-of-the-art (SOTA) protein design models explicitly trained to optimize a single concept. Remarkably, our general-purpose CB-p LM, trained to learn over 700 concepts, delivers results comparable to SOTA models, while maintaining the naturalness of the protein.
Researcher Affiliation	Collaboration	1Genentech 2Prescient Design 3University of California San Diego 4Guide Labs 5Department of Computer Science, New York University 6Center for Data Science, New York University
Pseudocode	Yes	Algorithm 1 Single Concept intervention procedure
Open Source Code	No	The paper does not provide an explicit statement about releasing the source code for their methodology, nor does it include a direct link to a code repository.
Open Datasets	Yes	We combined sequences from Uni Ref50 (Suzek et al., 2015) and SWISS-PROT (Bairoch & Apweiler, 2000), removing duplicates. ... To assess performance, we used 10,000 randomly sampled antibodies from the publicly available Mason dataset Mason et al. (2021).
Dataset Splits	Yes	Using a validation dataset, we selected 10,000 sequences with the lowest and highest concept values for positive and negative interventions. ... To test generation control, we mask 5% of the input sequence (up to 25 amino acids) and intervene on the concept value. ... During training we mask percentage is 25% of the sequence.
Hardware Specification	No	The paper mentions aspects of training efficiency and distributed backends but does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running the experiments.
Software Dependencies	No	Biopython (Cock et al., 2009) was used to extract biophysical and bioinformatics sequence-level concepts. ... We employed the Adam W, mixed precision training, and gradient clipping for stability. ... To evaluate the naturalness of designs generated by various models, we folded the designs using ABody Builder2 (Abanades et al., 2023) and analyzed different protein surface properties with the Therapeutic Antibody Profiler (TAP) Raybould & Deane (2022).
Experiment Setup	Yes	MODEL PARAMETERS AND CONFIGURATIONS AT DIFFERENT SCALE: Number of layers 10 27 33 26; Embedding dim 408 768 1280 2560; Attention heads 12 12 20 40; Concept embedding dim 2 2 2 2; Learning rate 0.001 0.0001 0.0001 0.0001; Clip norm 0.5 0.5 0.5 0.5; precision 16 16 16 bf16; Warmup steps 3000 10000 30000 30000; Effective batch size 512 1024 1024 1024; Distributed backend ddp ddp ddp deepspeed stage 1. Training: During training we mask percentage is 25% of the sequence. We then truncate all sequences to a maximum length of 512. ... We use Rotary Position Embedding.