Diffusion on Language Model Encodings for Protein Sequence Generation

Authors: Viacheslav Meshchaninov, Pavel Strashnov, Andrey Shevtsov, Fedor Nikolaev, Nikita Ivanisenko, Olga Kardymon, Dmitry Vetrov

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate existing methods alongside Di MA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. Di MA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. Section 3 is titled "Experiments" and contains subsections like "Evaluation Metrics", "Denoiser Component Analysis", and "Comparison Across Generative Paradigms".
Researcher Affiliation Collaboration 1Constructor University, Bremen, Germany 2AIRI, Moscow, Russia. Correspondence to: Viacheslav Meshchaninov <EMAIL>, Pavel Strashnov <EMAIL>, Andrey Shevtsov <EMAIL>.
Pseudocode No The paper describes methods and architectures in prose and via diagrams (e.g., Figure 1, Figure 9), but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is released at Git Hub.
Open Datasets Yes Swiss Prot is a dataset that contains a high-quality, manually annotated subset of the Uni Prot (Consortium, 2020) database. Another dataset we use is AFDBv4-90 from Durairaj et al. (2023), a subset of the Uni Ref50 database.
Dataset Splits No During inference, we first sample the target sequence length from the training data distribution to ensure realistic protein lengths. We finetune Di MA on the CATH S40 non-redundant dataset ( 27k proteins) and evaluate performance on a hold-out set of 100 structures. For each structure, we generate 10 proteins and assess their similarity to the target fold using the TM-score.
Hardware Specification Yes The experiments were conducted using 4 A100 80GB GPUs.
Software Dependencies No The paper mentions several software tools and models, such as "ESM-2", "CHEAP", "Sa Prot", "RFDiffusion", "Protein MPNN", and "Inter Pro Scan", but does not specify their version numbers. For example, it does not state "Python 3.8, PyTorch 1.9, and CUDA 11.1" or similar specific versioning for the ancillary software used in the experiments.
Experiment Setup Yes All models were trained with a batch size of 512 and a learning rate of 1e 4 to convergence. We clip our gradient norm to 2 and have a linear warmup schedule for the first 5000 iterations. We also use a 0.9999 EMA. Our diffusion model employs a transformer architecture with 12 layers, 16 attention heads, and a hidden size of 320.