PepTune: De Novo Generation of Therapeutic Peptides with Multi-Objective-Guided Discrete Diffusion

Authors: Sophia Tang, Yinuo Zhang, Pranam Chatterjee

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experiments Pep MDLM generates diverse chemically-modified and cyclic peptides. Our optimized unconditional MDLM (Pep MDLM) shows increased uniqueness and diversity with lower SNN compared to the autoregressive generator of macrocyclic peptides, HELM-GPT (Xu et al., 2024), demonstrating our capability to comprehensively search the subspace of valid peptide SMILES (Table 7). Furthermore, our unconditional generator Pep MDLM generates valid peptides with a higher average n AA frequency than experimentally-validated peptide SMILES for membrane permeability and binding affinity (Figure 12), demonstrating our unique ability to design de novo peptides with cyclic and n AA modifications and expanding the search space of therapeutic peptides well beyond any generative model trained on canonical amino acid representations. In addition, the fraction of valid peptides consistently reaches 100% after only 20 iterations of the MCTS search algorithm, demonstrating the effectiveness of backpropagating the classifier-based rewards.
Researcher Affiliation Academia 1Department of Computer and Information Science, University of Pennsylvania 2Center of Computational Biology, Duke-NUS Medical School 3Department of Bioengineering, University of Pennsylvania. Correspondence to: Pranam Chatterjee <EMAIL>.
Pseudocode Yes J. Algorithms Algorithm 1 outlines the training algorithm for Pep MDLM, our bond-dependent masked discrete diffusion model for unconditional peptide SMILES generation. Algorithms 2, 3, 4, 5 and 6 describe Pep Tune, our MCTS-guided peptide SMILES generator. Algorithms 7 and 8 describe the bond mask function and peptide sequence decoder which can also act as a validity filter.
Open Source Code Yes Our peptide filtering, analysis, and visualization tool, SMILES2PEPTIDE, is freely available on Hugging Face: https://huggingface.co/spaces/Chatterjee Lab/SMILES2PEPTIDE. The Pep Tune codebase is freely accessible to the academic community via a non-commercial license at https: //huggingface.co/Chatterjee Lab/Pep Tune.
Open Datasets Yes To train the unconditional masked diffusion language model generator, we collected 11 million peptide SMILES consisting of 7451 sequences from the Cyc Pept MPDB database (Li et al., 2023a), 825,632 unique peptides from Sm Prot (Li et al., 2021), and approximately 10 million modified peptides generated from Cyclo Ps (Duffy et al., 2011; Feller & Wilke, 2024), which consists of 90% canonical amino acids, 10% unnatural amino acids from Swiss Sidechain (Gfeller et al., 2012), 10% dextro-chiral alpha carbons, 20% N-methylated amine backbone atoms, and 10% PEGylated peptides. ... The dataset contains 34,853 experimentally validated peptide SMILES, consisting of 22,040 SMILES sequences obtained from the Ch EMBL database (Mendez et al., 2018) and 7451 sequences from the Cyc Pept MPDB database (Li et al., 2023a).
Dataset Splits Yes We split our data by k-means clustering into 1000 groups of sequences with similar chemical properties based on their Morgan fingerprint (Rogers & Hahn, 2010), which is a bit-vector representation of the full peptide sequence where each bit encodes a feature relating to the SMILES atom types, connectivity, and bonding environment. The final dataset was a 0.8 to 0.2 split based on the clusters, maintaining similar diversities of the SMILES strings. ... Data was randomly shuffled and split into 0.8/0.1/0.1 ratio for train, validation, and test.
Hardware Specification Yes The model used to generate the validation results in this manuscript is trained on our in-house 8 A6000 Nvidia GPUs (50G memory) for 1600 GPU hours using the Adam W optimizer with a learning rate of 0.0003 and weight decay of 0.075.
Software Dependencies Yes For valid generated peptide SMILES with non-dominated scores across objectives, we used Autodock Vina (Eberhardt et al., 2021) (v 1.1.2) for in silico docking of the peptide binders to their target proteins (Appendix 11) to confirm binding affinity. Targets were preprocessed with MGITools (Morris et al., 2009) (v 1.5.7) and the conformations of the SMILES were optimized by ETKDG from RDKit (Eberhardt et al., 2021; Wang et al., 2020). The final results were visualized in Py Mol (Schrödinger, LLC, 2015) (v 3.1). ... For cell membrane permeability, we trained an XGBoost (Chen & Guestrin, 2016) boosted tree regression model on Peptide CLM (Feller & Wilke, 2024) embeddings which returns the predicted PAMPA lipophilicity score (log P) given a peptide SMILES sequence
Experiment Setup Yes The model used to generate the validation results in this manuscript is trained on our in-house 8 A6000 Nvidia GPUs (50G memory) for 1600 GPU hours using the Adam W optimizer with a learning rate of 0.0003 and weight decay of 0.075. After training for 8 epochs with 11 million peptide SMILES (Appendix C.1), we achieved a train loss of 0.832 and a validation loss of 0.880.