ReNovo: Retrieval-Based \emph{De Novo} Mass Spectrometry Peptide Sequencing

Authors: Shaorong Chen, Jun Xia, Jingbo Zhou, Lecheng Zhang, Zhangyang Gao, Bozhen Hu, Cheng Tan, Wenjie Du, Stan Z Li

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A series of experiments have confirmed that Re Novo outperforms state-of-the-art models across multiple widely-used datasets, incurring only minor storage and time consumption, representing a significant advancement in proteomics. Supplementary materials include the code. (...) In this study, we introduce a novel Retrieval-based De Novo peptide sequencing methodology, termed Re Novo, which draws inspiration from database search methods.
Researcher Affiliation Academia 1Zhejiang University, 2School of Engineering, Westlake University, 3University of Science and Technology of China Hangzhou, China EMAIL
Pseudocode No The paper describes the methodology in detail using textual descriptions and figures, but it does not include explicitly labeled pseudocode blocks or algorithms.
Open Source Code Yes Supplementary materials include the code.
Open Datasets Yes For evaluation, we utilize three representative datasets: Seven-species Dataset (Tran et al., 2017), Nine-species Dataset(Tran et al., 2017), and HC-PT Dataset(Eloff et al., 2023).
Dataset Splits Yes To simulate real-world scenarios requiring novel peptide identification, we utilized a leave-one-out approach, akin to prior de novo peptide sequencing studies. For example, all the models were trained on data from six species and subsequently tested on the remaining species in Seven-species Dataset. The same applies to the other datasets. (...) The detailed statistics of the datasets are shown in Table 6
Hardware Specification Yes When evaluating running times, we ensured consistent experimental setups across all models including: the use of an Nvidia A100 GPU (80GB), setting the batch size to 32, and calculating the average time by dividing the total time by the number of steps.
Software Dependencies No The paper mentions the use of 'transformer architecture (Vaswani, 2017)' and references various models like 'Deep Novo', 'Point Novo', 'Casanovo', 'Instanovo', 'Helix Novo', and 'Adanovo', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In order to encode MS2 data s = {si}M i=1 into feature vectors {Ei}M i=1, we follow previous methods (Yilmaz et al., 2022; Xia et al., 2024), treating each peak si = (mi, Ii) as a word in a MS2 sentence s. The peak embedding Ei is obtained by separately encoding its m/z value mi and intensity value Ii into Em i and EI i , then combining them through summation. Formally, (...) where d denotes the feature dimension, W Rd 1 represents a trainable linear layer, N1 and N2 are user-defined scalars that can be set to any value. Specifically, we set d = 512, N1 = mmax/mmin and N2 = mmin/2π, where mmax = 10, 000 and mmin = 0.001 in our work. (...) When evaluating running times, we ensured consistent experimental setups across all models including: the use of an Nvidia A100 GPU (80GB), setting the batch size to 32, and calculating the average time by dividing the total time by the number of steps. (...) We conducted sensitivity analysis experiments on the Seven-species Dataset to investigate the impact of the parameter K on the Re Novo s performance (...) We conducted sensitivity analysis on the Seven-species Dataset to investigate the impact of the parameter T on the Re Novo s performance.