reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ReNovo: Retrieval-Based \emph{De Novo} Mass Spectrometry Peptide Sequencing

Authors: Shaorong Chen, Jun Xia, Jingbo Zhou, Lecheng Zhang, Zhangyang Gao, Bozhen Hu, Cheng Tan, Wenjie Du, Stan Z Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	A series of experiments have confirmed that Re Novo outperforms state-of-the-art models across multiple widely-used datasets, incurring only minor storage and time consumption, representing a significant advancement in proteomics. Supplementary materials include the code. (...) In this study, we introduce a novel Retrieval-based De Novo peptide sequencing methodology, termed Re Novo, which draws inspiration from database search methods.
Researcher Affiliation	Academia	1Zhejiang University, 2School of Engineering, Westlake University, 3University of Science and Technology of China Hangzhou, China EMAIL
Pseudocode	No	The paper describes the methodology in detail using textual descriptions and figures, but it does not include explicitly labeled pseudocode blocks or algorithms.
Open Source Code	Yes	Supplementary materials include the code.
Open Datasets	Yes	For evaluation, we utilize three representative datasets: Seven-species Dataset (Tran et al., 2017), Nine-species Dataset(Tran et al., 2017), and HC-PT Dataset(Eloff et al., 2023).
Dataset Splits	Yes	To simulate real-world scenarios requiring novel peptide identification, we utilized a leave-one-out approach, akin to prior de novo peptide sequencing studies. For example, all the models were trained on data from six species and subsequently tested on the remaining species in Seven-species Dataset. The same applies to the other datasets. (...) The detailed statistics of the datasets are shown in Table 6
Hardware Specification	Yes	When evaluating running times, we ensured consistent experimental setups across all models including: the use of an Nvidia A100 GPU (80GB), setting the batch size to 32, and calculating the average time by dividing the total time by the number of steps.
Software Dependencies	No	The paper mentions the use of 'transformer architecture (Vaswani, 2017)' and references various models like 'Deep Novo', 'Point Novo', 'Casanovo', 'Instanovo', 'Helix Novo', and 'Adanovo', but it does not specify software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	In order to encode MS2 data s = {si}M i=1 into feature vectors {Ei}M i=1, we follow previous methods (Yilmaz et al., 2022; Xia et al., 2024), treating each peak si = (mi, Ii) as a word in a MS2 sentence s. The peak embedding Ei is obtained by separately encoding its m/z value mi and intensity value Ii into Em i and EI i , then combining them through summation. Formally, (...) where d denotes the feature dimension, W Rd 1 represents a trainable linear layer, N1 and N2 are user-defined scalars that can be set to any value. Specifically, we set d = 512, N1 = mmax/mmin and N2 = mmin/2π, where mmax = 10, 000 and mmin = 0.001 in our work. (...) When evaluating running times, we ensured consistent experimental setups across all models including: the use of an Nvidia A100 GPU (80GB), setting the batch size to 32, and calculating the average time by dividing the total time by the number of steps. (...) We conducted sensitivity analysis experiments on the Seven-species Dataset to investigate the impact of the parameter K on the Re Novo s performance (...) We conducted sensitivity analysis on the Seven-species Dataset to investigate the impact of the parameter T on the Re Novo s performance.