reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MeToken: Uniform Micro-environment Token Boosts Post-Translational Modification Prediction

Authors: Cheng Tan, Zhenxiao Cao, Zhangyang Gao, Lirong Wu, Siyuan Li, Yufei Huang, Jun Xia, Bozhen Hu, Stan Z Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments to validate the effectiveness and generalizability of Me Token across multiple datasets, demonstrating its superior performance in accurately identifying PTM types. The results underscore the importance of incorporating structural data and highlight Me Token s potential in facilitating accurate and comprehensive PTM predictions, which could signiﬁcantly impact proteomics research.
Researcher Affiliation	Academia	1Zhejiang University, Hangzhou, China 3Xi an Jiaotong University, China 2AI Lab, Research Center for Industries of the Future, Westlake University, Hangzhou, China
Pseudocode	No	The paper describes methods in prose and mathematical equations but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	We constructed a large-scale dataset by integrating db PTM (Li et al., 2022a), the most extensive sequence-based PTM dataset available, with structural data obtained from the Protein Data Bank (PDB) (Berman et al., 2000) and the Alpha Fold database (Varadi et al., 2022; 2024). ... To assess the generalizability, we used the pre-trained models on the large-scale dataset to directly test on the PTMint (Hong et al., 2023) and q PTM (Yu et al., 2023a) datasets.
Dataset Splits	Yes	We utilized MMseqs2 (Steinegger & Söding, 2017) to cluster the data based on sequence similarity with a threshold of 40% and grouped the data into clusters, which were then allocated to the training, validation, or test set.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run its experiments.
Software Dependencies	No	The paper mentions software like MMseqs2, Pi GNN, and ESM2, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	In our model, we implement a temperature-scaled vector quantization mechanism that introduces a temperature parameter, τv, to modulate the quantization process. ... Initially set at 1, τv is gradually reduced towards zero during training. ... Lcodebook = Lrecon + αLu where α is set as 0.1 empirically, balancing the reconstruction loss and the uniform loss. ... The predictor network is trained using the cross-entropy loss... Following Dauparas et al. (2022), we introduced Gaussian noise with a mean of zero and a standard deviation of 0.0005 to the atomic coordinates.