reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FoldToken: Learning Protein Language via Vector Quantization and Beyond

Authors: Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, Stan Z. Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments We conduct experiments to answer the following questions: Reconstruction (Q1): Do the proposed methods outperform baselines on reconstruction quality? Backbone Inpainting (Q2): How does the learned protein language perform in the generation task? Ablation (Q3): What are the key factors contributing to the effectiveness of Soft CVQ?
Researcher Affiliation	Academia	Zhangyang Gao1,2 , Cheng Tan1,2 , Jue Wang1,2, Yufei Huang1,2, Lirong Wu1,2, Stan Z.Li1 1Westlake University 2Zhejiang University EMAIL
Pseudocode	No	The paper describes the methodology in detail using mathematical formulations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or providing links to a code repository. The text only mentions utilizing pretrained checkpoints of baseline models.
Open Datasets	Yes	Datasets c AF2DB for VQ Pretraining We employ c AF2DB, a clustered version of the Alpha Fold Uniprot v3 database, for VQ pretraining. The original AF2DB contains a large number of structures (214,684,311), which presents computational challenges. To overcome this, we utilize c AF2DB (Barrio-Hernandez et al. 2023) consisting of 2.27M structural clusters. CATH4.3 for Backbone Inpainting For the task of backbone inpainting, we employ CATH4.3.
Dataset Splits	Yes	To construct the train, validation, and test sets, we utilize the CAT code to randomly partition proteins in a ratio of 95:2:3, ensuring no overlap in the training, validation, and test sets on the CAT code. This results in a train set with 30,290 samples, a validation set with 638 samples, and a test set with 957 samples.
Hardware Specification	No	The paper mentions "BF16 precision training in Deep Speed" but does not specify any particular hardware components such as GPU or CPU models.
Software Dependencies	No	The paper mentions "Deep Speed", "One Cycle scheduler", and "Adam W optimizer", but does not provide specific version numbers for these or any other software libraries/frameworks.
Experiment Setup	Yes	With BF16 precision training in Deep Speed, the model is trained for 15 epochs using the One Cycle scheduler and Adam W optimizer. The batch size is set to 128, the learning rate is 0.0001, and the padding length is 512. [...] The model contains 15 transformer layers with 480 hidden dimension and 20 attention heads. We train the model up to 20k steps using the One Cycle scheduler and Adam W optimizer. The batch size is 128, the learning rate is 0.0005, and the padding length is 512.