FoldToken: Learning Protein Language via Vector Quantization and Beyond

Authors: Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, Stan Z. Li

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments We conduct experiments to answer the following questions: Reconstruction (Q1): Do the proposed methods outperform baselines on reconstruction quality? Backbone Inpainting (Q2): How does the learned protein language perform in the generation task? Ablation (Q3): What are the key factors contributing to the effectiveness of Soft CVQ?
Researcher Affiliation Academia Zhangyang Gao1,2 *, Cheng Tan1,2 *, Jue Wang1,2, Yufei Huang1,2, Lirong Wu1,2, Stan Z.Li1 1Westlake University 2Zhejiang University EMAIL
Pseudocode No The paper describes the methodology in detail using mathematical formulations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code or providing links to a code repository. The text only mentions utilizing pretrained checkpoints of baseline models.
Open Datasets Yes Datasets c AF2DB for VQ Pretraining We employ c AF2DB, a clustered version of the Alpha Fold Uniprot v3 database, for VQ pretraining. The original AF2DB contains a large number of structures (214,684,311), which presents computational challenges. To overcome this, we utilize c AF2DB (Barrio-Hernandez et al. 2023) consisting of 2.27M structural clusters. CATH4.3 for Backbone Inpainting For the task of backbone inpainting, we employ CATH4.3.
Dataset Splits Yes To construct the train, validation, and test sets, we utilize the CAT code to randomly partition proteins in a ratio of 95:2:3, ensuring no overlap in the training, validation, and test sets on the CAT code. This results in a train set with 30,290 samples, a validation set with 638 samples, and a test set with 957 samples.
Hardware Specification No The paper mentions "BF16 precision training in Deep Speed" but does not specify any particular hardware components such as GPU or CPU models.
Software Dependencies No The paper mentions "Deep Speed", "One Cycle scheduler", and "Adam W optimizer", but does not provide specific version numbers for these or any other software libraries/frameworks.
Experiment Setup Yes With BF16 precision training in Deep Speed, the model is trained for 15 epochs using the One Cycle scheduler and Adam W optimizer. The batch size is set to 128, the learning rate is 0.0001, and the padding length is 512. [...] The model contains 15 transformer layers with 480 hidden dimension and 20 attention heads. We train the model up to 20k steps using the One Cycle scheduler and Adam W optimizer. The batch size is 128, the learning rate is 0.0005, and the padding length is 512.