FoldToken: Learning Protein Language via Vector Quantization and Beyond
Authors: Zhangyang Gao, Cheng Tan, Jue Wang, Yufei Huang, Lirong Wu, Stan Z. Li
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments We conduct experiments to answer the following questions: Reconstruction (Q1): Do the proposed methods outperform baselines on reconstruction quality? Backbone Inpainting (Q2): How does the learned protein language perform in the generation task? Ablation (Q3): What are the key factors contributing to the effectiveness of Soft CVQ? |
| Researcher Affiliation | Academia | Zhangyang Gao1,2 *, Cheng Tan1,2 *, Jue Wang1,2, Yufei Huang1,2, Lirong Wu1,2, Stan Z.Li1 1Westlake University 2Zhejiang University EMAIL |
| Pseudocode | No | The paper describes the methodology in detail using mathematical formulations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code or providing links to a code repository. The text only mentions utilizing pretrained checkpoints of baseline models. |
| Open Datasets | Yes | Datasets c AF2DB for VQ Pretraining We employ c AF2DB, a clustered version of the Alpha Fold Uniprot v3 database, for VQ pretraining. The original AF2DB contains a large number of structures (214,684,311), which presents computational challenges. To overcome this, we utilize c AF2DB (Barrio-Hernandez et al. 2023) consisting of 2.27M structural clusters. CATH4.3 for Backbone Inpainting For the task of backbone inpainting, we employ CATH4.3. |
| Dataset Splits | Yes | To construct the train, validation, and test sets, we utilize the CAT code to randomly partition proteins in a ratio of 95:2:3, ensuring no overlap in the training, validation, and test sets on the CAT code. This results in a train set with 30,290 samples, a validation set with 638 samples, and a test set with 957 samples. |
| Hardware Specification | No | The paper mentions "BF16 precision training in Deep Speed" but does not specify any particular hardware components such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions "Deep Speed", "One Cycle scheduler", and "Adam W optimizer", but does not provide specific version numbers for these or any other software libraries/frameworks. |
| Experiment Setup | Yes | With BF16 precision training in Deep Speed, the model is trained for 15 epochs using the One Cycle scheduler and Adam W optimizer. The batch size is set to 128, the learning rate is 0.0001, and the padding length is 512. [...] The model contains 15 transformer layers with 480 hidden dimension and 20 attention heads. We train the model up to 20k steps using the One Cycle scheduler and Adam W optimizer. The batch size is 128, the learning rate is 0.0005, and the padding length is 512. |