nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder

Authors: Maksim Kuznetsov, Airat Valiev, Alex Aliper, Daniil Polykovskiy, Elena Tutubalina, Rim Shayakhmetov, Zulfat Miftahutdinov

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments demonstrating superior or comparable performance to LM baselines and state-of-the-art diffusion approaches across six spatial molecular generation tasks. We evaluate the quality of the nach0-pc model across several established spatial molecular generation tasks: (i) 3D molecular structures generation: spatial molecular distribution learning, conformation generation, (ii) molecular completion: linker design, scaffold decoration, (iii) shape-conditioned generation, (iv) pocket-conditioned generation.
Researcher Affiliation Industry 1Insilico Medicine Canada Inc., 2Insilico Medicine AI Ltd. *Corresponding author: EMAIL
Pseudocode Yes Algorithm 1: Point Cloud Encoder
Open Source Code No The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. It mentions utilizing existing architectures (T5) and models (nach0), but not their specific implementation of nach0-pc.
Open Datasets Yes Our work adopts small molecules ZINC (Irwin et al. 2020), MOSES (Polykovskiy et al. 2020), and GEOM-Drugs (Axelrod and Gómez-Bombarelli 2022) datasets, as well as the Cross Docked2020 (Francoeur et al. 2020) dataset, which includes pocket-ligand pairs.
Dataset Splits Yes In cases when tasks use the same dataset, to avoid any potential data leakage, we use the same dataset split. We utilize the same train/validation/test splits as in the conformation generation task from the Torsional Diffusion(Jing et al. 2022) paper and retrain baseline if they were trained on another split.
Hardware Specification Yes The model was trained using two NVIDIA A6000 GPUs. The total training and evaluation time for our model was 164.5 hours, resulting in an estimated CO2 emission of 20.73 kg CO2eq. For the training and evaluation of Mol Diff and EDM models, we utilized an Nvidia A4000.
Software Dependencies No The paper mentions using RDKit and Open Babel tools and relies on the T5 architecture and nach0 model, but it does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks used for implementation.
Experiment Setup Yes The pre-training and finetuning stages were executed using the following hyperparameters: a batch size of 64 for both pre-training and finetuning, a learning rate set to 1e-4, a weight decay of 0.01, and a cosine schedule. Both the pre-training and fine-tuning stages lasted for 100000 steps.