reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

nach0-pc: Multi-task Language Model with Molecular Point Cloud Encoder

Authors: Maksim Kuznetsov, Airat Valiev, Alex Aliper, Daniil Polykovskiy, Elena Tutubalina, Rim Shayakhmetov, Zulfat Miftahutdinov

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrating superior or comparable performance to LM baselines and state-of-the-art diffusion approaches across six spatial molecular generation tasks. We evaluate the quality of the nach0-pc model across several established spatial molecular generation tasks: (i) 3D molecular structures generation: spatial molecular distribution learning, conformation generation, (ii) molecular completion: linker design, scaffold decoration, (iii) shape-conditioned generation, (iv) pocket-conditioned generation.
Researcher Affiliation	Industry	1Insilico Medicine Canada Inc., 2Insilico Medicine AI Ltd. *Corresponding author: EMAIL
Pseudocode	Yes	Algorithm 1: Point Cloud Encoder
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. It mentions utilizing existing architectures (T5) and models (nach0), but not their specific implementation of nach0-pc.
Open Datasets	Yes	Our work adopts small molecules ZINC (Irwin et al. 2020), MOSES (Polykovskiy et al. 2020), and GEOM-Drugs (Axelrod and Gómez-Bombarelli 2022) datasets, as well as the Cross Docked2020 (Francoeur et al. 2020) dataset, which includes pocket-ligand pairs.
Dataset Splits	Yes	In cases when tasks use the same dataset, to avoid any potential data leakage, we use the same dataset split. We utilize the same train/validation/test splits as in the conformation generation task from the Torsional Diffusion(Jing et al. 2022) paper and retrain baseline if they were trained on another split.
Hardware Specification	Yes	The model was trained using two NVIDIA A6000 GPUs. The total training and evaluation time for our model was 164.5 hours, resulting in an estimated CO2 emission of 20.73 kg CO2eq. For the training and evaluation of Mol Diff and EDM models, we utilized an Nvidia A4000.
Software Dependencies	No	The paper mentions using RDKit and Open Babel tools and relies on the T5 architecture and nach0 model, but it does not specify version numbers for any software dependencies like programming languages, libraries, or frameworks used for implementation.
Experiment Setup	Yes	The pre-training and finetuning stages were executed using the following hyperparameters: a batch size of 64 for both pre-training and finetuning, a learning rate set to 1e-4, a weight decay of 0.01, and a cosine schedule. Both the pre-training and fine-tuning stages lasted for 100000 steps.