Geometry Informed Tokenization of Molecules for Language Model Generation

Authors: Xiner Li, Limei Wang, Youzhi Luo, Carl Edwards, Shurui Gui, Yuchao Lin, Heng Ji, Shuiwang Ji

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that, when coupled with Geo2Seq, various LMs excel in molecular geometry generation, especially in controlled generation tasks. Our experimental results demonstrate these advantages. We show that using different LMs with Geo2Seq can reliably produce valid and diverse 3D molecules and outperform the strong diffusion-based baselines by a large margin in conditional generation. In this section, we evaluate the method of generating 3D molecules in the form of our proposed Geo2Seq representations by LLMs.
Researcher Affiliation Academia *Equal contribution 1Texas A&M University 2University of Illinois Urbana-Champaign. Correspondence to: Shuiwang Ji <EMAIL>.
Pseudocode No The paper describes methods verbally and uses mathematical formulations but does not present any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code has been released as part of the AIRS library (https://github.com/divelab/AIRS/).
Open Datasets Yes We adopt two datasets, QM9 (Ramakrishnan et al., 2014) and GEOM-DRUGS (Axelrod & Gomez-Bombarelli, 2022), to evaluate performances in the random generation task.
Dataset Splits Yes Following Anderson et al. (2019), we split the dataset into train, validation and test sets with 100k, 18k and 12k samples, separately. The GEOM-DRUGS dataset consists of over 450k large molecules with 37 million DFT-calculated 3D structures. Molecules in GEOM-DRUGS has up to 181 atoms and 44.2 atoms on average. We follow Hoogeboom et al. (2022) to select 30 3D structures with the lowest energies per molecule for model training.
Hardware Specification Yes All experiments on the QM9 dataset are conducted using a single NVIDIA A6000 GPU. Experiments on the GEOM-DRUGS dataset are deployed on 4 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions using GPT and Mamba models, Adam W optimizers, and RDKit, but does not provide specific version numbers for any of these software components or programming languages/libraries.
Experiment Setup Yes On the QM9 dataset, we set the training batch size to 32, base learning rate to 0.0004, and train a 12-layer GPT model and a 26-layer Mamba model by Adam W (Loshchilov & Hutter, 2019) optimizers. On the GEOM-DRUGS dataset, we set the training batch size to 32, base learning rate to 0.0004, and train a 14-layer GPT model and a 28-layer Mamba model by Adam W optimizers. During model training, we use Adam W (Loshchilov & Hutter, 2019) optimizer and follow the commonly used linear warm up and cosine decay scheduler to adjust learning rates. Specifically, the learning rate first linearly increases from zero to the base learning rate 0.0004 when handling the first 10% of total training tokens, then gradually decreases to 0.00004 by the cosine decay scheduler. Besides, the tokenization of real numbers uses the precision of two and three decimal places for QM9 and GEOM-DRUGS datasets, respectively. In the controllable generation experiment (Section 5.2), we train 16-layer Mamba models for 200 epochs, and all the other hyperparameters and settings are the same as the random generation experiment. Based on data statistics, we set the context length to 512 for QM9 dataset and 744 for GEOM-DRUGS dataset throughout the experiments.