Enhancing Molecular Conformer Generation via Fragment- Augmented Diffusion Pretraining

Authors: Xiaozhuang Song, YUZHAO TU, Tianshu Yu

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive benchmarks show Frag Diff s superior performance, especially in data-scarce scenarios. Notably, it achieves 12.2 13.4% performance improvement on molecules 3 beyond training scale through pretraining on fragments. ... Comprehensive empirical evaluations of fragmentation pretraining on two distinct diffusion frameworks, Geo Diff and Tor Diff, demonstrate consistent improvements across multiple datasets and settings, particularly in data-scarce regimes...
Researcher Affiliation Academia Xiaozhuang Song EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen; Yuzhao Tu EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen; Tianshu Yu EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen
Pseudocode Yes Algorithm 1 Graph-based Molecular Fragmentation. ... Algorithm 2 Frag Diff Training and Inference
Open Source Code Yes The code is available at https://github.com/Shawn KS/fragdiff.
Open Datasets Yes We utilize three subsets GEOM-QM9, GEOM-DRUGS, and GEOM-XL from the GEOM dataset (Axelrod & Gomez-Bombarelli, 2022)
Dataset Splits Yes The datasets were randomly divided into training, validation, and test sets with sizes as follows: for GEOM-DRUGS, there are 243,473 training samples, 30,433 validation samples, and 1,000 test samples; for GEOM-QM9, there are 106,586 training samples, 13,323 validation samples, and 1,000 test samples. Since GEOM-XL is used solely for testing, its test set includes all 102 molecules from the Molecule Net dataset that contain at least 100 atoms.
Hardware Specification Yes We trained the Torsional Diffusion models on NVIDIA RTX A100 GPUs for 250 epochs using the Adam optimizer for GEOM-DRUGS and GEOM-QM9.
Software Dependencies No The paper mentions 'MMFF94s force field implemented in RDKit' and 'PSI4 toolkit' but does not provide specific version numbers for these software components or for the programming language/libraries used for implementation.
Experiment Setup Yes We trained the Torsional Diffusion models on NVIDIA RTX A100 GPUs for 250 epochs using the Adam optimizer for GEOM-DRUGS and GEOM-QM9. The primary hyperparameters were optimized using the validation set, resulting in the following configurations: an initial learning rate of 0.001, a learning rate scheduler with a patience of 20 epochs, 4 network layers, a second-order maximum representation, a cutoff radius rmax of 10 Å, and the inclusion of batch normalization. ... The results reported for Frag Diff-T utilize 20 reverse diffusion steps... The minimum fragment size z was set to 10 for both GEOMDRUGS and GEOM-XL, while no such limit was applied in the GEOM-QM9 experiments. The maximum fragmentation edge number κ is set to 5 for all datasets.