Enhancing Molecular Conformer Generation via Fragment- Augmented Diffusion Pretraining
Authors: Xiaozhuang Song, YUZHAO TU, Tianshu Yu
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Comprehensive benchmarks show Frag Diff s superior performance, especially in data-scarce scenarios. Notably, it achieves 12.2 13.4% performance improvement on molecules 3 beyond training scale through pretraining on fragments. ... Comprehensive empirical evaluations of fragmentation pretraining on two distinct diffusion frameworks, Geo Diff and Tor Diff, demonstrate consistent improvements across multiple datasets and settings, particularly in data-scarce regimes... |
| Researcher Affiliation | Academia | Xiaozhuang Song EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen; Yuzhao Tu EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen; Tianshu Yu EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen |
| Pseudocode | Yes | Algorithm 1 Graph-based Molecular Fragmentation. ... Algorithm 2 Frag Diff Training and Inference |
| Open Source Code | Yes | The code is available at https://github.com/Shawn KS/fragdiff. |
| Open Datasets | Yes | We utilize three subsets GEOM-QM9, GEOM-DRUGS, and GEOM-XL from the GEOM dataset (Axelrod & Gomez-Bombarelli, 2022) |
| Dataset Splits | Yes | The datasets were randomly divided into training, validation, and test sets with sizes as follows: for GEOM-DRUGS, there are 243,473 training samples, 30,433 validation samples, and 1,000 test samples; for GEOM-QM9, there are 106,586 training samples, 13,323 validation samples, and 1,000 test samples. Since GEOM-XL is used solely for testing, its test set includes all 102 molecules from the Molecule Net dataset that contain at least 100 atoms. |
| Hardware Specification | Yes | We trained the Torsional Diffusion models on NVIDIA RTX A100 GPUs for 250 epochs using the Adam optimizer for GEOM-DRUGS and GEOM-QM9. |
| Software Dependencies | No | The paper mentions 'MMFF94s force field implemented in RDKit' and 'PSI4 toolkit' but does not provide specific version numbers for these software components or for the programming language/libraries used for implementation. |
| Experiment Setup | Yes | We trained the Torsional Diffusion models on NVIDIA RTX A100 GPUs for 250 epochs using the Adam optimizer for GEOM-DRUGS and GEOM-QM9. The primary hyperparameters were optimized using the validation set, resulting in the following configurations: an initial learning rate of 0.001, a learning rate scheduler with a patience of 20 epochs, 4 network layers, a second-order maximum representation, a cutoff radius rmax of 10 Å, and the inclusion of batch normalization. ... The results reported for Frag Diff-T utilize 20 reverse diffusion steps... The minimum fragment size z was set to 10 for both GEOMDRUGS and GEOM-XL, while no such limit was applied in the GEOM-QM9 experiments. The maximum fragmentation edge number κ is set to 5 for all datasets. |