reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing Molecular Conformer Generation via Fragment- Augmented Diffusion Pretraining

Authors: Xiaozhuang Song, YUZHAO TU, Tianshu Yu

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive benchmarks show Frag Diff s superior performance, especially in data-scarce scenarios. Notably, it achieves 12.2 13.4% performance improvement on molecules 3 beyond training scale through pretraining on fragments. ... Comprehensive empirical evaluations of fragmentation pretraining on two distinct diffusion frameworks, Geo Diff and Tor Diff, demonstrate consistent improvements across multiple datasets and settings, particularly in data-scarce regimes...
Researcher Affiliation	Academia	Xiaozhuang Song EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen; Yuzhao Tu EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen; Tianshu Yu EMAIL School of Data Science The Chinese University of Hong Kong, Shenzhen
Pseudocode	Yes	Algorithm 1 Graph-based Molecular Fragmentation. ... Algorithm 2 Frag Diff Training and Inference
Open Source Code	Yes	The code is available at https://github.com/Shawn KS/fragdiff.
Open Datasets	Yes	We utilize three subsets GEOM-QM9, GEOM-DRUGS, and GEOM-XL from the GEOM dataset (Axelrod & Gomez-Bombarelli, 2022)
Dataset Splits	Yes	The datasets were randomly divided into training, validation, and test sets with sizes as follows: for GEOM-DRUGS, there are 243,473 training samples, 30,433 validation samples, and 1,000 test samples; for GEOM-QM9, there are 106,586 training samples, 13,323 validation samples, and 1,000 test samples. Since GEOM-XL is used solely for testing, its test set includes all 102 molecules from the Molecule Net dataset that contain at least 100 atoms.
Hardware Specification	Yes	We trained the Torsional Diffusion models on NVIDIA RTX A100 GPUs for 250 epochs using the Adam optimizer for GEOM-DRUGS and GEOM-QM9.
Software Dependencies	No	The paper mentions 'MMFF94s force field implemented in RDKit' and 'PSI4 toolkit' but does not provide specific version numbers for these software components or for the programming language/libraries used for implementation.
Experiment Setup	Yes	We trained the Torsional Diffusion models on NVIDIA RTX A100 GPUs for 250 epochs using the Adam optimizer for GEOM-DRUGS and GEOM-QM9. The primary hyperparameters were optimized using the validation set, resulting in the following configurations: an initial learning rate of 0.001, a learning rate scheduler with a patience of 20 epochs, 4 network layers, a second-order maximum representation, a cutoff radius rmax of 10 Å, and the inclusion of batch normalization. ... The results reported for Frag Diff-T utilize 20 reverse diffusion steps... The minimum fragment size z was set to 10 for both GEOMDRUGS and GEOM-XL, while no such limit was applied in the GEOM-QM9 experiments. The maximum fragmentation edge number κ is set to 5 for all datasets.