reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Enhancing the Scalability and Applicability of Kohn-Sham Hamiltonians for Molecular Systems

Authors: Yunyang Li, Zaishuo Xia, Lin Huang, Xinran Wei, Samuel Harshe, Han Yang, Erpai Luo, Zun Wang, Jia Zhang, Chang Liu, Bin Shao, Mark Gerstein

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this study, we generate a substantially larger training set (Pub Chem QH) than used previously and use it to create a scalable model for DFT calculations with physical accuracy. For our model, we introduce a loss function derived from physical principles, which we call Wavefunction Alignment Loss (WALoss). WALoss involves performing a basis change on the predicted Hamiltonian to align it with the observed one; thus, the resulting differences can serve as a surrogate for orbital energy differences, allowing models to make better predictions for molecular orbitals and total energies than previously possible. WALoss also substantially accelerates self-consistent-field (SCF) DFT calculations. Here, we show it achieves a reduction in total energy prediction error by a factor of 1347 and an SCF calculation speed-up by a factor of 18%. These substantial improvements set new benchmarks for achieving accurate and applicable predictions in larger molecular systems.
Researcher Affiliation	Collaboration	BEMAIL EMAIL Yale University UC Davis MSR AI4Science
Pseudocode	Yes	Algorithm 1 Simultaneous reduction of a matrix pair (H , S) Require: Groud-truth Hamiltonian matrix H and overlap matrix S Ensure: Diagonal matrix ϵ and matrix C such that (C ) SC = I and (C ) H C = ϵ 1: Compute the Cholesky decomposition S = GG . 2: Define M = G 1H G . 3: Apply the symmetric QR algorithm to find the Schur form (Q ) M Q = ϵ . 4: Compute C = G Q .
Open Source Code	No	The paper does not explicitly state that the authors are releasing their source code for the methodology described (WANet, WALoss), nor does it provide a link to a code repository. It only mentions using official implementations for other baseline models.
Open Datasets	Yes	In our study, we investigated the scalability of Hamiltonian learning by utilizing a CUDA-accelerated SCF implementation (Ju et al., 2024) to perform computational quantum chemistry calculations, thereby generating the Pub Chem QH dataset. We began with geometries from the Pub Chem QC dataset by (Nakata & Maeda, 2023), selecting only molecules with a molecular weight above 400. The QH9 dataset is a comprehensive quantum chemistry resource designed to support the development and evaluation of machine learning models for predicting quantum Hamiltonian matrices. Built upon the QM9 dataset, QH9 contains Hamiltonian matrices for 130,831 stable molecular geometries
Dataset Splits	Yes	For the Pub Chem QH dataset, we used an 80/10/10 train/validation/test split, resulting in 40,257 training molecules, 5,032 validation molecules, and 5,032 test molecules.
Hardware Specification	Yes	Generating this comprehensive dataset represents a substantial computational effort, requiring approximately one month of continuous processing using 128 NVIDIA-V100 GPUs.
Software Dependencies	No	The paper mentions software components like "CUDA-accelerated SCF implementation" and "Adam optimizer" and using "pyscf" without providing specific version numbers for these or other key software dependencies like Python or PyTorch.
Experiment Setup	Yes	We trained all models of maximum 300,000 steps with a batch size of 8, using early stopping with a patience of 1,000 steps. WANet converges at 278,391 steps, QHNet at 258,267 steps, and Phis Net at 123,170 steps. All models used the Adam optimizer with a learning rate of 0.001 for Pub Chem QH, along with a polynomial learning rate scheduler with 1,000 warmup steps. We used gradient clipping of 1, and a radius cutoff of 5 Å. For QHNet and Phis Net, we referred to the official implementation for these two models. Table 12: Hyperparameter settings for the experimental study using the Pub Chem QH and QH9 datasets. Table 13: The training details for the HOMO, LUMO, and GAP predictions.