SimXRD-4M: Big Simulated X-ray Diffraction Data and Crystal Symmetry Classification Benchmark
Authors: Bin Cao, Yang Liu, Zinan Zheng, Ruifeng Tan, Jia Li, Tong-Yi Zhang
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark 21 sequence models in both in-library and out-of-library scenarios and analyze the impact of class imbalance in long-tailed crystal label distributions. Remarkably, we find that: (1) current neural networks struggle with classifying low-frequency crystals, particularly in out-of-library situations; (2) models trained on Sim XRD can generalize to real experimental data. |
| Researcher Affiliation | Academia | 1Guangzhou Municipal Key Laboratory of Materials Informatics, Advanced Materials Thrust, The Hong Kong University of Science and Technology (Guangzhou) 2Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou) 3Sustainable Energy and Environment Thrust, The Hong Kong University of Science and Technology (Guangzhou) 4The Hong Kong University of Science and Technology |
| Pseudocode | No | The paper describes the XRD simulation method and experimental procedures in natural language and mathematical equations, but it does not contain clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Additionally, we have made the Sim XRD database, simulation code, benchmark models, evaluation process and tutorial notebooks into a repository: https: //github.com/Bin-Cao/Sim XRD. |
| Open Datasets | Yes | To address this, we introduce Sim XRD-4M, the largest open-source simulated XRD pattern dataset to date, aimed at accelerating the development of crystallographic informatics. ... Additionally, we have made the Sim XRD database, simulation code, benchmark models, evaluation process and tutorial notebooks into a repository: https: //github.com/Bin-Cao/Sim XRD. |
| Dataset Splits | Yes | For in-library classification, a fundamental task in crystallography, the dataset is randomly split according to the types of simulated environments, resulting in 119,569 × 30 training instances, 119,569 × 1 validation instances, and 119,569 × 2 testing instances. ... Under out-of-library settings, the training and testing XRD patterns are generated from non-overlapping crystals. This setup yields 83,698 × 33 training instances, 11,957 × 33 validation instances, and 23,914 × 33 testing instances. |
| Hardware Specification | Yes | All models are implemented using the Py Torch (Paszke et al., 2019) library and trained on Ge Force RTX 3090 GPU. |
| Software Dependencies | No | All models are implemented using the Py Torch (Paszke et al., 2019) library and trained on Ge Force RTX 3090 GPU. The paper mentions PyTorch but does not provide a specific version number. |
| Experiment Setup | Yes | We use the following hyper-parameters across all experiments: Batch size of 128 and learning rate of 2.5 × 10−4. All models are trained for 50 epochs with an early stopping patience of 3. We use the Cross-Entropy function to measure the difference between predictions and the ground truth. |