reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SimXRD-4M: Big Simulated X-ray Diffraction Data and Crystal Symmetry Classification Benchmark

Authors: Bin Cao, Yang Liu, Zinan Zheng, Ruifeng Tan, Jia Li, Tong-Yi Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark 21 sequence models in both in-library and out-of-library scenarios and analyze the impact of class imbalance in long-tailed crystal label distributions. Remarkably, we find that: (1) current neural networks struggle with classifying low-frequency crystals, particularly in out-of-library situations; (2) models trained on Sim XRD can generalize to real experimental data.
Researcher Affiliation	Academia	1Guangzhou Municipal Key Laboratory of Materials Informatics, Advanced Materials Thrust, The Hong Kong University of Science and Technology (Guangzhou) 2Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou) 3Sustainable Energy and Environment Thrust, The Hong Kong University of Science and Technology (Guangzhou) 4The Hong Kong University of Science and Technology
Pseudocode	No	The paper describes the XRD simulation method and experimental procedures in natural language and mathematical equations, but it does not contain clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Additionally, we have made the Sim XRD database, simulation code, benchmark models, evaluation process and tutorial notebooks into a repository: https: //github.com/Bin-Cao/Sim XRD.
Open Datasets	Yes	To address this, we introduce Sim XRD-4M, the largest open-source simulated XRD pattern dataset to date, aimed at accelerating the development of crystallographic informatics. ... Additionally, we have made the Sim XRD database, simulation code, benchmark models, evaluation process and tutorial notebooks into a repository: https: //github.com/Bin-Cao/Sim XRD.
Dataset Splits	Yes	For in-library classification, a fundamental task in crystallography, the dataset is randomly split according to the types of simulated environments, resulting in 119,569 × 30 training instances, 119,569 × 1 validation instances, and 119,569 × 2 testing instances. ... Under out-of-library settings, the training and testing XRD patterns are generated from non-overlapping crystals. This setup yields 83,698 × 33 training instances, 11,957 × 33 validation instances, and 23,914 × 33 testing instances.
Hardware Specification	Yes	All models are implemented using the Py Torch (Paszke et al., 2019) library and trained on Ge Force RTX 3090 GPU.
Software Dependencies	No	All models are implemented using the Py Torch (Paszke et al., 2019) library and trained on Ge Force RTX 3090 GPU. The paper mentions PyTorch but does not provide a specific version number.
Experiment Setup	Yes	We use the following hyper-parameters across all experiments: Batch size of 128 and learning rate of 2.5 × 10−4. All models are trained for 50 epochs with an early stopping patience of 3. We use the Cross-Entropy function to measure the difference between predictions and the ground truth.