reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems

Authors: Maksim Zhdanov, Max Welling, Jan-Willem Van De Meent

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate Erwin s effectiveness across multiple domains, including cosmology, molecular dynamics, PDE solving, and particle fluid dynamics, where it consistently outperforms baseline methods both in accuracy and computational efficiency.
Researcher Affiliation	Collaboration	1AMLab, University of Amsterdam 2Cusp AI. Correspondence to: Maksim Zhdanov <EMAIL>.
Pseudocode	Yes	To highlight the simplicity of our method, we provide the pseudocode3: # coarsening ball tree x = rearrange([x, rel.pos], "(n 2l) d n (2l d)") @ Wc pos = reduce(pos, "(n 2l) d n d", "mean")
Open Source Code	Yes	The code is available at https://github.com/ maxxxzdn/erwin.
Open Datasets	Yes	To demonstrate our model s ability to capture long-range interactions, we use the cosmology benchmark (Balla et al., 2024), which consists of large-scale point clouds representing potential galaxy distributions. The dataset consists of single-chain coarse-grained polymers (Webb et al., 2020; Fu et al., 2022) simulated using MD. We benchmark on multiple datasets taken from Li et al. (2023a). Additionally, we evaluate our model on airflow pressure modeling (Umetani & Bickel, 2018; Alkin et al., 2024a). We use EAGLE (Janny et al., 2023), a large-scale benchmark of unsteady fluid dynamics.
Dataset Splits	Yes	Dataset splits followed the original benchmarks: Cosmology: Training set varied from 64 to 8192 examples, with validation and test sets of 512 examples each Molecular Dynamics: 100 short trajectories for training, 40 long trajectories for testing PDE Benchmarks: 1000 training / 200 test examples (except Plasticity: 900/80) Shape Net-Car: 700 training / 189 test examples EAGLE: 1184 trajectories with 80%/10%/10% split
Hardware Specification	Yes	All experiments were conducted on a single NVIDIA RTX A6000 GPU with 48GB memory and 16 AMD EPYC 7543 CPUs.
Software Dependencies	Yes	Erwin and all baselines except those for cosmology were implemented in Py Torch 2.6.
Experiment Setup	Yes	All models were trained using the Adam W optimizer (Loshchilov & Hutter, 2019) with weight decay 10 5. The learning rate was tuned in the range 10 4 to 10 3 to minimize loss on the respective validation sets with cosine decay to 10 7. Gradient clipping by norm with value 1.0 was applied across all experiments. Early stopping was used only for Shape Net-Car and molecular dynamics tasks, while all other models were trained until convergence. In every experiment, we normalize inputs to the model. Hyperparameter optimization was performed using grid search with single trials.