PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations

Authors: Benjamin Holzschuh, Qiang Liu, Georg Kohl, Nils Thuerey

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We perform a detailed ablation study on accuracy-compute tradeoffs when scaling and modifying PDETransformer. We evaluate the performance of PDE-Transformer for autoregressive prediction with Tp = 1 preceding snapshots. Our experiments are divided into two parts: first we compare PDE-Transformer to other SOTA transformer architectures on a large pre-training set of different PDEs, focusing on accuracy, training time and required compute. We motivate our design choices for PDE-Transformer in an ablation study. Second, we finetune the pre-trained network on three different challenging downstream tasks involving new boundary conditions, different resolutions, physical channels and domain sizes, showcasing its generalization capabilities to out-of-distribution data.
Researcher Affiliation Academia 1School of Computation, Information and Technology, Technical University of Munich, Germany. Correspondence to: Benjamin Holzschuh <EMAIL>.
Pseudocode Yes Algorithm 1 EMA Gradinet Clip.
Open Source Code Yes Our source code is available at https://github. com/tum-pbs/pde-transformer.
Open Datasets Yes The datasets are based on APEBench (Koehler et al., 2024), and described in detail in Appendix C. [...] All the simulations are from the Well dataset (Ohana et al., 2024).
Dataset Splits Yes For each type of PDE, we consider 600 trajectories of 30 simulation steps each, which are randomly split into a fixed training, validation and test set. [...] Validation Set: random 15% split of all sequences from s [0, 500[ [...] Test Set: all sequences from s [500, 600[ [...] For each provided dataset, a predefined data split in training, validation, and test set already exists in the Well dataset. We randomly select 42, 8, and 10 trajectories from the corresponding split data for training, validation, and testing, respectively.
Hardware Specification Yes Table 1. Training S configurations for 100 epochs on 4x H100 GPUs. [...] The training time for 100 epochs is reported on 4x H100 GPUs.
Software Dependencies No We employ the Distributed Data Parallel (DDP) strategy, as supported by Py Torch Lightning (Falcon & The Py Torch Lightning team, 2019), to train models across multiple GPUs.
Experiment Setup Yes Table 4. Major hyperparameters for training. Pre-training Downstream tasks Effective batch size 256 256 Learning rate 4.00 10 5 1.00 10 4 (Active matter & RBC) 4.00 10 5 (Shear flow) Optimizer Adam W Adam W Epochs 100 2000