EWMoE: An Effective Model for Global Weather Forecasting with Mixture-of-Experts

Authors: Lihao Gan, Xin Man, Chenghong Zhang, Jie Shao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct our evaluation on the ERA5 dataset using only two years of training data. Extensive experiments demonstrate that EWMo E outperforms current models such as Four Cast Net and Clima X at all forecast time, achieving competitive performance compared with the state-of-the-art models Pangu-Weather and Graph Cast in evaluation metrics such as Anomaly Correlation Coefficient (ACC) and Root Mean Square Error (RMSE). Additionally, ablation studies indicate that applying the Mo E architecture to weather forecasting offers significant advantages in improving accuracy and resource efficiency.
Researcher Affiliation Academia 1University of Electronic Science and Technology of China, Chengdu, China 2Sichuan Artificial Intelligence Research Institute, Yibin, China 3Institute of Plateau Meteorology, China Meteorological Administration, Chengdu, China EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes methods and equations but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our implementation code is available at https://github.com/technomii/EWMo E.
Open Datasets Yes ERA5 (Hersbach et al. 2020) is a publicly available atmospheric reanalysis dataset produced by the European Centre for Medium-Range Weather Forecasts (ECMWF).
Dataset Splits Yes In addition, to demonstrate the effectiveness of our model in the case of limited data and computing resources, we use two years of data for training (2015 and 2016), one year for validation (2017), and one year for testing (2018).
Hardware Specification Yes The training of EWMo E was completed under 9 days on 2 Nvidia 3090 GPUs.
Software Dependencies No The paper mentions using the Adam W optimizer but does not specify versions for any key software libraries or frameworks (e.g., PyTorch, TensorFlow, etc.).
Experiment Setup Yes For each input data sample from the ERA5 dataset, it can be represented as an image with 20 channels. We set the patch size as 8 8, and the EWMo E model consists of encoders with depth=6, dim=768 and decoders with depth=6, dim=512. Each encoder has a Mo E layer, and each Mo E layer consists of 20 independent experts. Specifically, in the gating network of each Mo E layer, we use top-2 routing to select the top-2 ranked experts for forward propagation of training data. We employ the Adam W optimizer with two momentum parameters β1=0.9 and β2=0.95, and set the weight decay to 0.05.