- 1Forschungszentrum Jülich, Jülich Supercomputing Centre (JSC), Jülich, Germany (wael.almikaeel.95@gmail.com)
- *A full list of authors appears at the end of the abstract
Foundation models have shown strong potential for data-driven weather and climate forecasting by supporting multiple tasks with limited task-specific engineering. The use of architectures that extract maximum value from large, heterogeneous datasets is important in this approach. WeatherGenerator follows this paradigm by learning from diverse observational and reanalysis sources to encode a latent representation of atmospheric dynamics. In this work, we examined the impact of integrating Mixture of Experts (MoE) layers, developed for large language models, into WeatherGenerator, and assessed how MoE can best be incorporated within its decoder architecture.
The motivation behind MoE is straightforward: during training, a router learns to assign tokens to specialized experts, allowing different parts of the decoder to focus on distinct spatial regions or physical regimes. We build on this idea by introducing spatially aware routing, in which geographic context is provided to the router, and by evaluating loss-aware routing strategies that favor experts by minimizing local prediction errors.
We evaluate four MoE decoder configurations, based on the use of spatial context and loss-aware routing, and compare them against the baseline model. Experiments are conducted using ERA5 reanalysis data, with performance measured using global RMSE and MAE for wind components (u, v), temperature (2t, t850), geopotential height (z500), and specific humidity over three 6-hour autoregressive forecast steps.
Across experiments, MoE architectures consistently improve performance for thermodynamic and large-scale variables. In particular, z500 RMSE is reduced by 26–31% at the first forecast step, with spatially aware routing performing best. Near-surface temperature shows a 7% improvement in RMSE and an 11% improvement in MAE when combining spatial and loss-aware routing. These improvements appear early in training within the first few epochs, indicating efficient use of the available data. On the other hand, MoE variants show limited or slightly negative effects at the second and last forecasting step when evaluating wind component variables, while the baseline performance shows similar or better results, especially on the last forecasting step.
These preliminary results indicate that MoE provides variable-dependent benefits, with notable improvements for slowly varying, large-scale thermodynamic fields, but less impact on highly dynamic momentum variables. Ongoing work will further assess performance across longer forecast horizons, different climatic regions, and training with multiple datasets from different sources.
Jehangir Awan, Sebastian Buschow, Peter Düben, Simon Grasse, Moritz Hauschulz, Till Hauer, Sebastian Hickman, Timothy Hunter, Matthias Karlbauer, Javad Kasravi, Enxhi Kreshpa, Julian Kuehnert, Michael Langguth, Christian Lessig, Ilaria Luise, Savvas Melidonis, Simone Norberti, Kacper Nowak, Sorcha Owens, Ankit Patnala, Yura Perugachi Diaz, Julius Polz, Konstantin Rushchanskii, Martin Schultz, Asma Semcheddine, Michael Tarnawa, Kerem Tezcan, Sindhu Vasireddy, Jifeng Wang, Florentine Weber, Sophie Xhonneux
How to cite: Almikaeel, W. and the WeatherGenerator Team: Mixture of Experts with Spatial Routing in a Weather Foundation Model: Early Results from WeatherGenerator, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-14545, https://doi.org/10.5194/egusphere-egu26-14545, 2026.