- Karlsruhe Institute of Technology (KIT), Chair for AI in Climate and Environmental Sciences, Institute of Theoretical Informatics (ITI), Karlsruhe, Germany (mozhgan.amjadi@kit.edu)
Data-driven weather prediction models have demonstrated remarkable skill, yet their ability to maintain a physically consistent three-dimensional atmospheric structure under out-of-distribution (OOD) conditions remains poorly understood. If OOD performance criteria could be met approximately, AI models would open up entirely new possibilities to generate large AI weather ensembles under future climate scenarios—for example, if initialized from climate model simulations (Rackow et al., 2024). This study conducts a multi-scale diagnostic evaluation of four state-of-the-art models—NeuralGCM (a deterministic hybrid model), GraphCast (a deterministic graph neural-network model), AIFS (a deterministic transformer-based model), and GenCast (an ensemble generative and diffusion-based model)—initialized across three distinct climate states: 1955 (cold), 2023 (neutral), sourced from ERA5 reanalysis, and 2049 (warm) simulated by the nextGEMS climate model (Segura et al., 2025).
Over 1–10-day leads, we find no detectable resolution-dependence for NeuralGCM's global skill, though the 1.4° configuration minimizes mean drift. A dominant spatial signature emerges across all models: a robust land–ocean contrast where oceans maintain smaller biases and slower Anomaly Correlation Coefficient (ACC) decay. Cross-hemispheric skill comparisons reveal that this contrast drives a significant asymmetry in error characteristics. In the 2049 warming scenario, the land-heavy Northern Hemisphere (NH, 39% land coverage) is the primary site of GraphCast's systematic "cool-drift" toward its training distribution, which peaks during boreal summer (JJA). In contrast, the generative GenCast model develops a pronounced warm bias localized in the oceanic Southern Hemisphere (SH, with about 20% land coverage).
For all three climate states, we further evaluate model performance across the entire troposphere and, as far as available, the stratosphere. While all four models maintain high variance-explained in the present-day mid-troposphere, performance degrades non-linearly under OOD forcing elsewhere, particularly within the stratosphere (< 200 hPa) and the boundary layer (> 900 hPa). Latitudinal R2-score cross-sections reveal that this degradation is most severe at polar latitudes; notably, in the 2049 scenario, GenCast exhibits a near-total collapse of skill by day 10, whereas NeuralGCM and GraphCast maintain localized predictive skill within the tropical troposphere.
The architecture-dependence of these simulated ensembles is confirmed by projecting day-10 drifts onto inter-climate "fingerprints" (T2049 - T2023 and T1955 - T2023). While AIFS and NeuralGCM show superior stability, GraphCast exhibits a systematic "cool-drift" toward its training climatology, and GenCast develops a distinct warm ocean drift. Beyond evaluating skill in surface variables, our results underline the need to assess data-driven models comprehensively across vertical, hemispheric, and seasonal diagnostics when applied to climate science scenarios, with implications for future AI model development.
References:
Rackow, T., et al (2024). Robustness of AI-based weather forecasts in a changing climate. arXiv preprint arXiv:2409.18529. https://doi.org/10.48550/arXiv.2409.18529
Segura, H., et al. (2025). nextGEMS: entering the era of kilometer-scale Earth system modeling. Earth system modeling, Geosci. Model Dev., 18, 7735–7761, https://doi.org/10.5194/gmd-18-7735-2025
How to cite: Amiramjadi, M., Roth, C., and Nowack, P.: Architectural Sensitivity of AI Weather Prediction Models to 3D Structural and Seasonal Climate Forcing, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-18038, https://doi.org/10.5194/egusphere-egu26-18038, 2026.