- 1Department of Meteorology and Geophysics, University of Vienna, Vienna, Austria (maximilian.meindl@univie.ac.at)
- 2Research Unit Sustainability and Climate Risk, Center for Earth System Research and Sustainability (CEN), University of Hamburg, Hamburg, Germany
The emergence of global km-scale climate models challenges traditional model evaluation approaches, which typically rely on long climatological averages. The substantial computational costs and enormous data volumes associated with km-scale simulations often constrain simulation length, limiting the availability of long-term averages. As a result, conventional analysis methods become less practical and less informative when assessing short, high-frequency model output that is potentially dominated by internal variability. At the same time, recent advances in machine learning (ML), particularly in deep neural networks, offer new and innovative ways to efficiently extract information from large climate datasets. Building on this progress, we present an ML-based framework for evaluating climate models on a regional scale over short periods, focusing on daily near-surface air temperature fields over Europe.
We train a convolutional neural network (CNN) to distinguish spatial temperature fields from a large set of climate models. We employ 28 regional simulations from EURO-CORDEX and two global km-scale models from nextGEMS and Destination Earth. Beyond the classification based on climate model simulations, the pre-trained CNN is applied to observation-based test datasets. This setup allows us to build towards an evaluation metric, as the model, the observation-based datasets are more frequently assigned to, might be considered most similar to observed climate. Despite the regional focus of EURO-CORDEX, observation-based samples are most frequently classified as the global km-scale model IFS-FESOM. This suggests that this global km-scale model may capture regional temperature patterns more accurately than regional climate model simulations. Although our results are consistent with traditional metrics in identifying IFS-FESOM as the best-performing model, they also indicate that CNN-based evaluation provides additional information about the similarity between models and observations.
To better understand which spatial features influence the CNN’s classification for observation-based samples, we apply explainable artificial intelligence (XAI) methods, specifically layerwise relevance propagation (LRP), to the classification outcomes. The resulting relevance patterns indicate that static features such as orography and coastlines, as well as relevance hotspots potentially linked to regions of dynamic variability, play a dominant role in the classification. This highlights that the CNN is sensitive to physically meaningful structures that define model-specific spatial fingerprints.
Using our ML-based framework, we show that a CNN can robustly distinguish between climate models on regional and short time scales as well as identify the model closest to observations. More broadly, we demonstrate that ML, combined with XAI, offers a scalable and physically interpretable approach for evaluating high-resolution climate models, thereby complementing established evaluation frameworks.
How to cite: Meindl, M., Kornblueh, M., Brunner, L., and Voigt, A.: Using Explainable AI to uncover physically meaningful features in km-scale climate models on a regional scale, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-9427, https://doi.org/10.5194/egusphere-egu26-9427, 2026.