Interpretable Swin Transformer&ndash;Based Downscaling of PM2.5 Air Pollution Field

Marcos Martínez-Roig; Francisco Granell-Haro; Kevin Monsalvez-Pozo; Nuria P. Plaza-Martin; Victor Galván Fraile; Paul Ramacher Martin Otto; Johannes Bieser; Johannes Flemming; Paula Harder; Miha Razinger; Cesar Azorin-Molina

doi:https://doi.org/10.5194/egusphere-egu26-19262

[Back] [Session AS5.2]

EGU26-19262, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-19262

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Interpretable Swin Transformer–Based Downscaling of PM2.5 Air Pollution Field

Marcos Martínez-Roig¹, Francisco Granell-Haro¹, Kevin Monsalvez-Pozo², Nuria P. Plaza-Martin¹, Victor Galván Fraile³, Paul Ramacher Martin Otto⁴, Johannes Bieser⁴, Johannes Flemming⁵, Paula Harder⁵, Miha Razinger⁵, and Cesar Azorin-Molina¹

Marcos Martínez-Roig et al.

¹Consejo Superior de Investigaciones Científicas (CSIC), Ecology and Global Change, Madrid, Spain (marcos.martinez.roig@csic.es)
²Universitat de València, Image Processing Laboratory (IPL), Paterna, València. Spain
³Universidad Complutense de Madrid, Department of Physics of the Earth and Astrophysics, Ciudad Universitaria, ZIP code 28040 Madrid, Spain
⁴Helmholtz-Zentrum Hereon, Geesthacht, Germany
⁵European Centre for Medium-Range Weather Forecasts (ECMWF)

Fine particulate matter (PM2.5) is one of the most harmful air pollutants, posing severe risks to human health and contributing significantly to premature mortality worldwide. Accurate high-resolution monitoring and forecasting of PM2.5 are therefore essential for air quality management, public health assessment, and the design of effective mitigation policies. However, operational atmospheric composition models such as the Copernicus Atmosphere Monitoring Service (CAMS) provide global fields at relatively coarse spatial resolution (~40 km), limiting their ability to represent local-scale pollution patterns driven by complex interactions between emissions, meteorology, and topography. Higher-resolution regional CAMS products (~10 km) partly address this limitation but are computationally expensive and are restricted to specific geographical domains, mainly Europe. As a result, high-resolution information remains difficult to obtain consistently at the global scale.

In this work, we present a deep learning–based super-resolution approach to downscale PM2.5 concentration fields from 40 km to 10 km resolution, bridging the gap between global model outputs and regional-scale applications. The proposed approach is based on a SwinFIR architecture, a hierarchical Vision Transformer that leverages shifted window self-attention to efficiently capture multiscale spatial dependencies. The model ingests multiple low-resolution dynamic variables from CAMS, including PM2.5, 2-meter temperature (T2M), 10-meter wind speed components (U10, V10), dewpoint (D2M) and boundary layer height (BLH), providing both chemical and meteorological context. In addition, high-resolution static data, such as orography and population, are introduced through a secondary branch, enabling the model to condition the super-resolutionprocess on fine-scale geographical features that strongly influence pollutant distributions. The output consists of high-resolution (10 km) PM2.5 fields. Model performance is evaluated using both a held-out test period and independent ground-based PM2.5 observations from the European Environment Agency.

Results show that the model effectively reconstructs fine-scale PM2.5 structures and reduces biases present in the global forecasts. Verification against ground-based observations indicates that the model achieves performance comparable to high-resolution CAMS Europe regional forecasts. The proposed SwinFIR model consistently outperforms a carefully optimized state-of-the-art U-Net baseline across multiple evaluation criteria, including error metrics, spatial correlation, and structural consistency. These improvements reflect the ability of self-attention mechanisms to capture long-range spatial interactions that are difficult to model using purely convolutional approaches.

Beyond predictive performance, we also focus on interpretability. Feature importance analyses quantify the relative contribution of each input variable, demonstrating that static inputs used play a key role in the downscaling process. Attention maps further reveal that the model focuses on physically meaningful events, including high-concentration peaks and regions of strong wind, indicating physically consistent behavior.

Finally, transferability was assessed by applying the model to North America, a region unseen during training. Evaluation against AirNow observations shows reasonable generalization performance, while highlighting the need for further research to improve robustness and extrapolation to unseen regions.

Overall, this study demonstrates the potential of Transformer-based architectures for data-driven downscaling of atmospheric composition fields, providing both improved accuracy and enhanced physical interpretability. The proposed framework offers a promising tool for high-resolution air
quality applications based on global model outputs.

How to cite: Martínez-Roig, M., Granell-Haro, F., Monsalvez-Pozo, K., Plaza-Martin, N. P., Galván Fraile, V., Martin Otto, P. R., Bieser, J., Flemming, J., Harder, P., Razinger, M., and Azorin-Molina, C.: Interpretable Swin Transformer–Based Downscaling of PM2.5 Air Pollution Field, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-19262, https://doi.org/10.5194/egusphere-egu26-19262, 2026.