Gap-Aware Transformer-Based Foundation Model Pretraining for Spatiotemporal Earth Observation Data

Charly Zimmer; Josefine Umlauft; Guido Kraemer; David Montero; Miguel D Mahecha

doi:https://doi.org/10.5194/egusphere-egu26-7697

[Back] [Session ESSI1.11]

EGU26-7697, updated on 14 Mar 2026

https://doi.org/10.5194/egusphere-egu26-7697

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Gap-Aware Transformer-Based Foundation Model Pretraining for Spatiotemporal Earth Observation Data

Charly Zimmer¹, Josefine Umlauft², Guido Kraemer¹, David Montero¹, and Miguel D Mahecha¹

Charly Zimmer et al.

¹Leipzig University, Institute for Earth System Science and Remote Sensing, Leipzig, Germany (charly.zimmer@uni-leipzig.de)
²Leipzig University, ScaDS.AI, Leipzig, Germany (josefine.umlauft@uni-leipzig.de)

Earth observation datasets, especially those derived from remote sensing, are often characterized by significant data gaps. However, the pretraining of Geospatial Foundation Models requires mostly complete samples, leading to very selective sampling strategies that leave out large parts of the original observations. The problem is exacerbated in spatiotemporal data where these restrictions apply to the entire time series. Systems like Prithvi-EO-2.0 allow very small gap regions that can be addressed with interpolation during preprocessing. But a solution for integrating significant gap areas (>20% of the sample) into the pretraining process is yet to be established. We introduce an architecture that builds upon the random masking strategies in popular MAE-style architectures by additionally force-masking patches that contain gaps. Doing so requires a BERT masking scheme where masked patches are encoded instead of being removed from the sequence. Custom loss functions are introduced to account for the gaps in both the targets and the masked patches. While the resulting encoder-only architecture does not benefit from the reduced computational complexity in MAE-style masking, we mitigate this effect by using factorized space-time attention in the Video Vision Transformer (ViViT) backbone, thus creating a simple and lightweight model that is easily scalable. We demonstrate the potential of the architecture by performing spatiotemporal representation learning in a multivariate setup involving global Land Surface Temperature (LST) observations. The model is embedded in a framework that provides customizable sampling strategies for large-scale Earth observation datasets, including control over parameters like the maximum gap ratio per sample, the sampling strides, and the involved variables in shared-grid datasets like Earth System Data Cubes (ESDC). This flexibility in sampling enables the generation of training datasets with millions of samples, thus exposing the full volume of information stored in Earth observation data to Geospatial Foundation Models.

How to cite: Zimmer, C., Umlauft, J., Kraemer, G., Montero, D., and Mahecha, M. D.: Gap-Aware Transformer-Based Foundation Model Pretraining for Spatiotemporal Earth Observation Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7697, https://doi.org/10.5194/egusphere-egu26-7697, 2026.