EGU26-7697, updated on 14 Mar 2026
https://doi.org/10.5194/egusphere-egu26-7697
EGU General Assembly 2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Wednesday, 06 May, 14:35–14:45 (CEST)
 
Room -2.33
Gap-Aware Transformer-Based Foundation Model Pretraining for Spatiotemporal Earth Observation Data
Charly Zimmer1, Josefine Umlauft2, Guido Kraemer1, David Montero1, and Miguel D Mahecha1
Charly Zimmer et al.
  • 1Leipzig University, Institute for Earth System Science and Remote Sensing, Leipzig, Germany (charly.zimmer@uni-leipzig.de)
  • 2Leipzig University, ScaDS.AI, Leipzig, Germany (josefine.umlauft@uni-leipzig.de)

Earth observation datasets, especially those derived from remote sensing, are often characterized by significant data gaps. However, the pretraining of Geospatial Foundation Models requires mostly complete samples, leading to very selective sampling strategies that leave out large parts of the original observations. The problem is exacerbated in spatiotemporal data where these restrictions apply to the entire time series. Systems like Prithvi-EO-2.0 allow very small gap regions that can be addressed with interpolation during preprocessing. But a solution for integrating significant gap areas (>20% of the sample) into the pretraining process is yet to be established. We introduce an architecture that builds upon the random masking strategies in popular MAE-style architectures by additionally force-masking patches that contain gaps. Doing so requires a BERT masking scheme where masked patches are encoded instead of being removed from the sequence. Custom loss functions are introduced to account for the gaps in both the targets and the masked patches. While the resulting encoder-only architecture does not benefit from the reduced computational complexity in MAE-style masking, we mitigate this effect by using factorized space-time attention in the Video Vision Transformer (ViViT) backbone, thus creating a simple and lightweight model that is easily scalable. We demonstrate the potential of the architecture by performing spatiotemporal representation learning in a multivariate setup involving global Land Surface Temperature (LST) observations. The model is embedded in a framework that provides customizable sampling strategies for large-scale Earth observation datasets, including control over parameters like the maximum gap ratio per sample, the sampling strides, and the involved variables in shared-grid datasets like Earth System Data Cubes (ESDC). This flexibility in sampling enables the generation of training datasets with millions of samples, thus exposing the full volume of information stored in Earth observation data to Geospatial Foundation Models.

How to cite: Zimmer, C., Umlauft, J., Kraemer, G., Montero, D., and Mahecha, M. D.: Gap-Aware Transformer-Based Foundation Model Pretraining for Spatiotemporal Earth Observation Data, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-7697, https://doi.org/10.5194/egusphere-egu26-7697, 2026.