ESSI1.10 | Pre-training Strategies for Geospatial Foundation Models
Pre-training Strategies for Geospatial Foundation Models
Convener: Conrad Albrecht | Co-conveners: Thomas Brunschwiler, Gabriele Cavallaro

Earth Observation Foundation Models (EO-FMs) are large-scale AI models that undergo pre-training on proxy tasks using self-supervised learning to generate versatile data representations. Those representations are referred to as embeddings. Pre-training of EO-FMs is designed to capture the unique spatio-temporal characteristics of remote sensing data and climate model predictions into embeddings. After pre-training, the EO-FM encoder can be adapted for a wide range of downstream applications through fine-tuning with a decoder head. This approach enables high accuracy and generalization across different regions, even when limited labeled data is available.

The most critical and computationally intensive phase in the development of EO-FMs is the pre-training stage. This phase involves several key components:
1. data sampling strategies to ensure diverse and representative datasets,
2. exploration of various proxy tasks—such as spatial-spectral-temporal masking, diffusion-based methods, and contrastive learning—to enhance the model's ability to learn the structure of complex geospatial data,
3. robust scaling law measurement and extrapolation: finding the optimal balance between model size, data size, and computational effort estimation,
4. scalable model training: leveraging distributed systems (supercomputers, cloud environments, etc.) and advanced parallelism techniques for efficient pre-training

Evaluation of pre-trained EO-FMs is critically linked to benchmark embeddings. It remains an open direction of research to develop generic frameworks to test the universal character of Earth observation data embeddings.

This session will focus on sharing insights from extensive pre-training experiments, including ablation studies that examine the impact of different pre-training strategies on model performance. We also invite contributions with practical guidance for training EO-FMs on large-scale GPU clusters. In addition, we would like to discuss scalable training strategies for multi-modal and multi-domain EO-FMs. We aim to foster a deeper understanding of how to optimize pre-training processes for enhanced model effectiveness in various geospatial applications through the use of embeddings.

Earth Observation Foundation Models (EO-FMs) are large-scale AI models that undergo pre-training on proxy tasks using self-supervised learning to generate versatile data representations. Those representations are referred to as embeddings. Pre-training of EO-FMs is designed to capture the unique spatio-temporal characteristics of remote sensing data and climate model predictions into embeddings. After pre-training, the EO-FM encoder can be adapted for a wide range of downstream applications through fine-tuning with a decoder head. This approach enables high accuracy and generalization across different regions, even when limited labeled data is available.

The most critical and computationally intensive phase in the development of EO-FMs is the pre-training stage. This phase involves several key components:
1. data sampling strategies to ensure diverse and representative datasets,
2. exploration of various proxy tasks—such as spatial-spectral-temporal masking, diffusion-based methods, and contrastive learning—to enhance the model's ability to learn the structure of complex geospatial data,
3. robust scaling law measurement and extrapolation: finding the optimal balance between model size, data size, and computational effort estimation,
4. scalable model training: leveraging distributed systems (supercomputers, cloud environments, etc.) and advanced parallelism techniques for efficient pre-training

Evaluation of pre-trained EO-FMs is critically linked to benchmark embeddings. It remains an open direction of research to develop generic frameworks to test the universal character of Earth observation data embeddings.

This session will focus on sharing insights from extensive pre-training experiments, including ablation studies that examine the impact of different pre-training strategies on model performance. We also invite contributions with practical guidance for training EO-FMs on large-scale GPU clusters. In addition, we would like to discuss scalable training strategies for multi-modal and multi-domain EO-FMs. We aim to foster a deeper understanding of how to optimize pre-training processes for enhanced model effectiveness in various geospatial applications through the use of embeddings.