EGU25-18029, updated on 15 Mar 2025
https://doi.org/10.5194/egusphere-egu25-18029
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Monday, 28 Apr, 08:45–08:55 (CEST)
 
Room -2.92
GeoDINO: A Vision Foundation Model for Earth Observation Leveraging DINO Architecture and Sentinel-2 Multi-Spectral Data
Riccardo Musto1, Giancarlo Paoletti1, Nikolaos Dionelis2, Simone Sarti1, Fabio Di Matteo1, Jente Bosmans2, Peter Naylor2, Giacomo Donato Cascarano3, Casper Fibaek2, and Nicolas Longépé2
Riccardo Musto et al.
  • 1Leonardo Labs (Leonardo S.p.A.), Italy
  • 2European Space Agency (ESA), ESRIN, Φ-lab, Italy
  • 3e-GEOS S.p.A., Via Tiburtina 965, Rome, 00156, Italy

Foundation Models are emerging as a transformative paradigm in Earth observation, offering powerful solutions to the challenges of processing and understanding satellite imagery at scale. The scarcity of large-scale labeled datasets and the technical challenges of annotating the vast volumes of data collected by satellites pose significant barriers to achieving high accuracy in many important downstream tasks. Furthermore, the dynamic nature of Earth adds complexity, as labels tied to a specific geographical region at a particular moment in time are insufficient to capture the evolving characteristics of the environment. Self-supervised learning techniques have emerged as a promising solution, enabling models to learn rich representations from unlabeled data while requiring minimal supervised fine-tuning for specific applications.
In this work, we present GeoDINO, a novel foundation model that adapts the DINO self-supervised learning architecture to handle multi-spectral Sentinel-2 data. While the original DINO framework has shown remarkable success in computer vision tasks through its teacher-student architecture and self-distillation approach, we extend it significantly for Earth observation applications. Our key innovation lies in the addition of multiple supervised auxiliary tasks: after the encoder generates representations, we attach specialized MLPs designed to predict various geospatial attributes including climate zones, permanent water bodies and geographical coordinates. Both the teacher and student networks are trained to predict these auxiliary labels, with the teacher network being updated through Exponential Moving Average (EMA) of the student's weights. This modification enables our model to learn not only from the self-supervised distillation process but also from the rich spatial and temporal information inherent in satellite imagery.
We are currently training GeoDINO on MajorTOM, a comprehensive Sentinel-2 dataset comprising 23TB of Core-S2L2A data, exploiting the Leonardo Davinci-1 Supercomputer. Furthermore, to validate our approach, we are also training the model on FastTOM and TinyTOM, two subsets of MajorTOM. Finally, the model will be evaluated within the PhilEO Bench framework to assess its performance on different tasks, including land cover classification, change detection, and building density estimation. Looking ahead, we plan to transition to the DINOv2 architecture to further enhance our model's capabilities. Through this research, we aim to demonstrate how self-supervised learning techniques, when properly adapted for Earth observation data, can address the fundamental challenges of data scarcity and temporal dynamics in remote sensing applications. The development of GeoDINO represents a step toward more efficient and adaptable Earth observation systems that can leverage the vast amounts of available satellite data while minimizing the need for extensive labeled datasets.
References:  
[1] M. Caron, et al., “Emerging Properties in Self-Supervised Vision Transformers”, arXiv:2104.14294, 2021
[2] C. Fibaek, et al., “PhilEO Bench: Evaluating Geo-Spatial Foundation Models,” in Proceedings IGARSS, 2024. 
[3] N. Dionelis, et al., “Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI,” arXiv:2406.18295, 2024. 
[4] N. Dionelis and N. Longepe, “Fine-Tuning Foundation Models with Confidence Assessment for enhanced Semantic segmentation,” 2024. 
[5] A. Francis and M. Czerkawski, “MajorTOM: Expandable Datasets for Earth Observation,” IGARSS, 2024. 
[6] B. Le Saux, et al., “The PhilEO Geospatial Foundation Model Suite,” EGU, 2024.

How to cite: Musto, R., Paoletti, G., Dionelis, N., Sarti, S., Di Matteo, F., Bosmans, J., Naylor, P., Donato Cascarano, G., Fibaek, C., and Longépé, N.: GeoDINO: A Vision Foundation Model for Earth Observation Leveraging DINO Architecture and Sentinel-2 Multi-Spectral Data, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-18029, https://doi.org/10.5194/egusphere-egu25-18029, 2025.