ESSI1.1 | AI Foundation Models for Earth, Space and Planetary Sciences
AI Foundation Models for Earth, Space and Planetary Sciences
Convener: Rahul Ramachandran | Co-conveners: Valentine Anantharaj, Tsengdar Lee, Aman GuptaECSECS, Takuya KurihanaECSECS
| Thu, 18 Apr, 16:15–18:00 (CEST)
Room 0.94/95
Posters on site
| Attendance Thu, 18 Apr, 10:45–12:30 (CEST) | Display Thu, 18 Apr, 08:30–12:30
Hall X2
Posters virtual
| Attendance Thu, 18 Apr, 14:00–15:45 (CEST) | Display Thu, 18 Apr, 08:30–18:00
vHall X2
Orals |
Thu, 16:15
Thu, 10:45
Thu, 14:00
Foundation Models (FM) represent the next frontier in Artificial Intelligence (AI). These generalized AI models are designed not just for specific tasks but for a plethora of downstream applications. Trained on any sequence data through self-supervised methods, FMs eliminate the need for extensive labeled datasets. Leveraging the power of transformer architectures, which utilize self-attention mechanisms, FMs can capture intricate relationships in data across space and time. Their emergent properties, derived from the data, make them invaluable tools for scientific research. When fine-tuned, FMs outperform traditional models, both in efficiency and accuracy, paving the way for rapid development of diverse applications. FMs, with their ability to synthesize vast amounts of data and discern intricate patterns, can revolutionize our understanding of and response to challenging global problems, such as monitoring and mitigating the impacts of climate change and other natural hazards.

The session will discuss advances, early results and best practices related to the preparation and provisioning of curated data, construction and evaluation of model architectures, scaling properties and computational characteristics of model pretraining, use cases and finetuning of downstream applications, and MLops for the deployment of models for research and applications. The session also encourages discussion on broad community involvement toward the development of open foundation models for science that are accessible for all.

Orals: Thu, 18 Apr | Room 0.94/95

Chairpersons: Aman Gupta, Takuya Kurihana, Rahul Ramachandran
FM for Science Applications
On-site presentation
Manil Maskey, Rahul Ramachandran, Tsengdar Lee, Kevin Murphy, Sujit Roy, Muthukumaran Ramasubramanian, Iksha Gurung, and Raghu Ganti

Foundation models signify a significant shift in AI by creating large-scale machine learning models (FMs) pre-trained on wide-ranging datasets. These models act as flexible starting points, ready to be fine-tuned for various specialized tasks. Distinct from traditional models designed for narrow objectives, foundation models apply their broad pre-training to learn patterns across data, enhancing their adaptability and efficiency in diverse domains. This approach minimizes the necessity for extensive, task-specific labeled datasets and prolonged training periods. A single foundation model can be tailored for many scientific applications, often outperforming traditional models in some tasks, even when labeled data is scarce.


Addressing the right array of complex scientific challenges using AI FMs requires interdisciplinary teams from various groups and organizations. No single research group or institution can independently muster the necessary resources or expertise to construct useful AI FMs. Thus, collaborative efforts are essential, combining diverse skills, resources, and viewpoints to create more comprehensive solutions. The right blend of domain-specific expertise and a broad understanding of various AI subfields is crucial to ensure the versatility and adaptability of foundation models. Moreover, the scientific community must develop a wide array of use cases, labeled datasets, and benchmarks to evaluate these models effectively across different scenarios to be accepted and widely utilized within science.


Building Foundation Models for science demands fostering collaboration among a diverse spectrum of research groups to ensure this broad range of perspectives. This strategy should include stakeholders like individual researchers, academic and government institutions, and tech companies. Embedding this collaboration within the principles of open science is therefore vital. Open science calls for transparent research, open sharing of findings, promoting reproducibility by making methodologies and data accessible, and providing tools researchers can freely use, modify, and distribute. Encouraging community collaboration in the model pre-training development leads to more robust and functional FM. Guaranteeing open access to datasets, models, and fine-tuning code enables researchers to validate findings and build upon previous work, thus reducing redundancy in data collection and cultivating a culture of shared knowledge and progress.

How to cite: Maskey, M., Ramachandran, R., Lee, T., Murphy, K., Roy, S., Ramasubramanian, M., Gurung, I., and Ganti, R.: Foundation Models for Science: Potential, Challenges, and the Path Forward, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-3202,, 2024.

On-site presentation
Ilaria Luise, Christian Lessig, Martin Schultz, and Michael Langguth

The atmosphere affects humans in a multitude of ways, from loss of lives due to adverse weather effects to long-term social and economic impacts. Very recently, AI-based models have shown tremendous potential in reducing the computational costs for numerical weather prediction. However, they lack the versatility of conventional models. The team has therefore recently introduced AtmoRep, a first probabilistic foundation model of atmospheric dynamics for multi-purpose applications [Lessig 2023].  Through large-scale representation learning, AtmoRep encapsulates a general description of the atmosphere dynamics based on the ERA5 reanalysis. Following the principles of in-context learning from natural language processing, adapted here to Earth system science, domain applications like e.g. forecasting and downscaling can be performed without any task-specific training. The model has therefore been applied as the backbone for several tasks, from weather forecasting to downscaling, spatio-temporal interpolations and data driven precipitation forecasting. After fine-tuning AtmoRep achieves skill competitive with Pangu-Weather [Bi 2023] for short-term forecasting and substantially exceeds the AI-based competitor [Stengel 2021] for downscaling. 


The model has been conceived as a flexible stack of Transformers, one for each field, coupled through cross attention to ensure a plug-and-play architecture and allow the dynamical integration of new fields without the need of retraining from scratch. The main innovation consists of a newly developed statistical loss, which generalises from the concept of cross-entropy in classification problems. The model is therefore fully probabilistic, and each application comes with a well calibrated set of ensemble members with spread correlated to the variability of the system, as demonstrated for e.g. in forecasting by inspecting the CRPS score or the error to spread ratios (see [Lessig 2023]). 


In addition, the flexible nature of the model allows to perform model fine-tuning on different data-types. To demonstrate that, the precipitation forecasting skill of AtmoRep has been fine-tuned on real radar data using the Radklim dataset as a proxy for accurate total precipitation rates. Using Radklim as ground truth, the diagnostic scores e.g. the RMSE or the FBI (Frequency Bias Indicator), indicate univocally that after fine-tuning the AtmoRep model outperforms ERA5, both in terms of accuracy in spatial coverage and intensity. 


In terms of future plans, we are currently working to extend the model to longer term weather forecasts, up to medium range forecasting. Furthermore, we are integrating the downscaling and forecasting steps using the CERRA 5km resolution reanalysis over Europe, so to achieve multi-resolution coarse-to-fine predictions beyond quarter degree resolution in the next few months. 

AtmoRep represents a step forward in the direction of building solid and skilful multi-purpose approaches and the present work is, in our opinion, only a first step towards the possibilities that are enabled by the methodology.


[Lessig 2023] Lessig et. al. AtmoRep: A stochastic model of atmosphere dynamics using large scale representation learning. arXiv:2308.13280, 2023.

[Bi 2023] K. Bi et al., “Accurate medium-range global weather forecasting with 3d neural networks,” Nature, 2023.

[Stengel 2021] K. Stengel et al., “Adversarial super-resolution of climatological wind and solar data,” Proceed- ings of the National Academy of Sciences, vol. 117, 2020.

How to cite: Luise, I., Lessig, C., Schultz, M., and Langguth, M.: AtmoRep: large scale representation learning for atmospheric dynamics, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-1651,, 2024.

Discussion: FM for Science
On-site presentation
Michael Smith, Luke Fleming, and James Geach

 We introduce EarthPT -- an Earth Observation (EO) pretrained transformer. EarthPT is a 700 million parameter decoding transformer foundation model trained in an autoregressive self-supervised manner and developed specifically with EO use-cases in mind. We demonstrate that EarthPT is an effective forecaster that can accurately predict future pixel-level surface reflectances across the 400-2300 nm range well into the future. For example, forecasts of the evolution of the Normalised Difference Vegetation Index (NDVI) have a typical error of approximately 0.05 (over a natural range of -1 -> 1) at the pixel level over a five month test set horizon, out-performing simple phase-folded models based on historical averaging. We also demonstrate that embeddings learnt by EarthPT hold semantically meaningful information and could be exploited for downstream tasks such as highly granular, dynamic land use classification. Excitingly, we note that the abundance of EO data provides us with -- in theory -- quadrillions of training tokens. Therefore, if we assume that EarthPT follows neural scaling laws akin to those derived for Large Language Models (LLMs), there is currently no data-imposed limit to scaling EarthPT and other similar `Large Observation Models.'

EarthPT is available under the MIT licence.

How to cite: Smith, M., Fleming, L., and Geach, J.: EarthPT: a foundation model for Earth Observation, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-1760,, 2024.

On-site presentation
Anna Jungbluth, Matt Allen, Francisco Dorr, Joseph Gallego-Mejia, Laura Martínez-Ferrer, Freddie Kalaitzis, and Raúl Ramos-Pollán

Satellite-based Earth Observation (EO) is crucial for monitoring land changes and natural hazards on a global scale. In addition to optical imagery, synthetic aperture radar (SAR) technology has proven indispensable, since radar pulses can penetrate clouds and detect millimeter changes on the ground surface. While SAR polarimetry data is easily available (e.g. via Google Earth Engine), interferometric products are harder to obtain due to complex pre-processing requirements. 

In general, using the information contained in EO data (both optical and SAR) for specific downstream tasks often requires specialized analysis pipelines that are not easily accessible to the scientific community. In the context of applying machine learning to EO, self-supervised learning (SSL) - machine learning models that learn features in data without being provided with explicit labels - offer great potential to fully leverage the wealth and complexity of the available data.

In this work, we apply self-supervised learning techniques to create pre-trained models that can leverage the features learned from unlabelled EO data for a variety of downstream tasks. More specifically, we pre-train our models on optical imagery (Sentinel-2) or SAR data (Sentinel-1), and fine-tune our models to predict local events (e.g. fires, floods) and annual land characteristics (e.g. vegetation percentage, land cover, biomass). We compare a number of state-of-the-art SSL techniques (MAE1, DINO2, VICReg3, CLIP4) that have shown great performance on standard image or text based tasks. By adapting these models to our use case, we demonstrate the potential of SSL for EO, and show that self-supervised pre-training strongly reduces the requirement for labels.

In addition to the pre-trained models, we provide global benchmarking datasets of EO input data and associated downstream tasks ready for use in machine learning pipelines. Our data contains 25+ TB of co-registered and aligned tiles, covering South America, the US, Europe, and Asia. By comparing how well our pre-trained models perform on unseen data (both regionally and temporally), we investigate the generalizability of SSL techniques for EO research. With this, our work provides a first step towards creating EO foundation models that can predict anything, anywhere on Earth.


1. He, K. et al. Masked Autoencoders Are Scalable Vision Learners. (2021).

2. Caron, M. et al. Emerging Properties in Self-Supervised Vision Transformers. (2021).

3. Bardes, A., Ponce, J. & LeCun, Y. VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. (2021).

4. Radford, A. et al. Learning Transferable Visual Models From Natural Language Supervision. (2021).

How to cite: Jungbluth, A., Allen, M., Dorr, F., Gallego-Mejia, J., Martínez-Ferrer, L., Kalaitzis, F., and Ramos-Pollán, R.: Towards Foundation Models for Earth Observation; Benchmarking Datasets and Performance on Diverse Downstream Tasks, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-11514,, 2024.

On-site presentation
Bertrand Le Saux, Casper Fibaek, Luke Camilleri, Andreas Luyts, Nikolaos Dionelis, Giacomo Donato Cascarano, Leonardo Bagaglini, and Giorgio Pasquali

Foundation Models (FMs) are the latest big advancement in AI that build upon Deep Learning. They have the ability to analyse large volumes of unlabeled Earth Observation (EO) data by learning at scale, identifying complex patterns and trends that may be difficult or even impossible to detect through traditional methods. These models can then be used as a base to create powerful applications that automatically identify, classify, and analyse features in EO data, unlocking the full potential of AI in EO like never before, providing a paradigm shift in the field.

The field of geospatial FMs is blooming with milestones such as Seasonal Contrast (SeCo) [1] or Prithvi [2]. We present the PhilEO Suite: a dataset (the PhilEO Globe), a series of models (the PhilEO Pillars), and an evaluation testbed (the PhilEO Bench).

In particular, the PhilEO Bench [3] is a novel framework to evaluate the performances of the numerous EO FM propositions on a unified set of downstream tasks. Indeed, there is the need now to assess them with respect to their expected qualities in terms of generalisation, universality, label efficiency, and easiness to derive specialised models. The PhilEO Bench comprises a fair testbed bringing independence to external factors and a novel 400GB global, stratified Sentinel-2 dataset containing labels for the three downstream tasks of building density estimation, road segmentation, and land cover classification.



[1] Oscar Manas, et al., “Seasonal Contrast: Unsupervised pre-training from uncurated remote sensing data,” in Proc. ICCV, 2021.

[2] Johannes Jakubik, Sujit Roy, et al., “Foundation Models for Generalist Geospatial Artificial Intelligence,” arxiv:2310.18660, 2023.

[3] Casper Fibaek, Luke Camilleri, Andreas Luyts, Nikolaos Dionelis, and Bertrand Le Saux, “PhilEO Bench: Evaluating Geo-Spatial Foundation Models,” arXiv:2401.04464, 2024.

How to cite: Le Saux, B., Fibaek, C., Camilleri, L., Luyts, A., Dionelis, N., Cascarano, G. D., Bagaglini, L., and Pasquali, G.: The PhilEO Geospatial Foundation Model Suite, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-17934,, 2024.

On-site presentation
Lin He, Yi Lin, and Yufei Song

Remote sensing image scene classification is to annotate semantic categories for image areas covering multiple land cover types, reflecting the spatial aggregation of relevant social resources among feature objects, which is one of the remote sensing interpretation tasks with higher challenges for algorithms to understand the images. Nowadays, scene semantic information extraction of images using deep neural networks is also one of the hot research directions. In comparison to other algorithms, deep neural networks can better capture semantic information in images to achieve higher classification accuracy involved in applications such as urban planning. In recent years, multi-modal models represented by image-text have achieved satisfactory performance in downstream tasks. The introduction of "multi-modal" in the field of remote sensing research should not be limited to the use of multi-source data, but more importantly to the coding of diverse data and the extracted deep features based on the huge amount of data. Therefore, in this paper, based on an image-text matching model, we establish a multi-modal scene classification model (Fig. 1) for high spatial resolution aerial images which is dominated by image features and text provides facilitation for the representation of image features. The algorithm first employs self-supervised learning of the visual model, to align the expression domain of the image features obtained from training on natural images with that of our particular dataset, which will help to improve the feature extraction effectiveness of the aerial survey images on the visual model. The features generated by the pre-trained image encoding model and the text encoding model will be further aligned and some of the parameters in the image encoder will be iteratively updated during training. A valid classifier is designed at the end of the model to implement the scene classification task. Through experiments, it was found that our algorithm has a significant improvement effect on the task of scene categorization on aerial survey images compared to single visual models. The model presented in the article obtained precision and recall of above 90% on the test dataset, contained in the high spatial resolution aerial survey images dataset we built with 27 categories (Fig. 2).

Fig 1. Diagram of the proposed model structure. Blue boxes are associated with the image, green boxes with the text, and red boxes with both image and text.

Fig 2. Samples in our high spatial resolution aerial survey images dataset.

How to cite: He, L., Lin, Y., and Song, Y.: A multi-modal high spatial resolution aerial imagery scene classification model with visual enhancement, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-6107,, 2024.

On-site presentation
Martin Reinhardt, Karin Mora, Gunnar Brandt, Tejas Morbagal Harish, David Montero, Chaonan Ji, Teja Kattenborn, Francesco Martinuzzi, Clemens Mosig, and Miguel D. Mahecha

Terrestrial surface processes exhibit distinctive spectral signatures captured by optical satellites. Despite the development of over two hundred spectral indices (SIs), current studies often narrow their focus to individual SIs, overlooking the broader context of land surface processes. This project seeks to understand the holistic features of Sentinel-2 based SIs and their relationships with human impact and overall land surface dynamics. To address this, we propose an AI-driven approach that synthesises SIs derived from Sentinel data through dimension reduction, yielding interpretable latent variables describing the system comprehensively. Our goals are to (i) reduce the number of SIs and (ii) compute a few latent variables representing spatio-temporal dynamics, which culminate in a Feature Data Cube. This fully descriptive cube reduces computational costs, facilitating diverse applications. We plan to demonstrate its efficacy in land cover classification, standing deadwood detection, and terrestrial gross primary production estimation. The presentation outlines the project's implementation strategy, confronts methodological challenges, and extends an invitation to the remote sensing and machine learning community to collaborate on pressing environmental challenges. The project DeepFeatures is funded by ESA’s AI4Science activity. Website: 

How to cite: Reinhardt, M., Mora, K., Brandt, G., Morbagal Harish, T., Montero, D., Ji, C., Kattenborn, T., Martinuzzi, F., Mosig, C., and Mahecha, M. D.: DeepFeatures: Remote sensing beyond spectral indices, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-18852,, 2024.

On-site presentation
Mohanad Albughdadi, Vasileios Baousis, Tolga Kaprol, Armagan Karatosun, and Claudio Pisa

In the realm of remote sensing, where labeled datasets are scarce, leveraging pre-trained models via transfer learning offers a compelling solution. This study investigates the efficacy of the Segment Anything Model (SAM), a foundational computer vision model, in the domain of optical remote sensing tasks, specifically focusing on image classification and semantic segmentation.

The scarcity of labeled data in remote sensing poses a significant challenge for machine learning development. Transfer learning, a technique utilizing pre-trained models like SAM, circumvents this challenge by leveraging existing data from related domains. SAM, developed and trained by Meta AI, serves as a foundational model for prompt-based image segmentation. It employs over 1 billion masks on 11 million images, facilitating robust zero-shot and few-shot capabilities. SAM's architecture comprises an image encoder, prompt encoder, and mask decoder components, all geared towards swift and accurate segmentation for various prompts, ensuring real-time interactivity and handling ambiguity.

Two distinct use cases leveraging SAM-based models in the domain of optical remote sensing are presented, representing two critical tasks: image classification and semantic segmentation. Through comprehensive analysis and comparative assessments, various model architectures, including linear and convolutional classifiers, SAM-based adaptations, and UNet for semantic segmentation, are examined. Experiments encompass contrasting model performances across different dataset splits and varying training data sizes. The SAM-based models include using a linear, a convolutional or a ViT decoder classifiers on top of the SAM encoder.

Use Case 1: Image Classification with EuroSAT Dataset

The EuroSAT dataset, comprising 27,000 labeled image patches from Sentinel-2 satellite images across ten distinct land cover classes, serves as the testing ground for image classification tasks. SAM-ViT models consistently demonstrate high accuracy, ranging between 89% and 93% on various sizes of training datasets. These models outperform baseline approaches, exhibiting resilience even with limited training data. This use case highlights SAM-ViT's effectiveness in accurately categorizing land cover classes despite data limitations.

Use Case 2: Semantic Segmentation with Road Dataset

In the semantic segmentation domain, the study focuses on the Road dataset, evaluating SAM-based models, particularly SAM-CONV, against the benchmark UNet model. SAM-CONV showcases remarkable superiority, achieving F1-scores and Dice coefficients exceeding 0.84 and 0.82, respectively. Its exceptional performance in pixel-level labeling emphasizes its robustness in delineating roads from surrounding environments, surpassing established benchmarks and demonstrating its applicability in fine-grained analysis.

In conclusion, SAM-driven transfer learning methods hold promise for robust remote sensing analysis. SAM-ViT excels in image classification, while SAM-CONV demonstrates superiority in semantic segmentation, paving the way for their practical use in real-world remote sensing applications despite limited labeled data availability.

How to cite: Albughdadi, M., Baousis, V., Kaprol, T., Karatosun, A., and Pisa, C.: Exploring Transfer Learning Using Segment Anything Model in Optical Remote Sensing, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-5769,, 2024.


Posters on site: Thu, 18 Apr, 10:45–12:30 | Hall X2

Display time: Thu, 18 Apr 08:30–Thu, 18 Apr 12:30
Chairpersons: Valentine Anantharaj, Tsengdar Lee
Valentine Anantharaj, Takuya Kurihana, Gabriele Padovani, Ankur Kumar, Aristeidis Tsaris, Udaysankar Nair, Sandro Fiore, and Ian Foster

Pretraining a foundation model using MODIS observations of the earth’s atmosphere 

The earth and atmospheric sciences research community has an unprecedented opportunity to exploit the vast amount of data available from earth observation (EO) satellites and earth system models (ESM). Smaller and cheaper satellites with reduced operational costs have made a variety of EO data affordable, and technological advances have made the data accessible to a wide range of stakeholders, especially the scientific community (EY, 2023). The NASA ESDS program alone is expected to host 320 PB of data by 2030 (NASA ESDS, 2023). The ascent and application of artificial intelligence foundation models (FM) can be attributed to the availability of large volumes of curated data, accessibility to extensive compute resources and the maturity of deep learning architectures, especially the transformer (Bommasani et al., 2021). 

Developing a foundation model involves pretraining a suitable deep learning architecture with large amounts of data, often via self supervised learning (SSL) methods. The pretrained models can then be adapted to downstream tasks via fine tuning, requiring less amount of data than task-specific models. Large language models (LLM) are likely the most common type of foundation encountered by the general public. Vision transformers (ViT) are based on the LLM architecture and adapted for image and image-like data (Dosovitskiy, et. al., 2020), such as EO data and ESM simulation output.  We are in the process of pretraining a ViT model for the earth’s atmosphere using a select few bands of 1-km Level-1B MODIS radiances and brightness temperatures, MOD021KM and MYD021KM from the NASA Terra and Aqua satellites respectively. We are using 200 million image chips of size 128x128 pixels. We are pretraining two ViT models of sizes 100 million and 400 million parameters respectively. The pretrained models will be finetuned for cloud classification and evaluated against AICCA. We will discuss our experiences involving data and computing, and present preliminary results.



Bommasani R, Hudson DA, Adeli E, Altman R, Arora S, et al: On the opportunities and risks of foundation models. CoRR abs/2108.07258., 2021. 

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S. and Uszkoreit, J.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

Ernst & Young (EY): How can the vantage of space give you strategic advantage on Earth?, 2023. Accessed 10 January 2024.

Kurihana, Takuya, Elisabeth J. Moyer, and Ian T. Foster: AICCA: AI-Driven Cloud Classification Atlas. Remote Sensing 14, no. 22: 5690., 2022.

NASA MODIS: MODIS - Level 1B Calibrated Radiances. DOI: 10.5067/MODIS/MOD021KM.061 and DOI: 10.5067/MODIS/MYD021KM.061

NASA ESDS: Earthdata Cloud Evolution Accessed 10 January 2024.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I: Attention is all you need. Adv Neural Inf Process Syst 30, 2017.

How to cite: Anantharaj, V., Kurihana, T., Padovani, G., Kumar, A., Tsaris, A., Nair, U., Fiore, S., and Foster, I.: Pretraining a foundation model using MODIS observations of the earth’s atmosphere, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-22461,, 2024.

Michael Langguth, Christian Lessig, Martin Schultz, and Ilaria Luise

In recent years, deep neural networks (DNN) to enhance the resolution of meteorological data, known as statistical downscaling, have surpassed classical statistical methods that have been developed previously with respect to several validation metrics. The prevailing approach for DNN downscaling is to train deep learning models in an end-to-end manner. However, foundation models trained on very large datasets in a self-supervised way have proven to provide new SOTA results for various applications in natural language processing and computer vision. 

To investigate the benefit of foundation models in Earth Science applications, we deploy the large-scale representation model for atmospheric dynamics AtmoRep (Lessig et al., 2023) for statistical downscaling of the 2m temperature over Central Europe. AtmoRep has been trained on almost 40 years of ERA5 data from 1979 to 2017 and has shown promising skill in several intrinsic and downstream applications. By extending AtmoRep’s encoder-decoder with a tail network for downscaling, we super-resolve the coarse-grained 2 m temperature field from ERA5-data (Δx = 25 km) to attain the high spatial resolution (Δx = 6 km) of the COSMO REA6 dataset. Different coupling approaches between the core and tail network (e.g. with and without fine-tuning the core model) are tested and analyzed in terms of accuracy and computational efficiency. Preliminary results show that downscaling with a task-specific extension of the foundation model AtmoRep can improve the downscaled product in terms of standard evaluation metrics such as the RMSE compared to a task-specific deep learning model. However, deficiencies in the spatial variability of the downscaled product are also revealed, highlighting the need for future work to focus especially on target data that inhibit a high degree of spatial variability and intrinsic uncertainty such as precipitation.

How to cite: Langguth, M., Lessig, C., Schultz, M., and Luise, I.: Downscaling with the foundation model AtmoRep, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-18331,, 2024.

Iraklis Giannakis, Anshuman Bhardwaj, Lydia Sam, and Georgis Leontidis

Impact craters, resulting from the collision of meteorites, asteroids, or comets with planetary surfaces, manifest as circular-elliptical depressions with diverse sizes and shapes influenced by various factors. These morphological features play a crucial role in planetary exploration, offering insights into the geological composition and structure of celestial bodies. Beyond their scientific importance, craters may also hold valuable natural resources, such as frozen water in the Moon's permanently shadowed craters. Furthermore, understanding craters’ spatial distribution is pivotal for terrain-relative navigation and for selecting future landing sites.

Manual crater mapping through visual inspection is an impractical and laborious process, often unattainable for large-scale investigations. Moreover, manual crater mapping is susceptible to human errors and biases, leading to potential disagreements of up to 40%. In order to tackle these issues, semi-automatic crater detection algorithms (CDA) have been developed to mitigate human biases, and to enable large-scale and real-time crater detection and mapping.

The majority of CDAs’ are based on machine learning (ML) and data-driven methods. ML-based CDAs’ are trained in a supervised manner using specific datasets that were manually labelled. Because of that, existing ML-based CDAs’ are constrained to specific data types according to the type of their training data. This makes current ML-based CDAs’ unstable and un-practical, since applying an ML scheme to a different type of data requires acquiring and labelling a new training set, and subsequently use it to train a new ML scheme, or fine-tune an already existing one.

In this study, we describe a universal approach [1] for crater identification based on Segment Anything Model (SAM), a foundational computer vision and image segmentation model developed by META [2]. SAM was trained with over 1 billion masks, and is capable to segment various data types (e.g., photos, DEM, spectra, gravity) from different celestial bodies (e.g., Moon, Mars) and measurement setups. The segmentation output undergoes further classification into crater and no-crater based on geometric indices assessing circular and elliptical attributes of the investigated mask. The proposed framework is proven effective across different datasets from various planetary bodies and measurement configurations. The outcomes of this study underlines the potential of foundational segmentation models in planetary science. Foundational models tuned for planetary data can provide universal classifiers contributing towards an automatic scheme for identifying, detecting and mapping various morphological and geological targets in different celestial bodies.



[1] Giannakis, I., Bhardwaj, A., Sam, L., Leontidis, G., (2024). A Flexible Deep Learning Crater Detection Scheme Using Segment Anything Model (SAM), Icarus, 2024.

[2] Kirillov, A., et al. (2024). Segment Anything, arXiv:2304.02643

How to cite: Giannakis, I., Bhardwaj, A., Sam, L., and Leontidis, G.: Segment Anything Model (SAM) for Automatic Crater Detection, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-21146,, 2024.

Filippo Bocchino, Germana Sergi, Roberta Ravanelli, and Mattia Crespi

Water reservoirs play a crucial role in the supply of freshwater, agricultural irrigation, hydroelectric power generation, and various industrial applications. However, their existence is increasingly threatened by water stress, due to growing water demand, water pollution, and impacts of climate change, including intensified and prolonged droughts. To address this challenge, a sustainable management of water resources is essential, relying on continuous and accurate monitoring of water reservoirs. Modern Earth Observation technologies offer an effective, frequent, and cost-efficient means for monitoring water basins. 

This study focuses on evaluating the potential of the Segment Anything Model (SAM) network (Kirillov et al., 2023), released by Meta AI in April 2023, for segmenting water reservoirs through the processing of satellite images. SAM aims to serve as a foundational segmentation model capable of generalising its segmentation abilities in a zero-shot manner across diverse tasks. Unlike traditionally supervised learning, zero-shot learning enables a model to recognize objects or features it has never seen during the training. Notably, SAM’s application to satellite imagery, a type of images for which it was not specifically trained, poses a unique challenge. 

In this work, SAM was applied to Sentinel-2 multispectral imagery using a "prompt click" approach, where a water-class pixel was pre-selected for each input image. Google Earth Engine facilitated temporal aggregation of Sentinel-2 images on the interest period (from 01/01/2019 to 31/12/2019), creating four RGB median images, one for each three-month period. SAM was independently applied to investigate  each of these four sub-periods. 

Validation was carried out in the Genoa port area to minimise the influence of temporal water level variations, which in turn produce water area changes. Indeed, the use of a port area made it possible to consider a single reference mask for the different sub-periods analysed, greatly simplifying the validation procedure. 

The validation phase revealed SAM’s superior performance in coastlines with regular shapes and undisturbed water (Fig. 1 and Tab. 1), while port areas, characterised by irregular shapes, higher activity and turbidity, yielded less satisfactory results (Fig. 2 and Tab. 2). 

In conclusion, this study highlighted SAM’s limitations, primarily related to the specific nature of satellite images, vastly different from the training data. Limitations include SAM’s training on three-band (R,G,B) and 8-bit images: the first one has led to the impossibility of using all the 13 bands of Sentinel-2 multispectral images and the second one caused the need to reduce the radiometric resolution of the Sentinel-2 images (from 16 bit to 8 bit), both resulting in information loss. Despite these limitations, SAM demonstrated effective segmentation capabilities, especially in simpler and less disturbed coastal areas, comparable to water segmentation algorithms based on spectral indices. Future improvements could be achieved through fine-tuning on satellite images, and applying SAM to  high-resolution ones.







Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y. et al.,2023. Segment anything. arXiv preprint arXiv:2304.02643.

How to cite: Bocchino, F., Sergi, G., Ravanelli, R., and Crespi, M.: Preliminary analysis of the potentialities of the Segment Anything Model (SAM) in the segmentation of Sentinel-2 imagery for water reservoir monitoring, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-11145,, 2024.

Posters virtual: Thu, 18 Apr, 14:00–15:45 | vHall X2

Display time: Thu, 18 Apr 08:30–Thu, 18 Apr 18:00
Chairpersons: Rahul Ramachandran, Takuya Kurihana
Ali J. Ghandour, Hasan Moughnieh, Mohammad Hasan Zahweh, Hasan Nasrallah, Mustafa Shukor, Cristiano Nattero, and Paolo Campanella

Foundation models have demonstrated impressive proficiency across multiple domains, including language, vision, and multi-modal applications, establishing new standards for efficiency and adaptability. In the context of localization-based foundational models, the core strength of such models is their ability to precisely recognize and locate objects across a diverse set of objects in wide-area scenes. This precision is particularly vital in the Remote Sensing (RS) field. The multimodality aspect of these models becomes pivotal in RS, as they can process and interpret complex data, allowing for more comprehensive aerial and satellite image analysis.

Multimodality has emerged as a crucial and dynamic area in recent AI developments, finding diverse applications such as image captioning and visual question answering. More related to traditional visual tasks, Visual Grounding (VG) stands out, involving the localization of objects based on textual descriptions. Unlike conventional approaches that train models on predefined and fixed lists of objects, VG allows a model to locate any entity in an image based on diverse textual descriptions, enabling open-vocabulary predictions. Despite notable efforts in developing powerful VG models to solve general benchmarks, there is a need for more exploration into transferring these models to the remote sensing context.

This paper addresses this gap by delving into the task of visual grounding for remote sensing. Our initial exploration reveals that utilizing general pretrained foundational models for RS yields suboptimal performance. After recognizing these limitations, our work systematically investigates various parameter-efficient tuning techniques to fine-tune these models for RS visual grounding applications. The insights and methodologies presented in this paper provide valuable guidance for researchers seeking to adapt pretrained models to the RS domain efficiently. This adaptation marks a substantial advancement in the field, offering a significant stride toward enhancing the applicability of visual grounding in remote sensing scenarios.

How to cite: Ghandour, A. J., Moughnieh, H., Zahweh, M. H., Nasrallah, H., Shukor, M., Nattero, C., and Campanella, P.: Efficient adaptation of Foundation Models for Visual Grounding Remote Sensing task, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-10914,, 2024.