ESSI1.9 | GeoML-Ops: Frameworks & methods for automated geospatial machine-learning at scale on hybrid systems
EDI Poster session
GeoML-Ops: Frameworks & methods for automated geospatial machine-learning at scale on hybrid systems
Convener: Thomas Brunschwiler | Co-conveners: Conrad AlbrechtECSECS, Campbell Watson, Grega Milcinski, Anca Anghelea
Posters on site
| Attendance Tue, 25 Apr, 14:00–15:45 (CEST)
 
Hall X4
Tue, 14:00
Machine learning (ML) applied to earth observation (EO) data provides an ample source to distill insights about our planet and societal activities. Typically, such investigations run as scientific research projects or as industrial proof-of-concept studies with significant manual interaction. In practice, corresponding solutions operate on a local or regional scale considering individual events or limited time periods. Advancing platform technologies and adherence to Open Science principles to enable scalable and reproducible workflows of high complexity are key to drive innovation in EO science and applications.

In our session presenters discuss the design of platforms and methods to scale-up and develop end- to-end repeatable, reusable and/or reproducible ML-model workflows based on multi-modal EO data to global and real-time services. These methods support the organization of input data, the efficient model training, continuous evaluation & testing, and deployment for federated operations on hybrid compute systems.

In particular, the following five topics will be addressed:
1. big geospatial data hubs for efficient preparation of analysis-ready data and features, 2. large-scale ML training on high-performance computing and cloud infrastructure,
3. frameworks for ML-operations at global scale considering complex workflows and hybrid systems,
4. reusability and reproducibility of complex EO-based workflows across platforms, as well as
5. big geospatial data and GeoML-model federation to reach maximal scale by efficient data sharing and model training & inference across institutions around the globe.

We target to not exclusively provide insights into frameworks and methods, but will also discuss the challenges faced en route from research experiments to a successfully integrated, real-time and global service.

Machine learning (ML) applied to earth observation (EO) data provides an ample source to distill insights about our planet and societal activities. Typically, such investigations run as scientific research projects or as industrial proof-of-concept studies with significant manual interaction. In practice, corresponding solutions operate on a local or regional scale considering individual events or limited time periods. Advancing platform technologies and adherence to Open Science principles to enable scalable and reproducible workflows of high complexity are key to drive innovation in EO science and applications.

In our session presenters discuss the design of platforms and methods to scale-up and develop end- to-end repeatable, reusable and/or reproducible ML-model workflows based on multi-modal EO data to global and real-time services. These methods support the organization of input data, the efficient model training, continuous evaluation & testing, and deployment for federated operations on hybrid compute systems.

In particular, the following five topics will be addressed:
1. big geospatial data hubs for efficient preparation of analysis-ready data and features, 2. large-scale ML training on high-performance computing and cloud infrastructure,
3. frameworks for ML-operations at global scale considering complex workflows and hybrid systems,
4. reusability and reproducibility of complex EO-based workflows across platforms, as well as
5. big geospatial data and GeoML-model federation to reach maximal scale by efficient data sharing and model training & inference across institutions around the globe.

We target to not exclusively provide insights into frameworks and methods, but will also discuss the challenges faced en route from research experiments to a successfully integrated, real-time and global service.

Posters on site: Tue, 25 Apr, 14:00–15:45 | Hall X4

Chairpersons: Anca Anghelea, Conrad Albrecht, Thomas Brunschwiler
X4.215
|
EGU23-3441
Michiaki Tatsubori, Daiki Kimura, Takao Moriyama, Naomi Simumba, and Tatsuya Ishikawa

While deep machine learning approaches are getting pervasively used in remote sensing and modeling the earth, difficulties due to the size of satellite data are always pains for scientists in implementing such experiential software. We present a programming model for geospatial machine learning based on TorchGeo and PyTorch, which are getting the de fact standards in programming with PyTorch/Python.  TorchGeo is open-sourced and designed to make it simple for remote sensing experts to explore machine learning solutions. Our objective is to allow machine-learning programs using TorchGeo to scale leveraging proprietary high-performance computing (HPC) and multicloud HPC resources, from ones notebook. One of key technologies specifically needed in geospatial machine learning is the smart integration of peta-scale data services and data-distributed parallel frameworks.  We implement such a platform as a part of IBM Research Geospatial Discovery Network (GDN) and experiment segmentation tasks such as flood detection from satellite data to show its scalability.

How to cite: Tatsubori, M., Kimura, D., Moriyama, T., Simumba, N., and Ishikawa, T.: A Programming Model for Geospatial Machine-Learning with Scalability in Hybrid Multiclouds, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3441, https://doi.org/10.5194/egusphere-egu23-3441, 2023.

X4.216
|
EGU23-4160
Grega Milcinski and Primoz Kolaric

Every experiment starts with the data, which needs to be fine-tuned for the specific use-case. We call this "analysis ready data (ARD)". In some cases, for the sake of reusability and comparability, the specifications for ARD are well defined. In many other cases, however, the procedures are not yet mature enough to support standardisation. In Earth Observation (EO) field this is especially true, as the whole community is moving from (semi)manually analysing individual scenes, from the time there were any data barely available, to processing of time-series, now that Landsat and Sentinel made this possible. We are now even facing a problem where there is simply too much of data, with PBs of open and commercial imagery being readily available. With the data being distributed at different places (Copernicus Data Access Service for Sentinel, AWS for Landsat) the challenge is further magnified. Machine learning (ML) approach can address the challenge of shifting through data, but ML as well requires data to be pre-processed for purpose and made available at the place where ML is running. Therefore, it is essential to have facility, which can generate ARD data customised for the specific analysis' requirements.

Sentinel Hub (SH) is a satellite imagery processing service, which is capable of on-the-fly gridding, re-projection, re-scaling, mosaicking, compositing, orthorectification and other actions required, either for integration in web-applications, where pictures are mostly served, or in ML and similar analysis processes, where pixel values and statistics are essential. SH works with original satellite data files and does not require replication or pre-processing.  It uses cloud infrastructure and innovative methods to efficiently process and distribute data in a matter of seconds. Sentinel Hub gives access to a rich collection of satellite data including a full set of Sentinel satellites, Landsat collections, commercial VHR collections and other complimentary collections. It also provides an ability for users to onboard their own data in one of the standardised formats. Furthermore, the data located at different clouds, can be fused together in one single process, benefiting from the variability and volume of different sensors.

There are two main capabilities, which make SH especially fit for purpose of generating on-demand ARD data. First one is the support for user-provided processing scripts, which are a set of recipes on what should happen with the sensor data (band composites, indices, even simple neural networks combining available data). The second one is a set of processing orchestration options. There is a Process API for immediate, access to the pixel values. Statistical API is optimised for time-series analysis, which aggregates the data over specific area of interest and provides configurable statistics through time. And then there are asynchronous siblings of these services, which are fine-tuned for large scale processing - if one wants to prepare ML features for the entire continent or get time-series for millions of agriculture parcels.

We will present the technology behind the scenes, making the processing possible, as well as several use-cases, how one can efficiently make use the service in ML.

How to cite: Milcinski, G. and Kolaric, P.: Sentinel Hub - federated on-demand ARD generation, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-4160, https://doi.org/10.5194/egusphere-egu23-4160, 2023.

X4.217
|
EGU23-11709
Karolina Sarna, Johannes Hiekkasaari, and Joni Taajamo

Fast response to natural catastrophe events is crucial in our fast changing world. Creating comprehensible solutions based on the Earth Observation (EO) and geospatial data is complex and requires combining multiple data sources and maintaining high level of configuration parameters.

In this talk we discuss the application of microservices architecture to tackle some of the issues inherent to building products based on EO and geospatial data. We will present how decomposing sophisticated algorithms into small services can help with continuous delivery, scaling and deployment of large, complex applications that can be reused for various products. This architecture enables reproducibility of analysis which is a crucial component for applying machine learning and automation into any EO based product. We will also address the additional complexity of creating a distributed system as well as high dependency on data consistency and availability.

How to cite: Sarna, K., Hiekkasaari, J., and Taajamo, J.: Microservice architecture to enable fast assessment of the NatCat events based on EO and geospatial data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-11709, https://doi.org/10.5194/egusphere-egu23-11709, 2023.

X4.218
|
EGU23-13852
Ioannis Tsoukalas, Panagiotis Kossieris, Luca Brocca, Silvia Barbetta, Hamidreza Mosaffa, and Christos Makropoulos

Key variable of earth observation (EO) systems is precipitation, as indicated by the wide spectrum of applications that is involved (e.g., water resources and early warning systems for flood/drought events). During the last decade, the EO community has put significant research efforts towards the development of satellite-based precipitation products (SPPs), however, their deployment in real-world applications has not yet reached the full potential, despite their ever-growing availability, spatiotemporal coverage and resolution. This may be associated with the reluctancy of end-users to employ SPPs, either worrying about uncertainty and biases inherited in SPPs or even due to the existence of multiple SPPs, whose performance fluctuates across the globe, and thus making it difficult to select the most appropriate SPP (some sort of a choice paradox). To address this issue, this work targets the development of an explainable machine learning approach capable of integrating multiple satellite-based precipitation (P) and soil moisture (SM) products into a single precipitation product. Hence, in principle, to create a new dataset that optimally combines the properties of each individual satellite dataset (used as predictors), better matching the ground-based observations (used as predictand, i.e., reference dataset). The proposed approach is showcased via a benchmark dataset consisted of 1009 cells/locations around the world (Europe, USA, Australia and India), highlighting its robustness as well as its application capability which are independent of specific climatic regimes and local peculiarities.

How to cite: Tsoukalas, I., Kossieris, P., Brocca, L., Barbetta, S., Mosaffa, H., and Makropoulos, C.: Can machine learning help us to create improved and trustworthy satellite-based precipitation products?, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13852, https://doi.org/10.5194/egusphere-egu23-13852, 2023.

X4.219
|
EGU23-16802
Matej Batič, Žiga Lukšič, and Grega Milcinski

Analysing EO data is a complex process, and solutions often require custom tailored algorithms. On top of that, in the EO domain most problems come with an additional challenge: How can the solution be applied on a large scale?

Within the H2020 project Global Earth Monitor (GEM) we have updated and extended eo-learn with additional functionalities that allow for new approaches to scalable and cost-effective Earth Observation data processing. We have tied it with the Sentinel Hub’s unified main data interface (Process API), the Data Cube processing engine for constructing analysis-ready adjustable data cubes using Batch Process API, and, finally, the Statistical API and Batch Statistical API to streamline access to spatio-temporally aggregated satellite data.

As part of GEM processing framework, we have built eo-grow which facilitates extraction of valuable information from satellite imagery. eo-grow tackles the issues of scalability by enabling coordination of clusters to run the EO workflows over large areas using  Ray. At the same time the framework provides reproducibility and traceability of the experiments using schemed input configurations and their validation.

In eo-grow a workflow based solution is wrapped into a pipeline object, which takes care of parametrization, logging, storage, multi-processing, data management and more. The pipeline object is configured via a well-defined schema allowing straightforward experimentation and scaling up: going to larger area of interest, running on different time interval, or tweak of any other pipeline parameter becomes just a matter of updating (json) configuration, which additionally serve as record of the experiment.

eo-grow library has been publicly released on GitHub: https://github.com/sentinel-hub/eo-grow. The documentation available in the repository provides the overview of the eo-grow general structure, its core objects, and instructions on installation and using eo-grow with command line interface. Additional repository, https://github.com/sentinel-hub/eo-grow-examples showcases eo-grow on a few use-cases.

In the presentation we will introduce the framework and showcase its usability on concrete examples. We will illustrate how eo-grow is used in large-scale research experiments, explain its role in reproducibility and show how the no-code approach and code reuse facilitate the productionalization of the workflows.

How to cite: Batič, M., Lukšič, Ž., and Milcinski, G.: eo-grow - Earth Observation framework for scaled-up processing in Python, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-16802, https://doi.org/10.5194/egusphere-egu23-16802, 2023.

X4.220
|
EGU23-14353
|
Frank de Morsier and Julien Rebetez

The key to drive innovation in EO science and applications, boosting geospatial mass adoption and in turn ‘geo-enabling’ companies, researchers and institutions, is moving away from complex, inefficient, and expensive workflows and making fundamental changes in ML practices. This is where geospatial MLOps, and platforms such as Picterra play a crucial role: cloud-native, shared platforms offer user friendly and efficient interfaces, smart toolkit and features paired with auto-scaling infrastructure and state of the art deep learning architecture. They allow to create and operate geospatial ML models at scale, enabling organizations to complete geospatial ML projects faster than ever before.

MLOps platforms systemize the process of building and training experimental machine learning models as well as translating them into production. This workflow efficiency empowers teams working with massive datasets, and allows organizations to leverage data analytics for decision-making and building better customer experiences.

Achieving productivity and speed requires streamlining and automating processes, as well as building reusable assets that can be managed closely for quality and risk. When significant model drift is detected, the ability to retrain and redeploy ML models in an automated fashion is crucial to ensure business continuity.

Shared platform, managed infrastructure, and integrable architecture results in streamlined pipelines and straightforward integration. This agility reduces the time to value and frees up time to serve more use cases, leading to increased value to the business. Companies implementing geospatial MLOps can speed up model training times, dramatically improve accuracy, and go from an idea to a live solution in just days – without increasing headcount or technical debt. Over time, they will also collect a library of strategic ML assets that will enable them to act on timely data - fast.

Using Picterra as a prime example of geospatial ML platform built with MLOps processes in its core, we will dive into how it facilitatesthe key steps of ML workflows incl: 

  • Direct access to a diverse range of satellite imagery sources via the platform ie. Sentinel-1/2, Planetscope, open aerial imagery campaigns, ingesting WMS//XYZ server streams.
  • Compatibility with any geospatial imagery sources (e.g. Optical, SAR, hyperspectral, thermal infrared, etc.) and possibility to connect to data cloud storage or directly upload via web interface, besides the above mentioned images servers.
  • A unequalled MLOps interface to prototype the extraction of new information from imagery around any custom defined use case ie. biodiversity monitoring, crops mapping and classification, assets management and many more. Trained model are directly served and made available for inference at large scale.
  • Extensive toolset on explainable & interpretable AI which is bringing robustness & efficiency in creating geospatial Machine Learning models for example dataset exploration 
  • Fast turnaround time in creating and validating Machine Learning models to save time and resources, thanks to the auto-scaling infrastructure leveraging Kubernetes and an intuitive interface for fast prototyping.
  • A unique set of advanced GIS pre/post-processing tools to manage imagery and the geospatial outputs extracted.
  • A complete API interface and Python library to further integrate with existing workflows or softwares (e.g. ESRI ArcGIS, Safe FME, etc.)

How to cite: de Morsier, F. and Rebetez, J.: MLOps in practice: how to scale your geospatial practice with cloud-based shared MLOps platform, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14353, https://doi.org/10.5194/egusphere-egu23-14353, 2023.

X4.221
|
EGU23-3501
|
ECS
Zhitong Xiong and Xiao Xiang Zhu

Earth observation (EO) data are critical for monitoring the state of planet Earth and can be helpful for various real-world applications [1]. Although numerous benchmark datasets have been released, there is no unified platform for developing and fairly comparing deep learning models on EO data [2]. For deep learning methods, the backbone networks, hyper-parameters, and training details are influential factors while comparing the performances.. However, existing works usually neglect these details and even evaluate the performance with different training/validation/test dataset splits. This makes it difficult to fairly and reliably compare different algorithms. In this study, we introduce the EarthNets platform, an open deep-learning platform for remote sensing and Earth observation. The platform is based on PyTorch [3] and TorchData. There are about ten different libraries, covering different tasks in remote sensing. Among them, Dataset4EO is designed as a standard and easy-to-use data-loading library, which can be used alone or together with other high-level libraries like RSI-Classification (for image classification), RSI-Detection (for object detection), RSI-Segmentation (for semantic segmentation), and so on. Two factors are considered for the design of the EarthNets platform: the first one is the decoupling between dataset loading and high-level EO tasks. As there are more than 400 RS datasets with different data modalities, research domains, and download links, efficient preparation of analysis-ready data can largely accelerate the research for the whole community. The other factor is to bring advances in machine learning to EO by providing new deep-learning models. The EarthNets platform provides a fair and consistent evaluation of deep learning methods on remote sensing and Earth observation data [4]. It also helps bring together the remote sensing and a larger machine-learning community. The platform, dataset collections are publicly available at https://earthnets.github.io.

[1] Zhu, Xiao Xiang, et al. "Deep learning in remote sensing: A comprehensive review and list of resources." IEEE Geoscience and Remote Sensing Magazine 5.4 (2017): 8-36.

[2]Long, Yang, et al. "On creating benchmark dataset for aerial image interpretation: Reviews, guidances, and million-aid." IEEE Journal of selected topics in applied earth observations and remote sensing 14 (2021): 4205-4230.

[3] Paszke, Adam, et al. "Pytorch: An imperative style, high-performance deep learning library." Advances in neural information processing systems 32 (2019).

[4] Xiong, Zhitong, et al. "EarthNets: Empowering AI in Earth observation." arXiv preprint arXiv:2210.04936 (2022).

How to cite: Xiong, Z. and Zhu, X. X.: EarthNets: An Open Deep Learning Platform for Earth Observation, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3501, https://doi.org/10.5194/egusphere-egu23-3501, 2023.

X4.222
|
EGU23-7233
Blair Edwards, Paolo Fraccaro, Nikola Stoyanov, Anne Jones, Junaid Butt, Julian Kuehnert, Andrew Taylor, and Bhargav Garikipati

Understanding and quantifying the risk of the physical impacts of climate change and their subsequent consequences have crucial importance in the changing climate for both businesses and society more widely. Historically, modelling workflows to assess such impacts have been bespoke and constrained by the data they can consume, the compute infrastructure, the expertise required to run them and the specific ways they are configured. Here we present, a cloud-native modelling framework for running geospatial models in a flexible, scalable, configurable, user-friendly manner. This enables models (physical or ML/AI) to be rapidly onboarded and composed into workflows. These workflows can be flexible, dynamic and extendable, running as for historical events, or as forecast ensembles, with varying data inputs, or extended to model impact in the real world (e.g. for example to infrastructure and populations). The framework supports the streamlined training and deployment of AI models, which can be seamlessly integrated with physical models to create hybrid workflows. We demonstrate the application and features of the framework for the examples of flooding and wildfire.

How to cite: Edwards, B., Fraccaro, P., Stoyanov, N., Jones, A., Butt, J., Kuehnert, J., Taylor, A., and Garikipati, B.: A flexible, scalable, cloud-native framework for geospatial modelling, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7233, https://doi.org/10.5194/egusphere-egu23-7233, 2023.

X4.223
|
EGU23-13481
H. Gijs J. Van den Dool and Deepali Bidwai

In many parts of the world, reforestation is an ongoing activity, but due to the deforestation processes (e.g. the change in soil conditions, agricultural expansion, and infrastructure expansion such as urbanisation or road building), the success rate of replanting is far from sure; therefore, it is essential to:

  • have a good idea of the pre-planting conditions at the location,
  • monitor the growth,
  • improve the growing conditions whenever possible, and
  • adapt the site selection criteria

In the proposed method, it is not possible to change the site selections of already planted locations, but it is possible to monitor the selected location and check under which conditions the trees are growing best. 

Several data sources are identified to predict plant health and stress, first to establish a baseline and, from this baseline, project into the future (short and mid-term). We compute the main vegetation index (NDVI) from the high-resolution image data provided by Planet (through the NICFI Basemaps for Tropical Forest Monitoring program). The historical NDVI values are obtained from the Sentinel 2 (and potentially LandSat) data at lower resolutions. Environmental conditions are added to the stress index by extracting the relevant meteorological parameters from the ERA5 database (temperature and precipitation) to compute the drought indices (e.g. KBDI/SPI/SPIE) and water availability (AWC) with the dominant soil type, supplemented with supporting indices from the satellite data (e.g. NDWI/SAVI/EVI-2).

For reforestation projects, it is vital to monitor the impact of environmental parameters on plant health and stress, and to assist with the forest maintenance of the sites, we built time series models for temperature, precipitation, and various vegetation indices to create a baseline for site-specific growing conditions. Deep Learning (DL) models like semantic segmentation based on Convolutional Neural Network (CNN) can be built on top of it using transfer learning to extract the features from pre-trained models using large (global) datasets. The model can not only predict tree health but can also be used to predict growing conditions in the near future by flagging out potential dry periods before they happen.

The high-resolution remote sensed products are available in the (sub)tropical zone [30N - 30S], while the lower resolution products and the ERA5 data have a global cover. The test sites in this study are example sites, but the developed method can be applied to any reforestation monitoring project. The result of the analysis is a near-term growth indicator, which can be used to adjust the growing conditions of the site, as well as assist with the site selection for new reforestation projects (based on the established baseline and predictions).

The next step, after validation, is to create a dashboard where the user can select any location (within the data domain) and construct the baseline and prediction, based on available information.

How to cite: Van den Dool, H. G. J. and Bidwai, D.: Satellite data as a predictor for monitoring tree health and stress in reforestation projects, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13481, https://doi.org/10.5194/egusphere-egu23-13481, 2023.

X4.224
|
EGU23-17046
|
ECS
Rodrigo Pardo Meza, Jorge-Arnulfo Quiané-Ruiz, Begüm Demir, and Volker Markl

Wayang AgoraEO Plugin: The Framework for Scalable EO Workflows

Currently, Earth Observation (EO) platforms provide datasets, algorithms, and processing capabilities. Nevertheless, each platform proposes its own exclusive habitat to discover, process, and run EO elements. We recently proposed AgoraEO [2], a decentralized, open, and unified ecosystem, where users can find EO elements, compose cross-platform EO pipelines, and execute them efficiently. With this ambition of supporting cross-platform federated analytics, Agora EO relies on Apache Wayang [1] as its main analytical processing platform. Within AgoraEO, we are developing and enabling Apache Wayang with EO features, exposing the internals of BigEarthNet [2] to the Earth Observation community. Here we present our Wayang AgoraEO plugin that follows the BigEarthNet workflow to achieve all its benefits in a scalable and parameterizable (reusable) way. The Wayang AgoraEO plugin empowers users to create EO workflows, using any EO platform in a simple way: using operators and an intuitive API that follows the behaviors of the EO platforms it exploits. The execution of sub-tasks is controlled but isolated in any required data processing system in tandem with the rest of the platform. In addition, one can fetch datasets from several independent sources. By design, Apache Wayang works as a declarative framework for ML: Users specify ML tasks at a high level, using the most convenient API to write a workflow (Java-Scala, Python, and Postgres are supported). Wayang then models an ML task as a mathematical optimization problem and uses its gradient descent-based optimizer to invoke the appropriate physical algorithms and system configurations to execute a given ML task. Therefore, decoupling user specification of ML tasks from its execution. We believe the Wayang AgoraEO plugin can be a game changer in the tedious task of implementing and deploying EO workflows within EO platforms today: It makes it easy to reuse resources and share them. Likewise, it is an easily extensible solution to include new operators that can include new EO platforms and tasks. As a result, this solution can be a great leap in the democratization of EO technologies, contributing to their integration, scalability, and access to high-performance computing.

References

[1] S. Kruse, Z. Kaoudi, J. -A. Quiane-Ruiz, S. Chawla, F. Naumann and B. Contreras-Rojas, "Optimizing Cross-Platform Data Movement," IEEE 35th International Conference on Data Engineering, 2019, pp. 1642-1645.

[2] A. Wall, B. Deiseroth, E. Tzirita Zacharatou, J-A, Quiané-Ruiz, B. Demir, V. Markl, "AGORA-EO: A Unified Ecosystem for Earth Observation - A Vision For Boosting EO Data Literacy," Big Data from Space Conference, 2021.

How to cite: Pardo Meza, R., Quiané-Ruiz, J.-A., Demir, B., and Markl, V.: Wayang AgoraEO Plugin: The Framework for Scalable EO Workflows, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-17046, https://doi.org/10.5194/egusphere-egu23-17046, 2023.