ESSI2.8 | HPC and cloud infrastructures in support of Earth Observation, Earth Modeling and community-driven Geoscience approach PANGEO
EDI
HPC and cloud infrastructures in support of Earth Observation, Earth Modeling and community-driven Geoscience approach PANGEO
Co-organized by CL5/GI1/OS5
Convener: Vasileios Baousis | Co-conveners: Tina Odaka, Umberto Modigliani, Anne Fouilloux, Alejandro Coca-Castro
Orals
| Mon, 24 Apr, 08:30–12:30 (CEST)
 
Room 0.16
Posters on site
| Attendance Mon, 24 Apr, 16:15–18:00 (CEST)
 
Hall X4
Orals |
Mon, 08:30
Mon, 16:15
Cloud computing has emerged as the dominant paradigm, supporting practically all industrial applications and a significant number of academic and research projects. Since its introduction in the early 2010s and its widespread adoption thereafter, migration to cloud computing has been a considerable task for many organisations and companies. Processing of big data close to their physical location is a perfect use case for cloud technologies and cloud storage infrastructure which offer all the necessary infrastructure and tools, especially if cloud infrastructure is offered together with HPC resources.
Pangeo (pangeo.io) is a global community of researchers and developers that tackle big geoscience data challenges in a collaborative manner using HPC and Cloud infrastructure.
This session's aim is threefold:
(1) Focuses on Cloud/Fog/Edge computing use cases and aims to identify the status and the steps towards a wider cloud computing adoption in Earth Observation and Earth Modeling.
(2) to motivate researchers that are using or developing in the Pangeo ecosystem to share their endeavors with a broader community that can benefit from these new tools.
(3) to contribute to the Pangeo community in terms of potential new applications for the Pangeo ecosystem, containing the following core packages: Xarray, Iris, Dask, Jupyter, Zarr, Kerchunk and Intake.
We encourage contributions describing all kinds of Cloud/Fog/Edge computing efforts in Earth Observation and Earth Modeling domains, such as:
- Cloud Applications, Infrastructure and Platforms (IaaS, PaaS SaaS and XaaS).
- Cloud federations and cross domain integration
- Service-Oriented Architecture in Cloud Computing
- Cloud Storage, File Systems, Big Data storage and Management.
- Networks within Cloud systems, the Storage Area, and to the outside
- Fog and Edge Computing
- Operational systems on the cloud.
- Data lakes and warehouses on the cloud.
- Cloud computing and HPC convergence in EO data processing.
Also presentations using at least one of Pangeo’s core packages in any of the following domains:
- Atmosphere, Ocean and Land Models
- Satellite Observations
- Machine Learning
- And other related applications
We welcome any contributions in the above themes presented as science-based in other EGU sessions, but more focused on research, data management, software and/or infrastructure aspects. For instance, you can showcase your implementation through live executable notebooks.

Orals: Mon, 24 Apr | Room 0.16

Chairpersons: Vasileios Baousis, Umberto Modigliani, Stathes Hadjiefthymiades
08:30–08:35
08:35–08:45
|
EGU23-5807
|
On-site presentation
Arnau Folch, Josep DelaPuente, Antonio Costa, Benedikt Halldórson, Jose Gracia, Piero Lanucara, Michael Bader, Alice-Agnes Gabriel, Jorge Macías, Finn Lovholt, Vadim Montellier, Alexandre Fournier, Erwan Raffin, Thomas Zwinger, Clea Denamiel, Boris Kaus, and Laetitia le Pourhiet

The second phase (2023-2026) of the Center of Excellence for Exascale in Solid Earth (ChEESE-2P), funded by HORIZON-EUROHPC-JU-2021-COE-01 under the Grant Agreement No 101093038, will prepare 11 European flagship codes from different geoscience domains (computational seismology, magnetohydrodynamics, physical volcanology, tsunamis, geodynamics, and glacier hazards). Codes will be optimised in terms of performance on different types of accelerators, scalability, containerisation, and continuous deployment and portability across tier-0/tier-1 European systems as well as on novel hardware architectures emerging from the EuroHPC Pilots (EuPEX/OpenSequana and EuPilot/RISC-V) by co-designing with mini-apps. Flagship codes and workflows will be combined to farm a new generation of 9 Pilot Demonstrators (PDs) and 15 related Simulation Cases (SCs) representing capability and capacity computational challenges selected based on their scientific importance, social relevance, or urgency. The SCs will produce relevant EOSC-enabled datasets and enable services on aspects of geohazards like urgent computing, early warning forecast, hazard assessment, or fostering an emergency access mode in EuroHPC systems for geohazardous events including access policy recommendations. Finally, ChEESE-2P will liaise, align, and synergise with other domain-specific European projects on digital twins and longer-term mission-like initiatives like Destination Earth.

How to cite: Folch, A., DelaPuente, J., Costa, A., Halldórson, B., Gracia, J., Lanucara, P., Bader, M., Gabriel, A.-A., Macías, J., Lovholt, F., Montellier, V., Fournier, A., Raffin, E., Zwinger, T., Denamiel, C., Kaus, B., and le Pourhiet, L.: The EuroHPC Center of Excellence for Exascale in Solid Earth, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-5807, https://doi.org/10.5194/egusphere-egu23-5807, 2023.

08:45–08:55
|
EGU23-11117
|
On-site presentation
Olaf Stein, Abhiraj Bishnoi, Luis Kornblueh, Lars Hoffmann, Norbert Eicker, Estela Suarez, and Catrin I. Meyer

Significant progress has been made in recent years to develop km-scale versions of global Earth System Models (ESM), combining the chance of replacing uncertain model parameterizations by direct treatment and the improved representation of orographic and land surface features (Schär et al., 2020, Hohenegger et al., 2022). However, adapting climate codes to new hardware and at the same time keeping the performance portability, still remains a major issue. Given the long development cycles, the various maturity of ESM modules and their large code bases, it is not expected that all code parts can be brought to the same level of exascale readiness in the near future. Instead, short term model adaptation strategies need to focus on software abilities as well as hardware availability. Moreover, energy use efficiency is of growing importance on both sides, supercomputer providers and scientific projects employing climate simulations.

Here, we present results from first simulations of the coupled atmosphere-ocean modelling system ICON-v2.6.6-rc on the supercomputing system JUWELS at the Jülich Supercomputing Centre (JSC) with a global resolution of 5 km, using significant parts of the HPC system. While the atmosphere part of ICON (ICON-A) is capable of running on GPUs, model I/O currently performs better on a CPU cluster and the ocean module (ICON-O) has not been ported to modern accelerators yet. Thus, we make use of the modular supercomputing architecture (MSA) of JUWELS and its novel batch job options for the coupled ICON model with ICON-A running on the NVIDIA A100 GPUs of JUWELS Booster, while ICON-O and the model I/O are running simultaneously on the CPUs of the JUWELS Cluster partition. As expected, ICON performance is limited by ICON-A. Thus we chose the performance-optimal Booster-node configuration for ICON-A considering also memory requirements (84 nodes) and adapted ICON-O configuration to achieve minimum waiting times for simultaneous time step execution and data exchange (63 cluster nodes).  We compared runtime and energy efficiency to cluster-only simulations (on up to 760 cluster nodes) and found only small improvements in runtime for the MSA case, but energy consumption is already reduced by 26% without further improvements in vector length applied with ICON. When switching to even higher ICON resolutions, cluster-only simulations are not fitting to most of current HPC systems and upcoming exascale systems will rely to a large extent on GPU acceleration. Thus exploiting MSA capabilities is an important step towards performance portable and energy efficient use of km-scale climate models.

References:

Hohenegger et al., ICON-Sapphire: simulating the components of the Earth System and their interactions at kilometer and subkilometer scales, https://doi.org/10.5194/gmd-2022-171, in review, 2022.

Schär et al., Kilometer-Scale Climate Models: Prospects and Challenges, https://doi.org/10.1175/BAMS-D-18-0167.1, 2020.

 

How to cite: Stein, O., Bishnoi, A., Kornblueh, L., Hoffmann, L., Eicker, N., Suarez, E., and Meyer, C. I.: Modeling the Earth System on Modular Supercomputing Architectures: coupled atmosphere-ocean simulations with ICON, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-11117, https://doi.org/10.5194/egusphere-egu23-11117, 2023.

08:55–09:05
|
EGU23-12539
|
On-site presentation
Roberto Cuccu, Vasileios Baousis, Umberto Modigliani, Charalampos Kominos, Xavier Abellan, and Roope Tervo

The European Centre for Medium-Range Weather Forecasts (ECMWF) together with the European Organisation for the Exploitation of Meteorological Satellites (EUMETSAT) have worked together to offer to their Member States a new paradigm to access and consume weather data and services. The “European Weather Cloud-(EWC)” (https://www.europeanweather.cloud/), concluded its pilot phase and is expected to become operational during the first months of 2023.

This initiative aims to offer a community cloud infrastructure on which Member and Co‐operating States of both organizations can create on demand virtual compute (including GPUs) and storage resources to gain easy and high throughput access to the ECMWF’s Numerical Weather Predication (NWP) and EUMETSAT’s satellite data in a timely and configurable fashion. Moreover, one of the main goals is to involve more National Meteorological Services to jointly form a federation of clouds/data offered from their Member States, for the maximum benefit of the European Meteorological Infrastructure (EMI). During the pilot phase of the project, both organizations have jointly hosted user and technical workshops to actively engage with the meteorological community and align the evolution of the EWC to reflect and satisfy their operational goals and needs.

The EWC, in its pilot phase hosted several use cases, mostly aimed at users in the developers’ own organisations. These broad categories of these cases are:

  • Web services to explore hosted datasets
  • Data processing applications
  • Platforms to support the training of machine learning models on archive datasets
  • Workshops and training courses (e.g., ICON model training, ECMWF training etc)
  • Research in collaboration with external partners
  • World Meteorological Organization (WMO) support with pilots and PoC.

Some examples of the use cases currently developed at the EWC are:

  • The German weather service DWD, which is already feeding maps generated by a server it deployed on the cloud into its public GeoPortal service.
  • EUMETSAT and ECMWF joint use case assesses bias correction schemes for the assimilation of radiance data based on several satellite data time series
  • the Royal Netherlands Meteorological Institute (KNMI) hosts a climate explorer web application based on KNMI climate explorer data and ECMWF weather and climate reanalyses
  • The Royal Meteorological Institute of Belgium prepares ECMWF forecast data for use in a local atmospheric dispersion model.
  • NordSat, a collaboration of northern European countries which is developing and testing imagery generation tools in preparation for the Meteosat Third Generation (MTG) satellite products.
  • UK Met Office with the DataProximateCompute use case, which distributes compute workload close to data, with the automatic creation and disposal of Dask clusters, as well as the data plane VPN network, on demand and in heterogeneous cloud environments.

In this presentation, the status of the project, the offered services and how these are accessed by the end users along with examples of the existing use cases will be analysed. The plans, next steps for the evolution of the EWC and its relationship with other projects and initiatives (like DestinE) will conclude the presentation.

How to cite: Cuccu, R., Baousis, V., Modigliani, U., Kominos, C., Abellan, X., and Tervo, R.: European Weather Cloud: A community cloud tailored for big Earth modelling and EO data processing, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-12539, https://doi.org/10.5194/egusphere-egu23-12539, 2023.

09:05–09:15
|
EGU23-12785
|
On-site presentation
Neil Massey, Jack Leland, and Bryan Lawrence

Managing huge volumes of data is a problem now, and will only become worse with the advent of exascale computing and next generation observational systems. An important recognition is that data needs to be more easily migrated between storage tiers. Here we present a new solution, the Near-Line Data store (NLDS), for managing data migration between user facing storage systems and tape by using an object storage cache.  NLDS builds on lessons learned from previous experience developing the ESIWACE funded Joint Data Migration App (JDMA) and deploying it at the Centre for Environmental Data Analysis (CEDA). 
 
CEDA currently has over 50PB of data stored on a range of disk based storage systems.  These systems are chosen on cost, power usage and accessibility via a network, and include three different types of POSIX disk and object storage. Tens of PB of additional data are also stored on tape. Each of these systems has different workflows, interfaces and latencies, causing difficulties for users.  

NLDS, developed with ESIWACE2 and other funding, is a multi-tiered storage solution using object storage as a front end to a tape library.  Users interact with NLDS via a HTTP API, with a Python library and command-line client provided to support both programmatic and interactive use.  Files transferred to NLDS are first written to the object storage, and a backup is made to tape.  When the object storage is approaching capacity, a set of policies is interrogated to determine which files will be removed from it.  Upon retrieving a file, NLDS may have to first transfer the file from tape to the object storage, if it has been deleted by the policies.  This implements a multi-tier of hot (disk), warm (object storage) and cold (tape) storage via a single interface. While systems like this are not novel, NLDS is open source, designed for ease of redeployment elsewhere, and for use from both local storage and remote sites. 

NLDS is based around a microservice architecture, with a message exchange brokering communication between the microservices, the HTTP API and the storage solutions.  The system is deployed via Kubernetes, with each microservice in its own Docker container, allowing the number of services to be scaled up or down, depending on the current load of NLDS.  This provides a scalable, power efficient system while ensuring that no messages between microservices are lost.  OAuth is used to authenticate and authorise users via a pluggable authentication layer. The use of object storage as the front end to the tape allows both local and remote cloud-based services to access the data, via a URL, so long as the user has the required credentials. 

NLDS is a a scalable solution to storing very large data for many users, with a user-friendly front end that is easily accessed via cloud computing. This talk will detail the architecture and discuss how the design meets the identified use cases.

How to cite: Massey, N., Leland, J., and Lawrence, B.: A Scalable Near Line Storage Solution for Very Big Data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-12785, https://doi.org/10.5194/egusphere-egu23-12785, 2023.

09:15–09:25
|
EGU23-17029
|
On-site presentation
Campbell Watson, Hendrik Hamann, Kommy Weldemariam, Thomas Brunschwiler, Blair Edwards, Anne Jones, and Johannes Schmude

The ballooning volume and complexity of geospatial data is one of the main inhibitors for advancements in climate & sustainability research. Oftentimes, researchers need to create bespoke and time-consuming workflows to harmonize datasets, build/deploy AI and simulation models, and perform statistical analysis. It is increasingly evident that these workflows and the underlying infrastructure are failing to scale and exploit the massive amounts of data (Peta and Exa-scale) which reside across multiple data centers and continents. While there have been attempts to consolidate relevant geospatial data and tooling into single cloud infrastructures, we argue that the future of climate & sustainability research relies on networked/federated systems. Here we present recent progress towards multi-cloud technologies that can scale federated geospatial discovery and modeling services across a network of nodes. We demonstrate how the system architecture and associated tooling can simplify the discovery and modeling process in multi-cloud environments via examples of federated analytics for AI-based flood detection and efficient data dissemination inspired by AI foundation models.

How to cite: Watson, C., Hamann, H., Weldemariam, K., Brunschwiler, T., Edwards, B., Jones, A., and Schmude, J.: Establishing a Geospatial Discovery Network with efficient discovery and modeling services in multi-cloud environments, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-17029, https://doi.org/10.5194/egusphere-egu23-17029, 2023.

09:25–09:35
|
EGU23-12851
|
Virtual presentation
|
Fabien Castel and Emma Rizzi

Tackling complex environmental issues requires accessing and processing a wide range of voluminous data. The Copernicus spatial data is a very complete and valuable source for many earth science domains, in particular thanks to its Core Services (Land, Atmosphere, Marine…). For almost five years now, Copernicus DIAS platforms have provided broad access to the core services products through the cloud. Among these platforms, the Wekeo platform operated by EUMETSAT, Mercator Ocean, ECMWF and EEA provides wider access to Copernicus Core Service data.

However, Copernicus data needs an additional layer of processing and preparation to be presented and understood by the general public and decision makers. Murmuration has developed data processing pipelines to produce environmental indicators from Copernicus data constituting powerful tools to put environmental issues at the centre of decision-making processes.

Throughout its use, limitations on the DIAS platforms were observed. Firstly, the cloud service offerings are basic in comparison to the market leaders (such as AWS and GCP). In particular, there is no built-in solution for automating and managing data processing pipelines, which must be set up at the user's expense. Secondly, the cost of resources is higher than market price. Limiting the activities on DIAS to edge data processing and relying on a cheaper offering for applications not requiring the direct access to raw Copernicus data is a cost effective choice.  FInally, the performance and reliability requirements to access the data can sometimes not be met when relying on a single DIAS platform. Implementing a multi-DIAS approach ensures backup data sources. This raises the question of the automation and orchestration of such a multi-cloud system.

We propose an approach combining the wide data offer of the DIAS platforms, the automation features provided by the Prefect platform and the usage of efficient cloud technologies to build a repository of environmental indicators. Prefect is a hybrid orchestration platform dedicated to automation of data processing flows. It does not host any data processing flow itself and rather connects in a cloud-agnostic way to any cloud environment, where periodic and triggered flow executions can be scheduled. Prefect centrally controls flows that run on different cloud environments through a single platform.

Technologies leveraged to build the system allow to efficiently produce and disseminate the environmental indicators: firstly, containerisation and clustering (using Docker and Kubernetes) to manage processing resources; secondly object storage combined with cloud native access (Zarr data format); and finally, the Python scientific software stack (including pandas, scikit-learn, etc.) complemented by the powerful Xarray library. Data processing pipelines ensure a path from the NetCDF Copernicus Core Services products to cloud-native Zarr products. The Zarr format allows windowed read/write operations, avoiding unnecessary data transfers. This efficient data access allows plugging into the data repository fast data dissemination services following well-established OGC standards and feeding interactive dashboards for decision makers. The cycle is complete, from the Copernicus satellite data to an environmentally aware field decision.

How to cite: Castel, F. and Rizzi, E.: From the Copernicus satellite data to an environmentally aware field decision, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-12851, https://doi.org/10.5194/egusphere-egu23-12851, 2023.

09:35–09:45
|
EGU23-10697
|
Highlight
|
On-site presentation
Dom Heinzeller, Maryam Abdi-Oskouei, Stephen Herbener, Eric Lingerfelt, Yannick Trémolet, and Tom Auligné

The Joint Effort for Data assimilation Integration (JEDI), is an innovative data assimilation system for Earth system prediction, spearheaded by the Joint Center for Satellite Data Assimilation (JCSDA) and slated for implementation in major operational modeling systems across the globe in the coming years. Funded as an inter-agency development by NOAA, NASA, the U.S. Navy and Air Force, and with contributions from the UK Met Office, JEDI must operate on a wide range of computing platforms. The recent move towards cloud computing systems puts portability, adaptability and performance across systems, from dedicated High Performance Computing systems to commercial clouds and workstations, in the critical path for the success of JEDI.

JEDI is a highly complex application that relies on a large number of third-party software packages to build and run. These packages can include I/O libraries, workflow engines, Python modules for data manipulation and plotting, several ECMWF libraries for complex arithmetics and grid manipulations, and forecast models such as the Unified Forecast System (UFS), the Goddard Earth Observing System (GEOS), the Modular Ocean Model (MOM6), the Model for Prediction across Scales (MPAS), the Navy Environmental Prediction sysTem Utilizing the NUMA corE (NEPTUNE), and the Met Office Unified Model (UM).

With more than 100 contributors and rapid code development it is critical to perform thorough automated testing, from basic unit tests to comprehensive end-to-end-tests. This presentation summarizes recent efforts to leverage cloud computing environments for research, development, and near real-time applications of JEDI, as well as for developing a Continuous Integration/Continuous Delivery (CI/CD) pipeline. These efforts rest on a newly developed software stack called spack-stack, a joint effort of JCSDA, the NOAA Environmental Modeling Center (EMC) and the U.S. Earth Prediction Innovation Center (EPIC). Automatic testing in JEDI is implemented with modern software development tools such as GitHub, Docker containers, various Amazon Web Services (AWS), and CodeCov for testing and evaluation of code performance. End-to-end testing is realized in JCSDA’s newly developed Skylab Earth system data assimilation application, which combines JEDI with the Research Repository for Data and Diagnostics (R2D2) and the Experiments and Workflow Orchestration Kit (EWOK), and which leverages the AWS Elastic Compute Cloud (EC2) for testing, research, development and production.

How to cite: Heinzeller, D., Abdi-Oskouei, M., Herbener, S., Lingerfelt, E., Trémolet, Y., and Auligné, T.: The Joint Effort for Data Assimilation Integration (JEDI): A unified data assimilation framework for Earth system prediction supported by NOAA, NASA, U.S. Navy, U.S. Air Force, and UK Met Office, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-10697, https://doi.org/10.5194/egusphere-egu23-10697, 2023.

09:45–09:55
|
EGU23-3738
|
On-site presentation
Maoyi Huang

The National Oceanic and Atmospheric Administration (NOAA) established the Earth Prediction Innovation Center (EPIC) to be the catalyst for community research and modeling focused on informing and accelerating advances in our nation’s operational NWP forecast modeling systems. The Unified Forecast System (UFS) is a community-based, coupled, comprehensive Earth modeling system. The UFS numerical applications span local to global domains and predictive time scales from sub-hourly analyses to seasonal predictions. It is designed to support the Weather Enterprise and to be the source system for NOAA‘s operational numerical weather prediction applications. EPIC applies an open-innovation and open-development framework that embraces open-source code repositories integrated with automated Continuous Integration/Continuous Deployment (CI/CD) pipelines on cloud and on-prem HPCs. EPIC also supports UFS public releases, tutorials and training opportunities (e.g., student workshops, hackathons, and codesprints), and advanced user support via a virtual community portal (epic.noaa.gov). This framework allows community developers to track the status of their contributions, and facilitate rapid incorporation of innovation by implementing consistent and transparent, standardized and community-driven validation and verification tests. In this presentation, I will demonstrate capabilities in the EPIC framework using the UFS Short-range Weather (SRW) Application as an example in the follow aspects:

  • Public Releases of a Cloud-ready UFS SRW application with a scalable container following a modernize continuous release paradigm 
  • Test cases for challenging forecast environments released with datasets
  • Training and Tutorials for users and developers
  • Baseline for benchmarking in skill and computation on cloud HPCs , and
  • An Automated CI/CD pipeline to enable seamless transition to operations

How to cite: Huang, M.: An Open-innovation and Open-development Framework for the Unified Forecast System Powered by the Earth Prediction Innovation Center, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3738, https://doi.org/10.5194/egusphere-egu23-3738, 2023.

09:55–10:05
|
EGU23-3639
|
Virtual presentation
Armagan Karatosun, Michael Grant, Vasileios Baousis, Duncan McGregor, Richard Care, John Nolan, and Roope Tervo

Although utilizing the cloud infrastructure for big data processing algorithms is increasingly common, the challenges of utilizing cloud infrastructures efficiently and effectively are often underestimated. This is especially true in multi-cloud scenarios where data are available only on a subset of the participating clouds. In this study, we have iteratively developed a solution enabling efficient access to ECMWF’s Numerical Weather Prediction (NWP) and EUMETSAT’s satellite data on the European Weather Cloud [1], in combination with UK Met Office assets in Amazon Web Services (AWS), in order to provide a common template for multi-cloud processing solutions in meteorological application development and operations in Europe.  

Dask [2] was chosen as the computing framework due to its widespread use in the meteorological community, its ability to automatically spread processing, and its flexibility in changing how workloads are distributed across physical or virtualized infrastructures while maintaining scalability. However, the techniques used here are generally applicable to other frameworks. The primary limitation in using Dask is that all nodes should be able to intercommunicate freely, which is a serious limitation when nodes are distributed over multiple clouds. Although it is possible to route between multiple cloud environments over the Internet, this introduces considerable administrative work (firewalls, security) as well as networking complexities (e.g., due to extensive use of potentially-clashing private IP ranges and NAT in clouds, or cost for public IPs). Virtual Private Networks (VPNs) can hide these issues, but many use a hub-and-spokes model, meaning that communications between workers pass through a central hub. By use of a mesh network VPN (WireGuard) between clusters using IPv6 private addressing, all these difficulties can be avoided, in addition to providing a simplified network addressing scheme with extremely high scalability. Another challenge was to ensure the Dask worker nodes were aware of data locality, both in terms of placing work near data and in terms of minimizing transfers. Here, the UK Met Office’s work on labeling resource pools (in this case, data) and linking scheduling decisions to labels was the key. 

In summary, by adapting Dask's concept of resourcing [3] into resource pools [4], building an automated start-up process, and effectively utilizing self-configuring IPv6 VPN mesh networks, we managed to provide a “cloud-native” transient model where all resources can be easily created and disposed of as needed. The resulting “throwaway” multi-cloud Dask framework is able to efficiently place processing on workers proximate to the data while minimizing necessary data traffic between clouds, thus achieving results more quickly and cheaper than naïve implementations, and with a simple, automated setup suitable for meteorological developers. The technical basis of this work was published on the Dask blog [5] but is covered more holistically here, particularly regarding the application side and challenges of developing cloud-native applications which can effectively utilize modern multi-cloud environments, with future applicability to distributed (e.g., Kubernetes) and serverless computing models. 

References: 

[1] https://www.europeanweather.cloud 
[2] https://www.dask.org 
[3] https://distributed.dask.org/en/stable/resources.html
[4] https://github.com/gjoseph92/dask-worker-pools  
[5] https://blog.dask.org/2022/07/19/dask-multi-cloud  

How to cite: Karatosun, A., Grant, M., Baousis, V., McGregor, D., Care, R., Nolan, J., and Tervo, R.: Data Proximate Computation; Multi-cloud approach on European Weather Cloud and Amazon Web Services , EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3639, https://doi.org/10.5194/egusphere-egu23-3639, 2023.

10:05–10:15
|
EGU23-4298
|
Virtual presentation
Stamatia Rizou, Vaggelis Marinakis, Gema Hernández Moral, Carmen Sánchez-Guevara, Luis Javier Sánchez-Aparicio, Ioannis Brilakis, Vasileios Baousis, Tijs Maes, Vassileios Tsetsos, Marco Boaria, Piotr Dymarski, Michail Bourmpos, Petra Pergar, and Inga Brieze

BUILDSPACE aims to couple terrestrial data from buildings (collected by IoT platforms, BIM solutions and other) with aerial imaging from drones equipped with thermal cameras and location annotated data from satellite services (i.e., EGNSS and Copernicus) to deliver innovative services for the building and urban stakeholders and support informed decision making towards energy-efficient buildings and climate resilient cities. The platform will allow integration of these heterogeneous data and will offer services at building scale, enabling the generation of high fidelity multi-modal digital twins and at city scale providing decision support services for energy demand prediction, urban heat and urban flood analysis. The services will enable the identification of environmental hotspots that increase pressure to local city ecosystems and raise probability for natural disasters (such as flooding) and will issue alerts and recommendations for action to local governments and regions (such as the support of policies for building renovation in specific vulnerable areas). BUILDSPACE services will be validated and assessed in four European cities with different climate profiles. The digital twin services at building level will be tested during the construction of a new building in Poland, and the city services validating the link to digital twin of buildings will be tested in 3 cities (Piraeus, Riga, Ljubljana) across EU. BUILDSPACE will create a set of replication guidelines and blueprints for the adoption of the proposed applications in building resilient cities at large. 

How to cite: Rizou, S., Marinakis, V., Hernández Moral, G., Sánchez-Guevara, C., Sánchez-Aparicio, L. J., Brilakis, I., Baousis, V., Maes, T., Tsetsos, V., Boaria, M., Dymarski, P., Bourmpos, M., Pergar, P., and Brieze, I.: BUILDSPACE: Enabling Innovative Space-driven Services for Energy Efficient Buildings and Climate Resilient Cities, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-4298, https://doi.org/10.5194/egusphere-egu23-4298, 2023.

Coffee break
Chairpersons: Tina Odaka, Anne Fouilloux, Alejandro Coca-Castro
10:45–10:50
10:50–11:00
|
EGU23-14547
|
On-site presentation
Marine Vernet, Erwan Bodere, Jérôme Detoc, Christelle Pierkot, Alessandro Rizzo, and Thierry Carval

Earth observation and modelling is a major challenge for research and a necessity for environmental and socio-economic applications. It requires voluminous and heterogeneous data from distributed and domain-dependent data sources, managed separately by various national and European infrastructures.

In a context of unprecedented data wealth and growth, new challenges emerge to enable inter-comparison, inter-calibration and comprehensive studies and uses of earth system and environmental data.

To this end, the FAIR-EASE project aims to provide integrated and interoperable services through the European Open Science Cloud to facilitate the discovery, access and analysis of large volumes of heterogeneous data from distributed sources and from different domains and disciplines of Earth system science.

This presentation will explain how the PANGEO stack will be used within FAIR EASE to improve data access, interpolation and analysis, but will also explore its integration with existing services (e.g. Galaxy) and underlying IT infrastructure to serve multidisciplinary research uses.

How to cite: Vernet, M., Bodere, E., Detoc, J., Pierkot, C., Rizzo, A., and Carval, T.: PANGEO multidisciplinary test case for Earth and Environment Big data analysis in FAIR-EASE Infra-EOSC project, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14547, https://doi.org/10.5194/egusphere-egu23-14547, 2023.

11:00–11:10
|
EGU23-6960
|
On-site presentation
Pranav Chandramouli, Francesco Nattino, Meiert Grootes, Ou Ku, Fakhereh Alidoost, and Yifat Dzigan

Remote-sensing (RS) and Earth observation (EO) data have become crucial in areas ranging from science to policy, with their use expanding beyond the ‘usual’ fields of geosciences to encompass ‘green’ life sciences, agriculture, and even social sciences. Within this context, the RS-DAT project has developed and made available a readily deployable framework enabling researchers to scale their analysis of EO and RS data on HPC systems and associated storage resources. Building on and expanding the established tool stack of the Pangeo Community, the framework integrates tools to access, retrieve, explore, and process geospatial data, addressing common needs identified in the EO domain. On the computing side RS-DAT leverages Jupyter (Python), which provides users a web-based interface to access (remote) computational resources, and Dask, which enables to scale analysis and workflows to large computing systems. Both Jupyter and Dask are well-established tools in the Pangeo community and can be deployed in several ways and on different infrastructures. RS-DAT provides an easy-to-use deployment framework for two targets: the generic case of SLURM-based HPC systems (for example, Dutch Supercomputer Snellius/Spider) which offer flexibility in computational resources; and the special case of an ansible-based cloud-computing infrastructure (Surf Research Cloud (SRC)) which is more straight-forward for the user but less flexible. Both these frameworks enable the easy scale-up of workflows, using HPCs, to access, manipulate and process large-scale datasets as commonly found in EO. On the data access and storage side RS-DAT integrates two python packages, STAC2dCache and dCacheFS, which were developed to facilitate data retrieval from online STAC catalogs (STAC2dCache) and its storage on the HPC system or local mass storage, specifically dCache.  This ensures efficient computation for large-scale analyses where data retrieval and handling can cause significant bottlenecks. User-defined input/output to Zarr file format is also supported within the framework. We present an application of the tools developed to the calculation of leaf-spring indices for North America using the Daymet dataset at a 1km resolution for 42 years (~940 GiB, completed in under 5 hours using 60 cores on the Dutch supercomputing system) and look forward to on-going work integrating both deployment targets in the case of the Dutch HPC ecosystem.

How to cite: Chandramouli, P., Nattino, F., Grootes, M., Ku, O., Alidoost, F., and Dzigan, Y.: Remote Sensing Deployable Analysis environmenT, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-6960, https://doi.org/10.5194/egusphere-egu23-6960, 2023.

11:10–11:20
|
EGU23-16117
|
ECS
|
On-site presentation
Ezequiel Cimadevilla, Maialen Iturbide, and Antonio S. Cofiño

The ESGF Virtual Aggregation (EVA) is a new data workflow approach that aims to advance the sharing and reuse of scientific climate data stored in the Earth System Grid Federation (ESGF). The ESGF is a global infrastructure and network of internationally distributed research centers that together work as a federated data archive, supporting the distribution of global climate model simulations of the past, current and future climate. The ESGF provides modeling groups with nodes for publishing and archiving their model outputs to make them accessible to the climate community at any time. The standardization of the model output in a specified format, and the collection, archival and access of the model output through the ESGF data replication centers have facilitated multi-model analyses. Thus, ESGF has been established as the most relevant distributed data archive for climate data, hosting the data for international projects such as CMIP and CORDEX. As of 2022 it includes more than 30 PB of data distributed across research institutes all around the globe and it is the reference archive for Assessment Reports (AR) on Climate Change produced by the Intergovernmental Panel on Climate Change (IPCC). However, explosive data growth has confronted the climate community with a scientific scalability issue. Conceived as a distributed data store, the ESGF infrastructure is designed to keep file sizes manageable for both sysadmins and end users. However, use cases in scientific research often involve calculations on datasets spanning multiple variables, over the whole time period and multiple model ensembles. In this sense, the ESGF Virtual Aggregation extends the federation capabilities, beyond file search and download, by providing out of the box remote climate data analysis capabilities over data analysis ready, virtually aggregated, climate datasets, on top of the existing software stack of the federation. In this work we show an analysis that serves as a test case for the viability of the data workflow and provides the basis for discussions on the future of the ESGF infrastructure, contributing to the debate on the set of reliable core services upon which the federation should be built.

Acknowledgements

This work it’s been developed under support from IS-ENES3 which is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 824084.

This work it’s been developed under support from CORDyS (PID2020-116595RB-I00) funded by MCIN/AEI/10.13039/501100011033.

How to cite: Cimadevilla, E., Iturbide, M., and Cofiño, A. S.: Virtual aggregations to improve scientific ETL and data analysis for datasets from the Earth System Grid Federation, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-16117, https://doi.org/10.5194/egusphere-egu23-16117, 2023.

11:20–11:30
|
EGU23-15964
|
ECS
|
On-site presentation
Floris Calkoen, Fedor Baart, Etiënne Kras, and Arjen Luijendijk

The coastal community widely anticipates that in the next years data-driven studies are going to make essential contributions to bringing about long-term coastal adaptation and mitigation strategies at continental scale. This view is also supported by CoCliCo, a Horizon 2020 project, where coastal data form the fundamental building block for an open-web portal that aims to improve decision making on coastal risk management and adaptation. The promise of data is likely triggered by several coastal analyses that showed how the coastal zone can be be monitored at unprecedented spatial scales using geospatial cloud platforms . However, we note that when analyses become more complex, i.e., require specific algorithms, pre- and post-processing or include data that are not hosted by the cloud provider, the cloud-native processing workflows are often broken, which makes analyses at continental scale impractical.

We believe that the next generation of data-driven coastal models that target continental scales can only be built when: 1) processing workflows are scalable; 2) computations are run in proximity to the data; 3) data are available in cloud-optimized formats; 4) and, data are described following standardized metadata specifications. In this study, we introduce these practices to the coastal research community by showcasing the advantages of cloud-native workflows by two case studies.

In the first example we map building footprints in areas prone to coastal flooding and estimate the assets at risk. For this analysis we chunk a coastal flood-risk map into several tiles and incorporate those into a coastal SpatioTemporal Asset Catalog (STAC). The second example benchmarks instantaneous shoreline mapping using cloud-native workflows against conventional methods. With data-proximate computing, processing time is reduced from the order of hours to seconds per shoreline km, which means that a highly-specialized coastal mapping expedition can be upscaled from regional to global level.

The analyses mostly rely on "core-packages" from the Pangeo project, with some additional support for scalable geospatial data analysis and cloud I/O, although they can essentially be run on a standard Python Planetary Computer instance. We publish our code, including self-explanatory Juypter notebooks, at https://github.com/floriscalkoen/egu2023.

To conclude, we foresee that in next years several coastal data products are going to be published, of which some may be considered "big data". To incorporate these data products into the next generation of coastal models, it is urgently required to agree upon protocols for coastal data stewardship. With this study we do not only want to show the advantages of scalable coastal data analysis; we mostly want to encourage the coastal research community to adopt FAIR data management principles and workflows in an era of exponential data growth.

How to cite: Calkoen, F., Baart, F., Kras, E., and Luijendijk, A.: A novel data ecosystem for coastal analyses, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-15964, https://doi.org/10.5194/egusphere-egu23-15964, 2023.

11:30–11:40
|
EGU23-17494
|
On-site presentation
Thierry Carval, Erwan Bodere, Julien Meillon, Mathiew Woillez, Jean Francois Le Roux, Justus Magin, and Tina Odaka

We are experimenting with hybrid access from Cloud and HPC environments using the Pangeo platform to make use of a data lake in an HPC infrastructure “DATARMOR”.  DATARMOR is an HPC infrastructure hosting ODATIS services (https://www.odatis-ocean.fr) situated at “Pôle de Calcul et de Données pour la Mer” in IFREMER. Its parallel file system has a disk space dedicated for shared data, called “dataref”.  Users of DATARMOR can access these data, and some of those data are cataloged by sextant service (https://sextant.ifremer.fr/Ressources/Liste-des-catalogues-thematiques/Datarmor-Donnees-de-reference ) and is open and accessible from the internet, without duplicating the data. 

In the cloud environment, the ability to access files in a parallel manner is essential for improving the speed of calculations. The Zarr format (https://zarr.readthedocs.io) enables parallel access to data sets, as it consists of numerous chunked “object data” files and some “metadata” files. Although it enables multiple data access, it is simple to use since all the collections of data stored in a Zarr format are accessible through one access point.  

For HPC centers, the numerous “object data” files create a lot of metadata on parallel file systems, slowing the data access time. Recent progress on development of Kerchunk (https://fsspec.github.io/kerchunk/), which recognize the chunks in a file (e.g. NetCDF / HDF5) as a Zarr chunk and its capability to recognize a series of files as one Zarr file, is solving these technical difficulties in our PANGEO use cases at DATARMOR. Thanks to Kerchunk and Intake (https://intake.readthedocs.io/) it is now possible to use different sets of data stored in DATARMOR in an efficient and simple manner.    

We are further experimenting with this workflow using the same use cases on the PANGEO-EOSC cloud.   We make use of the same data stored at the data lake in DATARMOR, but based on Kerchunk and Intake catalog through ODATIS access, without duplicating the source data. In the presentation we will share our recent experiences from these experiments. 

How to cite: Carval, T., Bodere, E., Meillon, J., Woillez, M., Le Roux, J. F., Magin, J., and Odaka, T.: Enabling simple access to a data lake both from HPC and Cloud using Kerchunk and Intake, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-17494, https://doi.org/10.5194/egusphere-egu23-17494, 2023.

11:40–11:50
|
EGU23-14515
|
ECS
|
On-site presentation
|
Fabian Wachsmann

In this showcase, we present to you how Intake and its plugin Intake-ESM are utilized at DKRZ to provide highly FAIR data collections from different projects, stored on different types of storages in different formats.

The Intake Plugin Intake-ESM allows users to not only find the data of interest, but also load them as analysis-ready-like Xarray datasets. We utilize this tool to provide users with access to many available data collections at our institution from only one single access point, the main DKRZ intake catalog at www.dkrz.de/s/intake. The functionality of this package works independently of data standards and formats and therefore enables full metadata-driven data access including data processing. Intake-esm catalogs increase the FAIRness of the data collections in all aspects but especially in terms of Accessibility and Interoperability.

Started with a collection of DKRZ’s CMIP6 Data Pool, DKRZ now hosts catalogs for more than 10PB of data on different local storages. The Intake-ESM package has been well integrated into ESM data provisioning workflows.

  • Early sharing and making accessible: The co-developed inhouse ICON model generates an intake-esm catalog on each run.
  • Uptake from other technologies: E.g., intake-esm catalogs serve as templates for the more advanced DKRZ STAC Catalogs. 
  • Making accessible all storage types: tools used for writing data to the local institutional cloud allow users to create Intake-ESM catalogs for the written data.
  • Data archiving: Catalogs for projects in the archive can be created from its metadata database.

For future activities, we plan to make use of new functionalities like the support for kerchunked data and the derived variable registry.

The DKRZ data management team develops and maintains local services around intake-esm for a positive user experience. In this showcase, we will present excerpts of seminars, workflows and integrations.

How to cite: Wachsmann, F.: Intaking DKRZ ESM data collections, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14515, https://doi.org/10.5194/egusphere-egu23-14515, 2023.

11:50–12:00
|
EGU23-6768
|
ECS
|
On-site presentation
Ou Ku, Francesco Nattino, Meiert Grootes, Pranav Chandramouli, and Freek van Leijen

Satellite-based Interferometric Synthetic Aperture Radar (InSAR) plays a significant role for numerous surface motion monitoring applications, e.g. civil-infrastructure stability, hydrocarbons extraction, etc. InSAR monitoring is based on a coregistered stack of Single Look Complex (SLC) SAR images. Due to the long temporal coverage, broad spatial coverage and high spatio-temporal resolution of an SLC SAR stack, handling it in an efficient way is a common challenge within the community. Aiming to meet this need, we present SarXarray: an open-source Xarray extension for SLC SAR stack processing. SarXarray provides a Python interface to read and write a coregistered stack of SLC SAR data, with basic SAR processing functions. It utilizes Xarray’s support on labeled multi-dimensional datasets to stress the space-time character of an SLC SAR stack. It also leverages Dask to perform lazy evaluation of the operations. SarXarray can be integrated to existing Python workflows in a flexible way. We provide a case study of creating a SAR Mean Reflectivity Map to demonstrate the functionality of SarXarray.

How to cite: Ku, O., Nattino, F., Grootes, M., Chandramouli, P., and van Leijen, F.: SarXarray: an Xarray extension for SLC SAR data processing, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-6768, https://doi.org/10.5194/egusphere-egu23-6768, 2023.

12:00–12:10
|
EGU23-14507
|
On-site presentation
Marco Mancini, Mirko Stojiljkovic, and Jakub Walczak

geokube is a Python package for data analysis and visualisation in geoscience that  provides high level abstractions in terms of both Data Model, inspired by Climate Forecast and Unidata Common Data Models, and Application Programming Interface (API), inspired by xarray. Key features of geokube are the capabilities to: (i) perform georeferenced axis-based indexing on data structures and specialised geospatial operations according to different types of geo scientific datasets like structured grids, point observations, profiles etc. (e.g. extracting a bounding box or a multipolygon of variable values defined on a rotated pole grid), (ii) perform operations on the variables that are either instantaneous or defined over intervals, (iii) convert to/from xarray data structures and to read/write CF-compliant netCDF datasets.

How to cite: Mancini, M., Stojiljkovic, M., and Walczak, J.: geokube: A Python Package for Data Analysis and Visualization in Geoscience, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14507, https://doi.org/10.5194/egusphere-egu23-14507, 2023.

12:10–12:20
|
EGU23-14774
|
ECS
|
On-site presentation
Martí Bosch

Observational meteorological data is central to understanding atmospheric processes, and is thus a key requirement for the calibration and validation of atmospheric and numerical weather prediction models. While recent decades have seen the development of notorious platforms to make satellite data easily accessible, observational meteorological data mostly remains scattered through the sites of regional and national meteorological service, each potentially offering different magnitudes, temporal coverage and data formats. 

In order to overcome these shortcomings, we propose meteostations-geopy, a Pythonic library to access data from meteorological stations. The central objective is to provide a common interface to retrieve observational meteorological data, therefore reducing the amount of time required to process and wrangle the data. The library interacts with APIs from different weather services, handling authentication if needed and transforming the requested information into geopandas data frames of geolocated and timestamped observations that are homogeneously structured independently of the provider. 

The project is currently in an early development stage with support for two providers only. Current and future work is organized in three interrelated main axes, namely integration of further providers, implementation of native support of distributed data structures and organization of the library into the intake technical structure with drivers, catalogs, metadata sharing and plugin packages that are provider specific.

How to cite: Bosch, M.: meteostations-geopy: a Pythonic interface to access data from meteorological stations, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14774, https://doi.org/10.5194/egusphere-egu23-14774, 2023.

12:20–12:30
|
EGU23-7825
|
On-site presentation
Fabian Gans and Felix Cremer
With the amount of high resolution earth observation data available it is not feasible anymore to do all analysis on local computers or even local cluster systems. To achieve high performance for out-of-memory datasets we develop the YAXArrays.jl package in the Julia programming language. YAXArrays.jl provides both an abstraction over chunked n-dimensional arrays with labelled axes and efficient multi-threaded and multi-process computation on these arrays.
In this contribution we would like to present the lessons we learned from scaling an analysis of high resolution Sentinel-1 time series
data. By bringing a Sentinel-1 change detection use case which has been performed on a small local area of interest to a whole region we test the ease and performance of distributed computing on the European Open Science Cloud (EOSC) in Julia.

How to cite: Gans, F. and Cremer, F.: Scaling up a Sentinel 1 change detection pipeline using the Julia programming language, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7825, https://doi.org/10.5194/egusphere-egu23-7825, 2023.

Posters on site: Mon, 24 Apr, 16:15–18:00 | Hall X4

Chairpersons: Vasileios Baousis, Tina Odaka
X4.148
|
EGU23-2579
|
Jean Iaquinta and Anne Fouilloux

In most places on the planet vegetation thrives: it is known as “greening Earth”. However in certain regions, especially in the Arctic, there are areas exhibiting a browning trend. This phenomenon is well known but not fully understood yet, and grasping its impact on local ecosystems requires involvement of scientists from different disciplines, including social sciences and humanities, as well as local populations. Here we focus on the Troms and Finnmark counties in northern Norway to assess the extent of the problem and any link with local environmental conditions as well as potential impacts. 

We have chosen to adopt an open and collaborative process and take advantage of the services offered by RELIANCE on the European Open Science Cloud (EOSC). RELIANCE delivers a suite of innovative and interconnected services that extend the capabilities of the European Open Science Cloud (EOSC) to support the management of the research lifecycle within Earth Science Communities and Copernicus Users. The RELIANCE project has delivered 3 complementary  technologies: Research Objects (ROs), Data Cubes and AI-based Text Mining. RoHub is a Research Object management platform that implements these 3 technologies and enables researchers to collaboratively manage, share and preserve their research work. 

We will show how we are using these technologies along with EGI notebooks to work open and share an executable Jupyter Notebook that is fully reproducible and reusable. We use a number of Python libraries from the Pangeo software stack such as Xarray, Dask and Zarr. Our Jupyter Notebook is bundled with its computational environment, datacubes and related bibliographic resources in an executable Research Object. We believe that this approach can significantly speed up the research process and can drive it to more exploitable results. 

Up to now, we have used indices derived from satellite data (in particular Sentinel-2) to assess how the vegetation cover in Troms and Finnmark counties has changed. To go a bit further we are investigating how to relate such information to relevant local parameters obtained from meteorological reanalysis data (ERA5 and ERA5-land from ECMWF). That should give a good basis for training an Artificial Intelligence algorithm and testing it, with the objective of getting an idea about the possibility of “predicting” what is likely to happen in the near future with certain types of vegetation like mosses and lichens which are essential for local populations and animals.

How to cite: Iaquinta, J. and Fouilloux, A.: Using FAIR and Open Science practices to better understand vegetation browning in Troms and Finnmark (Norway), EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-2579, https://doi.org/10.5194/egusphere-egu23-2579, 2023.

X4.149
|
EGU23-6857
|
ECS
|
Donatello Elia, Sonia Scardigno, Alessandro D'Anca, Gabriele Accarino, Jorge Ejarque, Francesco Immorlano, Daniele Peano, Enrico Scoccimarro, Rosa M. Badia, and Giovanni Aloisio

Typical end-to-end Earth System Modelling (ESM) workflows rely on different steps including data pre-processing, numerical simulation, output post-processing, as well as data analytics and visualization. The approaches currently available for implementing scientific workflows in the climate context do not properly integrate the entire set of components into a single workflow and in a transparent manner. The increasing usage of High Performance Data Analytics (HPDA) and Machine Learning (ML) in climate applications further exacerbate the issues. A more integrated approach would allow to support next-generation ESM and improve the workflow in terms of execution and energy consumption.

Moreover, a seamless integration of components for HPDA and ML into the ESM workflow will open the floor to novel applications and support larger scale pre- and post-processing. However, these components typically have different deployment requirements spanning from HPC (for ESM simulation) to Cloud computing (for HPDA and ML). It is paramount to provide scientists with solutions capable of hiding the technical details of the underlying infrastructure and improving workflow portability.

In the context of the eFlows4HPC project, we are exploring the use of innovative workflow solutions integrating approaches from HPC, HPDA and ML for supporting end-to-end ESM simulations and post-processing, with a focus on extreme events analysis (e.g., heat waves and tropical cyclones). In particular, the envisioned solution exploits PyCOMPSs for the management of parallel pipelines, task orchestration and synchronization, as well as PyOphidia for climate data analytics and ML frameworks (i.e., TensorFlow) for data-driven event detection models. This contribution presents the approaches being explored in the frame of the project to address the convergence of HPC, Big Data and ML into a single end-to-end ESM workflows.

How to cite: Elia, D., Scardigno, S., D'Anca, A., Accarino, G., Ejarque, J., Immorlano, F., Peano, D., Scoccimarro, E., Badia, R. M., and Aloisio, G.: Convergence of HPC, Big Data and Machine Learning for Earth System workflows, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-6857, https://doi.org/10.5194/egusphere-egu23-6857, 2023.

X4.150
|
EGU23-8756
|
Anne Fouilloux, Pier Lorenzo Marasco, Tina Odaka, Ruth Mottram, Paul Zieger, Michael Schulz, Alejandro Coca-Castro, Jean Iaquinta, and Guillaume Eynard Bontemps

The ever increasing number of scientific datasets made available by authoritative data providers (NASA, Copernicus, etc.) and provided by the scientific community opens new possibilities for advancing the state of the art in many areas of the natural sciences. As a result, researchers, innovators, companies and citizens need to acquire computational and data analysis skills to optimally exploit these datasets. Several educational programs dispense basic courses to students, and initiatives such as “The Carpentries” (https://carpentries.org/) complement this offering but also reach out to established researchers to fill the skill gap thereby empowering them to perform their own data analysis. However, most researchers find it challenging to go beyond these training sessions and face difficulties when trying to apply their newly acquired knowledge to their own research projects. To this regard, hackathons have proven to be an efficient way to support researchers in becoming competent practitioners but organising good hackathons is difficult and time consuming. In addition, the need for large amounts of computational and storage resources during the training and hackathons requires a flexible solution. Here, we propose an approach where researchers  work on realistic, large and complex data analysis problems similar to or directly part of  their research work. Researchers access an infrastructure deployed on the European Ocean Science Cloud (EOSC)  that supports intensive data analysis (large compute and storage resources). EOSC is a European Commission initiative for providing a federated and open multi-disciplinary environment where data, tools and services can be shared, published, found and re-used. We used jupyter book for delivering a collection of FAIR training materials for data analysis relying on Pangeo EOSC deployments as its primary computing platform. The training material (https://pangeo-data.github.io/foss4g-2022/intro.html, https://pangeo-data.github.io/clivar-2022/intro.html, https://pangeo-data.github.io/escience-2022/intro.html) is customised (different datasets with similar analysis) for different target communities and participants are taught the usage of Xarray, Dask and more generally how to efficiently access and analyse large online datasets. The training can be completed by group work where attendees can work on larger scale scientific datasets: the classroom is split into several groups. Each group works on different scientific questions and may use different datasets. Using the Pangeo (http://pangeo.io) ecosystem is not always new for all attendees but applying Xarray (http://xarray.pydata.org)  and Dask (https://www.dask.org/) on actual scientific “mini-projects” is often a showstopper for many researchers. With this approach, attendees have the opportunity to ask questions, collaborate with other researchers as well as Research Software Engineers, and apply Open Science practices without the burden of trying and failing alone. We find the involvement of scientific computing research engineers directly in the training is crucial for success of the hackathon approach. Feedback from attendees shows that it provides a solid foundation for big data geoscience and helps attendees to quickly become competent practitioners. It also gives infrastructure providers and EOSC useful feedback on the current and future needs of researchers for making their research FAIR and open. In this presentation, we will provide examples of achievements from attendees and present the feedback EOSC providers have received.

How to cite: Fouilloux, A., Marasco, P. L., Odaka, T., Mottram, R., Zieger, P., Schulz, M., Coca-Castro, A., Iaquinta, J., and Eynard Bontemps, G.: Pangeo framework for training: experience with FOSS4G, the CLIVAR bootcamp and the eScience course, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8756, https://doi.org/10.5194/egusphere-egu23-8756, 2023.

X4.151
|
EGU23-13768
|
ECS
Alejandro Coca-Castro, Anne Fouilloux, J. Scott Hosking, and Environmental Data Science Book community

Making assets in scientific research Findable, Accessible, Interoperable and Reusable (FAIR) is still overwhelming for many scientists. When considered as an afterthought, FAIR research is indeed challenging, and we argue that its implementation is by far much easier when considered at an early stage and focusing on improving the researchers' day to day work practices. One key aspect is to bundle all the research artefacts in a FAIR Research Object (RO) using RoHub (https://reliance.rohub.org/), a Research Object management platform that enables researchers to collaboratively manage, share and preserve their research work (data, software, workflows, models, presentations, videos, articles, etc.). RoHub implements the full RO model and paradigm: resources associated to a particular research work are aggregated into a single FAIR digital object, and metadata relevant for understanding and interpreting the content is represented as semantic metadata that are user and machine readable. This approach provides the technical basis for implementing FAIR executable notebooks: the data and the computational environment can be “linked” to one or several FAIR notebooks that can then be executed via EGI Binder Service with scalable compute and storage capabilities. However, the need for defining clear practises for writing and publishing FAIR notebooks that can be reused to build upon new research has quickly arised. This is where a community of practice is required. The Environmental Data Science Book (or EDS Book) is a pan-european community-driven resource hosted on GitHub and powered by Jupyter Book. EDS Book provides practical guidelines and templates that help to translate research outputs into curated, interactive, shareable and reproducible executable notebooks. The quality of the FAIR notebooks is ensured by a collaborative and transparent reviewing process supported by GitHub related technologies. This approach provides immediate benefits for those who adopt it and can feed fruitful discussions to better define a reward system that would benefit Science and scientific communities. All the resources needed for understanding and executing the notebook are gathered into an executable Research Object in RoHub. To date, the community has successfully published ten FAIR notebooks covering a wide range of topics in environmental data science. The notebooks consume open-source python libraries e.g. intake, iris, xarray, hvplot for fetching, processing and interactively visualising environmental research.  While these notebooks are currently python-based, EDS Book supports other programming languages such as R and Julia, and we are aiming at engaging with computational notebooks communities alike towards improving the research practices in environmental science.

How to cite: Coca-Castro, A., Fouilloux, A., Hosking, J. S., and community, E. D. S. B.: FAIR Notebooks: opportunities and challenges for the geoscience community, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13768, https://doi.org/10.5194/egusphere-egu23-13768, 2023.

X4.152
|
EGU23-9095
Guillaume Eynard-Bontemps, Jean Iaquinta, Sebastian Luna-Valero, Miguel Caballer, Frederic Paul, Anne Fouilloux, Benjamin Ragan-Kelley, Pier Lorenzo Marasco, and Tina Odaka

Research projects heavily rely on the exchange and processing of data and in this context Pangeo (https://pangeo.io/), a world-wide community of scientists and developers, thrives to facilitate the deployment of ready to use and community-driven platforms for big data geoscience. The European Open Science Cloud (EOSC) is the main initiative in Europe for providing a federated and open multi-disciplinary environment where European researchers, innovators, companies and citizens can share, publish, find and re-use data, tools and services for research, innovation and educational purposes. While a number of services based on Jupyter Notebooks were already available, no public Pangeo deployments providing fast access to large amounts of data and compute resources were accessible on EOSC. Most existing cloud-based Pangeo deployments are USA-based, and members of the Pangeo community in Europe did not have a shared platform where scientists or technologists could exchange know-how. Pangeo teamed up with two EOSC projects, namely EGI-ACE (https://www.egi.eu/project/egi-ace/) and C-SCALE (https://c-scale.eu/) to demonstrate how to deploy and use Pangeo on EOSC and emphasise the benefits for the European community. 

The Pangeo Europe Community together with EGI deployed a DaskHub, composed of Dask Gateway (https://gateway.dask.org/) and JupyterHub (https://jupyter.org/hub), with Kubernetes cluster backend on EOSC using the infrastructure of the EGI Federation (https://www.egi.eu/egi-federation/). The Pangeo EOSC JupyterHub deployment makes use of 1) the EGI Check-In to enable user registration and thereby authenticated and authorised access to the Pangeo JupyterHub portal and to the underlying distributed compute infrastructure; and 2) the EGI Cloud Compute and the cloud-based EGI Online Storage to distribute the computational tasks to a scalable compute platform and to store intermediate results produced by the user jobs. 

To facilitate future Pangeo deployments on top of a wide range of cloud providers (AWS, Google Cloud, Microsoft Azure, EGI Cloud Computing, OpenNebula, OpenStack, and more), the Pangeo EOSC JupyterHub deployment is now possible through the Infrastructure Manager (IM) Dashboard (https://im.egi.eu/im-dashboard/login). All the computing and storage resources are currently supplied by CESNET (https://www.cesnet.cz/?lang=en) in the frame of EGI-ACE project (https://im.egi.eu/). Several deployments have been made to serve the geoscience community, both for teaching and for research work. To date, more than 100 researchers have been trained on Pangeo@EOSC deployments and more are expected to join, in particular with easy access to large amounts of Copernicus data through a recent collaboration established with the C-SCALE project. In this presentation, we will provide details on the different deployments, how to get access to JupyterHub deployments and more generally how to contribute to Pangeo@EOSC.



How to cite: Eynard-Bontemps, G., Iaquinta, J., Luna-Valero, S., Caballer, M., Paul, F., Fouilloux, A., Ragan-Kelley, B., Marasco, P. L., and Odaka, T.: Pangeo@EOSC: deployment of PANGEO ecosystem on the European Open Science Cloud, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-9095, https://doi.org/10.5194/egusphere-egu23-9095, 2023.

X4.153
|
EGU23-8096
|
ECS
Justus Magin and Tina Odaka

In order to make use of a collection of datasets – for example, scenes from a SAR satellite – more efficient, it is important to be able to search for datasets relevant for a specific application. In particular, one might want to search for a specific period in time, for the spatial extent, or perform searches over multiple collections together.

For SAR data or data obtained from optical satellites, Spatio-Temporal Asset Catalogs (STAC) have become increasingly popular in the past few years. Defined as JSON and backed by databases with geospatial extensions, STAC servers (endpoints) have the advantage of being efficient, language-agnostic and following a standardized API.

Just like satellite scenes, in-situ data is growing in size very quickly and thus would benefit from being catalogued. However, the sequential nature of in-situ data and its sparse distribution in space makes it difficult to fit into STAC's standard model.

In the session, we present a experimental STAC extension that defines the most common properties of in-situ data as identified from ArgoFloat and  biologging data.

How to cite: Magin, J. and Odaka, T.: Spatio-Temporal Asset Catalog (STAC) for in-situ data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8096, https://doi.org/10.5194/egusphere-egu23-8096, 2023.

X4.154
|
EGU23-16895
Toward Developing an African Earth Open Portal of Remote Sensing Data Processing for Environmental Monitoring Applications
(withdrawn)
Badr-Eddine Boudriki Semlali, Touria Benmira, Bouchra Aithssaine, and Abdelghani Chehbouni
X4.155
|
EGU23-13082
Lifting the Fog into Space: Complementing Data Centers with Satellite On-Board Datacube Processing
(withdrawn)
Dimitar Mishev and Peter Baumann