Session ESSI2.9

[Programme]

ESSI2.9 | Seamless transitioning between HPC and cloud in support of Earth Observation, Earth Modeling and community-driven Geoscience approach PANGEO

Orals |

Tue, 08:30

Posters on site

Tue, 16:15

Seamless transitioning between HPC and cloud in support of Earth Observation, Earth Modeling and community-driven Geoscience approach PANGEO

Co-organized by AS5/CL5/GI1/OS4

Convener: Tina Odaka | Co-conveners: Vasileios Baousis, Stathes Hadjiefthymiades, Anne Fouilloux, Alejandro Coca-CastroECSECS, Pier Lorenzo Marasco, Guillaume Eynard-Bontemps

Orals

| Tue, 16 Apr, 08:30–10:15 (CEST)

Room 0.51

Posters on site

| Attendance Tue, 16 Apr, 16:15–18:00 (CEST) | Display Tue, 16 Apr, 14:00–18:00

Hall X3

Orals: Tue, 16 Apr | Room 0.51

Chairpersons: Vasileios Baousis, Tina Odaka, Anne Fouilloux

08:30–08:32

Seamless Transitioning Between High-Performance Computing (HPC) and Cloud Computing in Earth Observation and Earth Modeling

08:32–08:42

EGU24-1857

On-site presentation

A Replicable Multi-Cloud Automation Architecture for Earth Observation

Armagan Karatosun, Claudio Pisa, Tolga Kaprol, Vasileios Baousis, and Mohanad Albughdadi

The EO4EU project aims at making the access and use of Earth Observation (EO) data easier for environmental, government and business forecasts and operations.

To reach this goal, the EO4EU Platform will soon be made officially available, leveraging existing EO data sources such as DestinE, GEOSS, INSPIRE, Copernicus and Galileo, and offering advanced tools and services, based also on machine learning techniques, to help users find, access and handle the data they are interested in. The EO4EU Platform relies on a combination of a multi-cloud computing infrastructure coupled with pre-exascale high-performance computing facilities to manage demanding processing workloads.

The EO4EU multi-cloud infrastructure is composed by IaaS resources hosted on the WEkEO and CINECA Ada clouds, leveraged by a set of Kubernetes clusters dedicated to different workloads (e.g. cluster management tools, observability, or specific applications such as an inference server). To automate the deployment and management of these clusters, with advantages in terms of minimisation of dedicated effort and human errors, we have devised an Infrastructure-as-Code (IaC) architecture based on the Terraform, Rancher and Ansible technologies.

We believe that the proposed IaC architecture, based on open-source components and extensively documented and tested on the field, can be successfully replicated by other EO initiatives leveraging cloud infrastructures.

How to cite: Karatosun, A., Pisa, C., Kaprol, T., Baousis, V., and Albughdadi, M.: A Replicable Multi-Cloud Automation Architecture for Earth Observation, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-1857, https://doi.org/10.5194/egusphere-egu24-1857, 2024.

08:42–08:52

EGU24-9795

ECS

On-site presentation

Unifying HPC and Cloud Systems; A Containerized Approach for the Integrated Forecast System (IFS)

Cathal O'Brien, Armagan Karatosun, Adrian Hill, Paul Cresswell, Michael Sleigh, and Ioan Hadade

The IFS (Integrated Forecast System) is a global numerical weather prediction system maintained by the European Centre for Medium-Range Weather Forecasts (ECMWF). Traditionally, ECMWF’s high-performance computing facility (HPCF) is responsible for operationally supporting the IFS cycles. However, with the emergence of new cloud technologies, initiatives such as Destination Earth (DestinE), and growth of OpenIFS users within Europe and around the globe, the need to run IFS outside of ECMWF's computing facilities becomes more evident. Concerning such use cases, IFSTestsuite allows for the complete IFS system and its dependencies (e.g. ecCodes) to be built and tested outside of ECMWF's HPCF and designed to be self-contained, eliminating the need for external tools like MARS or ecCodes. Despite the need for users to perform multiple steps and the dependency of the software availability and versions on the host operating system, this indicates that there might be a potential for more generic and broader approach.

Containerization might provide the much-needed portability and disposable environments to trigger new cycles with the desired compiler versions, or even with different compilers. In addition, pre-built container images can be executed on any platform, provided there is a compatible container runtime installed on the target system that adheres to Open Container Initiative (OCI) standards like Singularity or Docker. Another benefit of using container images is container image layers which can significantly reduce the image build time. Lastly, despite their differences, both Singularity and Docker adhere to the OCI standards, and converting one container image to another is straightforward. However, despite the clear advantages, there are several crucial design choices to keep in mind. Notably, the available hardware and software stacks varies greatly across different HPC systems. When performance is important, this heterogeneous landscape limits the portability of containers. The libraries and drivers inside the container must be specially selected with regard to the hardware and software stack of a specific host system to maximize performance on that system. If this is done correctly, the performance of containerized HPC applications can match native applications. We demonstrate this process with the use of a hybrid containerization strategy where compatible MPI stacks and drivers are built inside the containers. The binding of host libraries into containers is also used on systems where proprietary software cannot be rebuilt inside the container.

In this study we present a containerized solution which balances portability and efficient performance, with examples of containerizing the IFS on a variety of systems including cloud systems with generic x86-64 architecture, such as European Weather Cloud (EWC) and Microsoft Azure, on EuroHPC systems such as Leonardo and LUMI and provided container image recipes for OpenIFS.

How to cite: O'Brien, C., Karatosun, A., Hill, A., Cresswell, P., Sleigh, M., and Hadade, I.: Unifying HPC and Cloud Systems; A Containerized Approach for the Integrated Forecast System (IFS), EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-9795, https://doi.org/10.5194/egusphere-egu24-9795, 2024.

08:52–09:02

EGU24-12669

ECS

On-site presentation

DeployAI to Deliver Interoperability of Cloud and HPC Resources for Earth Observation in the Context of the European AI-on-Demand Platform

Antonis Troumpoukis, Iraklis Klampanos, and Vangelis Karkaletsis

The European AI-on-Demand Platform (AIoD, http://aiod.eu) is a vital resource for leveraging and boosting the European AI research landscape towards economic growth and societal advancement across Europe. Following and emphasising European values, such as openness, transparency, and trustworthiness for developing and using AI technologies, the AIoD platform aims to become the main one-stop shop for exchanging and building AI resources and applications within the European AI innovation ecosystem, whilst also adhering to European values. The primary goal of the DIGITAL-EUROPE CSA initiative DeployAI (DIGITAL-2022-CLOUD-AI-B-03, 01/2024-12/2027) is to build, deploy, and launch a fully operational AIoD platform, promoting trustworthy, ethical, and transparent European AI solutions for the industry, with a focus on SMEs and the public sector.

Building on Open-source and trusted software, DeployAI will provide a number of technological assets such as a comprehensive and Trustworthy AI resource catalogue and marketplace offering responsible AI resources and tools, workflow composition and execution systems for prototyping and user-friendly creation of novel services, responsible foundational models and services to foster dependable innovation, etc. In addition, and building upon the results of the ICT-49 AI4Copernicus project [1], which provided a bridge between the AIoD platform and the Copernicus ecosystem and the DIAS platforms, DeployAI will integrate impactful Earth Observation AI services into the AIoD platform. These will include (but not limited to) satellite imagery preprocessing, land usage classification, crop type identification, super-resolution, and weather forecasting.

Furthermore, DeployAI will allow the rapid prototyping of AI applications and their deployment to a variety of Cloud/Edge/HPC infrastructures. The project will focus on establishing a cohesive interaction framework that integrates with European Data Spaces and Gaia-X initiatives, HPC systems with an emphasis on the EuroHPC context, and the European Open Science Cloud. Interfaces to European initiatives and industrial AI-capable cloud platforms will be further implemented to enable interoperability. This capability enables the execution of Earth Observation applications not only within the context of a DIAS/DAS but also within several other compute systems. This level of interoperability enhances the adaptability and accessibility of AI applications, fostering a collaborative environment where geoscientific workflows can be seamlessly executed across diverse computational infrastructures and made available to a wide audience of innovators.

[1] A. Troumpoukis et al., "Bridging the European Earth-Observation and AI Communities for Data-Intensive Innovation", 2023 IEEE Ninth International Conference on Big Data Computing Service and Applications (BigDataService), Athens, Greece, 2023, pp. 9-16, doi:10.1109/BigDataService58306.2023.00008.

This work has been has received funding from the European Union’s Digital Europe Programme (DIGITAL) under grant agreement No 101146490.

How to cite: Troumpoukis, A., Klampanos, I., and Karkaletsis, V.: DeployAI to Deliver Interoperability of Cloud and HPC Resources for Earth Observation in the Context of the European AI-on-Demand Platform, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-12669, https://doi.org/10.5194/egusphere-egu24-12669, 2024.

09:02–09:12

EGU24-12410

On-site presentation

Towards Enhancing WaaS and Data Provenance over Reana

Iraklis Klampanos, Antonis Ganios, and Antonis Troumpoukis

Interoperability and reproducibility are critical aspects of scientific computation. The data analysis platform Reana [1], developed by CERN, enhances the interoperability and reproducibility of scientific analyses by allowing researchers to describe, execute, and share their analyses. This is achieved via the execution of standardised scientific workflows, such as CWL, within reusable containers. Moreover, it allows execution to span different types of resources, such as Cloud and HPC.

In this session we will present ongoing work to enhance Reana’s Workflows-as-a-Service (WaaS) functionality and also support Workflow registration and discoverability. Building upon the design goals and principles of the DARE platform [2], this work aims to enhance Reana by enabling users to register and discover available workflows within the system. In addition, we will present the integration of Data Provenance based on the W3C PROV-O standard [3] allowing the tracking and recording of data lineage in a systematic and dependable way across resource types.

In summary, key aspects of this ongoing work include:

Workflows-as-a-Service (WaaS): Extending Reana's service-oriented mode of operation, allowing users to register, discover, access, execute, and manage workflows by name or ID, via APIs, therefore enhancing the platform's accessibility and usability.
Data Provenance based on W3C PROV-O: Implementing support for recording and visualising data lineage information in compliance with the W3C PROV-O standard. This ensures transparency and traceability of data processing steps, aiding in reproducibility and understanding of scientific analyses.

This work aims to broaden Reana's functionality, aligning with best practices for reproducible and transparent scientific research. We aim to make use of the enhanced Reana-based system on the European AI-on-demand platform [4], currently under development, to address the requirements of AI innovators and researchers when studying and executing large-scale AI-infused workflows.

References:

[1] Simko et al., (2019). Reana: A system for reusable research data analyses. EPJ Web Conf., 214:06034, https://doi.org/10.1051/epjconf/201921406034

[2] Klampanos et al., (2020). DARE Platform: a Developer-Friendly and Self-Optimising Workflows-as-a-Service Framework for e-Science on the Cloud. Journal of Open Source Software, 5(54), 2664, https://doi.org/10.21105/joss.02664

[3] PROV-O: The PROV Ontology: https://www.w3.org/TR/prov-o/ (viewed 9 Jan 2024)

[4] The European AI-on-Demand platform: https://aiod.eu (viewed 9 Jan 2024)

This work has been has received funding from the European Union’s Horizon Europe research and innovation programme under Grant Agreement No 101070000.

How to cite: Klampanos, I., Ganios, A., and Troumpoukis, A.: Towards Enhancing WaaS and Data Provenance over Reana, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-12410, https://doi.org/10.5194/egusphere-egu24-12410, 2024.

09:12–09:14

Pangeo showcase: Geoscience using a Community Driven Approach

09:14–09:24

EGU24-9156

ECS

On-site presentation

DataLabs: development of a cloud collaborative platform for open interdisciplinary geo-environmental sciences

Michael Tso, Michael Hollaway, Faiza Samreen, Iain Walmsley, Matthew Fry, John Watkins, and Gordon Blair

In environmental science, scientists and practitioners are increasingly facing the need to create data-driven solutions to the environment's grand challenges, often needing to use data from disparate sources and advanced analytical methods, as well as drawing expertise from collaborative and cross-disciplinary teams [1]. Virtual labs allow scientists to collaboratively explore large or heterogeneous datasets, develop and share methods, and communicate their results to stakeholders and decision-makers.

DataLabs [2] has been developed as a cloud-based collaborative platform to tackle these challenges and promote open, collaborative, interdisciplinary geo-environmental sciences. It allows users to share notebooks (e.g. JupyterLab, R Studio, and most recently VS Code), datasets and computational environments and promote transparency and end-to-end reasoning of model uncertainty. It supports FAIR access to data and digital assets by providing shared data stores and discovery functionality of datasets and assets hosted on the platform’s asset catalogue. Its tailorable design allows it to be adaptable to different challenges and applications. It is also an excellent platform for large collaborative teams to work on outputs together [3] as well as communicating results to stakeholders by allowing easy prototyping and publishing of web applications (e.g. Shiny, Panel, Voila). It is currently deployed on JASMIN [4] and is part of the UK NERC Environmental data service [5].

There are a growing number of use cases and requirements for DataLabs and it is going to play a central part in several planned digital research infrastructure (DRI) initiatives. Future development needs of the platform to further its vision include e.g. more intuitive onboarding experience, easier access to key datasets at source, better connectivity to other cloud platforms, and better use of workflow tools. DataLabs shares many of the features (e.g. heavy use of PANGEO core packages) and design principles of PANGEO. We would be interested in exploring commonalities and differences, sharing best practices, and growing the community of practice in this increasingly important area.

[1] Blair, G.S., Henrys, P., Leeson, A., Watkins, J., Eastoe, E., Jarvis, S., Young, P.J., 2019. Data Science of the Natural Environment: A Research Roadmap. Front. Environ. Sci. 7. https://doi.org/10.3389/fenvs.2019.00121

[2] Hollaway, M.J., Dean, G., Blair, G.S., Brown, M., Henrys, P.A., Watkins, J., 2020. Tackling the Challenges of 21st-Century Open Science and Beyond: A Data Science Lab Approach. Patterns 1, 100103. https://doi.org/10.1016/j.patter.2020.100103

[3] https://eds.ukri.org/news/impacts/datalabs-streamlines-workflow-assessing-state-nature-uk

[4] https://jasmin.ac.uk/

[5] https://eds.ukri.org/news/impacts/datalabs-digital-collaborative-platform-tackling-environmental-science-challenges

How to cite: Tso, M., Hollaway, M., Samreen, F., Walmsley, I., Fry, M., Watkins, J., and Blair, G.: DataLabs: development of a cloud collaborative platform for open interdisciplinary geo-environmental sciences , EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-9156, https://doi.org/10.5194/egusphere-egu24-9156, 2024.

09:24–09:34

EGU24-20779

On-site presentation

Project Pythia: Building an Inclusive Geoscience Community with Cookbooks

John Clyne, Brian Rose, Orhan Eroglu, James Munroe, Ryan May, Drew Camron, Julia Kent, Amelia Snyder, Kevin Tyle, Maxwell Grover, and Robert Ford

Project Pythia is the educational arm of the Pangeo community, and provides a growing collection of community driven and developed training resources that help geoscientists navigate the Pangeo ecosystem, and the myriad complex technologies essential for today’s Big Data science challenges. Project Pythia began in 2020 with the support of a U.S. NSF EarthCube award. Much of the initial effort focused on Pythia Foundations: a collection of Jupyter Notebooks that covered essential topics such as Python language basics; managing projects with GitHub; authoring and using “binderized” Jupyter Notebooks; and many of Pangeo’s core packages such as Xarray, Pandas, and Matplotlib. Building upon Foundations, the Pythia community turned its attention toward creating Pythia Cookbooks: exemplar collections of recipes for transforming raw ingredients (publicly available, cloud-hosted data) into scientifically useful results. Built from Jupyter Notebooks, Cookbooks are explicitly tied to reproducible computational environments and supported by a rich infrastructure enabling collaborative authoring and automated health-checking – essential tools in the struggle against the widespread notebook obsolescence problem.

Open-access, cloud-based Cookbooks are a democratizing force for growing the capacity of current and future geoscientists to practice open science within the rapidly evolving open science ecosystem. In this talk we outline our vision of a sustainable, inclusive open geoscience community enabled by Cookbooks. With further support from the NSF, the Pythia community will accelerate the development and broad buy-in of these resources, demonstrating highly scalable versions of common analysis workflows on high-value datasets across the geosciences. Infrastructure will be deployed for performant data-proximate Cookbook authoring, testing, and use, on both commercial and public cloud platforms. Content and community will expand through annual workshops, outreach, and classroom use, with recruitment targeting under-served communities. Priorities will be guided by an independent steering board; sustainability will be achieved by nurturing a vibrant, inclusive community backed by automation that lowers barriers to participation.

How to cite: Clyne, J., Rose, B., Eroglu, O., Munroe, J., May, R., Camron, D., Kent, J., Snyder, A., Tyle, K., Grover, M., and Ford, R.: Project Pythia: Building an Inclusive Geoscience Community with Cookbooks, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-20779, https://doi.org/10.5194/egusphere-egu24-20779, 2024.

09:34–09:44

EGU24-18256

On-site presentation

Data access for km-scale resolution models

Florian Ziemen, Tobias Kölling, and Lukas Kluft

With the transition to global, km-scale simulations, model outputs have grown in size, and efficient ways of accessing data have become more important than ever. This implies that the data storage has to be optimized for efficient read access to small sub-sets of the data, and multiple resolutions of the same data need to be provided for efficient analysis on coarse as well as fine-grained scales.

In this high-level overview presentation, we present an approach based on datasets. Each dataset represents a coherent subset of a model output (e.g. all model variables stored at daily resolution). Aiming for a minimum number of datasets makes us enforce consistency in the model output and thus eases analysis. Each dataset is served to the user as one zarr store, independent of the actual file layout on disks or other storage media. Multiple datasets are grouped in catalogs for findability.

By serving the data via https, we can implement a middle layer between the user and the storage systems, allowing to combine different storage backends behind a unifying frontend. At the same time, this approach allows us to largely build the system on existing technologies such as web servers and caches, and efficiently serve data to users outside the compute center where the data is stored.
The approach we present is currently under development in the BMBF project WarmWorld with contributions by the H2020 project nextGEMS, and we expect it to be useful for many other projects as well.

How to cite: Ziemen, F., Kölling, T., and Kluft, L.: Data access for km-scale resolution models, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-18256, https://doi.org/10.5194/egusphere-egu24-18256, 2024.

09:44–09:54

EGU24-7765

On-site presentation

Unleashing the power of Dask with a high-throughput Trust Region Reflectance solver for raster datacubes

Bernhard Raml, Raphael Quast, Martin Schobben, Christoph Reimer, and Wolfgang Wagner

In remote sensing applications, the ability to efficiently fit models to vast amounts of observational data is vital for deriving high-quality data products, as well as accelerating research and development. Addressing this challenge, we developed a high-performance non-linear Trust Region Reflectance solver specialised for datacubes, by integrating Python's interoperability with C++ and Dask's distributed computing capabilities. Our solution achieves high throughput both locally and potentially on any Dask-compatible backend, such as EODC's Dask Gateway. The Dask framework takes care of chunking the datacube, and streaming each chunk efficiently to available workers where our specialised solver is applied. Introducing Dask for distributed computing enables our algorithm to run on different compatible backends. This approach not only broadens operational flexibility, but also allows us to focus on enhancing the algorithm's efficiency, free from concerns about concurrency. This enabled us to implement a highly efficient solver in C++, which is optimised to run on a single core, but still utilise all available resources effectively. For the heavy lifting, such as performing singular value decompositions and matrix operations we rely on Eigen, a powerful open-source C++ library specialized on linear algebra. To describe the spatial reference and other auxiliary data associated with our datacube, we employ the Xarray framework. Importantly, Xarray integrates seamlessly with Dask. Finally, to ensure robustness and extensibility of our framework, we applied state-of-the-art software engineering practices, including Continuous Integration and Test-Driven Development. In our work we demonstrate the significant performance gains achievable by effectively utilising available open-source frameworks, and adhering to best engineering practices. This is exemplified by our practical workflow demonstration to fit a soil moisture estimation model.

How to cite: Raml, B., Quast, R., Schobben, M., Reimer, C., and Wagner, W.: Unleashing the power of Dask with a high-throughput Trust Region Reflectance solver for raster datacubes, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-7765, https://doi.org/10.5194/egusphere-egu24-7765, 2024.

09:54–10:04

EGU24-20909

ECS

Virtual presentation

UXarray: Extensions to Xarray to support unstructured grids

Orhan Eroglu, Hongyu Chen, Philip Chmielowiec, John Clyne, Corrine DeCiampa, Cecile Hannay, Robert Jacob, Rajeev Jain, Richard Loft, Brian Medeiros, Lantao Sun, Paul Ullrich, and Colin Zarzycki

The arrival of kilometer-scale climate and global weather models presents substantial challenges for the analysis and visualization of the resulting data, not only because of their tremendous size but also because of the employment of unstructured grids upon which the governing equations of state are solved. Few Open Source analysis and visualization software tools exist that are capable of operating directly on unstructured grid data. Those that do exist are not comprehensive in the capabilities they offer, do not scale adequately, or both. Recognizing this gap in much-needed capability, Project Raijin - funded by an NSF EarthCube award - and the DOE SEATS project, launched a collaborative effort to develop an open source Python package called UXarray.

UXarray extends the widely used Xarray package, providing support for operating directly (without regridding) on unstructured grid model outputs found in the Earth System Sciences, such as CAM-SE, MPAS, SCRIP, UGRID, and in the future, ICON. Much like Xarray, UXarray provides fundamental analysis and visualization operators, upon which more specialized, domain-specific capabilities can be layered. This talk will present an overview of the current capabilities of UXarray, provide a roadmap for near term future development, and will describe how the Pangeo community can contribute to this on-going effort.

How to cite: Eroglu, O., Chen, H., Chmielowiec, P., Clyne, J., DeCiampa, C., Hannay, C., Jacob, R., Jain, R., Loft, R., Medeiros, B., Sun, L., Ullrich, P., and Zarzycki, C.: UXarray: Extensions to Xarray to support unstructured grids, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-20909, https://doi.org/10.5194/egusphere-egu24-20909, 2024.

10:04–10:14

EGU24-15416

ECS

On-site presentation

XDGGS: Xarray Extension for Discrete Global Grid Systems (DGGS)

Alexander Kmoch, Benoît Bovy, Justus Magin, Ryan Abernathey, Peter Strobl, Alejandro Coca-Castro, Anne Fouilloux, Daniel Loos, and Tina Odaka

Traditional geospatial representations of the globe on a 2-dimensional plane often introduce distortions in area, distance, and angles. Discrete Global Grid Systems (DGGS) mitigate these distortions and introduce a hierarchical structure of global grids. Defined by ISO standards, DGGSs serve as spatial reference systems facilitating data cube construction, enabling integration and aggregation of multi-resolution data sources. Various tessellation schemes such as hexagons and triangles cater to different needs - equal area, optimal neighborhoods, congruent parent-child relationships, ease of use, or vector field representation in modeling flows.

The fusion of Discrete Global Grid Systems (DGGS) and Datacubes represents a promising synergy for integrated handling of planetary-scale data.

The recent Pangeo community initiative at the ESA BiDS'23 conference has led to significant advancements in supporting Discrete Global Grid Systems (DGGS) within the widely used Xarray package. This collaboration resulted in the development of the Xarray extension XDGGS (https://github.com/xarray-contrib/xdggs). The aim of xdggs is to provide a unified, high-level, and user-friendly API that simplifies working with various DGGS types and their respective backend libraries, seamlessly integrating with Xarray and the Pangeo scientific computing ecosystem. Executable notebooks demonstrating the use of the xdggs package are also developed to showcase its capabilities.

This development represents a significant step forward, though continuous efforts are necessary to broaden the accessibility of DGGS for scientific and operational applications, especially in handling gridded data such as global climate and ocean modeling, satellite imagery, raster data, and maps.

Keywords: Discrete Global Grid Systems, Xarray Extension, Geospatial Data Integration, Earth Observation, Data Cube, Scientific Collaboration

How to cite: Kmoch, A., Bovy, B., Magin, J., Abernathey, R., Strobl, P., Coca-Castro, A., Fouilloux, A., Loos, D., and Odaka, T.: XDGGS: Xarray Extension for Discrete Global Grid Systems (DGGS), EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-15416, https://doi.org/10.5194/egusphere-egu24-15416, 2024.

10:14–10:15

Conclusion remarks

Posters on site: Tue, 16 Apr, 16:15–18:00 | Hall X3

Display time: Tue, 16 Apr, 14:00–18:00

Chairpersons: Vasileios Baousis, Tina Odaka, Anne Fouilloux

X3.1

EGU24-7790

Virtual presentation

Advanced Front-End Development for Customer-Facing Services in a Hybrid Cloud and High-Performance Computing Cluster for Earth Observation Data Processing

(withdrawn)

Marios Sophocleous, Stephane Kemgang, Christos Nicolaides, and Sozos Karageorgiou

X3.2

EGU24-15872

Deploying Pangeo on HPC: our experience with the Remote Sensing Deployment Analysis environmenT on SURF infrastructure

Francesco Nattino, Meiert W. Grootes, Pranav Chandramouli, Ou Ku, Fakhereh Alidoost, and Yifat Dzigan

The Pangeo software stack includes powerful tools that have the potential to revolutionize the way in which research on big (geo)data is conducted. A few of the aspects that make them very attractive to researchers are the ease of use of the Jupyter web-based interface, the level of integration of the tools with the Dask distributed computing library, and the possibility to seamlessly move from local deployments to large-scale infrastructures.

The Pangeo community and project Pythia are playing a key role in providing training resources and examples that showcase what is possible with these tools. These are essential to guide interested researchers with clear end goals but also to provide inspiration for new applications.

However, configuring and setting up a Pangeo-like deployment is not always straightforward. Scientists whose primary focus is domain-specific often do not have the time to spend solving issues that are mostly ICT in nature. In this contribution, we share our experience in providing support to researchers in running use cases backed by deployments based on Jupyter and Dask at the SURF supercomputing center in the Netherlands, in what we call the Remote Sensing Deployment Analysis environmenT (RS-DAT) project.

Despite the popularity of cloud-based deployments, which are justified by the enormous data availability at various public cloud providers, we discuss the role that HPC infrastructure still plays for researchers, due to the ease of access via merit-based allocation grants and the requirements of integration with pre-existing workflows. We present the solution that we have identified to seamlessly access datasets from the SURF dCache massive storage system, we stress how installation and deployment scripts can facilitate adoption and re-use, and we finally highlight how technical research-support staff such as Research Software Engineers can be key in bridging researchers and HPC centers.

How to cite: Nattino, F., Grootes, M. W., Chandramouli, P., Ku, O., Alidoost, F., and Dzigan, Y.: Deploying Pangeo on HPC: our experience with the Remote Sensing Deployment Analysis environmenT on SURF infrastructure, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-15872, https://doi.org/10.5194/egusphere-egu24-15872, 2024.

X3.3

EGU24-6216

Pangeo environment in Galaxy Earth System supported by Fair-Ease

Thierry Carval, Marie Jossé, and Jérôme Detoc

The Earth System is a complex and dynamic system that encompasses the interactions between the atmosphere, oceans, land, and biosphere. Understanding and analyzing data from the Earth System Model (ESM) is essential, for example to predict and mitigate the impacts of climate change.

Today, collaborative efforts among scientists across diverse fields are increasingly urgent. The FAIR-EASE project aims to build an interdomain digital architecture for integrated and collaborative use of environmental data. Galaxy is a main component of this architecture which will be used by several domains of study chose by FAIR-EASE.

Galaxy, an open-source web platform, provides users with an easy and FAIR tool to access and handle multidisciplinary environmental data. By design, Galaxy manages data analyses by sharing and publishing all involved items like inputs, results, workflows, and visualisations, ensuring reproducibility by capturing the necessary information to repeat and understand data analyses.

From this point on, a Pangeo environment is a tool more than relevant to be used alongside earth-system related data and processing tools in order to create cross domain analyses. The good news is that a Pangeo environment is accessible on Galaxy. It can be exploited as a jupyterlab and allows the user to manage their NetCDF data in a Pangeo environment with the use of notebooks. Multiple tutorials are available on the Galaxy Training Network to learn how to use Pangeo.

The Galaxy Training Network significantly contributes to enhancing the accessibility and reusability of tools and workflows. The Galaxy Training platform hosts an extensive collection of tutorials. These tutorials serve as valuable resources for individuals seeking to learn how to navigate Galaxy, employ specific functionalities like Interactive Tools or how to execute workflows for specific analyses.

In synthetisis, Pangeo in Galaxy provide Pangeo users with an up-to-date data analysis platform ensuring reproducibility and mixing trainings and tools.

On the Earth System side, a first step was the creation of a Galaxy declination for Earth System studies (earth-system.usegalaxy.eu) with dedicated data, models, processing, visualisations and tutorials. It will make Earth System modeling more accessible to researchers in different fields.

In this Galaxy subdomain we choose to have the Pangeo tools. Our hope is to be able to implement cross domain workflows including climate and earth system sciences.

During this session our aim is to present how you can use the Pangeo environment from the Galaxy Earth System.

How to cite: Carval, T., Jossé, M., and Detoc, J.: Pangeo environment in Galaxy Earth System supported by Fair-Ease, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-6216, https://doi.org/10.5194/egusphere-egu24-6216, 2024.

X3.4

EGU24-15366

ECS

Enabling seamless integration of Copernicus and in-situ data

Iason Sotiropoulos, Athos Papanikolaou, Odysseas Sekkas, Anastasios Polydoros, Vassileios Tsetsos, Claudio Pisa, and Stamatia Rizou

BUILDSPACE aims to combine terrestrial data from buildings collected by IoT devices with aerial imaging from drones equipped with thermal cameras and location annotated data from satellite services (i.e., EGNSS and Copernicus) to deliver innovative services at building scale, enabling the generation of high fidelity multi-modal digital twins and at city scale providing decision support services for energy demand prediction, urban heat and urban flood analysis. A pivotal element and the foundational support of the BUILDSPACE ecosystem is the Core Platform and it plays a crucial role in facilitating seamless data exchange, secure and scalable data storage, and streamlined access to data from three Copernicus services, namely the Land, Atmosphere, and Climate Change.The platform's underlying technology is robust, incorporating two key components: OIDC for user authentication and group authorization over the data, and a REST API to handle various file operations. OIDC stands for OpenID Connect, a standard protocol that enables secure user authentication and allows for effective management of user groups and their access permissions. On the other hand, the platform employs a REST API for seamless handling of file-related tasks, including uploading, downloading, and sharing. This combination ensures efficient and secure data exchange within the system. Additionally, the use of an S3 compatible file system ensures secure and scalable file storage, while a separate metadata storage system enhances data organization and accessibility. Currently deployed on a Kubernetes cluster, this platform offers numerous advantages, including enhanced scalability, efficient resource management, and simplified deployment processes. The implementation of the Core Platform has led to a current focus on integrating APIs from Copernicus services into the Core Platform's API. This ongoing effort aims to enhance the platform's capabilities by seamlessly incorporating external data, enriching the overall functionality and utility of the project.

How to cite: Sotiropoulos, I., Papanikolaou, A., Sekkas, O., Polydoros, A., Tsetsos, V., Pisa, C., and Rizou, S.: Enabling seamless integration of Copernicus and in-situ data, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-15366, https://doi.org/10.5194/egusphere-egu24-15366, 2024.

X3.5

EGU24-18585

ECS

STAC catalogs for time-varying in-situ data

Justus Magin

The ability to search a collection of datasets is an important factor for the usefulness of the data. By organizing the metadata into catalogs, we can enable dataset discovery, look up file locations and avoid access to the data files before the actual computation. Spatio-Temporal Asset Catalogs (STAC) is a increasingly popular language-agnostic specification and vibrant ecosystem of tools for geospatial data catalogs, and is tailored for raster data like satellite imagery. It allows for a search using a variety of patterns, including the spatial and temporal extent.

In-situ data is heterogenous and would benefit from being cataloged, as well as the ecosystem of tools. However, due to the strict separation between the spatial and temporal dimensions in STAC the time-varying nature of in-situ data is not optimally captured. While for approximately stationary sensors like tide gauges, moorings, weather stations, and high-frequency radars this is not an issue (see https://doi.org/10.5194/egusphere-egu23-8096), it becomes troublesome for moving sensors, especially if the sensor moves at a high speed, covers big distances, or if the dataset contains a long time series.

To resolve this, we extend the STAC specification by replacing the geojson data with the JSON-encoded ODC moving feature standard.

How to cite: Magin, J.: STAC catalogs for time-varying in-situ data, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-18585, https://doi.org/10.5194/egusphere-egu24-18585, 2024.

X3.6

EGU24-9781

Optimizing NetCDF performance for cloud computing : exploring a new chunking strategy

Flavien Gouillon, Cédric Pénard, Xavier Delaunay, and Florian Wery

Owing to the increasing number of satellites and advancements in sensor resolutions, the volume of scientific data is experiencing rapid growth. NetCDF (Network Common Data Form) stands as the community standard for storing such data, necessitating the development of efficient solutions for file storage and manipulation in this format.

Object storage, emerging with cloud infrastructures, offers potential solutions for data storage and parallel access challenges. However, NetCDF may not fully harness this technology without appropriate adjustments and fine-tuning. To optimize computing and storage resource utilization, evaluating NetCDF performance on cloud infrastructures is essential. Additionally, exploring how cloud-developed software solutions contribute to enhanced overall performance for scientific data is crucial.

Offering multiple file versions with data split into chunks tailored for each use case incurs significant storage costs. Thus, we investigate methods to read portions of compressed chunks, creating virtual sub-chunks that can be read independently. A novel approach involves indexing data within NetCDF chunks compressed with deflate, enabling extraction of smaller data portions without reading the entire chunk.

This feature is very valuable in use cases such as pixel drilling or extracting small amounts of data from large files with sizable chunks. It also saves reading time, particularly in scenarios of poor network connection, such as those encountered onboard research vessels.

We conduct performance assessments of various libraries in various use cases to provide recommendations for the most suitable and efficient library for reading NetCDF data in different situations.

Our tests involved accessing remote NetCDF datasets (two files from the SWOT mission) available on the network via a lighttpd server and an s3 server. Additionally, simulations of degraded Internet connections, featuring high latency, packet loss, and limited bandwidth, are also performed.

We evaluate the performance of four Python libraries (netcdf4 lib, Xarray, h5py, and our chunk indexing library) for reading dataset portions through fsspec or fs_s3. A comparison of reading performance using netCDF, zarr, and nczarr data formats is also conducted on an s3 server.

Preliminary findings indicate that the h5py library is the most efficient, while Xarray exhibits poor performance in reading NetCDF files. Furthermore, the NetCDF format demonstrates reasonably good performance on an s3 server, albeit lower than zarr or nczarr formats. However, the considerable efforts required to convert petabytes of archived NetCDF files and adapt numerous software libraries for a performance improvement within the same order of magnitude can raise questions about the practicality of such endeavors and benefits is thus extremely related to the use cases.

How to cite: Gouillon, F., Pénard, C., Delaunay, X., and Wery, F.: Optimizing NetCDF performance for cloud computing : exploring a new chunking strategy, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-9781, https://doi.org/10.5194/egusphere-egu24-9781, 2024.

X3.7

EGU24-17150

ECS

Data access patterns of km-scale resolution models

Janos Zimmermann, Florian Ziemen, and Tobias Kölling

Climate models produce vast amounts of output data. In the nextGEMS project, we have run the ICON model at 5 km resolution for 5 years, producing about 750 TB of output data from one simulation. To ease analysis, the data is stored at multiple temporal and spatial resolutions. The dataset is now analyzed by more than a hundred scientists on the DKRZ levante system. As disk space is limited, it is crucial to obtain information, which parts of this dataset are accessed frequently and need to be kept on disk, and which parts can be moved to the tape archive and only be fetched on request.

By storing the output as zarr files with many small files for the individual data chunks, and logging file access times, we obtained a detailed view of more than half a year of access to the nextGEMS dataset, even going to regional level for a given variable and time step. The evaluation of those access patterns offers the possibility to optimize various aspects such as caching, chunking, and archiving. Furthermore, it provides valuable information for designing future output configurations.

In this poster, we present the observed access patterns and discuss their implications for our chunking and archiving strategy. Leveraging an interactive visualization tool, we explore and compare access patterns, distinguishing frequently accessed subsets, sparsely accessed variables, and preferred resolutions. We furthermore provide information on how we analyzed the data access to enable other users to follow our approach.

How to cite: Zimmermann, J., Ziemen, F., and Kölling, T.: Data access patterns of km-scale resolution models, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-17150, https://doi.org/10.5194/egusphere-egu24-17150, 2024.

X3.8

EGU24-17111

Cloudifying Earth System Model Output

Fabian Wachsmann

We introduce eerie.cloud (eerie.cloud.dkrz.de), a data server for efficient access to prominent climate data sets stored on disk at the German Climate Computing Center (DKRZ). We show how we “cloudify” data from two projects, EERIE and ERA5, and how one can benefit from it.

The European Eddy-rich Earth System Model (EERIE) project aims to develop state-of-the-art high-resolution Earth System Models (ESM) that are able to resolve ocean mesoscale processes. These models are then used to perform simulations over centennial scales and make their output available for the global community. At present, the total volume of the EERIE data set exceeds 0.5PB and is rapidly growing, posing challenges for data management.
ERA5 is the fifth generation ECMWF global atmospheric reanalysis. It is widely used as forcing data for climate model simulations, for model evaluation or for the analysis of climate trends. DKRZ maintains a 1.6 PB subset of ERA5 data at its native resolution.

We use Xpublish to set up the data server. Xpublish is a python package and a plugin for Pangeo's central analysis package Xarray. Its main feature is to provide ESM output by mapping any input data to virtual zarr data sets. Users can retrieve these data sets as if they were cloud-native and cloud-optimized.

eerie.cloud features

Parallel access to data subsets on chunk-level
Interfaces to make the data more FAIR
- User friendly content overviews with displays of xarray-like dataset representations
- Simple browsing and loading data with an intake catalog
On-the-fly server-side computation
- Register simple xarray routines for generating customized variables
- Compression for speeding up downloads
Generation of interactive geographical plots, including animations

Eerie.cloud is a solution to make EERIE data more usable by a wider community.

How to cite: Wachsmann, F.: Cloudifying Earth System Model Output, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-17111, https://doi.org/10.5194/egusphere-egu24-17111, 2024.

X3.9

EGU24-8058

Beyond ESGF – Bringing regional climate model datasets to the cloud on AWS S3 using the Pangeo Forge ETL framework

Lars Buntemeyer

The Earth System Grid Federation (ESGF) data nodes are usually the first address for accessing climate model datasets from WCRP-CMIP activities. It is currently hosting different datasets in several projects, e.g., CMIP6, CORDEX, Input4MIPs or Obs4MIPs. Datasets are usually hosted on different data nodes all over the world while data access is managed by any of the ESGF web portals through a web-based GUI or the ESGF Search RESTful API. The ESGF data nodes provide different access methods, e.g., https, OPeNDAP or Globus.

Beyond ESGF, there has been the Pangeo / ESGF Cloud Data Working Group that coordinates efforts related to storing and cataloging CMIP data in the cloud, e.g., in the Google cloud and in the Amazon Web Services Simple Storage Service (S3) where a large part of the WCRP-CMIP6 ensemble of global climate simulations is now available in analysis-ready cloud-optimized (ARCO) zarr format. The availibility in the cloud has significantly lowered the barrier for users with limited resources and no access to an HPC environment to work with CMIP6 datasets and at the same time increases the chance for reproducibility and reusability of scientific results.

Following the Pangeo strategy, we have adapted parts of the Pangeo Forge software stack for publishing our regional climate model datasets from the EURO-CORDEX initiative on AWS S3 cloud storage. The main tools involved are Xarray, Dask, Zarr, Intake and the ETL tools of pangeo-forge-recipes. Thanks to similar meta data conventions in comparison to the global CMIP6 datasets, the workflows require only minor adaptations. In this talk, we will show the strategy and workflow implemented and orchestrated in GitHub Actions workflows as well as a demonstration of how to access EURO-CORDEX datasets in the cloud.

How to cite: Buntemeyer, L.: Beyond ESGF – Bringing regional climate model datasets to the cloud on AWS S3 using the Pangeo Forge ETL framework, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-8058, https://doi.org/10.5194/egusphere-egu24-8058, 2024.

X3.10

EGU24-10741

Harnessing the Pangeo ecosystem for delivering the cloud-based Global Fish Tracking System

Daniel Wiesmann, Tina Odaka, Anne Fouilloux, Emmanuelle Autret, Mathieu Woillez, and Benjamin Ragan-Kelley

We present our approach of leveraging the Pangeo software stack for developing the Global Fish Tracking System (GFTS). The GFTS project tackles the challenge of accurately modelling fish movement in the ocean based on biologging data with a primary focus on Sea Bass. Modelling fish movements is essential to better understand migration strategies and site fidelity, which are critical aspects for fish stock management policy and marine life conservation efforts.

Estimating fish movements is a highly compute intensive process. It involves matching pressure and temperature data from in-situ biologging sensors with high resolution ocean temperature simulations over long time periods. The Pangeo software stack provides an ideal environment for this kind of modelling. While the primary target platform of the GFTS project is the new Destination Earth Service Platform (DESP), relying on the Pangeo ecosystem ensures that the GFTS project is a robust and portable solution that can be re-deployed on different infrastructure.

One of the distinctive features of the GFTS project is its advanced data management approach, synergizing with the capabilities of Pangeo. Diverse datasets, including climate change adaptation digital twin data, sea temperature observations, bathymetry, and biologging in-situ data from tagged fish, are seamlessly integrated within the Pangeo environment. A dedicated software called pangeo-fish has been developed to streamline this complex modelling process. The technical framework of the GFTS project includes Pangeo core packages such as Xarray and Dask, which facilitate scalable computations.

Pangeo's added value in data management becomes apparent in its capability to optimise data access and enhance performance. The concept of "data visitation" is central to this approach. By strategically deploying Dask clusters close to the data sources, the GFTS project aims to significantly improve performance of fish track modelling when compared to traditional approaches. This optimised data access ensures that end-users can efficiently interact with large datasets, leading to more streamlined and efficient analyses.

The cloud-based delivery of the GFTS project aligns with the overarching goal of Pangeo. In addition, the GFTS includes the development of a custom interactive Decision Support Tool (DST). The DST empowers non-technical users with an intuitive interface for better understanding the results of the GFTS project, leading to more informed decision-making. The integration with Pangeo and providing intuitive access to the GFTS data is not merely a technicality; it is a commitment to FAIR (Findable, Accessible, Interoperable and Reusable), TRUST (Transparency, Responsibility, User focus, Sustainability and Technology) and open science principles.

In short, the GFTS project, within the Pangeo ecosystem, exemplifies how advanced data management, coupled with the optimization of data access through "data visitation," can significantly enhance the performance and usability of geoscience tools. This collaborative and innovative approach not only benefits the immediate goals of the GFTS project but contributes to the evolving landscape of community-driven geoscience initiatives.

How to cite: Wiesmann, D., Odaka, T., Fouilloux, A., Autret, E., Woillez, M., and Ragan-Kelley, B.: Harnessing the Pangeo ecosystem for delivering the cloud-based Global Fish Tracking System, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-10741, https://doi.org/10.5194/egusphere-egu24-10741, 2024.

X3.11

EGU24-8343

ECS

Highlight

Virtual presentation

Implementation of a reproducible pipeline for producing seasonal Arctic sea ice forecasts

Vanessa Stöckl, Björn Grüning, Anne Fouilloux, Jean Iaquinta, and Alejandro Coca-Castro

This work highlights the integration of IceNet (https://doi.org/10.1038/s41467-021-25257-4), a cutting-edge sea ice forecasting system leveraging numerous Python packages from the Pangeo ecosystem, into the Galaxy platform—an open-source tool designed for FAIR (Findable, Accessible, Interoperable, and Reusable) data analysis. Aligned with the Pangeo ecosystem's broader objectives, and carried out in the frame of the EuroScienceGateway project (https://eurosciencegateway.eu), this initiative embraces a collaborative approach to tackle significant geoscience data challenges. The primary aim is to democratise access to IceNet's capabilities by converting a Jupyter Notebook, published in the Environmental Data Science book (www.edsbook.org), into Galaxy Tools and crafting a reusable workflow executable through a Graphical User Interface or standardised APIs. IceNet is meant to predict Arctic sea ice concentration up to six months in advance, and it outperforms previous systems. This integration establishes a fully reproducible workflow, enabling scientists with diverse computational expertise to automate sea ice predictions. The IceNet workflow is hosted on the European Galaxy Server (https://climate.usegalaxy.eu), along with the related tools, ensuring accessibility for a wide community of researchers. With the urgency of accurate predictions amid global warming's impact on Arctic sea ice, this work addresses challenges faced by scientists, particularly those with limited programming experience. The transparent, accessible, and reproducible pipeline for Arctic sea ice forecasting aligns with Open and Science principles. The integrated IceNet into Galaxy enhances accessibility to advanced climate science tools, allowing for automated predictions that contribute to early and precise identification of potential damages from sea ice loss. This initiative mirrors the overarching goals of the Pangeo community, advancing transparent, accessible, and reproducible research. The Galaxy-based pipeline presented serves as a testament to collaborative efforts within the Pangeo community, breaking down barriers related to computational literacy and empowering a diverse range of scientists to contribute to climate science research. The integration of IceNet into Galaxy not only provides a valuable tool for seasonal sea ice predictions but also exemplifies the potential for broad interdisciplinary collaboration within the Pangeo ecosystem.

How to cite: Stöckl, V., Grüning, B., Fouilloux, A., Iaquinta, J., and Coca-Castro, A.: Implementation of a reproducible pipeline for producing seasonal Arctic sea ice forecasts, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-8343, https://doi.org/10.5194/egusphere-egu24-8343, 2024.