ESSI3.3 | Scalable and FAIR Workflow Approaches in Earth System Science: Addressing Data and Computational Challenges
EDI
Scalable and FAIR Workflow Approaches in Earth System Science: Addressing Data and Computational Challenges
Co-organized by CR6/GI2/HS13/NP4/TS9
Convener: Karsten Peters-von Gehlen | Co-conveners: Miguel CastrilloECSECS, Ivonne Anders, Donatello EliaECSECS, Manuel Giménez de Castro MarcianiECSECS
Orals
| Tue, 29 Apr, 14:00–15:45 (CEST)
 
Room -2.92
Posters on site
| Attendance Tue, 29 Apr, 10:45–12:30 (CEST) | Display Tue, 29 Apr, 08:30–12:30
 
Hall X4
Posters virtual
| Attendance Tue, 29 Apr, 14:00–15:45 (CEST) | Display Tue, 29 Apr, 14:00–18:00
 
vPoster spot 4
Orals |
Tue, 14:00
Tue, 10:45
Tue, 14:00
Performing research in Earth System Science is increasingly challenged by the escalating volumes and complexity of data, requiring sophisticated workflow methodologies for efficient processing and data reuse. The complexity of computational systems, such as distributed and high-performance heterogeneous computing environments, further increases the need for advanced orchestration capabilities to perform and reproduce simulations effectively. On the same line, the emergence and integration of data-driven models, next to the traditional compute-driven ones, introduces additional challenges in terms of workflow management. This session delves into the latest advances in workflow concepts and techniques essential to address these challenges taking into account the different aspects linked with High-Performance Computing (HPC), Data Processing and Analytics, and Artificial Intelligence (AI).

In the session, we will explore the importance of the FAIR (Findability, Accessibility, Interoperability, and Reusability) principles and provenance in ensuring data accessibility, transparency, and trustworthiness. We will also address the balance between reproducibility and security, addressing potential workflow vulnerabilities while preserving research integrity.

Attention will be given to workflows in federated infrastructures and their role in scalable data analysis. We will discuss cutting-edge techniques for modeling and data analysis, highlighting how these workflows can manage otherwise unmanageable data volumes and complexities, as well as best practices and progress from various initiatives and challenging use cases (e.g., Digital Twins of the Earth and the Ocean).

We will gain insights into FAIR Digital Objects, (meta)data standards, linked-data approaches, virtual research environments, and Open Science principles. The aim is to improve data management practices in a data-intensive world.
On these topics, we invite contributions from researchers illustrating their approach to scalable workflows as well as data and computational experts presenting current approaches offered and developed by IT infrastructure providers enabling cutting edge research in Earth System Science.

Orals: Tue, 29 Apr | Room -2.92

The oral presentations are given in a hybrid format supported by a Zoom meeting featuring on-site and virtual presentations. The button to access the Zoom meeting appears just before the time block starts.
Chairpersons: Karsten Peters-von Gehlen, Manuel Giménez de Castro Marciani, Donatello Elia
14:00–14:05
14:05–14:25
|
EGU25-7056
|
solicited
|
On-site presentation
Valeriu Predoi and Bouwe Andela

ESMValTool is a software tool for analyzing data produced by Earth System Models (ESMs) in a reliable and reproducible way. It provides a large and diverse collection of “recipes” that reproduce standard, as well as state-of-the-art analyses. ESMValTool can be used for tasks ranging from monitoring continuously running ESM simulations to analysis for scientific publications such as the IPCC reports, including reproducing results from previously published scientific articles as well as allowing scientists to produce new analysis results. To make ESMValTool a user-friendly community tool suitable for doing open science, it adheres to the FAIR principles for research software. It is: - Findable - it is published in community registries, such as https://research-software-directory.org/software/esmvaltool; - Accessible - it can be installed from Python package community distribution channels such as conda-forge, and the open-source code is available on Zenodo with a DOI, and on GitHub; - Interoperable - it is based on standards: it works with data that follows CF Conventions and the Coupled Model Intercomparison Project (CMIP) Data Request, its reusable recipes are written in YAML, and provenance is recorded in the W3C PROV format. It supports diagnostics written in a number of programming language, with Python and R being best supported. Its source code follows the standards and best practices for the respective programming languages; - Reusable - it provides a well documented recipe format and Python API that allow reusing previous analyses and building new analysis with previously developed components. Also, the software can be installed from conda-forge and DockerHub and can be tailored by installing from source from GitHub. In terms of input data, ESMValTool integrates well with the Earth System Grid Federation (ESGF) infrastructure. It can find, download and access data from across the federation, and has access to large pools of observational datasets. ESMValTool is built around two key scientific software metrics: scalability and user friendliness. An important aspect of user friendliness is reliability. ESMValTool is built on top of the Dask library to allow scalable and distributed computing, ESMValTool also uses parallelism at a higher level in the stack, so that jobs can be distributed on any standard High Performance Computing (HPC) facility; and software reliability and reproducibility - our main strategy to ensure reliability is modular, integrated, and tested design. This comes back at various levels of the tool. We try to separate commonly used functionality from “one off” code, and make sure that commonly used functionality is covered by unit and integration tests, while we rely on regression testing for everything else. We also use comprehensive end-to-end testing for all our “recipes” before we release new versions. Our testing infrastructure ranges from basic unit tests to tools that smartly handle various file formats, and use image comparison algorithms to compare figures. This greatly reduces the need for ‘human testing’, allowing for built-in robustness through modularity, and a testing strategy that has been tailored to match the technical skills of its contributors.

How to cite: Predoi, V. and Andela, B.: Reliable and reproducible Earth System Model data analysis with ESMValTool, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7056, https://doi.org/10.5194/egusphere-egu25-7056, 2025.

14:25–14:35
|
EGU25-8114
|
ECS
|
On-site presentation
Willem Tromp, Dirk Eilander, Hessel Winsemius, Tjalling De Jong, Brendan Dalmijn, Hans Gehrels, and Bjorn Backeberg

Flood risk assessments are increasingly guiding urban developments to safeguard against flooding. These assessments, consisting mainly of hazard and risk maps, make use of interconnected models consisting of a chain of climate, hydrological, hydraulic, and impact models, which are increasingly run interactively to support scenario modelling and decision-making in digital twins. To maintain interoperability, transparency, and reusability of this chain and the assessments themselves, using a workflow manager to manage the inter-model dependencies is a natural fit. However, composing and maintaining workflows is a non-trivial, time-consuming task, and they often have to be refactored for new workflow engines, or when changing compute environments, even if the workflow conceptually remains unchanged. These issues are particularly relevant in the development of digital twins for climate adaptation, where flood risk assessments serve as input to indicate high-risk areas. The complex model chain underpinning such digital twins can benefit greatly from transparent workflows that can be easily reused across different contexts.

To address these challenges, we developed the HydroFlows Python framework for composing and maintaining flood risk assessment workflows by leveraging common patterns identified across different workflows. The framework allows users to use one of the many steps available in the library or define workflow steps themselves and combine these into complete workflows which are validated on the fly. Available workflow steps include building, running, and postprocessing of models. Execution of the workflow is handled by one of the workflow managers to which our workflow description can be exported, such as Snakemake or tools with CWL support. This flexibility allows users to easily scale their workflows to different compute environments whenever the computational requirements demand so.

We demonstrate the flexibility of the HydroFlows framework by highlighting how it can be used to create complex workflows needed for digital twins supporting climate adaptation. HydroFlows not only enhances the flexibility and portability of the digital twin modelling workflows but also facilitates the integration of digital twin tooling and advanced computing and processing solutions to support interactive flood risk assessments in federated compute and data environments.

How to cite: Tromp, W., Eilander, D., Winsemius, H., De Jong, T., Dalmijn, B., Gehrels, H., and Backeberg, B.: Flexible and scalable workflow framework HydroFlows for compound flood risk assessment and adaptation modelling, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-8114, https://doi.org/10.5194/egusphere-egu25-8114, 2025.

14:35–14:45
|
EGU25-10981
|
On-site presentation
Fabrizio Antonio, Gabriele Padovani, Ludovica Sacco, Carolina Sopranzetti, Marco Robol, Konstantinos Zefkilis, Nicola Marchioro, and Sandro Fiore

Scientific workflows and provenance are two faces of the same medal. While the former addresses the coordinated execution of multiple tasks over a set of computational resources, the latter relates to the historical record of data from its original sources. As experiments rapidly evolve towards complex end-to-end workflows, handling provenance at different levels of granularity and during the entire analytics workflow lifecycle is key for managing lineage information related to large-scale experiments in a flexible way as well as enabling reproducibility scenarios, thus playing a relevant role in Open Science.

The contribution highlights the importance of tracking multi-level provenance metadata in complex, AI-based scientific workflows as a way to foster documentation of data and experiments in a standardized format, strengthen interpretability, trustworthiness and authenticity of the results, facilitate performance diagnosis and troubleshooting activities, and advance provenance exploration. More specifically, the contribution introduces yProv, a joint research effort between CMCC and University of Trento targeting multi-level provenance management in complex, AI-based scientific workflows. The yProv project provides a rich software ecosystem consisting of a web service (yProv service) to store and manage provenance documents compliant with the W3C PROV family of standards, two libraries to track provenance in scientific workflows at different levels of granularity with a focus on AI models training (yProv4WFs and yProv4ML), and a data science tool for provenance inspection, navigation, visualization, and analysis (yProv Explorer). Activity on trustworthy provenance with yProv is also ongoing to fully address end-to-end provenance management requirements.

The contribution will cover the presentation of the yProv software ecosystem and use cases from the interTwin (https://www.intertwin.eu/) and ClimateEurope2 (https://climateurope2.eu/) European projects as well as from the ICSC National Center on HPC, Big Data and Quantum Computing targeting Digital Twins for extreme weather & climate events and data-driven/data-intensive workflows for climate change. 

How to cite: Antonio, F., Padovani, G., Sacco, L., Sopranzetti, C., Robol, M., Zefkilis, K., Marchioro, N., and Fiore, S.: yProv: a Software Ecosystem for Multi-level Provenance Management and Exploration in Climate Workflows, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-10981, https://doi.org/10.5194/egusphere-egu25-10981, 2025.

14:45–14:55
|
EGU25-5593
|
On-site presentation
Pratichhya Sharma

In an era of unprecedented availability of Earth Observation (EO) data, the Copernicus Data Space Ecosystem (CDSE) emerges as a vital platform to bridge the gap between data accessibility and actionable insights. With petabytes of freely accessible satellite data at our fingertips and multiple operational data processing platforms in place, many of the foundational challenges of accessing and processing sensor data have been addressed. Yet, the widespread adoption of EO-based applications remains below expectations. The challenge lies in the effective extraction of relevant information from the data. While numerous R&D projects demonstrate the possibilities of EO, their results are often neither repeatable nor reusable, primarily due to prototype-level implementations and overly tailored, non-standardized workflows.  

CDSE tackles these barriers by adopting common standards and patterns, most notably through openEO, an interface designed to standardize EO workflow execution across platforms. openEO enables the development of reusable workflows that are scalable and transferable, paving the way for systematic and objective monitoring of the planet. CDSE has already integrated openEO as a core processing interface, and further advancements are underway, including the integration of Sentinel Hub to support openEO. This integration will enhance instantaneous visualization, synchronous API requests, and batch processing, as well as support openEO process graphs within the Copernicus Browser, bringing the simplicity and speed of Sentinel Hub’s synchronous engine to the openEO ecosystem.  

CDSE’s openEO capabilities are already validated through large-scale operational projects such as ESA WorldCereal and Copernicus Global Land Cover and Tropical Forestry Mapping and Monitoring Service (LCFM), which leverage its robust, scalable, and reliable infrastructure. Additionally, the openEO Algorithm Plaza fosters collaboration by enabling the easy sharing and reuse of processing workflows, while the Bring Your Own Data feature allows users to integrate their datasets into the ecosystem, promoting data interoperability and collaborative advancements.  

CDSE is embracing a federated approach, allowing additional data or service providers to become part of the ecosystem. This inclusivity ensures a growing network of interoperable services while maintaining technical and operational stability—a cornerstone for broad adoption and long-term sustainability.  

By addressing the need for operational and reusable workflows with openEO and related initiatives, CDSE is not only advancing the technical landscape of EO but also fostering a culture of repeatable, scalable, and impactful science. Through this session, we aim to spark a discussion on how to make EO applications more accessible, reusable, and impactful for the global community.

How to cite: Sharma, P.: How openEO standardizes workflows for scalable and reusable EO data analysis, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-5593, https://doi.org/10.5194/egusphere-egu25-5593, 2025.

14:55–15:00
15:00–15:10
|
EGU25-1511
|
ECS
|
On-site presentation
Aina Gaya-Àvila, Bruno de Paula Kinoshita, Stella V. Paronuzzi Ticco, Oriol Tintó Prims, and Miguel Castrillo

In this work, we explored the deployment and execution of the NEMO ocean model using Singularity containers within the EDITO Model Lab, implementing the European Digital Twin of the Ocean. The Auto-NEMO workflow, a fork of Auto-EC-Earth used to run NEMO workflows using the NEMO Community reference code, was adapted to run simulations using containers. The use of a Singularity container ensures consistent execution by packaging all dependencies, making it easier to deploy the model across various HPC systems.

The containerized approach was tested on multiple HPC platforms, including MareNostrum5 and LUMI, to evaluate scaling performance. Our tests compared the use of mpich and openmp libraries, providing insights into how communication strategies impact the computational performance of the model in containerized setups. In addition, the runs are orchestrated by a content workflow manager, in this case Autosubmit, deployed in a cloud infrastructure in EDITO-Infra, making the entire solution (workflow manager and workflow itself) portable end-to-end. The benefits of portability and reproducibility make containers an attractive solution for streamlining workflows in diverse computational environments.

A comparison between containerized and non-containerized runs highlights the trade-offs involved. Direct execution may provide slightly better performance in some cases, but the containerized approach greatly reduces setup complexity. These findings demonstrate the potential of containerization to enhance efficiency and accessibility in large-scale ocean modeling efforts.

How to cite: Gaya-Àvila, A., de Paula Kinoshita, B., Paronuzzi Ticco, S. V., Tintó Prims, O., and Castrillo, M.: A workflow for cloud-based and HPC simulations with the NEMO ocean model using containers, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-1511, https://doi.org/10.5194/egusphere-egu25-1511, 2025.

15:10–15:20
|
EGU25-6201
|
On-site presentation
Marco Salvi, Rossana Paciello, Valerio Vinciarelli, Kety Giuliacci, Daniele Bailo, Pablo Orviz, Keith Jeffery, Manuela Volpe, Roberto Tonini, and Alejandra Guerrero

The increasing complexity and volume of data in Solid Earth Science necessitate robust solutions for workflow representation, sharing, and reproducibility. Within the DT-GEO (https://dtgeo.eu/) project, we addressed the challenge of creating interoperable and discoverable representations of computational workflows to facilitate data reuse and collaboration. Leveraging the EPOS Platform (https://www.epos-eu.org/), a multidisciplinary research infrastructure focused on Solid Earth Science, we aimed to expose workflows, datasets, and software to the community while adhering to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles. While the EPOS-DCAT-AP (https://github.com/epos-eu/EPOS-DCAT-AP) model, already used in EPOS, can effectively represent datasets and software, it lacks direct support for computational workflows, necessitating the adoption of alternative standards.

To overcome this limitation, we employed the Common Workflow Language (CWL, https://www.commonwl.org/) to describe workflows, capturing their structure, software, datasets, and dependencies. The developed CWL representations are "abstract" focusing on general workflow structures while omitting execution-specific details to prioritize interoperability. To package these workflows along with metadata, we utilized Workflow Run Crate, an extension of the RO-Crate (https://www.researchobject.org/ro-crate/) standard. Together, these technologies enable workflows to become self-contained entities, simplifying sharing and reuse. 

This approach not only aligns with community standards but also benefits from a mature ecosystem of tools and libraries, ensuring seamless integration and widespread applicability. Initial implementations within the DT-GEO project serve as a model for adoption in related initiatives such as Geo-INQUIRE (https://www.geo-inquire.eu/), where similar methodologies are being used to share workflows derived from the Simulation Data Lake (SDL) infrastructure. These implementations pave the way for broader integration within the EPOS Platform, enhancing access to advanced workflows across disciplines.

Our contribution highlights the value of adopting standardized tools and methodologies for workflow management in Solid Earth Science, showcasing how CWL and RO-Crate streamline interoperability and foster collaboration. These advances address challenges in data and computational management, contributing to the scalable FAIR workflows essential for tackling the complexities of Solid Earth Science. Moving forward, the integration of these standards across projects like DT-GEO and Geo-INQUIRE will further enhance the EPOS Platform's capabilities, offering a unified gateway to reproducible, secure, and trustworthy workflows that meet the evolving needs of the scientific community.

How to cite: Salvi, M., Paciello, R., Vinciarelli, V., Giuliacci, K., Bailo, D., Orviz, P., Jeffery, K., Volpe, M., Tonini, R., and Guerrero, A.: Advancing Computational Workflow Sharing in Earth Science: Insights from DT-GEO and Geo-INQUIRE, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-6201, https://doi.org/10.5194/egusphere-egu25-6201, 2025.

15:20–15:30
|
EGU25-9791
|
On-site presentation
Zhiyi Zhu and Min Chen

Geo-simulation experiments (GSEs) are experiments allowing the simulation and exploration of Earth’s surface (such as hydrological, geomorphological, atmospheric, biological, and social processes and their interactions) with the usage of geo-analysis models (hereafter called ‘models’). Computational processes represent the steps in GSEs where researchers employ these models to analyze data by computer, encompassing a suite of actions carried out by researchers. These processes form the crux of GSEs, as GSEs are ultimately implemented through the execution of computational processes. Recent advancements in computer technology have facilitated sharing models online to promote resource accessibility and environmental dependency rebuilding, the lack of which are two fundamental barriers to reproduction. In particular, the trend of encapsulating models as web services online is gaining traction. While such service-oriented strategies aid in the reproduction of computational processes, they often ignore the association and interaction among researchers’ actions regarding the usage of sequential resources (model-service resources and data resources); documenting these actions can help clarify the exact order and details of resource usage. Inspired by these strategies, this study explores the organization of computational processes, which can be extracted with a collection of action nodes and related logical links (node-link ensembles). The action nodes are the abstraction of the interactions between participant entities and resource elements (i.e., model-service resource elements and data resource elements), while logical links represent the logical relationships between action nodes. In addition, the representation of actions, the formation of documentation, and the reimplementation of documentation are interconnected stages in this approach. Specifically, the accurate representation of actions facilitates the correct performance of these actions; therefore, the operation of actions can be documented in a standard way, which is crucial for the successful reproduction of computational processes based on standardized documentation. Aprototype system is designed to demonstrate the feasibility and practicality of the proposed approach. By employing this pragmatic approach, researchers can share their computational processes in a structured and open format, allowing peer scientists to re-execute operations with initial resources and reimplement the initial computational processes of GSEs via the open web.

How to cite: Zhu, Z. and Chen, M.: Reproducing computational processes in service-based geo-simulation experiments, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-9791, https://doi.org/10.5194/egusphere-egu25-9791, 2025.

15:30–15:40
|
EGU25-13604
|
ECS
|
On-site presentation
Kasra Keshavarz, Alain Pietroniro, Darri Eythorsson, Mohamed Ismaiel Ahmed, Paul Coderre, Wouter Knoben, Martyn Clark, and Shervan Gharari

High-resolution and high-complexity process-based hydrological models play a pivotal role in advancing our understanding and prediction of water cycle dynamics, particularly in ungauged basins and under nonstationary climate conditions. However, the configuration, application, and evaluation of these models are often hindered by the intricate and inconsistent nature of a priori information available in various datasets, necessitating extensive preprocessing steps. These challenges can limit the reproducibility, applicability, and accessibility of such models for the broader scientific user community. To address these challenges, we introduce our generalized Model-Agnostic Framework (MAF), aimed at simplifying the configuration and application of data-intensive process-based hydrological models. Through a systematic investigation of commonly used models and their configuration procedures, we provide workflows designed to streamline the setup process for this category of hydrological models. Building on earlier efforts, this framework adheres to the principle of separating model-agnostic and model-specific tasks in the setup procedure of such models. The model-agnostic workflows focus on both dynamic datasets (e.g., meteorological data) and static datasets (e.g., land-use maps), while the model-specific components feed preprocessed, relevant data to the hydrological models of interest. Our initial prototypes of MAF includes recipes for various static and dynamic datasets and also tailored model-specific workflows for MESH, SUMMA, and HYPE process-based modelling frameworks. We demonstrate the effectiveness of these novel workflows in reducing configuration complexity and enhancing the reproducibility of process-based hydrological models through test applications in high-performance computing environments. The framework automates numerous manual tasks, significantly saving time, and enabling continuity in research efforts. Moreover, by minimizing human error and enhancing reproducibility, this research has fostered collaboration with several Canadian government entities, leveraging sophisticated process-based models to address complex environmental challenges.

How to cite: Keshavarz, K., Pietroniro, A., Eythorsson, D., Ahmed, M. I., Coderre, P., Knoben, W., Clark, M., and Gharari, S.: Streamlining configurations of process-based models through extensible and free workflows, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-13604, https://doi.org/10.5194/egusphere-egu25-13604, 2025.

15:40–15:45

Posters on site: Tue, 29 Apr, 10:45–12:30 | Hall X4

The posters scheduled for on-site presentation are only visible in the poster hall in Vienna. If authors uploaded their presentation files, these files are linked from the abstracts below.
Display time: Tue, 29 Apr, 08:30–12:30
Chairpersons: Ivonne Anders, Miguel Castrillo, Karsten Peters-von Gehlen
X4.10
|
EGU25-8305
|
ECS
Alejandro Garcia Lopez, Leo Arriola Meikle, Gilbert Montane Pinto, Miguel Castrillo, Bruno de Paula Kinoshita, Eric Ferrer Escuin, and Aina Gaya Avila

Climate simulations require complex workflows that often integrate multiple components and different configurations per experiment, typically involving high-performance computing resources. The exhaustive testing required for these workflows can be time and resource consuming, presenting significant challenges in terms of computational cost and human effort. However, robust Continuous Integration (CI) testing ensures the reliability and reproducibility of such complex workflows by validating the codebase and ensuring the integrity of all the components used when performing climate simulations. Additionally, CI testing facilitates both major and minor releases, enhancing the efficiency of the development lifecycle.

To address these challenges, we present our Testing Suite software, designed to automate the setup, configuration, and execution of integration tests using Autosubmit, a workflow manager developed at the BSC. Autosubmit is typically used for climate modelling experiments, but also atmospheric composition ones, and also constitutes the backbone of some operational systems and Digital Twin initiatives. The Testing Suite software allows Autosubmit commands to be executed in batches and the responses from the Workflow Manager to be bypassed in a structured manner. By streamlining this process, it minimizes the effort required for exhaustive testing while ensuring reliability.

Beyond integration testing, the Testing Suite offers advanced capabilities for scientific result verification. By automatically comparing output data bit by bit, it swiftly detects regressions during test execution. Additionally, it provides CPMIP performance metrics, offering insights into the efficiency of the workflows.

As a result, the Testing Suite plays an important role in quality assurance, particularly during releases, where extensive testing ensures the workflow meets required functionality and performance standards across different configurations. These integration tests act as a checkpoint, validating the stability and robustness of the software before release. They also identify stable points in the main codebase, enabling developers to create new branches with confidence. This approach minimizes compatibility issues and facilitates a smoother development process.

In conclusion, the Testing Suite is a crucial part of the development lifecycle for climate simulations. It mitigates risks, ensures stability, and fosters innovation, all while maintaining a robust and reliable foundation for scientific research and development.

How to cite: Garcia Lopez, A., Arriola Meikle, L., Montane Pinto, G., Castrillo, M., de Paula Kinoshita, B., Ferrer Escuin, E., and Gaya Avila, A.: Enabling reliable workflow development with an advanced Testing Suite, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-8305, https://doi.org/10.5194/egusphere-egu25-8305, 2025.

X4.11
|
EGU25-8621
|
ECS
Eric Ferrer, Gilbert Montane, Miguel Castrillo, and Alejandro Garcia

The European community Earth system model EC-Earth is based on different and interoperable climate components simulating different processes of the Earth system. This makes it a complex model that requires multiple input data sources for its various model components, which can be run in parallel with multiple configurations and resolutions, demanding different computational resources in each case.

The EC-Earth software contains a minimum set of scripts to manage the compilation and execution of the simulations, but these are not enough to perform all the tasks that experiments demand nor to guarantee the traceability and reproducibility of the entire workflow in a high-productivity scientific environment. For that matter, the Auto-EC-Earth software has been developed at the Earth Sciences department of the Barcelona Supercomputing Center (BSC-ES) relying on Autosubmit, a workflow manager also developed at BSC-ES.

We take advantage of the automatization provided by the workflow manager that allows us to configure, manage, orchestrate and share experiments with different configurations and target platforms. The workflow manager allows the user to split the run into different tasks that are executed on different local and remote machines, like the HPC platform where the simulation needs to be performed. This is achieved in a seamless integration between Autosubmit, the EC-Earth tools, and the different machines where the scripts run, all without any user-input required after the initial setup and the launch of the experiment thanks to the workflow developments. Autosubmit also allows to ensure traceability of the actual runs, to have all the required data available for different kinds of experiments separated and well documented.

However, running the main part of the simulation is a cooperative task between the Autosubmit workflow manager and the different tools used for each model version. Auto-EC-Earth workflow has evolved to adapt the best possible to the EC-Earth model scripts that are present to help with the model runs. In EC-Earth 4, ScriptEngine is used to manage the run, and it has been fully integrated into the Auto-EC-Earth 4 workflow and used to set up the environment, while Autosubmit still manages the submission of jobs to the HPC and the dependencies between them.

Auto-EC-Earth is a great example of a workflow system that has been developed and used throughout the years, well established within the BSC-ES and used in multiple production cases, like multiple CMIP exercises as well as a reference for newer ESM workflows like the one developed in the Destination Earth project. It has also allowed the BSC-ES to collaborate with the EC-Earth community through the testing of the new releases of the model.

How to cite: Ferrer, E., Montane, G., Castrillo, M., and Garcia, A.: Auto-EC-Earth: An automatic workflow to manage climate modelling experiments using Autosubmit, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-8621, https://doi.org/10.5194/egusphere-egu25-8621, 2025.

X4.12
|
EGU25-18040
|
ECS
Francesco Carere, Francesca Mele, Italo Epicoco, Mario Adani, Paolo Oddo, Eric Jansen, Andrea Cipollone, and Ali Aydogdu

Numerical reproducibility is a crucial yet often overlooked challenge in ensuring the credibility of computational results and the validity of Earth system models. In large-scale, massively parallel simulations, achieving numerical reproducibility is complicated by factors such as heterogeneous HPC architectures, floating point intricacies, complex hardware/software dependencies, and the non-deterministic nature of parallel execution.

This work addresses the challenges of debugging and ensuring bitwise reproducibility (BR) in parallel simulations, specifically for the MPI-parallelised OceanVar data assimilation model. We explore methods for detecting and resolving BR-related bugs, focusing on an automated debugging process. Currently mature tools to automate this process are lacking for bugs due to MPI-parallelisation, making automatic BR verification in scientific workflows involving such codebases a time-consuming challenge.

However, BR is sometimes considered unrealistic in workflows involving heterogeneous computing architectures. As an alternative, statistical reproducibility (SR) is proposed and explored by various research groups in the Earth system modelling community, for which automated tools have been developed. For example, the scientific workflow of CESM supports automatic verification of SR using the CESM-ECT framework/PyCECT software. In case of failure of SR a root-cause analysis tool exists, CESM-RUANDA, albeit currently not fully functional. We explore SR as an alternative and complementary approach to of BR focusing on its potential to support numerical reproducibility in workflows involving heterogeneous computing architectures.

How to cite: Carere, F., Mele, F., Epicoco, I., Adani, M., Oddo, P., Jansen, E., Cipollone, A., and Aydogdu, A.: Workflows for numerical reproducibility in the OceanVar data assimilation model, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-18040, https://doi.org/10.5194/egusphere-egu25-18040, 2025.

X4.13
|
EGU25-4466
|
ECS
Francesc Roura-Adserias, Aina Gaya-Avila, Leo Arriola i Meikle, Iker Gonzalez-Yeregi, Bruno De Paula Kinoshita, Jaan Tollander de Balsch, and Miguel Castrillo

The Climate Adaptation Digital Twin (ClimateDT), a contract (DE_340) inside the Destination Earth (DestinE) flagship initiative from the European Commission, is a highly collaborative project where climate models are executed in an operational manner on different EuroHPC platforms. The workflow software supporting such executions, called ClimateDT Workflow, contains a model component and an applications component. The applications can be seen as elements that consume the data that is provided by the climate models. They aim to provide climate information to sectors that are critically dependent on climate change, such as renewable energy or wildfires, among others. This workflow relies on the Autosubmit workflow manager and is executed over different EuroHPC platforms that are part of the contract.

There are six lightweight applications that are run in this workflow, in parallel to the model and in a streaming fashion. Setting up and maintaining an environment for these applications for each EuroHPC platform (plus the development environments) is a time-consuming and cumbersome task. These machines are shared by multiple users, have different operating systems and libraries, some do not have internet access for all users on their login nodes, and there are different rules to install and maintain software on each machine.

In order to overcome these difficulties all the application-required dependencies of the workflow are encapsulated beforehand in a Singularity container and therefore the portability to the different platforms becomes merely an issue with path-binding inside the platform. Through the use of Singularity containers, their execution does not require administrator permissions, which allows anyone with access to the project to execute the desired application either on the EuroHPC machines, or on their local development environment.

This work shows the structure of the ClimateDT workflow and how it uses Singularity containers, how they contribute not only to portability but also to traceability and provenance, and finally the benefits and issues found during its implementation. We believe that the successful use of containers in this climate workflow, where applications run in parallel to the climate models in a streaming fashion and where the complete workflow runs on different HPC platforms, presents a good reference for other projects and workflows that must be platform-agnostic and that require agile portability of their components.

How to cite: Roura-Adserias, F., Gaya-Avila, A., Arriola i Meikle, L., Gonzalez-Yeregi, I., De Paula Kinoshita, B., Tollander de Balsch, J., and Castrillo, M.: ClimateDT Workflow: A containerized climate workflow, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-4466, https://doi.org/10.5194/egusphere-egu25-4466, 2025.

X4.14
|
EGU25-21553
Stella Valentina Paronuzzi Ticco, Simon Lyobard, Mathis Bertin, Quentin Gaudel, Jérôme Gasperi, and Alain Arnaud

The EDITO platform serves as the foundational framework for building the European Digital Twin of the Ocean, seamlessly integrating oceanographic data, processes and services on a single and comprehensive platform. The platform provides scalable computing resources interconnected with EuroHPC supercomputing centers. We have developed a mechanism that allows users to remotely execute functions (processes) on HPCs and store the resulting output at the location of their choice (e.g. EDITO personal storage, third parties S3 buckets, etc.). This output can then be leveraged as input for subsequent processes, fostering a streamlined and interconnected workflow. Our presentation will delve into the technical details to achieve such an integration between cloud and HPC systems. 

How to cite: Paronuzzi Ticco, S. V., Lyobard, S., Bertin, M., Gaudel, Q., Gasperi, J., and Arnaud, A.: European Digital Twin of the Ocean: the integration with EuroHPC platforms, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-21553, https://doi.org/10.5194/egusphere-egu25-21553, 2025.

X4.15
|
EGU25-6544
Bryan N. Lawrence, David Hassell, Grenville Lister, Predoi Valeriu, Scott Davidson, Mark Goddard, Matt Pryor, Stig Telfer, Konstantinos Chasapis, and Jean-Thomas Acquaviva

Active storage (also known as computational storage) has been a concept often proposed but not often delivered. The idea is that there is a lot of under-utilised compute power in modern storage systems, and this could be utilised to carry out some parts of data analysis workflows. Such a facillity would reduce the cost of moving data, and make distributed data analysis much more efficient.

For storage to be able to handle compute, either an entire compute stack has to be migrated to the storage (with all the problems around security and dependencies) or the storage has to offer suitable compute interfaces. Here we take the second approach, borrowing the concept of providing system reduction operations in the MPI interface of HPC systems, to define and implement a reduction interface for the complex layout of HDF5 (and NetCDF4) data.

We demonstrate a near-production quality deployment of the technology (PyActiveStorage) fronting JASMIN object storage, and describe how we have built a POSIX prototype. The first provides compute “near” the storage, the second is truly “in” the storage. The performance with the object store is such that for some tasks distributed workflows based on reduction operations on HDF5 data can be competitive with local workflow speeds, a result which has significant implications for avoiding expensive copies of data and unnecessary data movement. As a byproduct of this work, we have also upgraded a pre-existing pure python HDF5 reader to support lazy access, which opens up threadsafe read operations on suitable HDF5 and NetCDF4 data.

To our knowledge, there has previously been no previous practical demonstration of active storage for scientific data held in HDF5 files. While we have developed this technology with application in distributed weather and climate workflows, we believe it will find utility in a wide range of scientific workflows.

How to cite: Lawrence, B. N., Hassell, D., Lister, G., Valeriu, P., Davidson, S., Goddard, M., Pryor, M., Telfer, S., Chasapis, K., and Acquaviva, J.-T.: PyActiveStorage:  Efficient distributed data analysis using Active Storage for HDF5/NetCDF4, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-6544, https://doi.org/10.5194/egusphere-egu25-6544, 2025.

X4.16
|
EGU25-9175
Roc Salvador Andreazini, Xavier Yepes Arbós, Stella Valentina Paronuzzi Ticco, Oriol Tintó Prims, and Mario Acosta Cobos

Earth system models (ESMs) are essential to understand and predict climate variability and change. However, their complexity and computational demands of high-resolution simulations often lead to performance bottlenecks that can impede research progress. Identifying and resolving these inefficiencies typically require significant expertise and manual effort, posing challenges for both climate scientists and High Performance Computing (HPC) engineers.

We propose automating performance profiling as a solution to help researchers concentrate on improving and optimizing their models without the complexities of manual profiling. The Automatic Performance Profiling (APP) tool brings this solution to life by streamlining the generation of detailed performance reports for climate models.

The tool ranges from high-level performance metrics, such as Simulated Years Per Day (SYPD), to low-level metrics, such as PAPI counters and MPI communication statistics. This dual-level reporting makes the tool accessible to a wide range of users, from climate scientists seeking a general understanding of the model efficiency, to HPC experts requiring granular insights for advanced optimizations.

Seamlessly integrated with Autosubmit, the workflow manager developed at the Barcelona Supercomputing Center (BSC), APP ensures compatibility with complex climate modelling workflows. By automating the collection and reporting of key metrics, APP reduces the effort and expertise needed for performance profiling, empowering users to enhance the scalability and efficiency of their climate models.

APP currently supports multiple models, including the EC-Earth4 climate model and the NEMO ocean model, and is compatible with different HPC systems, such as Marenostrum 5 and ECMWF’s supercomputer. Furthermore, the modular design of the tool allows adding new models and HPC platforms easily.

How to cite: Salvador Andreazini, R., Yepes Arbós, X., Paronuzzi Ticco, S. V., Tintó Prims, O., and Acosta Cobos, M.: Enhancing Earth system models efficiency: Leveraging the Automatic Performance Profiling tool, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-9175, https://doi.org/10.5194/egusphere-egu25-9175, 2025.

X4.17
|
EGU25-4355
|
ECS
Iker Gonzalez-Yeregi, Pierre-Antoine Bretonnière, Aina Gaya-Avila, and Francesc Roura-Adserias

The Climate Adaptation Digital Twin (ClimateDT) is a contract under the Destination Earth initiative (DestinE) that aims to develop a digital twin to account for climate change adaptation. This is achieved by running high-resolution simulations with different climate models by making use of the different EuroHPC platforms. In addition to the climate models, applications that consume data from models are also developed under the contract. A common workflow is used to execute the whole pipeline from the model launching to the data consumption by the applications in a user-friendly and automated way.

One of the challenges of this complex workflow is to handle the different outputs that each of the climate models initially offered. Each model works with its own grid, vertical levels, and variable set. These differences in format make it very complicated for applications to consume and compare data coming from different models in an automated and timely manner. This issue is resolved by introducing the concept of Generic State Vector (GSV), which defines a common output portfolio for all models to ensure a homogeneous output between models. The conversion from the model's native output to the GSV happens before the data is written in the HPC and it is automated in the workflow allowing transparent access to the data changing only the name of the model in the call.

Data in the GSV format can be read using a newly designed dedicated Python tool: the GSV Interface. This tool links the model part of the workflow with the applications part of the workflow, enabling running everything in a single complex workflow (end-to-end workflow). The GSV Interface allows to read data that has been previously converted to GSV, adding proper metadata. It also offers some extra features like interpolation to regular grids and area selection. All the workflow components that read data from the models rely on the GSV Interface. In addition to that, the GSV Interface can also be used to transparently retrieve and process data from the public Destination Earth Service Platform.

How to cite: Gonzalez-Yeregi, I., Bretonnière, P.-A., Gaya-Avila, A., and Roura-Adserias, F.: Generic State Vector: streaming and accessing high resolution climate data from models to end users, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-4355, https://doi.org/10.5194/egusphere-egu25-4355, 2025.

X4.18
|
EGU25-18890
Klaus Getzlaff and Markus Scheinert

One of today's challenges is the effective access to scientific data either within research groups or across different institutions to increase the reusability of the data and therefore their value. While large operational modeling and service centers have enabled query and access to data via common web services, this is often not the case for smaller institutions or individual research groups. Especially the maintenance of the infrastructure and the simplicity of the workflows, in order to make the data and their provenance available and accessible, are common challenges for scientists and data management.

At GEOMAR there are several data steward positions to support RDM for special disciplines and formats. They are also connected across centres to work on common standards, e.g. the netcdf standard working group in the Helmholtz Earth and Environment DataHUB.

Here we will present the institutional approach on research data management for numerical simulations in earth system science. The data handling, especially the possibilities for data sharing, publication and access, which is in today’s focus, is realized by using persistent identifier handles in combinations with a modern http web server index solution and a THREDDS server allowing remote access using standardized protocols such as OPeNDAP, WMS. By cross-linking this into the central institutional metadata and publication repositories it allows the re-usability of the data by scientists from different research groups and backgrounds. In addition to the pure data handling the documentation of the numerical simulation experiments is of similar importance to allow re-usability or reproducibility and to provide the data which will be addressed too.

How to cite: Getzlaff, K. and Scheinert, M.: Research data management for numerical simulations in Earth-System Science, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-18890, https://doi.org/10.5194/egusphere-egu25-18890, 2025.

X4.19
|
EGU25-7070
Chandra Taposeea-Fisher, Garin Smith, Ewelina Dobrowolska, Daniele Giomo, Francesco Barchetta, Stephan Meißl, and Dean Summers

The Open Science and Innovation Vision included in ESA’s EO Science Strategy (2024) addresses 8 key elements: 1) openness of research data, 2) open-source scientific code, 3) open access papers with data and code; 4) standards-based publication and discovery of scientific experiments, 5) scientific workflows reproducible on various infrastructures, 6) access to education on open science, 7) community practice of open science; and 8) EO business models built on open-source. EarthCODE (https://earthcode.esa.int) is a strategic ESA EO initiative to support the implementation of this vision. 

EarthCODE (Earth Science Collaborative Open Development Environment) will form part of the next generation of cloud-based geospatial services, aiming towards an integrated, cloud-based, user-centric development environment for European Space Agency’s (ESA) Earth science activities. EarthCODE looks to maximise long-term visibility, reuse and reproducibility of the research outputs of such projects, by leveraging FAIR and open science principles and enabling, thus fostering a sustainable scientific process. EarthCODE proposes a flexible and scalable architecture developed with interoperable open-source blocks, with a long-term vision evolving by incrementally integrating industrially provided services from a portfolio of the Network of Resources.  Additionally, EarthCODE is a utilisation domain of EOEPCA+, contributing to the development and evolution of Open Standards and protocols, enabling internationally interoperable solutions.  

EarthCODE will provide an Integrated Development Platform, giving developers tools needed to develop high quality workflows, allowing experiments to be executed in the cloud and be end-to-end reproduced by other scientists. EarthCODE is built around existing open-source solutions, building blocks and platforms, such as the Open Science Catalogue, EOxHub and EOEPCA. It has additionally begun to integrate platform services from DeepESDL, Euro Data Cube, Polar TEP and the openEO federation on CDSE platforms, with more being added annually through ESA best practices. With it’s adopted federated approach, EarthCODE will facilitate processing on other platforms, i.e. DeepESDL, ESA EURO Data Cube, Open EO Cloud/Open EO Platform and AIOPEN/AI4DTE.   

The roadmap for the portal includes the initial portal release by end of 2024, followed by the capability to publish experiments in Q1 2025 (including development, publishing, finding and related community engagement), and by mid-2025 to have a further release with reproducibility capabilities around accessibility and execute functionalities.  

Collaboration and Federation are at the heart of EarthCODE. As EarthCODE evolves we expect providing solutions allowing federation of data and processing. EarthCODE has ambition to deliver a model for a Collaborative Open Development Environment for Earth system science, where researchers can leverage the power of the wide range of EO platform services available to conduct their science, while also making use of FAIR Open Science tools to manage data, code and documentation, create end-to-end reproducible workflows on platforms, and have the opportunity to discover, use, reuse, modify and build upon the research of others in a fair and safe way. Overall, EarthCODE aims to enable elements for EO Open Science and Innovation vision, including open data, open-source code, linked data/code, open-access documentation, end-to-end reproducible workflows, open-science resources, open-science tools, and a healthy community applying all the elements in their practice.

How to cite: Taposeea-Fisher, C., Smith, G., Dobrowolska, E., Giomo, D., Barchetta, F., Meißl, S., and Summers, D.: EarthCODE - a FAIR and Open Environment for collaborative research in Earth System Science , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7070, https://doi.org/10.5194/egusphere-egu25-7070, 2025.

X4.20
|
EGU25-19655
Ivette Serral, Vitalii Kriukov, Lucy Bastin, Riyad Rahman, and Joan Masó

In the era of declining biodiversity, global climate change and transformations in land use, terrestrial habitat connectivity is one of the key parameters of ecosystem management. In this regard, the land-use/land-cover (LULC) dynamics is crucial to detect the spatiotemporal trends in connectivity of focal endangered species and to predict the effects for biodiversity for planned or proposed LULC changes.

Apart from the LULC derivatives of remote sensing, connectivity analysis and scenarios modelling can also benefit from citizen science datasets, such as Open Street Map and GBIF species occurrence data cubes in which aggregated data can be perceived as a cube with three dimensions - taxonomic, temporal and geographic. The synthetic LULC datasets which cover Catalonia every 5 years (1987-2022) were enriched via developed Data4Land harmonisation tool harnessing Open Street Map (through Overpass Turbo API) and World Database on Protected Areas. Two outstanding well-known tools, Graphab and MiraMon GIS&RS (using the Terrestrial Connectivity Index Module - ICT), were used to create the overarching dataset on terrestrial habitat connectivity in Catalonia (2012-2022) for target species and broad land cover categories, forests. Significant decline trends in forest habitat connectivity are observed for Barcelona metropolitan area, and vice versa in the Pyrenees mountain corridor and protected areas. According to the local case study on the connectivity of Mediterranean turtle in the Albera Natural Park, general positive trend was affected by massive fires in 2012.

To ensure the replicable results, the pipeline to create reliable metadata in accordance with FAIR principles, especially data lineage, is being developed, as well as the high performance computing pipeline for Graphab.

How to cite: Serral, I., Kriukov, V., Bastin, L., Rahman, R., and Masó, J.: Multi-faceted habitat connectivity: how to orchestrate remote sensing with citizen science data?, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-19655, https://doi.org/10.5194/egusphere-egu25-19655, 2025.

X4.21
|
EGU25-2142
|
ECS
Albert Puiggros, Miguel Castrillo, Bruno de Paula Kinoshita, Pierre-Antoine Bretonniere, and Victòria Agudetse

Ensuring robust data provenance is paramount for advancing transparency, traceability, and reproducibility in climate research. This work presents the integration of FAIR (Findable, Accessible, Interoperable, and Reusable) principles into the workflow management ecosystem through provenance integration in Autosubmit, a workflow manager developed at the Barcelona Supercomputing Center (BSC), and SUNSET (SUbseasoNal to decadal climate forecast post-processing and asSEmenT suite), an R-based verification workflow also developed at the BSC.

Autosubmit supports the generation of data provenance information based on RO-Crate, facilitating the creation of machine-actionable digital objects that encapsulate detailed metadata about its executions. Autosubmit integrates persistent identifiers (PIDs) and schema.org annotations, making provenance records more accessible and actionable for both humans and machines.  However, the provenance metadata provided by Autosubmit through RO-Crate focuses on the workflow process and does not encapsulate the details of the data transformation processes. This is where SUNSET plays a complementary role. SUNSET’s approach for provenance information is based on the METACLIP (METAdata for CLImate Products) ontologies. METACLIP offers a semantic approach for describing climate products and their provenance. This framework enables SUNSET to provide specific, high-resolution  provenance metadata for its operations, improving transparency and compliance with FAIR principles. The generated files provide detailed information about each transformation the data has undergone, as well as additional details about the data's state, location, structure, and associated source code, all represented in a tree-like structure.

The main contribution of this work is the generation of a comprehensive provenance object by integrating these tools. SUNSET uses Autosubmit to parallelize its data processing tasks, with Autosubmit managing SUNSET jobs. As part of this process, an RO-Crate is automatically generated describing the overall execution. This object encapsulates detailed provenance metadata for each individual job within the workflow, using METACLIP's semantic framework to represent each SUNSET execution process. Certain schema.org entities are introduced to have the RO-Crate created by Autosubmit link with the provenance details generated by SUNSET. This integrated approach provides a unified hierarchical provenance record that spans to both the workflow management system and the individual job executions, ensuring that provenance objects are automatically generated for each experiment conducted.

This work demonstrates the practical application of FAIR principles in climate research by advancing provenance tracking within complex workflows. It represents an initial step to obtain and share metadata about the provenance of the data products that a workflow provides. The integration of RO-Crate and METACLIP not only enhances the reproducibility of climate data products but also fosters greater confidence in their reliability. To our knowledge, this is the first effort in the climate domain to combine different provenance formats into a single object, aiming to obtain a complete provenance graph with all the metadata. 

How to cite: Puiggros, A., Castrillo, M., de Paula Kinoshita, B., Bretonniere, P.-A., and Agudetse, V.: Enhancing Data Provenance in Workflow Management: Integrating FAIR Principles into Autosubmit and SUNSET, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-2142, https://doi.org/10.5194/egusphere-egu25-2142, 2025.

X4.22
|
EGU25-6216
Carlos Zuleta Salmon, Mirko Mälicke, and Alexander Dölich

The CAMELS-PLUS initiative is revolutionizing the way hydrological, and Earth System Science (ESS) data are processed, shared, and utilized by enhancing the widely-used CAMELS-DE dataset. While Germany boasts one of the richest hydrological datasets globally, CAMELS-DE has faced challenges due to its reliance on fragmented, manual workflows, which are error-prone and hinder collaboration. CAMELS-PLUS introduces a groundbreaking solution: a standardized framework for containerized scientific tools that embed rich metadata, ensuring provenance, reusability, and seamless integration across diverse scientific domains.

A key innovation of CAMELS-PLUS lies in its ability to bridge the gap between disciplines by implementing a fully containerized pipeline for dataset pre-processing. This approach allows researchers in meteorology, forestry, and other ESS subdomains to easily contribute and extend CAMELS-DE without the complexity of navigating storage systems or inconsistent workflows. The initiative’s metadata schema, implemented as YAML files with JSON-based tool parameterization, enables tools to "speak the same language," ensuring they are interoperable and aligned with FAIR principles.

Key Deliverables:

  • Updated CAMELS-DE Dataset: Incorporates new precipitation sources and enhanced metadata for seamless integration with the NFDI4Earth Knowledge Hub.
  • Standardized Scientific Containers: A community-adopted specification for containerized tools, promoting accessibility and reusability across disciplines.
  • Interactive Community Engagement: Extensions to camels-de.org, transforming it into a hub for exploring workflows and fostering interdisciplinary collaboration.

What makes CAMELS-PLUS particularly compelling is its potential to democratize access to cutting-edge hydrological datasets. By enabling non-specialists to contribute and utilize CAMELS-DE through intuitive, containerized workflows, the initiative reduces barriers to entry and accelerates innovation in data-driven hydrology and beyond. This project not only sets a new standard for dataset management in ESS but also creates a replicable model for tackling similar challenges across other scientific domains. CAMELS-PLUS is poised to inspire transformative changes in how large-sample datasets are curated, shared, and advanced for global scientific impact.

How to cite: Zuleta Salmon, C., Mälicke, M., and Dölich, A.: CAMELS-PLUS: Enhancing Hydrological Data Through FAIR Innovations., EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-6216, https://doi.org/10.5194/egusphere-egu25-6216, 2025.

X4.23
|
EGU25-11937
Jean Dumoulin, Thibaud Toullier, Nathanael Gey, and Mathias Malandain

Abstract

Efficient and secure dataset management is a critical component of collaborative research projects, where diverse data types, sharing requirements, and compliance regulations converge. This work presents a dataset management tool entitled DAM2 (Data and Model Monitoring) developed within the Chips Joint Undertaking (Chips JU) funded European BRIGHTER project [1], to address these challenges. It provides a robust and adaptable solution for handling private and public ground based measurements datasets throughout the project lifecycle. These datasets combine infrared images (e.g. multispectral ones), with visible images, local weather measurements, labeled data, etc.

The tool is designed to ensure rights management, enabling selective data sharing among authorized partners based on predefined permissions. It incorporates secure access controls to safeguard sensitive data and meets GDPR (General Data Protection Regulation) requirements to guarantee compliance with European privacy standards. For public datasets, the tool integrates with Zenodo, an open-access repository, to support long-term storage and accessibility, aligning with the principles of open science. Key technical features include the usage of an open source, S3 compatible object storage server (MinIO [2]) providing scalability to manage large volumes of data. Additionally, the use of Zarr [3] data format behind the scene offers significant advantages for this cloud-based data management tool, including efficient storage of large datasets through chunking and compression, fast parallel read and write operations, and compatibility with a wide range of data analysis tools. The tool adheres to FAIR (Findable, Accessible, Interoperable, Reusable) principles, storing metadata alongside datasets to enhance usability and interoperability.

Developed as an open-source platform, the tool promotes transparency and collaboration while providing a complete and well-documented API for seamless integration with other systems. A user-friendly interface ensures accessibility for stakeholders with varying technical expertise, while the tool remains flexible to accommodate additional file formats as required. The development process incorporates insights from relevant COFREND (French Confederation for Non-Destructive Testing) working groups, to ensure alignment with broader initiatives in data management, interoperability and durability.

This paper addresses the design, study and developed platform. First operational functionalities are demonstrated through the manipulation of first BRIGHTER and other research project datasets.

In conclusion, DAM2 is a comprehensive solution for managing diverse datasets in collaborative projects, balancing security, compliance, and accessibility. It provides a foundation for efficient, compliant, and interoperable data handling while supporting the principles of open science and FAIR data management.

Perspectives include expanding interoperability with additional repositories, incorporating advanced analytic and visualization features, and integrating AI-driven automation.

Acknowledgments

Authors would like to acknowledge the BRIGHTER HORIZON project. BRIGHTER has received funding from the Chips Joint Undertaking (JU) under grant agreement No 101096985. The JU receives support from the European Union’s Horizon Europe research and innovation program and France, Belgium, Portugal, Spain, Turkey.

References

[1] Brighter --- Project-Brighter. https://project-brighter.eu/, accessed on January 2025.

[2] MinIO, Inc. MinIO S3 Compatible Storage for AI --- Min.Io. https://min.io/, accessed on January, 2025.

[3] Zarr --- Zarr.dev. https://zarr.dev/, accessed on January, 2025.

How to cite: Dumoulin, J., Toullier, T., Gey, N., and Malandain, M.: DAM2 — A Scalable and Compliant Solution for Managing enriched Infrared images as FAIR Research Data , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-11937, https://doi.org/10.5194/egusphere-egu25-11937, 2025.

Posters virtual: Tue, 29 Apr, 14:00–15:45 | vPoster spot 4

The posters scheduled for virtual presentation are visible in Gather.Town. Attendees are asked to meet the authors during the scheduled attendance time for live video chats. If authors uploaded their presentation files, these files are also linked from the abstracts below. The button to access Gather.Town appears just before the time block starts. Onsite attendees can also visit the virtual poster sessions at the vPoster spots (equal to PICO spots).
Display time: Tue, 29 Apr, 08:30–18:00
Chairpersons: Filippo Accomando, Andrea Vitale

EGU25-7004 | Posters virtual | VPS19

Industrial High Performance Computing Scalable and FAIR Workflow Opportunities for EO Operations Processing, Operations, and Archiving 

Caroline Ball, Mark Chang, James Cruise, Camille de Valk, and Venkatesh Kannan
Tue, 29 Apr, 14:00–15:45 (CEST) | vP4.22
The computational demands of Sentinel data processing, archiving, and dissemination require scalable, efficient, and innovative solutions. While cloud computing-based services currently address these needs, integrating High-Performance Computing (HPC) systems into specific workflows could unlock a new level of industrial-scale capabilities. These include reduced processing times, faster data turnaround, and lower CO2 emissions. Leveraging HPC as a service allows for optimized data storage and access, enabling long-term strategies that prioritize essential data products and enhance operational efficiency.
Next-generation Quantum Computing (QC) holds the potential to redefine Earth Observation (EO) workflows by offering breakthroughs in solving complex optimization problems. As an operational service, QC could deliver significant cost and energy savings, provided that workflows can be seamlessly adapted to quantum-compatible infrastructures.
This presentation focuses on the evolution of HPC and QC technologies from research-driven concepts to industrial solutions, highlighting their maturity and applicability as services. We will explore the tangible benefits, associated costs, and pathways to operationalize these technologies for Level-0 to Level-2 data processing, operations, and archiving in support of current and future Sentinel missions.  We examine, at a high level, how artificial intelligence (AI) can provide a solution to hybrid HPC-QC challenges for EO data processing.

How to cite: Ball, C., Chang, M., Cruise, J., de Valk, C., and Kannan, V.: Industrial High Performance Computing Scalable and FAIR Workflow Opportunities for EO Operations Processing, Operations, and Archiving, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-7004, https://doi.org/10.5194/egusphere-egu25-7004, 2025.