ESSI2.2 | Advanced strategies for workflow development, execution and reproducibility of results in Earth System Science
Advanced strategies for workflow development, execution and reproducibility of results in Earth System Science
Convener: Miguel CastrilloECSECS | Co-conveners: Karsten Peters-von Gehlen, Valentine Anantharaj, Donatello EliaECSECS, Yolanda Becerra, Christine Kirkpatrick, Ivonne Anders
Orals
| Thu, 18 Apr, 14:00–15:45 (CEST)
 
Room G2
Posters on site
| Attendance Thu, 18 Apr, 10:45–12:30 (CEST) | Display Thu, 18 Apr, 08:30–12:30
 
Hall X2
Posters virtual
| Attendance Thu, 18 Apr, 14:00–15:45 (CEST) | Display Thu, 18 Apr, 08:30–18:00
 
vHall X2
Orals |
Thu, 14:00
Thu, 10:45
Thu, 14:00
Workflow methodologies and systems are fundamental tools for scientific experimentation, especially when complex computational systems such as distributed or high-performance computing are required, to improve scientific productivity and meet criteria essential for reproducibility and provenance of results.

Recent advances and upcoming developments in Earth System Science (ESS) are facing the challenge of having to i) efficiently handle close-to exascale data amounts and ii) providing methods to make the information content readily accessible and usable by both scientists and downstream communities.

Concurrently, awareness of the importance of the reproducibility and replicability of research results has increased considerably in recent years. Reproducibility refers to the possibility of independently arriving at the same scientific conclusions. Replicability or replication, is achieved if the execution of a scientific workflow arrives at the same result as before.

A sensible orchestration of these two aspects requires the application of seamless workflow tools employed at compute and data infrastructures which also enable the capture of required provenance information to - in an extreme case - rerun large-simulations and analysis routines to provide trust in model fidelity, data integrity and decision-making processes. Here, reproducibility, or even replicability, dedication to Open Science and FAIR data principles are key. Further, this enables communities of practice to establish best practices in applying future-proof workflows among a critical mass of users, thereby facilitating adoption.

This session discusses latest advances in workflow techniques for ESS in a two-tiered organizational structure, focusing on:

- sharing use cases, best practices and progress from various initiatives that improve different aspects of these technologies, such as eFlows4HPC (Enabling dynamic and Intelligent workflows in the future EuroHPC ecosystem), Climate Digital Twin (Destination Earth), or EDITO (European Digital Twin Ocean) Model-Lab;

-current approaches, concepts and developments in the area of reproducible workflows in ESS, like requirements for reproducibility and replicability including provenance tracking; technological and methodological components required for data reusability and future-proof research workflows; FAIR Digital Objects (FDOs); (meta)data standards, linked-data approaches, virtual research environments and Open Science principles.

Orals: Thu, 18 Apr | Room G2

Chairpersons: Miguel Castrillo, Karsten Peters-von Gehlen
14:00–14:05
Advanced Workflow Strategies in High-Performance Computing for Earth Sciences
14:05–14:15
|
EGU24-11859
|
ECS
|
On-site presentation
Willem Tromp, Hessel Winsemius, Dirk Eilander, Albrecht Weerts, and Björn Backeberg

Modelling of compound flood events, the assessment of their impact, and assessing mitigation and adaptation measures is in increasing demand for local authorities and stakeholders to support their decision making. Additionally, the severity of extreme events driving compound flooding, including storms and heavy rainfall, is projected to increase under climate change. To support local communities in flood risk management, complex modelling systems involving multiple cross-disciplinary models need to be orchestrated in order to effectively and efficiently run a wide range of what-if scenarios or historical events to understand the drivers and impacts of compound floods. The large volume and variety of data needed to configure the necessary models and simulate events strain the reproducibility of modelling frameworks, while the number of events and scenarios demand increasingly powerful computing resources. Here we present a solution to these challenges using automated workflows, leveraging the Common Workflow Language standard. The presented workflows update a base model configuration for a user-specified event or scenario, and automatically reruns multiple defined scenarios. The models are executed in containers and dispatched using the StreamFlow workflow manager designed for hybrid computing infrastructures. This solution offers a single, uniform interface for configuring all models involved in the model train, while also offering a single interface for running the model chain locally or on high performance computing infrastructures. The allows researchers to leverage data and computing resources more efficiently and provide them with a larger and more accurate range of compound flood events to support local authorities and stakeholders in their decision making.

How to cite: Tromp, W., Winsemius, H., Eilander, D., Weerts, A., and Backeberg, B.: Workflow composition for compound flooding events and adaptation measures, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-11859, https://doi.org/10.5194/egusphere-egu24-11859, 2024.

14:15–14:25
|
EGU24-17141
|
ECS
|
On-site presentation
Tobias Kölling and Lukas Kluft

Global kilometer-scale climate models generate vast quantities of simulation output, presenting significant challenges in using this wealth of data effectively. We approach this challenge from the output user perspective, craft useful dataset requirements and show how this also simplifies data handling and reduces the “time to plot”.

At first, this strategy involves the creation of a consolidated and analysis-ready n-dimensional dataset, directly from the running model. A dataset forms a consistent set of data, which as a whole aims to inform a user about what and what not to expect from the output of a given model run. This is notably distinct from multiple independent datasets or messages, which can’t convey the overall structure and are often (slightly, but annoyingly) inconsistent. Thus, the amount of user surprise can be reduced by reducing the number of output datasets of a model run towards 1. This of course requires synthesizing relevant information from diverse model outputs, but in turn streamlines the accessibility and usability of climate simulation data.

With the goal of the user-perspective of a single large dataset established, we need to ensure that access to this dataset is swift and ergonomic. At the kilometer-scale, horizontal grids of global models outgrow the size of computer screens and the capacity of the human eye, changing viable dataset usage patterns: we can either observe a coarser version of the data globally or high resolution data locally, but not both. We make use of two concepts to cope with this fact: output hierarches and multidimensional chunking.

By accumulating data in both temporal and spatial dimensions, while keeping the dataset structure, users can seemlessly switch between resolutions, reducing the computational burden during post-processing at the large scale perspective. In addition, splitting high-resolution data in compact spatiotemporal chunks, regional subsets can be extracted quickly as well. While adding the hierarchy adds a small amount of extra data to already tight disk space quotas, a good chunk design and state-of-the-art compression techniques reduce storage requirements without adding access time overhead. On top, the approach generates an opportunity for hierarchical storage systems: only those regions and resoultions which are actively worked on have to reside in “hot” storage.

In summary, our collaborative efforts bring together diverse existing strategies to revolutionize the output and post-processing landscape of global kilometer-scale climate models. By creating a single analysis-ready dataset, pre-computing hierarchies, employing spatial chunking, and utilizing advanced compression techniques, we aim to address challenges associated with managing and extracting meaningful insights from these vast simulations. This innovative approach enhances the efficiency of many real-life applications, which is a necessity for analysing multi-decadal kilometer-scale model output.

How to cite: Kölling, T. and Kluft, L.: Building useful datasets for Earth System Model output, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-17141, https://doi.org/10.5194/egusphere-egu24-17141, 2024.

14:25–14:35
|
EGU24-20358
|
On-site presentation
Miguel Andrés-Martínez, Nadine Wieters, Paul Gierz, Jan Streffing, Sebastian Wahl, Joakim Kjellsson, and Bernadette Fritzsch

In the last decades, the operation, maintenance and administration of Earth System Models (ESMs) have become substantially more complex due to the increasing number of available models, coupling approaches and versions, and the need of tuning for different scales and configurations. Another factor contributing to the complexity of operation is the requirement to run the models on different High Performance Computing (HPC) platforms. In this context, configuration tools, workflow managers and ESM-oriented scripting tools have become essential for administrating, distributing and operating ESMs, across research groups, institutions and members of international projects, while still ensuring simulation reproducibility.

ESM-Tools is an open-source software infrastructure and configuration tool that tackles these challenges associated with the operation of ESMs. ESM-Tools enables seamlessly building and running ESMs across different HPCs in a reproducible manner. Most importantly, it is used by model developers to distribute standard simulation configurations, so that the user can effortlessly run these predefined simulations while retaining the flexibility to modify only the parameters that align with their specific needs. This lowers the technical threshold for new model users and makes the ESMs more accessible.

The source-code consists of an HPC- and model-agnostic Python back-end, and a set of model- and HPC-specific configuration YAML files. In this way, adding a new model, coupled model or HPC is just a matter of writing new configuration YAML files. The configuration files are highly modularized which allows for their reutilization in new setups (e.g. new components are added, while some existing component configurations are reused). Configuration conflicts between the different files are resolved hierarchically accordingly to their configuration category, giving priority to model- and simulation-specific configurations. ESM-Tools also provides basic workflow-management capabilities which allow for plugging in preprocessing and postprocessing tasks and running offline coupled models. The tasks of the ESM-Tools workflow can be reorganized, new tasks can be included, and single tasks can be executed independently, allowing for its integration in more advance workflow manager software if required.

Among other coupled Earth System Models, ESM-Tools is currently used to manage and distribute the OpenIFS-based Climate Models AWI-CM3 (FESOM2 + OpenIFS, developed at AWI) and FOCI-OpenIFS (NEMO4 + OpenIFS43r3, developed at GEOMAR, running ORCA05 and ORCA12 in coupled mode with OASIS3-MCT5.0), as well as the AWI-ESM family of models (ECHAM6 + FESOM2). HPCs supported include those of the DKRZ (Hamburg, Germany), Jülich Supercomputing Center (Jülich, Germany), HLRN (Berlin and Göttingen, Germany), and the IBS Center for Climate Physics (Busan, South Korea), with plans to support LUMI (Kajaani, Finland) and desktop distributions (for educational purposes).

In this contribution we will introduce ESM-Tools and the design choices behind ESM-Tools architecture. Additionally, we will discuss the advantages of such a modular system, and address the challenges associated with its usability and maintainability resulting from these design choices and our mitigation strategies.

How to cite: Andrés-Martínez, M., Wieters, N., Gierz, P., Streffing, J., Wahl, S., Kjellsson, J., and Fritzsch, B.: ESM-Tools - A modular infrastructure for Earth System Modelling, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-20358, https://doi.org/10.5194/egusphere-egu24-20358, 2024.

14:35–14:45
|
EGU24-20642
|
solicited
|
On-site presentation
Bruno De Paula Kinoshita, Daniel Beltran Mora, Manuel G. Marciani, and Luiggi Tenorio Ku

In this talk we present current work in Autosubmit to track workflow provenance using the community maintained open standard RO-Crate. Autosubmit is an experiment and workflow manager designed to conduct climate experiments in different platforms (local, HPC, cloud), and is part of different Earth Digital Twin initiatives (Destination Earth Climate Digital Twin, and the European Digital Twin of the Ocean).

Workflow managers have a central role in receiving user input, processing it with local and remote jobs that run on different platforms and that generate output data. RO-Crate enables tracking of workflow prospective (what should happen, e.g. workflow configuration, Slurm job settings) and retrospective (what happened, e.g. log files, performance indicators) provenance. By adopting an open standard that is used by other workflow managers (e.g. Galaxy, COMPSs, Streamflow, WfExS, Sapporo, and Autosubmit) and tools (e.g. Workflow Hub, runcrate) from various domains we show that it not only improves data provenance in Autosubmit, but also interoperability with other workflow managers and tools.

We also describe recent work to integrate RO-Crate with METACLIP, a language-independent framework for climate product provenance that was used in IPCC Atlas. METACLIP uses ontologies such as PROV to track the provenance of climate products. We describe how that relates to RO-Crate, and how we are integrating both via JSON-LD.

How to cite: De Paula Kinoshita, B., Beltran Mora, D., G. Marciani, M., and Tenorio Ku, L.: Workflow Provenance with RO-Crate in Autosubmit, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-20642, https://doi.org/10.5194/egusphere-egu24-20642, 2024.

Enabling reproducibility of results in Earth System Science through improved workflows
14:45–15:05
|
EGU24-19486
|
solicited
|
Highlight
|
On-site presentation
Sean R. Wilkinson

The FAIR Principles, originally introduced as guiding principles for scientific data management and stewardship, also apply abstractly to other digital objects such as research software and scientific workflows. When introduced to the FAIR principles, most scientists can see that the concepts behind the FAIR principles — namely, to make digital objects Findable, Accessible, Interoperable, and Reusable — will improve the quality of research artifacts. It is less common, however, that scientists immediately recognize the ways in which incorporating FAIR methods into their research can enable them to tackle problems of greater size and complexity. In short, focusing on making artifacts that are reusable in the FAIR sense makes those artifacts reusable by humans as well as machines, thus enabling computational workflows that handle scaling issues automatically and someday even self-assemble. Here, we will discuss recent community developments in FAIR computational workflows and how they can impact the earth sciences now and in the future.

How to cite: Wilkinson, S. R.: FAIR Workflows and Methods for Scaling, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-19486, https://doi.org/10.5194/egusphere-egu24-19486, 2024.

15:05–15:15
|
EGU24-5115
|
On-site presentation
Alice Fremand, Julien Bodart, Tom Jordan, Peter Fretwell, and Alexander Tate

In the last 50 years, the British Antarctic Survey (BAS, https://www.bas.ac.uk/) has been a key player in acquiring airborne magnetic, gravity and radio-echo sounding data in Antarctica. These data have been central to many studies of the past, present and future evolution of the Antarctic Ice Sheet but until recently they were not accessible to the community.

In the last three years, the UK Polar Data Centre (https://www.bas.ac.uk/data/uk-pdc/) has made considerable efforts to standardise these datasets to comply with the FAIR (Findable, Accessible, Interoperable and Reusable) data principles and develop the Polar Airborne Geophysics Data Portal (https://www.bas.ac.uk/project/nagdp/). Workflows from collection to publication have been updated, data formats standardised, and Jupyter Notebooks created to improve reuse and comply with the needs of the scientific community [1].

Following this experience and to promote open access, the UK Polar Data Centre led the management of 60 years of international Antarctic ice thickness data through the Bedmap3 project (https://www.bas.ac.uk/project/bedmap/), an international project supported by the Scientific Committee of Antarctic Science. This time, it’s 80+ million points of ice thickness, ice surface and bed elevation from 270+ surveys collected from 50+ international partners that have been standardised and assimilated in the Bedmap data portal (https://bedmap.scar.org/) [2].

Today, airborne data are acquired using new types of platforms including uncrewed aerial systems (UAV) adding new challenges and opportunities to set up new standards and data management practices.

As part of this presentation, we will present the different workflows and data management practices that we are developing to make Antarctic science open and FAIR.

[1] Frémand, A. C., Bodart, J. A., Jordan, T. A., Ferraccioli, F., Robinson, C., Corr, H. F. J., Peat, H. J., Bingham, R. G., and Vaughan, D. G.: British Antarctic Survey's aerogeophysical data: releasing 25 years of airborne gravity, magnetic, and radar datasets over Antarctica, Earth Syst. Sci. Data, 14, 3379–3410, https://doi.org/10.5194/essd-14-3379-2022 , 2022.

[2] Frémand, A. C., Fretwell, P., Bodart, J., Pritchard, H. D., Aitken, A., Bamber, J. L., ... & Zirizzotti, A.: Antarctic Bedmap data: Findable, Accessible, Interoperable, and Reusable (FAIR) sharing of 60 years of ice bed, surface, and thickness data, Earth Syst. Sci. Data, 15, 2695–2710, https://doi.org/10.5194/essd-15-2695-2023, 2023.

How to cite: Fremand, A., Bodart, J., Jordan, T., Fretwell, P., and Tate, A.: Advancing polar airborne geophysics data management at the UK Polar Data Centre , EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-5115, https://doi.org/10.5194/egusphere-egu24-5115, 2024.

15:15–15:25
|
EGU24-8711
|
ECS
|
On-site presentation
Marianna Miola, Daniela Cabiddu, Simone Pittaluga, and Marino Vetuschi Zuccolini

Over the past few decades, geoscientists have progressively exploited and integrated techniques and tools from Applied Mathematics, Statistics, and Computer Science to investigate and simulate natural phenomena. Depending on situation, the sequence of computational steps may vary, leading to intricate workflows that are difficult to reproduce and/or further revisit. To ensure that such workflows can be repeated for validation, peer review, or further investigation, it is necessary to implement strategies of workflow persistence, that is the ability to maintain the continuity, integrity, and reproducibility of a workflow over time.

In this context, we propose an efficient strategy to support workflow persistence of Geoscience pipelines (i.e., Environmental Workflow Persistence methodology EWoPe). Our approach enables to document each workflow step, including details about data sources, processing algorithms, parameters, final and intermediate outputs. Documentation aids in understanding the workflow's methodology, promotes transparency, and ensures replicability. 

Our methodology views workflows as hierarchical tree data structures. In this representation, each node describes data, whether it's input data or the result of a computational step, and each arc is a computational step that uses its respective nodes as inputs to generate output nodes. The relationship between input and output can be described as either one-to-one or one-to-many, allowing the flexibility to support either singular or multiple outcomes from a single input.

The approach ensures the persistence of workflows by employing JSON (JavaScript Object Notation) encoding. JSON is a lightweight data interchange format designed for human readability and ease of both machine parsing and generation. By this persistence workflow management, each node within a workflow consists of two elements. One encodes the raw data itself (in one or multiple files). The other is a JSON file that describes through metadata the computational step responsible for generating the raw data, including the reference to input data and parameters. Such a JSON file serves to trace and certify the source of the data, offering a starting point for retracing the workflow backward to its original input data or to an intermediate result of interest.

Currently, EWoPe methodology has been implemented and integrated into MUSE (Modeling Uncertainty as a Support for Environment) (Miola et al., STAG2022), a computational infrastructure to evaluate spatial uncertainty in multi-scenario applications such as in environmental geochemistry, reservoir geology, or infrastructure engineering. MUSE allows running specific multi-stage workflows that involve spatial discretization algorithms, geostatistics, and stochastic simulations: the usage of EWoPe methodology in MUSE can be seen as an example of its deployment.

By exploiting transparency of input-output relationships and thus ensuring the reproducibility of results, EWoPe methodology offers significant benefits to both scientists and downstream communities involved in utilizing environmental computational frameworks.

Acknowledgements: Funded by the European Union - NextGenerationEU and by the Ministry of University and Research (MUR), National Recovery and Resilience Plan (NRRP), Mission 4, Component 2, Investment 1.5, project “RAISE - Robotics and AI for Socio-economic Empowerment” (ECS00000035) and by PON "Ricerca e Innovazione" 2014-2020, Asse IV "Istruzione e ricerca per il recupero", Azione IV.5 "Dottorati su tematiche green" DM 1061/2021.

How to cite: Miola, M., Cabiddu, D., Pittaluga, S., and Vetuschi Zuccolini, M.: EWoPe: Environmental Workflow Persistence methodology for multi-stage computational processes reproducibility, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-8711, https://doi.org/10.5194/egusphere-egu24-8711, 2024.

15:25–15:35
|
EGU24-14011
|
On-site presentation
Aidan Heerdegen, Harshula Jayasuriya, Tommy Gatti, Varvara Efremova, Kelsey Druken, and Andy Hogg

It is difficult to reliably build climate models, reproduce results and so replicate scientific findings. Modern software engineering coupled with the right tools can make this easier. 

Some sources of complexity that make this a difficult problem:  

  • Climate models are an imperfect translation of extremely complex scientific understanding into computer code. Imperfect because many assumptions are made to make the problems tractable.  
  • Climate models are typically a number of separate models of different realms of the earth system, which run independently while exchanging information at their boundaries.   
  • Building multiple completely separate models and their many dependencies, all with varying standards of software engineering and architecture. 
  • Computational complexity requires high performance computing (HPC) centres, which contain exotic hardware utilising specially tuned software.  

ACCESS-NRI uses spack, a build-from-source package manager that targets HPC, and which gives full build provenance and guaranteed build reproducibility. This makes building climate models easier and reliable. Continuous integration testing of build correctness and reproducibility, model replicability, and scientific reproducibility eliminates a source of complexity and uncertainty. The model is guaranteed to produce the same results from the same code, or modified code, when those changes should not alter answers.  

Scientists can be confident that any variation in their climate model experiments is due to factors under their control, rather than changes in software dependencies, or the tools used to build the model. 

How to cite: Heerdegen, A., Jayasuriya, H., Gatti, T., Efremova, V., Druken, K., and Hogg, A.: RRR: Reliability, Replicability, Reproducibility for Climate Models  , EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-14011, https://doi.org/10.5194/egusphere-egu24-14011, 2024.

15:35–15:45
|
EGU24-18724
|
ECS
|
On-site presentation
Alan Correa, Anil Yildiz, and Julia Kowalski

The pursuit of reproducibility in research has long been emphasized. It is even more critical in geohazards research and practice, where model-based decision-making needs to be transparent for trustworthy applications. However, enabling reproducibility in process-based or machine learning workflows requires time, energy, and sometimes manual operations or even unavailable resources. Moreover, the diversity in modern compute environments, both in hardware and software, significantly hinders the path to reproducibility. While many researchers focus on reproducibility, we advocate that reusability holds greater value and inherently requires the former. Reusable datasets and simulations can allow for transparent and reliable decision support, analysis as well as benchmarking studies. Reusable research software can foster composition and faster development of complex projects, while avoiding the reinvention of complicated data structures and algorithms.

Establishing reproducible workflows and compute environments is vital to enable and ensure reusability. Prioritising reproducible workflows is crucial for individual use, while both reproducible compute environments and workflows are essential for broader accessibility and reuse by others. We present herein various challenges faced in coming up with reproducible workflows and compute environments along with solution strategies and recommendations through experiences from two projects in geohazards research. We discuss an object-oriented approach to simulation workflows, automated metadata extraction and data upload, unique identification of datasets (assets) and simulation workflows (processes) through cryptographic hashes. We investigate essential factors, such as software versioning and dependency management, reproducibility across diverse hardware used by researchers, and time to first reproduction/reuse (TTFR), to establish reproducible computational environments. Finally, we shall explore the landscape of reproducibility in compute environments, covering language-agnostic package managers, containers, and language-specific package managers supporting binary dependencies.

How to cite: Correa, A., Yildiz, A., and Kowalski, J.: Reproducible Workflows and Compute Environments for Reusable Datasets, Simulations and Research Software, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-18724, https://doi.org/10.5194/egusphere-egu24-18724, 2024.

Posters on site: Thu, 18 Apr, 10:45–12:30 | Hall X2

Display time: Thu, 18 Apr, 08:30–Thu, 18 Apr, 12:30
Chairpersons: Valentine Anantharaj, Donatello Elia, Ivonne Anders
Advanced Workflow Strategies in High-Performance Computing for Earth Sciences
X2.5
|
EGU24-1042
|
ECS
Manuel Giménez de Castro Marciani, Gladys Utrera, Miguel Castrillo, and Mario C. Acosta

Experimenting with modern ESM inherently requires a workflow organization to handle the multiple steps comprising of, but not limited to, execution, data governance, cleaning, and coordinating multiple machines. And for climate experiments, due to long scale of the simulations, workflows are even more critical. The community has thoroughly proposed enhancements for reducing the runtime of the models, but long has overlooked the time to response, which also takes into account the queue time. And, that is what we aim to optimize by wrapping jobs, which would otherwise be submitted individually, onto a single one. The intricate three-way interaction of the HPC system usage, scheduler policy, and user's past usage is the main challenge addressed here to analyze the impact of wrapping jobs.

How to cite: Giménez de Castro Marciani, M., Utrera, G., Castrillo, M., and Acosta, M. C.: Assessing Job Wrapping as an Strategy for Workflow Optimization on Shared HPC Platforms, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-1042, https://doi.org/10.5194/egusphere-egu24-1042, 2024.

X2.6
|
EGU24-2152
Gilbert Montané Pinto, Eric Ferrer, Miriam Olid, Alejandro Garcia, Genís Bonet, and Amirpasha Mozaffari

At the Earth Sciences department of the Barcelona Supercomputing Center (BSC-ES) a variety of workflows are run for many different purposes like executing climate and atmospheric simulations, data downloading or performance evaluation. This implies having to deal with many different processes and time scales in different environments and machines.

To help conduct all these complex tasks, the Autosubmit workflow manager is used in the whole department as a unique framework. The fact that Autosubmit has been fully developed at the BSC-ES has led to the adoption of a co-design procedure between users, workflow developers and Autosubmit developers to fulfill the day-to-day department needs. The synergy and close collaboration among them allows the workflow engineers to gather the specific user requirements that later become new Autosubmit features available to everyone. Thanks to this continuous interaction at all levels, an efficient and very adaptable system could have been achieved, perfectly aligned with the constantly evolving user needs.

Here this collaborative strategy is presented from the workflow development point of view. Some real use cases and practical examples are used to show the positive impact it had in different operational and research projects, demonstrating how it can help in the achievement of high scientific productivity.

How to cite: Montané Pinto, G., Ferrer, E., Olid, M., Garcia, A., Bonet, G., and Mozaffari, A.: Collaboratively developing workflows at the BSC-ES, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-2152, https://doi.org/10.5194/egusphere-egu24-2152, 2024.

X2.7
|
EGU24-2533
|
ECS
|
Aina Gaya-Àvila, Leo Arriola i Meikle, Francesc Roura Adserias, Bruno De Paula Kinoshita, Daniel Beltrán Mora, Rohan Ahmed, Miguel Andrés-Martínez, and Miguel Castrillo

The escalating intricacy of climate models and the demand for high-resolution temporal and spatial data need the development of advanced workflows to effectively manage the complexities associated with a Climate Digital Twin. The designed workflow, tailored to meet these challenges, is model-agnostic, allowing for simulations across various models, such as IFS-NEMO, IFS-FESOM, and ICON. Notably, its adaptability extends to diverse High-Performance Computing environments, facilitated by the containerization of data consumers. 


A user-friendly configuration structure is implemented, providing scientists with a simplified interface that conceals the inherent complexity of the model during simulations. Additionally, the workflow includes immediate and continuous data processing, promoting scalability in temporal and spatial resolution. This approach ensures the efficient handling of intricate climate models, meeting the demands for high-resolution temporal and spatial data, while enhancing user accessibility and adaptability across different computational environments. 


Furthermore, the workflow, which uses Autosubmit as the workflow manager, ensures the traceability and reproducibility of the experiments, allowing for the tracking of processes and ensuring the ability to reproduce results accurately. Finally, the workflow allows for the aggregation of tasks into larger jobs, reducing queue times on shared machines and optimizing resource usage.

How to cite: Gaya-Àvila, A., Arriola i Meikle, L., Roura Adserias, F., De Paula Kinoshita, B., Beltrán Mora, D., Ahmed, R., Andrés-Martínez, M., and Castrillo, M.: A workflow for the Climate Digital Twin, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-2533, https://doi.org/10.5194/egusphere-egu24-2533, 2024.

X2.8
|
EGU24-2555
Enabling Reliable Workflow Development with an Advanced Testing Suite
(withdrawn)
Rohan Ahmed, Genis Bonet Garcia, Gilbert Montane Pinto, and Eric Ferrer Escuin
X2.9
|
EGU24-6350
Isidora Jankov, Daniel Abdi, Naureen Bharwani, Emily Carpenter, Christopher Harrop, Chritina Holt, Paul Madden, Timothy Sliwinski, Duane Rosenberg, and Ligia Bernardet

The NOAA Global Systems Laboratory, Earth Prediction Advancement Division, Scientific Computing Branch team works on approaches to facilitate development of cloud-resolving, Earth system prediction systems suitable for the next generation of exascale high performance computing (HPC), including exploration of machine learning (ML) algorithms within our systems for improved performance and reduced computational cost. 

Our work is divided into two main categories: incremental - shorter term and innovative - longer term challenges. Work related to incremental changes focuses on existing NOAA algorithms and improvement of their performance on different architectures (e.g. adapting existing codes to run on GPUs). The more innovative aspects focus on development and evaluation of new algorithms and approaches to environmental modeling that simultaneously improve prediction accuracy, performance, and portability. For this purpose we have developed the GeoFLuid Object Workflow (GeoFLOW), a C++ framework with convective (and other)  dynamics, high order truncation, quantifiable dissipation, an option to use a variety of 2D and 3D grids, and excellent strong scaling and on-node properties. An evaluation of the use of ML-based emulators for different components of the earth system prediction models also forms an important part of our research. 

Finally, a large portion of our research and development activities involves building federated and unified workflows to facilitate both the effective use of distributed computing resources as well as easy configuration for nontrivial workflow applications in research and operations.

A comprehensive summary of these research and development activities will be presented.



How to cite: Jankov, I., Abdi, D., Bharwani, N., Carpenter, E., Harrop, C., Holt, C., Madden, P., Sliwinski, T., Rosenberg, D., and Bernardet, L.: Research and Development within the Scientific Computing Branch of NOAA’s Global Systems Laboratory, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-6350, https://doi.org/10.5194/egusphere-egu24-6350, 2024.

X2.10
|
EGU24-10750
Farahnaz Khosrawi and Lars Hoffmann

Computer performance has increased immensely in recent years, but the ability to store data has hardly increased at all. The current version of meteorological reanalysis data ERA5 provided by the European Centre of Medium-Range Weather Forecasts (ECMWF) has increased by a factor of ∼80 compared to its predecessor ERA-Interim. This presents scientists with major challenges, especially if data covering several decades is to be stored on local computer systems. Accordingly, many compression methods have been developed in recent years with which data can be stored either lossless or lossy. Here we test three of these methods two lossy compression methods ZFP and Layer Packing (PCK) and the lossless compressor ZStandard (ZSTD) and investigate how the use of compressed data affects the results of Lagrangian air parcel trajectory calculations with the Lagrangian model for Massive-Parallel Trajectory Calculations (MPTRAC). We analysed 10-day forward trajectories that were globally distributed over the free troposphere and stratosphere. The largest transport deviations were derived when using ZFP with the largest compression. Using a less strong compression we could reduce the transport deviation and still derive a significant compression. Since ZSTD is a lossless compressor, we derive no transport deviations at all when using these compressed files, but do not loose much disk space using this compressor (reduction of ∼20%). The best result concerning compression efficiency and transport deviations is derived with the layer packing method PCK. The data is compressed by about 50%, but transport deviations do not exceed 40 km in the free troposphere and are even lower in the upper troposphere and stratosphere. Thus, our study shows that the PCK compression method would be valuable for application in atmospheric sciences and that with compression of meteorological reanalyses data files we can overcome the challenges of high demand of disk space from these data sets.

How to cite: Khosrawi, F. and Hoffmann, L.: Compression of meteorological reanalysis data files and their application to Lagrangian transport simulations , EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-10750, https://doi.org/10.5194/egusphere-egu24-10750, 2024.

X2.11
|
EGU24-11167
Louise Cordrie, Jorge Ejarque, Carlos Sánchez-Linares, Jacopo Selva, Jorge Macías, Steven J. Gibbons, Fabrizio Bernardi, Bernardi Tonini, Rosa M Badia, Sonia Scardigno, Stefano Lorito, Fabrizio Romano, Finn Løvholt, Manuela Volpe, Alessandro D'Anca, Marc de la Asunción, and Valentina Magni

The Urgent Tsunami Computing procedures discussed herein are designed to quantify potential hazards resulting from seismically-induced tsunamis following an earthquake, with a temporal scope ranging from minutes to a few hours. The presented workflow employs comprehensive simulations, encompassing the entire tsunami propagation process, while accounting for uncertainties associated with source parameters, tsunamigenesis and wave propagation dynamics. Within the EuroHPC eFlows4HPC project, we present a High-Performance Computing (HPC) workflow tailored for urgent tsunami computation in which the Probabilistic Tsunami Forecast (PTF) code has been restructured and adapted for seamless integration into a PyCOMPSs framework. This framework enables parallel execution of tasks and includes simulations from Tsunami-HySEA numerical model within a unified computational environment. Of particular significance is the workflow's capability to incorporate new datasets, such as focal mechanism data, seismic records, or real-time tsunami observations. This functionality facilitates an "on-the-fly" update of the PTF, ensuring that the forecasting model remains responsive to the latest information. The development of this workflow involves a systematic exploration of diverse scenarios, realistic simulations, and the assimilation of incoming data. The overarching goal is to rigorously diminish uncertainties, thereby producing updated probabilistic forecasts without compromising precision and enhancing risk mitigation efforts far from the seismic source. Improved risk management, achieved by informing decision-making in emergency situations, underscores the importance of this development. We will showcase the technical advancements undertaken to tailor the workflow for HPC environments, spanning from the developers' perspective to that of the end user. Additionally, we will highlight the scientific enhancements implemented to leverage the full potential of HPC capabilities, aiming to significantly reduce result delivery times while concurrently enhancing the accuracy and precision of our forecasts.

How to cite: Cordrie, L., Ejarque, J., Sánchez-Linares, C., Selva, J., Macías, J., Gibbons, S. J., Bernardi, F., Tonini, B., Badia, R. M., Scardigno, S., Lorito, S., Romano, F., Løvholt, F., Volpe, M., D'Anca, A., de la Asunción, M., and Magni, V.: A Dynamic HPC Probabilistic Tsunami Forecast Workflow for Real-time Hazard Assessment, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-11167, https://doi.org/10.5194/egusphere-egu24-11167, 2024.

X2.12
|
EGU24-11434
|
ECS
Forging collaborative approaches for enhancing Earth Observation product uptake in the Cloud Era.
(withdrawn after no-show)
Vasco Mantas and Andrea Portier
X2.13
|
EGU24-15849
|
ECS
Rut Blanco-Prieto, Marisol Monterrubio-Velasco, Marta Pienkowska, Jorge Ejarque, Cedric Bhihe, Natalia Zamora, and Josep de la Puente

The Urgent Computing Integrated Services for Earthquakes (UCIS4EQ) introduces a fully automatic seismic workflow centered on rapidly delivering synthetic assessments of the impact of moderate to large earthquakes throughout physics-based forward simulations. This novel approach links High-Performance Computing (HPC), High-Performance Data Analytics (HPDA), and highly optimized numerical solvers. Its core objective lies in performing numerical simulations either during or right after an earthquake, accomplishing this task within a short timeframe, typically spanning from minutes to a few hours.

During multi-node execution, PyCOMPSs orchestrates UCIS4EQ’s distributed tasks and improves its readiness level towards providing an operational service. UCIS4EQ coordinates the execution of multiple seismic sources to account for input and model uncertainties. Its comprehensive scope provides decision-makers with numerical insights into the potential outcomes of post-earthquake emergency scenarios.

The UCIS4EQ workflow includes a fast inference service based on location-specific pre-trained machine learning models. Such learned models permit a swift analysis and estimation of the potential damage caused by an earthquake. Leveraging advanced AI capabilities endows our workflow with the  ability to rapidly estimate a seismic event's impact. Ultimately it provides valuable support for rapid decision-making during emergencies.

Through the integration of high performance computational techniques and pioneering methodologies, our hope is to see UCIS4EQ emerge as a useful instrument to make agile and well-informed post-event decisions in the face of seismic events.

With this study, we account for UCIS4EQ's continuous development through a number of case studies. These case studies will shed light on the most recent developments and applications of urgent computing seismic workflow, demonstrating its efficacy in providing rapid and precise insights into earthquake scenarios.

How to cite: Blanco-Prieto, R., Monterrubio-Velasco, M., Pienkowska, M., Ejarque, J., Bhihe, C., Zamora, N., and de la Puente, J.: Urgent Computing Integrated Services for Earthquakes , EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-15849, https://doi.org/10.5194/egusphere-egu24-15849, 2024.

X2.14
|
EGU24-17007
|
ECS
Lukas Kluft and Tobias Kölling

Global kilometer-scale climate models produce vast amounts of output, posing challenges in efficient data utilization. For ICON, we addressed this by creating a consolidated and analysis-ready dataset in the Zarr format, departing from the previous cumbersome directory structure. This new dataset format provides a comprehensive overview of variables and time steps at one glance.

To ensure swift and ergonomic access to the dataset, we employ two key concepts: output hierarchies and multidimensional chunking. We remapped all output onto the HEALPix grid, facilitating hierarchical resolutions, and pre-computed temporal aggregations like daily and monthly averages. This enables users to seamlessly switch between resolutions, reducing computational burdens during post-processing.

Spatial chunking of high-resolution data further allows for efficient extraction of regional subsets, significantly improving the efficiency of common climate science analyses, such as time series and vertical cross-sections. While our efforts primarily integrate established strategies, the synergies achieved in resolution have shown a profound impact on the post-processing efficiency of our global kilometer-scale output.

In summary, our approach, creating a single analysis-ready dataset, pre-computing hierarchies, and employing spatial chunking, addresses challenges in managing and extracting meaningful insights from increasingly large model output. We successfully tested the new analysis-ready datasets during well-attended hackathons, revealing significant usability and performance improvements over a wide range of real-life applications.

How to cite: Kluft, L. and Kölling, T.: Building a useful dataset for ICON output, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-17007, https://doi.org/10.5194/egusphere-egu24-17007, 2024.

X2.15
|
EGU24-20844
Bruno De Paula Kinoshita, Daniel Beltran Mora, Manuel G. Marciani, Luiggi Tenorio Ku, and Miguel Castrillo

The European Digital Twin of the Ocean, EDITO, is an initiative of the European Commission that aims to create a virtual representation of marine and coastal environments around the globe to assess future impacts of climate change and human activities, improving the accessibility of ocean knowledge. EDITO Infra is the backbone of EDITO. It provides the infrastructure where components of the EDITO digital twin are combined and integrated.

In this work, we describe how Autosubmit is integrated and used in EDITO Infra, as the back-end component of the Virtual Ocean Model Lab (VOML, a virtual co-working environment). Users of the digital twin can connect to the VOML to customize and build ocean models, run using cloud and HPC resources, and access applications deployed in the EDITO Infra that consume the output of the models. Besides usually being a local software that helps users to leverage remote resources, in this case we demonstrate Autosubmit's versatility with this scenario in which it is deployed using Docker and Kubernetes, tools used traditionally in cloud environments.

In this context, Autosubmit acts as a middleware between the applications, and HPC and Cloud resources. It manages experiments and workflows, connecting EuroHPC’s like MareNostrum 5 and Leonardo. It provides integration with different GUI’s (with a REST API), GIS systems, and other services.

How to cite: De Paula Kinoshita, B., Beltran Mora, D., G. Marciani, M., Tenorio Ku, L., and Castrillo, M.: Workflows with Autosubmit in EDITO, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-20844, https://doi.org/10.5194/egusphere-egu24-20844, 2024.

X2.16
|
EGU24-21138
Ramon Carbonell, Arnau Folch, Antonio Costa, Beata Orlecka-Sikora, Piero Lanucara, Finn Løvholt, Jorge Macías, Sascha Brune, Alice-Agnes Gabriel, Sara Barsotti, Joern Behrens, Jorge Gomez, Jean Schmittbuhl, Carmela Freda, Joanna Kocot, Domenico Giardini, Michael Afanasiev, Helen Glaves, and Rosa Badía

The geophysical research community has developed a relatively large amount of numerical codes and scientific methodologies which are able to numerically simulate through physics the extreme behavior of the Earth systems (for example: volcanoes, tsunamis earthquakes, etc). Furthermore,
nowadays, large volumes of data have been acquired and, even near real-time data streams are accessible. Therefore, Earth scientist currently have on their hands the possibility of monitoring these events through sophisticated approaches using the current leading edge computational capabilities provided by pre-exascale computing infrastructures. The implementation and deployments of 12 Digital Twin Components (DTCs), addressing different aspects of geophysical extreme events is being carried out by DT-GEO, a project funded under the Horizon Europe programme (2022-2025). Each DTC is intended as self-contained entity embedding flagship simulation codes, Artificial Intelligence layers, large volumes of (real-time) data streams from and into data-lakes, data assimilation methodologies, and overarching workflows which will are executed independently or coupled DTCs in a centralized HPC and/or virtual cloud computing research infrastructure.

How to cite: Carbonell, R., Folch, A., Costa, A., Orlecka-Sikora, B., Lanucara, P., Løvholt, F., Macías, J., Brune, S., Gabriel, A.-A., Barsotti, S., Behrens, J., Gomez, J., Schmittbuhl, J., Freda, C., Kocot, J., Giardini, D., Afanasiev, M., Glaves, H., and Badía, R.: Digital Twining of Geophysical Extremes, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-21138, https://doi.org/10.5194/egusphere-egu24-21138, 2024.

Enabling reproducibility of results in Earth System Science through improved workflows
X2.17
|
EGU24-3070
A workflow solution for coastal hydrodynamic and water quality modelling using Delft3D Flexible Mesh
(withdrawn)
Björn Backeberg, Lőrinc Meszaros, and Sebastian Luna-Valero
X2.18
|
EGU24-3295
|
ECS
Zhiyi Zhu and Min Chen

Geo-simulation experiments (GSEs) are experiments allowing the simulation and exploration of Earth’s surface (such as hydrological, geomorphological, atmospheric, biological, and social processes and their interactions) with the usage of geo-analysis models (hereafter called ‘models’). Computational processes represent the steps in GSEs where researchers employ these models to analyze data by computer, encompassing a suite of actions carried out by researchers. These processes form the crux of GSEs, as GSEs are ultimately implemented through the execution of computational processes. Recent advancements in computer technology have facilitated sharing models online to promote resource accessibility and environmental dependency rebuilding, the lack of which are two fundamental barriers to reproduction. In particular, the trend of encapsulating models as web services online is gaining traction. While such service-oriented strategies aid in the reproduction of computational processes, they often ignore the association and interaction among researchers’ actions regarding the usage of sequential resources (model-service resources and data resources); documenting these actions can help clarify the exact order and details of resource usage. Inspired by these strategies, this study explores the organization of computational processes, which can be extracted with a collection of action nodes and related logical links (node-link ensembles). The action nodes are the abstraction of the interactions between participant entities and resource elements (i.e., model-service resource elements and data resource elements), while logical links represent the logical relationships between action nodes. In addition, the representation of actions, the formation of documentation, and the reimplementation of documentation are interconnected stages in this approach. Specifically, the accurate representation of actions facilitates the correct performance of these actions; therefore, the operation of actions can be documented in a standard way, which is crucial for the successful reproduction of computational processes based on standardized documentation. A prototype system is designed to demonstrate the feasibility and practicality of the proposed approach. By employing this pragmatic approach, researchers can share their computational processes in a structured and open format, allowing peer scientists to re-execute operations with initial resources and reimplement the initial computational processes of GSEs via the open web.

How to cite: Zhu, Z. and Chen, M.: Reproducing computational processes in service-based geo-simulation experiments, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-3295, https://doi.org/10.5194/egusphere-egu24-3295, 2024.

X2.19
|
EGU24-15382
|
ECS
Mirko Mälicke, Alexander Dolich, Ashish Manoj Jaseetha, Balazs Bischof, and Lucas Reid

We propose a framework-agnostic specification for contextualizing Docker containers in environmental research. Given a scientific context, containers are especially useful to combine scripts in different languages following different development paradigms. 

The specification standardizes inputs and outputs from and to containers to ease the development of new tools, retrace results and add a provenance context to scientific workflows. As of now we also provide templates for the implementation of new tools developed in Python, R, Octave and NodeJS, two different server applications to run the containers in a local or remote setting and a Python client to seamlessly include containers into existing workflows. A flutter template is in development, which can be used as a basis to build use-case specific applications for Windows, Linux, Mac, the Web, Android and iOS.

We present the specification itself, with a focus on ways of contributing, to align the specification with as many geoscientific use-cases as possible in the future. In addition a few insights into current implementations are given, namely the role of the compliant pre-processing tools in the generation of the CAMELS-DE dataset, as well as result presentation for a Machine learning application for predicting soil moisture. Both applications are presented at EGU as well. We use these examples to demonstrate how the framework can increase the reproducibility of associated workflows.

How to cite: Mälicke, M., Dolich, A., Manoj Jaseetha, A., Bischof, B., and Reid, L.: Using Docker for reproducible workflows, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-15382, https://doi.org/10.5194/egusphere-egu24-15382, 2024.

X2.20
|
EGU24-17154
|
ECS
Niklas Böing, Johannes Holke, Chiara Hergl, Achim Basermann, and Gregor Gassner

Large-scale earth system simulations produce huge amounts of data. Due to limited I/O bandwidth and available storage space this data often needs to be reduced before writing to disk or storing permanently. Error-bounded lossy compression is an effective approach to tackle the trade-off between accuracy and storage space.

We are exploring and discussing error-bounded lossy compression based on tree-based adaptive mesh refinement (AMR) techniques. According to flexible error-criteria the simulation data is coarsened until a given error bound is reached. This reduces the number of mesh elements and data points significantly.

The error criterion may for example be an absolute or relative point-wise error. Since the compression method is closely linked to the mesh we can additionally incorporate geometry information - for example varying the error by geospatial region.

We implement these techniques as the open source tool cmc, which is based on the parallel AMR library t8code. The compression tool can be linked to and used by arbitrary simulation applications or executed as a post-processing step. As a first example, we couple our compressor with the MESSy and MPTRAC libraries.

We show different results including the compression of ERA5 data. The compressed sample datasets show better results in terms of file size than conventional compressors such as SZ and ZFP. In addition, our method allows for a more fine-grained error control.

How to cite: Böing, N., Holke, J., Hergl, C., Basermann, A., and Gassner, G.: Adaptive Data Reduction Techniques for Extreme-Scale Atmospheric Models, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-17154, https://doi.org/10.5194/egusphere-egu24-17154, 2024.

X2.21
|
EGU24-18057
|
ECS
|
Jack Atkinson, Dominic Orchard, Elliott Kasoar, and Thomas Meltzer

Numerical models across the geophysical sciences make extensive use of physical parameterisations to represent sub-grid processes such as eddies, convection, gravity waves, and microphysics. With the advent of machine learning (ML), big data, and artificial intelligence there are increasing efforts to develop data-driven parameterisation schemes or emulate existing, but computationally-intensive, parameterisations using ML techniques.

Despite their name, the irony is that traditional design approaches often lead to parameterisations which are not themselves parameters, but rather tightly integrated, enmeshed pieces of code in a larger model. Such parameterisations and practices pose a threat to the reproducibility, accuracy, ease-of-(re)-use, and general FAIRness (Findable, Accessible, Interoperable, Reusable) [1] of the schemes being developed.

In contrast, a modular approach to parameterisations (and their receivers, e.g., GCMs), would enable them to be more easily (1) interchangeable and Interoperable, to compare different schemes and assess uncertainty due to their inherent approximate behaviour, and (2) portable and Reusable, between models, to reduce engineering effort, and (3) understandable and Accessible, by being clear about dependencies and the physical meaning of program variables, and (4) testable, to aid verification and correctness testing. 

Achieving this goal in the context of numerical modelling brings a number of scientific, engineering, and computational challenges. In this talk we aim to set out some best-practice principles for achieving modular parameterisation design for geoscience modellers. We then focus on the particular challenges around modem ML parameterisations. 

To this end we have developed a library for easily interoperating ML-based parameterisations in PyTorch with Fortran-based numerical models, called FTorch [2]. By reducing the Fortran-PyTorch Interoperability burden on researchers this framework should reduce errors that arise and increase the speed of development when compared to other approaches such as re-coding models in Fortran. FTorch aims to make emerging ML parameterisation research more Accessible to those who may not have deep expertise of ML, Fortran, and/or computer science. It also means that models developed using PyTorch can leverage its feature-rich libraries and be shared in their native format, maximising Reusability. We discuss the design principles behind FTorch in the context of modular parameterisations and demonstrate our principles and approach by coupling ML parameterisations to atmospheric and climate models.

In general, we present a number of considerations that could be used to make all parameterisation schemes more easily Interoperable and Re-useable by their developers.

 

[1] Barker, M. et al, Introducing the FAIR Principles for research software, Sci Data 9, 622 (2022) https://doi.org/10.1038/s41597-022-01710-x

[2] FTorch https://github.com/Cambridge-ICCS/FTorch

How to cite: Atkinson, J., Orchard, D., Kasoar, E., and Meltzer, T.: Tools and techniques for modular, portable (Machine Learning) parameterisations, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-18057, https://doi.org/10.5194/egusphere-egu24-18057, 2024.

X2.22
|
EGU24-20519
Charlotte Pascoe, Martina Stockhause, Graham Parton, Ellie Fisher, Molly MacRae, Beate Kreuss, and Lina Sitz

Many of the figures in the WGI contribution to the IPCC Sixth Assessment report (AR6) are derived from the data of multiple CMIP6 simulations.  For instance, a plot showing projections of global temperature change in Figure 2 of Chapter 4 of the IPCC AR6 is based on data from 183 CMIP6 simulation datasets. The figure helpfully tells us which CMIP6 experiments were used as input data but does not provide information about the models that ran the simulations. It is possible to deduce the specific input data from supplementary tables in the IPCC assessment report and from within the report’s annexes.  However, these information sources are not machine-accessible so are difficult to use for tracing purposes, and they are not sufficient to give credit as they do not enter indexing services, and they are difficult to find as they are not part of the printed report. Even if we gather this knowledge to create a navigable provenance network for the figure, we are still left with the unwieldy prospect of rendering 183 data citations for an outwardly simple plot.

We require a compact way to provide traceable provenance for large input data networks that makes transparent the specific input data used to create the CMIP6-based figures in IPCC AR6 and gives credit to modelling centres for the effort of running the simulations. The so-called complex citation discussed within the RDA Complex Citation Working Group. 

We present a pragmatic solution to the complex citation challenge that uses an existing public infrastructure technology, Zenodo.  The work establishes traceability by collating references to a figure’s input datasets within a Zenodo record and credit via Zenodo’s relatedWorks feature/DataCite’s relations which link to existing data objects through Persistent Identifiers (PIDs), in this case the CMIP6 data citations.   Whilst a range of PIDs exist to support connection between objects, the use of DOIs is widely used for citations and is well connected within the wider PID graph landscape and Zenodo provides a tool to create objects that utilise the DOI schema provided by DataCite.  CMIP6 data citations have sufficient granularity to assign credit, but the granularity is not fine enough for traceability purposes, therefore Zenodo reference handle groups are used to identify specific input datasets and Zenodo connected objects provide the join between them.

There is still work to be done to establish full visibility of credit referenced within the Zenodo records.  However, we hope to engage the community by presenting our pragmatic solution to the complex citation challenge, one that has the potential to provide modelling centres with a route to a more complete picture of the impact of their simulations.

How to cite: Pascoe, C., Stockhause, M., Parton, G., Fisher, E., MacRae, M., Kreuss, B., and Sitz, L.: A pragmatic approach to complex citations, closing the provenance gap between IPCC AR6 figures and CMIP6 simulations, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-20519, https://doi.org/10.5194/egusphere-egu24-20519, 2024.

Posters virtual: Thu, 18 Apr, 14:00–15:45 | vHall X2

Display time: Thu, 18 Apr, 08:30–Thu, 18 Apr, 18:00
Chairpersons: Miguel Castrillo, Karsten Peters-von Gehlen
Advanced Workflow Strategies in High-Performance Computing for Earth Sciences
vX2.2
|
EGU24-11774
|
Alessandro D'Anca, Sonia Scardigno, Jorge Ejarque, Gabriele Accarino, Daniele Peano, Francesco Immorlano, Davide Donno, Enrico Scoccimarro, Rosa M. Badia, and Giovanni Aloisio

The advances in Earth System Models (ESM), jointly with the availability of more powerful computing infrastructures and novel solutions for Big Data and Machine Learning (ML) is allowing to push research in the climate change field forward. In such context, workflows are fundamental tools to automate the complex processes of model simulations, data preparation and analyses. Such tools are becoming more important as the complexity and heterogeneity in the software and computing infrastructures, as well as the data volumes to be handled, grow. However, integrating into a single workflow simulation and data centric processes can be very challenging due to their different requirements.
This work presents an end-to-end workflow including the steps from the numerical ESM simulation run to the analysis of extreme weather events (e.g., heat waves and tropical cyclones) developed in the context of the eFlows4HPC EuroHPC project. It represents a real case study which requires components from High Performance Computing (HPC), Big Data and ML to carry out the workflow. In particular, the contribution demonstrates how the eFlows4HPC software stack can simplify the development, deployment, orchestration and execution of complex end-to-end workflows for climate science, as well as improve their portability over different computing infrastructures.

How to cite: D'Anca, A., Scardigno, S., Ejarque, J., Accarino, G., Peano, D., Immorlano, F., Donno, D., Scoccimarro, E., Badia, R. M., and Aloisio, G.: An end-to-end workflow for climate data management and analysis integrating HPC, Big Data and Machine Learning , EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-11774, https://doi.org/10.5194/egusphere-egu24-11774, 2024.

vX2.3
|
EGU24-21636
Rafael Ferreira da Silva, Ketan Maheshwari, Tyler Skluzacek, Renan Souza, and Sean Wilkinson

The advancement of science is increasingly intertwined with complex computational processes [1]. Scientific workflows are at the heart of this evolution, acting as essential orchestrators for a vast range of experiments. Specifically, these workflows are central to the field of computational Earth Sciences, where they orchestrate a diverse range of activities, from cloud-based data preprocessing pipelines in environmental modeling to intricate multi-facility instrument-to-edge-to-HPC computational frameworks for seismic data analysis and geophysical simulations [2].

The emergence of continuum and cross-facility workflows marks a significant evolution in computational sciences [3]. Continuum workflows represent continuous computing access required for analysis pipelines, while cross-facility workflows extend across multiple sites, integrating experiments and
computing facilities. These cross-facility workflows, crucial for real-time applications, offer resiliency and stand as solutions for the demands of continuum workflows. Addressing continuum and cross-facility computing requires a focus on data, ensuring workflow systems are equipped to handle diverse data
representations and storage systems.

As we navigate the computing continuum, the pressing needs of contemporary scientific applications in Earth Sciences call for a dual approach: the recalibration of existing systems and the innovation of new workflow functionalities. This recalibration involves optimizing data-intensive operations and
incorporating advanced algorithms for spatial data analysis, while innovation may entail the integration of machine learning techniques for predictive modeling and real-time data processing in earth sciences. We offer a comprehensive overview of cutting-edge advancements in this dynamic realm, with a focus on computational Earth Sciences, including managing the increasing volume and complexity of geospatial data, ensuring the reproducibility of large-scale simulations, and adapting workflows to leverage emerging computational architectures.

 

[1] Ferreira da Silva, R., Casanova, H., Chard, K., Altintas, I., Badia, R. M., Balis, B., Coleman, T., Coppens, F., Di Natale, F., Enders, B., Fahringer, T., Filgueira, R., et al. (2021). A Community Roadmap for Scientific Workflows Research and Development. 2021 IEEE Workshop on Workflows in Support of Large-Scale Science (WORKS), 81–90. DOI: 10.1109/WORKS54523.2021.00016


[2] Badia Sala, R. M., Ayguadé Parra, E., & Labarta Mancho, J. J. (2017). Workflows for science: A challenge when facing the convergence of HPC and big data. Supercomputing frontiers and innovations, 4(1), 27-47. DOI: 10.14529/jsfi170102


[3] Antypas, K. B., Bard, D. J., Blaschke, J. P., Canon, R. S., Enders, B., Shankar, M. A., ... & Wilkinson, S. R. (2021, December). Enabling discovery data science through cross-facility workflows. In 2021 IEEE International Conference on Big Data (Big Data) (pp. 3671-3680). IEEE. DOI: 10.1109/BigData52589.2021.9671421

How to cite: Ferreira da Silva, R., Maheshwari, K., Skluzacek, T., Souza, R., and Wilkinson, S.: Advancing Computational Earth Sciences:Innovations and Challenges in Scientific HPC Workflows, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-21636, https://doi.org/10.5194/egusphere-egu24-21636, 2024.

Enabling reproducibility of results in Earth System Science through improved workflows
vX2.4
|
EGU24-9381
|
Fabrizio Antonio, Mattia Rampazzo, Ludovica Sacco, Paola Nassisi, and Sandro Fiore

Provenance and reproducibility are two key requirements for analytics workflows in Open Science contexts. Handling provenance at different levels of granularity and during the entire experiment lifecycle becomes key to properly and flexibly managing lineage information related to large-scale experiments as well as enabling reproducibility scenarios, which in turn foster re-usability, one of the FAIR guiding data principles.

This contribution focuses on a multi-level approach applied to climate analytics experiments as a way to manage provenance information in a more structured and multifaceted way, and navigate and explore the provenance space across multiple dimensions, thus enabling the possibility to get coarse- or fine-grained information according to the actual requested level. Specifically, the yProv multi-level provenance service, a new core component within an Open Science-enabled research data lifecycle, is introduced by highlighting its design, main features and graph-based data model. Moreover, a climate models intercomparison data analysis use case is presented to showcase how to retrieve and visualize fine-grained provenance information, namely micro-provenance, compliant with the W3C PROV specifications.

This work was partially funded by the EU InterTwin project (Grant Agreement 101058386), the EU Climateurope2 project (Grant Agreement 101056933)and partially under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.4 - Call for tender No. 1031 of 17/06/2022 of Italian Ministry for University and Research funded by the European Union – NextGenerationEU (proj. nr. CN_00000013).

How to cite: Antonio, F., Rampazzo, M., Sacco, L., Nassisi, P., and Fiore, S.: A Multi-level Approach for Provenance Management And Exploration in Climate Workflows, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-9381, https://doi.org/10.5194/egusphere-egu24-9381, 2024.

vX2.5
|
EGU24-19437
Arndt Meier, Guillaume Monteil, Ute Karstens, Marko Scholze, and Alex Vermeulen

The Integrated Carbon Observation System (ICOS) and the Lund University department of Physical Geography and Ecosystem Science have developed the Lund University Modular Inversion Algorithm (LUMIA) that is being used for the inverse modelling of carbon and isotope resolved methane. The work is linked with past and present Horizon 2020 projects like DICE and AVENGERS with the overarching goal of supporting the Paris agreement goals.

The ICOS Carbon Portal (https://data.icos-cp.eu/portal) collects, maintains and supports a large range of greenhouse gas observations as well as some inventory and model data all of which have a unique persistent identifier each, which is a key pre-requisite for achieving a reproducible work-flow.

Here we present a self-documenting fully reproducible work-flow for our inverse carbon model LUMIA, that is based on frameworks discussed in EU initiatives like Copernicus CAMS and CoCO2 as well as our own experiences from actual work-flows routinely used in Australian court cases against illegal land use changes. We will show a live demonstration of the system including its graphical user interfaces and the created provenance and reproducibility meta-data.

How to cite: Meier, A., Monteil, G., Karstens, U., Scholze, M., and Vermeulen, A.: A fully automated reproducible self-documenting workflow implemented in our inverse regional carbon model, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-19437, https://doi.org/10.5194/egusphere-egu24-19437, 2024.