ESSI3.5 | Enabling Reproducibility in Earth System Science research through open and FAIR data, workflows and models

Awareness of the importance of the reproducibility of research results has increased considerably in recent years. Knowledge must be robust and reliable in order to serve as a foundation to build further progress on it. Reproducibility is a complex topic that spans technology, research communities, and research culture. In the narrow sense, reproducibility refers to the possibility of another researcher independently achieving the same result with the identical data and calculation methods. Put simply, one could say that research is either reproducible or not, but more practically there is a continuum of reproducibility where some factors weigh more heavily on influencing results. Replicability or replication, on the other hand, is a broader term and refers to one’s ability to replicate their own research. One problem, however, is that a large percentage of existing studies cannot be successfully reproduced or replicated. This endangers trust in science.

However, with the increasing complexity, volume and variety of Earth System Science (ESS) data - where data can be of multiple types like source code, entire workflows, observational or model output data - and the continuing push towards compliance with the FAIR data principles, achieving reproducibility is challenging. Dedicated solutions do exist only for a subset of implementation factors, but are mostly focused on single institutions or infrastructure providers. Current developments to establish the FDOs (FAIR Digital Objects) and corresponding frameworks go one step further to eventually enable a global interoperable data space to achieve scientific reproducibility. The adoption of Artificial Intelligence (AI), especially machine learning (ML), and other computational-intensive processes complicate this even further.

This session will explore current practices, methods and tools geared towards enabling reproducible results and workflows in ESS. We promote contributions from the areas of infrastructures, infrastructure requirements, workflow frameworks, software/tools, description of practices or other aspects (e.g. provenance tracking, quality information management, FDOs, AI/ML) that must be considered in order to achieve and enable reproducibility in Earth system sciences. These can be contributions that are generally valid and/or transferable or focus on certain areas of application. Finally, best practice examples (or as a counter-example bad practice) are also invited.

Co-organized by CL5/GI1/OS5
Convener: Karsten Peters-von Gehlen | Co-conveners: Christin HenzenECSECS, Rebecca FarringtonECSECS, Philippe Bonnet, Klaus Zimmermann, Joan Masó
Orals
| Tue, 25 Apr, 10:45–12:30 (CEST)
 
Room 0.51
Posters on site
| Attendance Wed, 26 Apr, 08:30–10:15 (CEST)
 
Hall X4
Posters virtual
| Attendance Wed, 26 Apr, 08:30–10:15 (CEST)
 
vHall ESSI/GI/NP
Orals |
Tue, 10:45
Wed, 08:30
Wed, 08:30

Orals: Tue, 25 Apr | Room 0.51

Chairpersons: Karsten Peters-von Gehlen, Klaus Zimmermann, Joan Masó
10:45–10:50
10:50–11:00
|
EGU23-2744
|
ESSI3.5
|
solicited
|
Virtual presentation
Alessandro Spinuso, Ian van der Neut, Mats Veldhuizen, Christian Pagé, and Daniele Bailo

Scientific progress requires research outputs to be reproducible, or at least persistently traceable and analysable for defects through time. This can be facilitated by coupling analysis tools that are already familiar to scientists, with reproducibility controls designed around common containerisation technologies and formats to represent metadata and provenance. Moreover, modern interactive tools for data analysis and visualisation, such as computational notebooks and visual analytics systems, are built to expose their functionalities through the Web. This facilitates the development of integrated solutions that are designed to support computational research with reproducibility in mind, and that, once deployed onto a Cloud infrastructure, benefit from operations that are securely managed and perform reliably. Such systems should be able to easily accommodate specific requirements concerning, for instance, the deployment of particular scientific software and the collection of tailored, yet comprehensive, provenance recordings about data and processes. By decoupling and generalising the description of the environment where a particular research took place from the underlying implementation, which may become obsolete through time, we obtain better chances to recollect relevant information for the retrospective analysis of a scientific product in the long term, enhancing preservation and reproducibility of results.

In this contribution we illustrate how this is achievable via the adoption of microservice architectures combined with a provenance model that supports metadata standards and templating. We aim at empowering scientific data portals with Virtual Research Environments (VREs) and provenance services, that are programmatically controlled via high-level functions over the internet. Our system SWIRRL deals, on behalf of the clients, with the complexity of allocating the interactive services for the VREs on a Cloud platform. It runs staging and preprocessing workflows to gather and organise remote datasets, making them accessible collaboratively. We show how Provenance Services manage provenance records about the underlying environment, datasets and analysis workflows, and how these are exploited by researchers to control different reproducibility use cases. Our solutions are currently being implemented in more contexts in Earth Science. We will provide an overview on the progress of these efforts for the EPOS and IS-ENES research infrastructures, addressing solid earth and climate studies, respectively.

Finally, although the reproducibility challenges can be tackled to a large extent by modern technology, this will be further consolidated and made interoperable via the implementation and uptake of the FDOs. To achieve this goal, it is fundamental to establish the conversation between engineers, data-stewards and researchers early in the process of delivering a scientific product. This fosters the definition and implementation of suitable best practices to be adopted by a particular research group. Scientific tools and repositories built around modern FAIR enabling resources can be incrementally refined thanks to this mediated exchange. We will briefly introduce success stories towards this goal in the context of the IPCC Assessment Reports.

How to cite: Spinuso, A., van der Neut, I., Veldhuizen, M., Pagé, C., and Bailo, D.: Provenance powered microservices: a flexible and generic approach fostering reproducible research in Earth Science, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-2744, https://doi.org/10.5194/egusphere-egu23-2744, 2023.

11:00–11:10
|
EGU23-8321
|
ESSI3.5
|
Virtual presentation
Lucy Bastin, Owen Reynolds, Antonio Garcia-Dominguez, and James Sprinks

Evaluating the quality of data is a major concern within the scientific community: before using any dataset for study, a careful judgement of its suitability must be conducted. This requires that the steps followed to acquire, select, and process the data have been thoroughly documented in a methodical manner, in a way that can be clearly communicated to the rest of the community. This is particularly important in the field of citizen science, where a project that can clearly demonstrate its protocols, transformation steps, and quality assurance procedures have much more chance of achieving social and scientific impact through the use and re-use of its data.

A number of specifications have been created to provide a common set of concepts and terminology, such as ISO 19115-3 or W3C PROV. These define a set of interchange formats, but in themselves, they do not provide tooling to create high-quality dataset descriptions. The existing tools built on these standards (e.g. GeoNetwork, USGS metadata wizard, CKAN) are overly complex for some users (for example, many citizen science project managers) who, despite being experts in their own fields, may be unfamiliar with the structure and context of metadata standards or with semantic modelling. 

In this presentation, we will describe a prototype authoring tool that was created using a Model-driven engineering (MDE) software development methodology. The tool was authored using JetBrains Meta Programming System (MPS) to implement a modelling language based on the ISO19115-3 model. A user is provided with a “text-like” editing environment, which assists with the formal structures needed to produce a machine-parable document.

This allows a user to easily describe data lineage and generic processing steps while reusing recognised external vocabularies with automated validation, autocompletion, and transformation to external formats (e.g. the XML format 19115-3 or JSON-LD). We will report on the results of user testing aimed at making the tool accessible to citizen scientists (through dedicated projections with simplified structures and dialogue-driven model creation) and evaluating with those users any new possibilities that comprehensive and machine-parsable provenance information may create for data integration and sharing. The prototype will also serve as a test pilot of the integration between ISO 19115-3 and existing/upcoming third-party vocabularies (such as the upcoming ISO data quality measures registry).

How to cite: Bastin, L., Reynolds, O., Garcia-Dominguez, A., and Sprinks, J.: Facilitating provenance documentation with a model-driven-engineering approach., EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8321, https://doi.org/10.5194/egusphere-egu23-8321, 2023.

11:10–11:20
|
EGU23-16288
|
ESSI3.5
|
Highlight
|
On-site presentation
Graham Parton, Barbara Brooks, Ag Stephens, and Wendy Garland

Within the UK the National Centre for Atmospheric Science (NCAS) operates a suite of observational instruments for atmospheric dynamics, chemistry and composition studies. These are principally made available through two facilities: the Atmospheric Measurement and Observations Facility (AMOF) and the Facility for Airborne Atmospheric Measurements (FAAM). Between these two facilities instrumentation can be on either campaign or long-term deployed in diverse environments (from polar to maritime; surface to high altitude), on a range of platforms (aircraft, ships) or dedicated atmospheric observatories.

The wide range of instruments, spanning an operational time period from the mid 1990s to present, has traditionally been orientated to specific communities, resulting in a plethora of different operational practices, data standards and workflows. The resulting data management and usage challenges have been further exacerbated over time by changes of staff, instruments and end-user communities and their requirements. This has been accompanied by the wider end-user community seeking greater access to and improved use of the data, with necessary associated improvements in data production to ensure transparency, quality, veracity and, thus, overall reproducibility. Additionally, these enhancemed workflows further ensure FAIR data outputs, widening long-term re-use of the data. 

Seeking to address these challenges in a more harmonious approach across the range of AMOF and FAAM facilities, NCAS established the NCAS Data Project in 2018 bringing together key players in the data workflows to break down barriers and common standards and procedures through improved dialogue. The resulting NCAS ‘Data Pyramid’ approach, brings together representatives from the data provider, data archive and end-user communities alongside supporting software engineers within a common framework that enables cross-working between all partners. This has lead to new data standards and workflows being established to ensure 3 key objectives: 1) capturing and flow of the necessary metadata to automate data flows and quality control as much as possible in a timely fashion ‘from field to end-user’; 2) enhanced transparency and traceability in data production via linked externally visible documentation, calibration and code repositories; and, 3) data products meeting end-user requirements in terms of their content and established quality control. Finally, data workflows are further enhanced thanks to scriptable conformance checking throughout the data production lifecycle, built on the controlled data product and metadata standards.

Thus, through the established workflows of the NCAS Data Project, the necessary details are captured and conveyed by both internal file-level and catalogue-level metadata to ensure that all three corners of the triangle of reproducibility, quality information, and provenance are able to be achieved in combination.

How to cite: Parton, G., Brooks, B., Stephens, A., and Garland, W.: The UK’s NCAS Data Project: establishing transparent observational data workflows from field to user, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-16288, https://doi.org/10.5194/egusphere-egu23-16288, 2023.

11:20–11:30
|
EGU23-15384
|
ESSI3.5
|
Highlight
|
On-site presentation
|
Daniel Nüst, Frank O. Ostermann, and Carlos Granell

The Reproducible AGILE initiative (https://reproducible-agile.github.io/) successfully established a code execution procedure following the CODECHECK principles (https://doi.org/10.12688/f1000research.51738.2) at the AGILE conference series (https://agile-online.org/conference). The AGILE conference is a medium-sized community-led conference in the domains of Geographic Information Science (GIScience), geoinformatics, and related fields. The conference is organised under the umbrella of the Association of Geographic Information Laboratories in Europe (AGILE).

Starting with a series of workshops on reproducibility from 2017 to 2019, a group of Open Science enthusiasts with the support of the AGILE Council (https://agile-online.org/agile-actions/current-initiatives/reproducible-publications-at-agile-conferences) was able to introduce guidelines for sharing reproducible workflows (https://doi.org/10.17605/OSF.IO/CB7Z8) and establish a reproducibility committee that conducts code executions for all accepted full papers.
In this presentation, we provide details of the taken steps and the encountered obstacles towards the current state. We revisit the process and abstract a series of actions that similar events or even journals may take to introduce a shift towards higher reproducibility of research publications in a specific community of practice.

We discuss the taken approach in the light of the challenges for reproducibility in Earth System Sciences (ESS) around four main ideas.
First, Reproducible AGILE’s human-centered process is able to handle the increasingly complex, large and varying data-based workflows in ESS because of the clear guidance on responsibilities (What should the author provide? How far does the reproducibility reviewer need to go?).
Second, the communicative focus of the process is very well suited to, over time, help to establish a shared practice based on current technical developments, such as FAIR Digital Objects, and to reform attitudes towards openness, transparency and sharing. A code execution following the CODECHECK principles is a learning experience that may sustainably change researcher behaviours and practice. At the same time, Reproducible AGILE’s approach avoids playing catch-up with technology and does not limit researcher freedom or includes a need to unitise researcher workflows beyond providing instructions suitable for a human evaluator, similar to academic peer review.
Third, while being agnostic of technology and infrastructures, a supportive framework of tools and infrastructure can of course increase the efficiency of conducting a code execution. We outline how existing infrastructures may serve this need and what is still missing.
Fourth, we list potential candidates of event series or journals that could introduce a code checking procedure because of their organisational setup or steps towards more open scholarhip that were already taken.

How to cite: Nüst, D., Ostermann, F. O., and Granell, C.: A peer review process for higher reproducibility of publications in GIScience can also work for Earth System Sciences, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-15384, https://doi.org/10.5194/egusphere-egu23-15384, 2023.

11:30–11:40
|
EGU23-3711
|
ESSI3.5
|
solicited
|
Virtual presentation
Gregory Tucker, Albert Kettner, Eric Hutton, Mark Piper, Tian Gan, Benjamin Campforts, Irina Overeem, and Matthew Rossi

The Community Surface Dynamics Modeling System (CSDMS) is a US-based science facility that supports computational modeling of diverse Earth and planetary surface processes, ranging from natural hazards and contemporary environmental change to geologic applications. The facility promotes open, interoperable, and shared software. Here we review approaches and lessons learned in advancing FAIR principles for geoscience modeling. To promote sharing and accessibility, CSDMS maintains an online Model Repository that catalogs over 400 shared codes, ranging from individual subroutines to large and sophisticated integrated models. Thanks to semi-automated search tools, the Repository now includes ~20,000 references to literature describing these models and their applications, giving prospective model users efficient access to information about how various codes have been developed and used. To promote interoperability, CSDMS develops and promotes the Basic Model Interface (BMI): a lightweight, language-agnostic API standard that provides control, query, and data-modification functions. BMI has been adopted by a number of academic, government, and quasi-private institutions for coupled-modeling applications. BMI specifications are provided for common scientific languages, including as Python, C, C++, Fortran, and Java. One challenge lies in broader awareness and adoption; for example, self-taught code developers may be unaware of the concept of an API standard, or may not perceive value in designing around such a standard. One way to address this challenge is to provide open-source programming libraries. One such library that CSDMS curates is Landlab Toolkit: a Python package that includes building blocks for model development (such as grid data structures and I/O functions) while also providing a framework for assembling integrated models from component parts. We find that Landlab can greatly speed model development, while giving user-developers an incentive to follow common patters and contribute new components to the library. However, libraries by themselves do not solve the reproducibility challenge. Rather than reinventing the wheel, the CSDMS facility has approached reproducibility by partnering with the Whole Tale initiative, which provides tools and protocols to create reproducible archives of computational research. Finally, we have found that a central challenge to FAIR modeling lies in the level of community knowledge. FAIR is a two-way street that depends in part on the technical skills of the user. Are they fluent in a particular programming language? How familiar are they with the numerical methods used by a given model? How familiar are they with underlying scientific concepts and simplifying assumptions? Are they conversant with modern version control and collaborative-development technology and practices? Although scientists should not need to become software engineers, in our experience there is a basic level of knowledge that can substantially raise the quality and sustainability of research software. To address this, CSDMS offers training programs, self-paced learning materials, and online help resources for community members. The vision is to foster a thriving community of practice in computational geoscience research, equipped with ever-improving modeling tools written by and for the community as a whole.

How to cite: Tucker, G., Kettner, A., Hutton, E., Piper, M., Gan, T., Campforts, B., Overeem, I., and Rossi, M.: Lessons in FAIR software from the Community Surface Dynamics Modeling System, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3711, https://doi.org/10.5194/egusphere-egu23-3711, 2023.

11:40–11:50
|
EGU23-13108
|
ESSI3.5
|
Highlight
|
On-site presentation
Charlotte Pascoe, Lina Sitz, Diego Cammarano, Anna Pirani, Martina Stockhause, Molly MacRae, and Emily Anderson

A new paradigm for Intergovernmental Panel on Climate Change (IPCC) Working Group I (WGI) data publication has been implemented.  IPCC Data Distribution Centre (DDC) partners at the Centre for Environmental Data Analysis (CEDA), the German Climate Computing Centre (DKRZ) and the Spanish Research Council (CSIC) have worked with the IPCC Technical Support Unit (TSU) for WGI to publish figure data from the Sixth Assessment Report (AR6). The work was guided by the IPCC Task Group on Data Support for Climate Change Assessments (TG-Data) recommendations for Open Science and FAIR data (making data Findable, Accessible, Interoperable, and Reusable) with a general aim to enhance the transparency and accessibility of AR6 outcomes.  We highlight the achievement of implementing FAIR for AR6 figure data and discuss the lessons learned on the road to FAIRness in the unique context of the IPCC.

  • Findable - The CEDA catalogue record for each figure dataset enhances findability. Keywords can be easily searched. Records are organised into collections for each AR6 chapter. There is a two-way link between the catalogue record and the figure on the AR6 website. CEDA catalogue records are duplicated on the IPCC-DDC. 
  • Accessible - Scientific language is understandable, acronyms and specific terminology are fully explained. CEDA services provide tools to access and download the data. 
  • Interoperable - Where possible data variables follow standard file format conventions such as CF-netCDF and have standard names, where this is not feasible readme files describe the file structure and content. 
  • Reusable - The data can be reused, shared and adapted elsewhere, with credit, under a Creative Commons Attribution 4.0 licence (CC BY 4.0). Catalogue records link to relevant documentation such as the Digital Object Identifier (DOI) for the code and other supplementary information. The code used to create the figures allows users to reproduce the figures from the report independently. 

CEDA catalogue records provide a platform to acknowledge the specific work of IPCC authors and dataset creators whose work supports the scientific basis of AR6. 

Catalogue records for figure datasets were created at CEDA with data archived in the CEDA repository and the corresponding code stored on GitHub and referenced via Zenodo.  For instances where the data and code were blended in a processing chain that could not be easily separated, we developed criteria to categorise the different blends of data and code and created a decision tree to decide how best to archive them. Key intermediate datasets were also archived at CEDA.

Careful definition of metadata requirements at the beginning of the archival process is important for handling the diversity of IPCC figure data which includes data derived from climate model simulations, historical observations and other sources of climate information. The reality of the implementation meant that processes for gathering data and information from authors were specified later in the preparation of AR6. This presented challenges with data management workflows and the separation of figure datasets from the intermediate data and code that generated them. 

We present recommendations for AR7 and scaling up this work in a feasible way.

How to cite: Pascoe, C., Sitz, L., Cammarano, D., Pirani, A., Stockhause, M., MacRae, M., and Anderson, E.: The reality of implementing FAIR principles in the IPCC context to support open science and provide a citable platform to acknowledge the work of authors., EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13108, https://doi.org/10.5194/egusphere-egu23-13108, 2023.

11:50–12:00
|
EGU23-4939
|
ESSI3.5
|
Highlight
|
On-site presentation
Shelley Stall and Kristina Vrouwenvelder

Open science is transformative, removing barriers to sharing science and increasing reproducibility and transparency. The benefits of open science are maximized when its principles are incorporated throughout the research process, through working collaboratively with community members and sharing data, software, workflows, samples, and other aspects of scientific research openly where it can be reused, distributed, and reproduced. However, the paths toward Open Science are not always apparent, and there are many concepts, approaches, tools to learn along the way.  

Open Science practices are along a continuum where researchers can make incremental adjustments to their research practices that may seem small but can have valuable benefits. Here we will share the first steps in a researcher’s open science journey and how to lead your own research team in adopting Open Science practices.

How to cite: Stall, S. and Vrouwenvelder, K.: Open Science: How Open is Open?, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-4939, https://doi.org/10.5194/egusphere-egu23-4939, 2023.

12:00–12:10
|
EGU23-8526
|
ESSI3.5
|
ECS
|
On-site presentation
Benjamin Schumacher, Patrick Griffiths, Edzer Pebesma, Jeroen Dries, Alexander Jacob, Daniel Thiex, Matthias Mohr, and Christian Briese

openEO Platform holds a large amount of free and open as well as commercial Earth Observation (EO) data which can be accessed and analysed with openEO, an open API that enables cloud computing and EO data access in a unified and reproducible way. Additionally, client libraries are available in R, Python and Javascript. A JupterLab environment and the Web Editor, a graphical interface, allow a direct and interactive development of processing workflows. The platform is developed with a strong user focus and various use cases have been implemented to illustrate the platform capabilities. Currently, three federated backends support the analysis of EO data from pixel to continental scale.  

The use cases implemented during the platform’s main development phase include a dynamic landcover mapping, an on-demand analysis-ready-data creation for Sentinel-1 GRD, Sentinel-2 MSI and Landsat data, time series-based forest dynamics analysis with prediction functionalities, feature engineering for crop type mapping and large-scale fractional canopy mapping. Additionally, three new use cases are being developed by platform users. These include large scale vessel detection based on Sentinel-1 and Sentinel-2 data, surface water indicators using the ESA World Water toolbox for a user-defined area of interest and monitoring of air quality parameters using Sentinel-5P data. 

The future evolution of openEO Platform in terms of data availability and processing capabilities closely linked to community requirements, facilitated by feature requests from users who design their workflows for environmental monitoring and reproducible research purposes. This presentation provides an overview of the completed use cases, the newly added functionalities such as user code sharing, and user interface updates based on the new use cases and user requests. openEO Platform exemplifies how the processing and analysing large amounts of EO data to meaningful information products is becoming easier and largely compliant with FAIR data principles supporting the EO community at large. 

How to cite: Schumacher, B., Griffiths, P., Pebesma, E., Dries, J., Jacob, A., Thiex, D., Mohr, M., and Briese, C.: openEO Platform – showcasing a federated, accessible platform for reproducible large-scale Earth Observation analysis, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8526, https://doi.org/10.5194/egusphere-egu23-8526, 2023.

12:10–12:20
|
EGU23-14845
|
ESSI3.5
|
On-site presentation
|
Massimiliano Cannata, Gregory Giuliani, Jens Ingensand, Olivier Ertz, and Maxime Collombin

In the era of cloud computing, big data and Internet of things, research is very often data-driven: based on the analysis of data, increasingly available in large quantities and collected by experiments, observations or simulations. These data are very often characterized as being dynamic in space and time and as continuously expanding (monitoring) or change (data quality management or survey). Modern Spatial Data Infrastructures (e.g.  swisstopo or INSPIRE), are based on interoperable Web services which expose and serve large quantities of data on the Internet using widely accepted and used open standards defined by the Open Geospatial Consortium (OGC) and the International Organization for Standardization (ISO). These standards mostly comply with FAIR principles but do not offer any capability to retrieve a dataset how it was in a defined instant, to refer to its status in that specific instant and to guarantee its immutability. These three aspects hinder the replicability of research based on such a kind of services. We discuss the issue here and the state of the art  and propose a possible solution to fill this gap, using or extending when needed the existing standards and or adopting best practices in the fields of sensor data, satellite data and vector data.

How to cite: Cannata, M., Giuliani, G., Ingensand, J., Ertz, O., and Collombin, M.: Open geospatial standards and reproducible research, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14845, https://doi.org/10.5194/egusphere-egu23-14845, 2023.

12:20–12:30
|
EGU23-12864
|
ESSI3.5
|
Highlight
|
On-site presentation
Lesley Wyborn, Nigel Rees, Jens Klump, Ben Evans, Rebecca Farrington, and Tim Rawling

Reproducible research necessitates full transparency and integrity in data collection (e.g. from observations) or generation of data, and further data processing and analysis to generate research products. However, Earth and environmental science data are growing in complexity, volume and variety and today, particularly for large-volume Earth observation and geophysics datasets, achieving this transparency is not easy. It is rare for a published data product to be created in a single processing event by a single author or individual research group. Modern research data processing pipelines/workflows can have quite complex lineages, and it is more likely that an individual research product is generated through multiple levels of processing, starting from raw instrument data at full resolution (L0) followed by successive levels of processing (L1-L4), which progressively convert raw instrument data into more useful parameters and formats. Each individual level of processing can be undertaken by different research groups using a variety of funding sources: rarely are those involved in the early stages of processing/funding properly cited.

The lower levels of processing are where observational data essentially remains at full resolution and is calibrated, georeferenced and processed to sensor units (L1) and then geophysical variables are derived (L2). Historically, particularly where the volumes of the L0-L2 datasets are measured in Terabytes to Petabytes, processing could only be undertaken by a minority of specialised scientific research groups and data providers, as few had the expertise/resources/infrastructures to process them on-premise. Wider availability of colocated data assets and HPC/cloud processing means that the full resolution, less processed forms of observational data can now be processed remotely in realistic timeframes by multiple researchers to their specific processing requirements, and also enables greater exploration of parameter space allowing multiple values for the same inputs to be trialled. The advantage is that better-targeted research products can now be rapidly produced. However, the downside is that far greater care needs to be taken to ensure that there is sufficient machine-readable metadata and provenance information to enable any user to determine what processing steps and input parameters were used in each part of the lineage of any released dataset/data product, as well as be able to reference exactly who undertook any part of the acquisition/processing and identify sources of funding (including instruments/field campaigns that collected the data).

The use of Persistent Identifiers (PIDs) for any component objects (observational data, synthetic data, software, model inputs, people, instruments, grants, organisations, etc.) will be critical. Global and interdisciplinary research teams of the future will be reliant on software engineers to develop community-driven software environments that aid and enhance the transparency and reproducibility of their scientific workflows and ensure recogniton. The advantage of the PID approach is that not only will reproducibility and transparency be enhanced, but through the use of Knowledge Graphs it will also be possible to trace the input of any researcher at any level of processing, whilst funders will be able to determine the impact of each stage from the raw data capture through to any derivative high-level data product. 

 

How to cite: Wyborn, L., Rees, N., Klump, J., Evans, B., Farrington, R., and Rawling, T.: Who Done It? Reproducibility of Data Products Also Requires Lineage to Determine Impact and Give Credit Where Credit is Due., EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-12864, https://doi.org/10.5194/egusphere-egu23-12864, 2023.

Posters on site: Wed, 26 Apr, 08:30–10:15 | Hall X4

Chairpersons: Karsten Peters-von Gehlen, Rebecca Farrington, Philippe Bonnet
X4.183
|
EGU23-15391
|
ESSI3.5
Swati Gehlot, Karsten Peters-von Gehlen, Andrea Lammert, and Hannes Thiemann

German climate research initiative PalMod phase II (www.palmod.de) is presented here as an exclusive example where the project end-product is unique, scientific paleo-climate data. PalMod-II data products include output from three state-of-the-art coupled climate models of varying complexity and spatial resolutions simulating the climate of the past 130,000 years. In addition to the long time series of modeling data, a comprehensive compilation of paleo-observation data is prepared to facilitate model-model and model-proxy intercomparison and evaluation. Being a large multidisciplinary project, a dedicated RDM (Research Data Management) approach is applied within the cross-cutting working group for PalMod-II. The DMP (Data Management Plan), as a living document, is used for documenting the data-workflow framework that defines the details of paleo-climate data life-cycle. The workflow containing the organisation, storage, preservation, sharing and long-term curation of the data is defined and tested.  In order to make the modeling data inter-comparable across the PalMod-II models and easily analyzable by the global paleo-climate community, model data standardization (CMORization) workflows are defined for individual PalMod models and their sub-models. The CMORization workflows contain setup, definition, and quality assurance testing of CMIP61 based standardization processes adapted to PalMod-II model simulation output requirements with a final aim of data publication via ESGF2. PalMod-II data publication via ESGF makes the paleo-climate data an asset which is (re-)usable beyond the project life-time.

The PalMod-II RDM infrastructure enables common research data management according to the FAIR3 data principles across all the working groups of PalMod-II using common workflows for the exchange of data and information along the process chain. Applying data management planning within PalMod-II made sure that all the data related workflows were defined, continuously updated if needed and made available to the project stakeholders. End products of PalMod-II which consist of unique long term scientific paleo-climate data (model as well as paleo-proxy data) are made available for re-use via the paleo-climate research community as well as other research disciplines (e.g., land-use, socio-economic studies etc.).

1. Coupled Model Intercomparison Project phase 6 (https://www.wcrp-climate.org/wgcm-cmip/wgcm-cmip6)

2. Earth System Grid Federation (https://esgf.llnl.gov)

3. Findable, Accessible, Interoperable, Reusable

How to cite: Gehlot, S., Peters-von Gehlen, K., Lammert, A., and Thiemann, H.: Data Management for PalMod-II – data workflow and re-use strategy, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-15391, https://doi.org/10.5194/egusphere-egu23-15391, 2023.

X4.184
|
EGU23-6375
|
ESSI3.5
Jochen Klar and Matthias Mengel

The Inter-Sectoral Impact Model Intercomparison Project (ISIMIP) is a community-driven climate impact modeling initiative that aims to contribute to a quantitative and cross-sectoral synthesis of the various impacts of climate change, including associated uncertainties. ISIMIP is organized into simulation rounds for which a simulation protocol defines a set of common scenarios. Participating modeling groups run their simulations according to these scenarios and with a common set of climatic and socioeconomic input data. The model output data are collected by the ISIMIP team at the Potsdam Institute for Climate Impact Research (PIK) and made publicly available in the ISIMIP repository. Currently the ISIMIP Repository at data.isimip.org includes data from over 150 impact models spanning across 13 different sectors. It comprises of over 100 Tb of data.

As the world's largest data archive of model-based climate impact data, ISIMIP output data is used by a very diverse audience inside and outside of academia, for all kind of research and analyses. Special care is taken to enable persistent identification, provenience, and citablity. A set of workflows and tools ensure the conformity of the model output data with the protocol and the transparent management of caveats and updates to already published data. Datasets are referenced using unique internal IDs and hash values are stored for each file in the database.

In recent years, this process has been significantly improved by introducing a machine-readable protocol, which is version controlled on GitHub and can be accessed over the internet. A set of software tools for quality control and data publication accesses this protocol to enforce a consistent data quality and to extract metadata. Some of the tools can be used independently by the modelling groups even before submitting the data. After the data is published on the ISIMIP Repository, it can be accessed via web or using an API (e.g. for access from Jupyter notebooks) using the same controlled vocabularies from the protocol. In order to make the data citable, DOI for each output sector are registered with DataCite. For each DOI, a precise list of each contained dataset is maintained. If data for a sector is added or replaced, a new, updated DOI is created.

While the specific implementation is highly optimized to the peculiarities of ISIMIP, the general ideas should be transferable to other projects. In our presentation, we will discuss the various tools and how they interact to create an integrated curation and publishing workflow.

How to cite: Klar, J. and Mengel, M.: A machine-actionable workflow for the publication of climate impact data of the ISIMIP project, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-6375, https://doi.org/10.5194/egusphere-egu23-6375, 2023.

X4.185
|
EGU23-17263
|
ESSI3.5
Ivonne Anders, Hannes Thiemann, Martin Bergemann, Christopher Kadow, and Etor Lucio-Eceiza

Some disciplines, e.g. Astrophysics or Earth system sciences, work with large to very large amounts of data. Storing this data, but also processing it, is a challenge for researchers because novel concepts for processing data and workflows have not developed as quickly. This problem will only become more pronounced with the ever increasing performance of High Performance Computing (HPC) – systems.

At the German Climate Computing Center, we analysed the users, their goals and working methods. DKRZ provides the climate science community with resources such as high-performance computing (HPC), data storage and specialised services and hosts the World Data Center for Climate (WDCC). In analysing users, we distinguish between two main groups: those who need the HPC system to run resource-intensive simulations and then analyse them, and those who reuse, build on and analyse existing data. Each group subdivides into subgroups. We have analysed the workflows for each identified user and found identical parts in an abstracted form and derived Canonical Workflow Modules. In the process, we critically examined the possible use of so-called FAIR Digital Objects (FDOs) and checked to what extent the derived workflows and workflow modules are actually future-proof.

We will show the analysis of the different users, the Canonical workflow and the vision of the FDOs. Furthermore, we will present the framework Freva and further developments and implementations at DKRZ with respect to the reproducibility of simulation-based research in the ESS.

How to cite: Anders, I., Thiemann, H., Bergemann, M., Kadow, C., and Lucio-Eceiza, E.: Towards reproducible workflows in simulation based Earth System Science, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-17263, https://doi.org/10.5194/egusphere-egu23-17263, 2023.

X4.186
|
EGU23-4525
|
ESSI3.5
|
ECS
Zarrar Khan, Chris Vernon, Isaac Thompson, and Pralit Patel

The number of models, as well as data inputs and outputs, are continuously growing as scientists continue to push the boundaries of spatial, temporal, and sectoral details being captured. This study presents the framework being developed to manage the Global Change Intersectoral Modeling System (GCIMS) eco-system of human-Earth system models. We discuss the challenges of ensuring continuous deployment and integration, reproducibility, interoperability, containerization, and data management for the growing suite of GCIMS models. We investigate the challenges of model version control and interoperability between models using different software, operating on different temporal and spatial scales, and focusing on different sectors. We also discuss managing transparency and accessibility to models and their corresponding data products throughout our integrated modeling lifecycle.

How to cite: Khan, Z., Vernon, C., Thompson, I., and Patel, P.: GCIMS – Integration: Reproducible, robust, and scalable workflows for interoperable human-Earth system modeling, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-4525, https://doi.org/10.5194/egusphere-egu23-4525, 2023.

X4.187
|
EGU23-7427
|
ESSI3.5
Florian Spreckelsen, Henrik tom Wörden, Daniel Hornung, Timm Fitschen, Alexander Schlemmer, and Johannes Freitag

The flexible open-source research data management toolkit CaosDB is used in a diversity of fields such as turbulence physics, legal research, maritime research and glaciology. It is used to link research data and make it findable and retrievable and to keep it consistent, even if the data model changes.

CaosDB is used in the glaciology department at the Alfred Wegener Institute in Bremerhaven for the management of ice core samples and related measurements and analyses. Researchers can use the system to query for ice samples linked to, e.g., specific measurements for which they then can request to borrow for further analyses. This facilitates inter-laboratory collaborative research on the same samples. The system helped to solve a number of needs for the researchers, such as: A revision system which intrinsically keeps track of changes to the data and in which state samples were, when certain analyses were performed. Automated gathering of information for the publication in a meta-data repository (Pangaea). Tools for storing, displaying and  querying geospatial information and graphical summaries of all the measurements and analyses performed on an ice core. Automatic data extraction and refinement into data records in CaosDB so that users do not need to enter the data manually. A state machine which guarantees certain workflows, simplifies development and can be extended to trigger additional actions upon transitions.

We demonstrate how CaosDB enables researchers to create and work with semantic data objects. We further show how CaosDB's semantic data structure enables researchers to publish their data as FAIR Digital Objects.

How to cite: Spreckelsen, F., tom Wörden, H., Hornung, D., Fitschen, T., Schlemmer, A., and Freitag, J.: Integrating sample management and semantic research-data management in glaciology, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7427, https://doi.org/10.5194/egusphere-egu23-7427, 2023.

X4.188
|
EGU23-13347
|
ESSI3.5
|
ECS
Alexander Schlemmer and Sinikka Lennartz

In our project we are employing semantic data management with the Open Source research data management system (RDMS) CaosDB [1] to link empirical data and simulation output from Earth System Models [2]. The combined management of these data structures allows us to perform complex queries and facilitates the integration of data and meta data into data analysis workflows.

One particular challenge for analyses of model output is to keep track of all necessary meta data of each simulation during the whole digital workflow. Especially for open science approaches it is of great importance to properly document - in human- and computer-readable form - all the information necessary to completely reproduce obtained results. Furthermore, we want to be able to feed all relevant data from data analysis back into our data management system, so that we are able to perform complex queries also on data sets and parameters stemming from data analysis workflows.

A specific aim of this project is to re-analyse existing sets of simulations under different research questions. This endeavour can become very time consuming without proper documentation in an RDMS.

We implemented a workflow, combining semantic research data management with CaosDB and Jupyter notebooks, that keeps track of data loaded into an analysis workspace. Procedures are provided that create snapshots of specific states of the analysis. These snapshots can automatically be interpreted by the CaosDB crawler that is able to insert and update records in the system accordingly. The snapshots include links to the input data, parameter information, the source code and results and therefore provide a high-level interface to the full chain of data processing, from empirical and simulated raw data to the results. For example, input parameters of complex Earth System Models can be extracted automatically and related to model performance. In our use case, not only automated analyses are feasible, but also interactive approaches are supported.

  • [1] Fitschen, T.; Schlemmer, A.; Hornung, D.; tom Wörden, H.; Parlitz, U.; Luther, S. CaosDB—Research Data Management for Complex, Changing, and Automated Research Workflows. Data 2019, 4, 83. https://doi.org/10.3390/data4020083
  • [2] Schlemmer, A., Merder, J., Dittmar, T., Feudel, U., Blasius, B., Luther, S., Parlitz, U., Freund, J., and Lennartz, S. T.: Implementing semantic data management for bridging empirical and simulative approaches in marine biogeochemistry, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11766, https://doi.org/10.5194/egusphere-egu22-11766, 2022.

How to cite: Schlemmer, A. and Lennartz, S.: Transparent and reproducible data analysis workflows in Earth System Modelling combining interactive notebooks and semantic data management, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13347, https://doi.org/10.5194/egusphere-egu23-13347, 2023.

X4.189
|
EGU23-9852
|
ESSI3.5
Filippo Giadrossich, Ilenia Murgia, and Roberto Scotti

NuoroForestrySchool (a study center of the Department of Agriculture, University of Sassari, Italy) has developed and published a ‘data documentation procedure’ (link to NFS-DDP) enabling the improvement of the dataset FAIRness that any data collector wishes to share as open data. Datasets are frequently shared as spreadsheet files. While this tool is very handy in data preparation and preliminary analysis, its structure and composition are not very effective for storing and sharing consolidated data, unless data structures are extremely simple. NFS-DDP takes in input a spreadsheet in which data are organized as relational tables, one per sheet, while four additional sheets contain metadata standardized according to the Dublin Core specifications. The procedure outputs an SQLite relational database (including data and metadata) and a pdf-file documenting the database structure and contents. A first example application of the proposed procedure was shared by Giadrossich et al. (2022) on the PANGEA repository, concerning experimental data of erosion in forest soil measured during artificial rainfall. The zip-archive that can be downloaded contains the experiment data and metadata processed by NFS-DDP. At the following link is available a test document where basic statistics are computed to show how NFS-DDProcedure facilitates the understanding and correct processing of the shared dataset. 

The NFS-DataDocumentationProcedure provides a simple solution for organizing and archiving data aiming to i) achieve a more FAIR archive, ii) exploit data consistency and comprehensibility of semantic connections in the relational database, ii) produce a report documenting the collection and organization of data, providing an effective and concise overview of the whole with all details at hand.

Giadrossich, F., Murgia, I., Scotti, R. (2022). Experiment of water runoff and soil erosion with and without forest canopy coverage under intense artificial rainfall. PANGAEA. DOI:10.1594/PANGAEA.943451



How to cite: Giadrossich, F., Murgia, I., and Scotti, R.: Proposal of a simple procedure to derive a more FAIR open data archive than a spreadsheet or a set of CSV files, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-9852, https://doi.org/10.5194/egusphere-egu23-9852, 2023.

X4.190
|
EGU23-12971
|
ESSI3.5
David Schäfer, Bert Palm, Peter Lünenschloß, Lennart Schmidt, and Jan Bumberger

Environmental sensor networks produce ever-growing volumes of time series data with great potential to broaden the understanding of complex spatiotemporal environmental processes. However, this growth also imposes its own set of new challenges. Especially the error-prone nature of sensor data acquisition is likely to introduce disturbances and anomalies into the actual environmental signal. Most applications of such data, whether it is used in data analysis, as input to numerical models or modern data science approaches, usually rely on data that complies with some definition of quality.

To move towards high-standard data products, a thorough assessment of a dataset's quality, i.e., its quality control, is of crucial importance. A common approach when working with time series data is the annotation of single observations with a quality label to transport information like its reliability. Downstream users and applications are hence able to make informed decisions, whether a dataset in its whole or at least parts of it are appropriate
for the intended use.

Unfortunately, quality control of time series data is a non-trivial, time-consuming, scientifically undervalued endeavor and is often neglected or executed with insufficient rigor. The presented software, the System for automated Quality Control (SaQC), provides all basic and many advanced building blocks to bridge the gap between data that is usually faulty but expected to be correct in an accessible, consistent, objective and reproducible way. Its user interfaces address different audiences ranging from the scientific practitioner with little access to the possibilities of modern software development to the trained programmer. SaQC delivers a growing set of generic algorithms to detect a multitude of anomalies and to process data using resampling, aggregation, and data modeling techniques. However, one defining component of SaQC is its innovative approach to storing runtime process information. In combination with a flexible quality annotation mechanism, SaQC allows to extend quality labels with fine-grained provenance information appropriate to fully reproduce the system's output.

SaQC is proving its usefulness on a daily basis in a range of fully automated data flows for large environmental observatories. We highlight use cases from the TERENO Network, showcasing how reproducible automated quality control can be implemented into real-world, large-scale data processing workflows to provide environmental sensor data in near real-time to data users, stakeholders and decision-makers.

 

How to cite: Schäfer, D., Palm, B., Lünenschloß, P., Schmidt, L., and Bumberger, J.: Reproducible quality control of time series data with SaQC, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-12971, https://doi.org/10.5194/egusphere-egu23-12971, 2023.

X4.191
|
EGU23-6726
|
ESSI3.5
|
ECS
Anna Simson, Anil Yildiz, and Julia Kowalski

A vast amount of in situ cryospheric data has been collected during publicly funded field campaigns to the polar regions over the past decades. Each individual data set yields important insights into local thermo-physical processes, but they need to be assembled into informative data compilations to unlock their full potential to produce regional or global outcomes for climate change related research. The efficient and sustainable interdisciplinary reuse of such data compilations is of large interest to the scientific community. Yet, the creation of such compilations is often challenging as they have to be composed of often heterogeneous data sets from various data repositories. We will focus on the reuse of data sets in this contribution, while generating extendible data compilations with enhanced reusability.

Data reuse is typically conducted by researchers other than the original data producers, and it is therefore often limited by the metadata and provenance information available. Reuse scenarios include the validation of physics-based process models, the training of data-driven models, or data-integrated predictive simulations. All these use cases heavily rely on a diverse data foundation in form of a data compilation, which depends on high quality information. In addition to metadata, provenance, and licensing conditions, the data set itself must be checked for reusability. Individual data sets containing the same metrics often differ in structure, content, and metadata, which challenges data compilation.

In order to generate data compilations for a specific reuse scenario, we propose to break down the workflow into four steps:
1) Search and selection: Searching, assessing, optimizing search, and selecting data sets.
2) Validation: Understanding and representing data sets in terms of the data collectors including structure, terms used, metadata, and relations between different metrics or data sets.
3) Specification: Defining the format, structure, and content of the data compilation based on the scope of the data sets.
4) Implementation: Integrating the selected data sets into the compilation.

We present a workflow herein to create a data compilation from heterogeneous sea ice core data sets following the previously introduced structure. We report on obstacles encountered in the validation of data sets mainly due to missing or ambiguous metadata. This leaves the (re)user space for subjective interpretation and thus increases uncertainty of the compilation. Examples are challenges in relating different data repositories associated with the same location or the same campaign, the accuracy of measurement methods, and the processing stage of the data. All of which often require a bilateral iteration with the data acquisition team. Our study shows that enriching data reusability with data compilations requires quality-ensured metadata on the individual data set level.

How to cite: Simson, A., Yildiz, A., and Kowalski, J.: Data compilations for enriched reuse of sea ice data sets, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-6726, https://doi.org/10.5194/egusphere-egu23-6726, 2023.

X4.192
|
EGU23-7417
|
ESSI3.5
|
ECS
|
Highlight
Anil Yildiz and Julia Kowalski

Investigating the mechanics of physical processes involved in various geohazards, e.g. gravitational, flow-like mass movements, shallow landslides or flash floods, predicting their temporal or spatial occurrence, and analysing the associated risks clearly benefit from advanced computational process-based or data-driven models. Reproducibility is needed not only for the integrity of the scientific results, but also as a trustbuilding element in practical geohazards engineering. Various complex numerical models or pre-trained machine learning algorithms exist in the literature, for example, to determine landslide susceptibility in a region or to predict the run-out of torrential flows in a catchment. These use FAIR datasets with increasing frequency, for example DEM data to set up the simulation, or open access landslide databases for training and validation purposes. However, we maintain that workflow reproducibility is not ensured simply due to the FAIRness of input or output datasets. Underlying computational or machine learning model needs to be (re)structured to enable the reproducibility and replicability of every step in the workflow so that a model can be (re)built to either reproduce the same results, or can be (re)used to elaborate on new cases or new applications. We propose a data-integrated, platform-independent scientific model publication approach combining self-developed Python packages, Jupyter notebooks, version controlling, FAIR data repositories and high-quality metadata. Model development in the form of a Python package guarantees that model can be run by any end-user, and defining submodules of analysis or visualisation within the package helps the users to build their own models upon the model presented. Publishing the manuscript as a data- and model-integrated Jupyter notebook creates a transparent application of the model, and the user can reproduce any result either presented in the manuscript or in the datasets. We demonstrate our workflow with two applications from geohazards research herein while highlighting the shortcomings of the existing frameworks and suggesting improvements for future applications.

How to cite: Yildiz, A. and Kowalski, J.: Data-integrated executable publications for reproducible geohazards research, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7417, https://doi.org/10.5194/egusphere-egu23-7417, 2023.

X4.193
|
EGU23-7532
|
ESSI3.5
Mattia Santoro, Paolo Mazzetti, and Stefano Nativi

Humankind is facing unprecedented global environmental and social challenges in terms of food, water and energy security, resilience to natural hazards, etc. To address these challenges, international organizations have defined a list of policy actions to be achieved in a relatively short and medium-term timespan (e.g., the UN SDGs). The development and use of knowledge platforms is key in helping the decision-making process to take significant decisions and avoid potentially negative impacts on society and the environment.

Scientific models are key tools to transform into information and knowledge the huge amount of data currently available online. Executing a scientific model (implemented as an analytical software) commonly requires the discovery and use of different types of digital resources (i.e. data, services, and infrastructural resources). In the present geoscience technological landscape, these resources are generally provided by different systems (working independently from one another) by utilizing Web technologies (e.g. Internet APIs, Web Services, etc.). In addition, a given scientific model is often designed and developed for execution in a specific computing environment. These are important barriers to enable reproducibility, replicability, and reusability of scientific models –becoming key interoperability requirements for a transparent decision-making process.

This presentation introduces the Virtual Earth Cloud concept, a multi-cloud framework for the generation of information/knowledge from Big Earth Data analytics. The Virtual Earth Cloud allows the execution of computational models to process and extract knowledge from Big Earth Data, in a multi-cloud environment, and thus improving their reproducibility, replicability and reusability.

The development and prototyping of the Virtual Earth Cloud is carried out in the context of the GEOSS Platform Plus (GPP) project, funded by the European Union’s Horizon 2020 Framework Programme, aims to contribute to the implementation of the Global Earth Observation System of Systems (GEOSS) by evolving the European GEOSS Platform components to allow access to tailor-made information and actionable knowledge.

How to cite: Santoro, M., Mazzetti, P., and Nativi, S.: Virtual Earth Cloud: a multi-cloud framework for improving replicability of scientific models, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-7532, https://doi.org/10.5194/egusphere-egu23-7532, 2023.

X4.194
|
EGU23-4354
|
ESSI3.5
|
Peter Löwe

This is a report from the chapter editor's perspective of a high visibility publication effort to foster the adoption of the FAIR principles (Findable, Accessible, Interoperable, Reusable) by encouraging the adoption of Persistent Identifiers (PID) and repository-based workflows in geospatial open source software communities as good practices. Lessons learned are detailed about how to communicate the benefits of PID adoption to software project communities focussing on professional software-development and meritocracy. Also encountered communication bottleneck patterns, the significance of cross-project  multiplicators, remaining challenges and emerging opportunities for publishers and repository infrastructures are reported. For the second Edition of the Springer Handbook of Geographic Information, a team of scientific domain experts from several software communities was tasked to rewrite a chapter about Open Source Geographic Information Systems (DOI: 10.1007/978-3-030-53125-6_30). For this, a sample of representative geospatial open source projects was selected, based on the range of projects integrated in the OSGeo live umbrella project (DOI: 10.5281/zenodo.5884859). The chapters authors worked in close contact with the respective Open Source software project communities. Since the editing and production process for the Handbook of Geographic Information was delayed due to the pandemic, this provided the opportunity to explore, improve and implement good practices for state of the art PID-based citation of software projects and versions, but also project communities, data and related scientific video ressources. This was a learning process for all stakeholders involved in the publication project. At the completion of the project, the majority of the involved software projects had minted Digital Object Identifiers (DOI) for their codebases. While the adoption level of software versioning with automated PID-generation and metadata quality remains heterogeneous, the insights gained from this process can simplify and accelerate the adoption of PID-based best software community practices for other open geospatial projects according to the FAIR principles.

How to cite: Löwe, P.: Going FAIR by the book: Accelerating the adoption of PID-enabled good practices in software communities through reference publication., EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-4354, https://doi.org/10.5194/egusphere-egu23-4354, 2023.

X4.195
|
EGU23-12443
|
ESSI3.5
|
Highlight
Eric Hutton and Gregory Tucker

Landlab is an open-source Python package designed to facilitate creating, combining, and reusing 2D numerical models. As a core component of the Community Surface Dynamics Modeling System (CSDMS) Workbench, Landlab can be used to build and couple models from a wide range of domains. We present how Landlab provides a platform that fosters a community of model developers and aids them in creating sustainable and FAIR (Findable, Accessible, Interoperable, Reusable) research software.

Landlab’s core functionality can be split into two main categories: infrastructural tools and community-contributed components. Infrastructural tools address the common needs of building new models (e.g. a gridding engine, and numerical utilities for common tasks). Landlab’s library of community-contributed components consists of several dozen components that each model a separate physical process (e.g. routing of shallow water flow across a landscape, calculating groundwater flow, or biologic evolution over a landscape). As these user-contributed components are incorporated into Landlab, they are able to attach to the Landlab infrastructure so that they also become both findable and accessible (through, for example, standardized metadata and versioning) and are maintained by the core Landlab developers.

One key aspect of Landlab’s design is its use of a standard programming interface for all components. This ensures that all Landlab components are interoperable with one another and with other software tools, allowing researchers to incorporate Landlab's components into their own workflows and analyses. By separating processes into individual components, they become reusable and allow researchers to combine components in new ways without having to write new components from scratch.

Overall, Landlab's design and development practices support the principles of FAIR research software, promoting the ability for scientific research to be easily shared and built upon. This design also provides a platform onto which model developers are able to attach their model components and take advantage of Landlab’s development practices and infrastructure and ensure their components also follow FAIR principles.

How to cite: Hutton, E. and Tucker, G.: Landlab: a modeling platform that promotes the building of FAIR research software, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-12443, https://doi.org/10.5194/egusphere-egu23-12443, 2023.

Posters virtual: Wed, 26 Apr, 08:30–10:15 | vHall ESSI/GI/NP

Chairpersons: Klaus Zimmermann, Christin Henzen, Karsten Peters-von Gehlen
vEGN.4
|
EGU23-229
|
ESSI3.5
|
ECS
|
Yuanqing He, Min Chen, Yongning Wen, and Songshan Yue

Integrated application of geo-analysis models is critical for geo-process research. Due to the continuity of the real world, the geo-analysis model cannot be applied immediately over the entire space. To date, the method of regrading space as a sequence of computing units (i.e. grid) has been widely used in geographic study. However, the model's variances in division algorithms result in distinct grid data structures. At first, researchers must install and setup the various software to generate the structure-specific grid data required by the models. This method of localized processing is inconvenient and inefficient. Second, in order to integrate the models that use different structural grid data, researchers need to design a specific conversion method based on the integration scenario. Due to difference of researcher’s development habits, it is difficult to reuse the conversion method in another runtime environment. The open and cross-platform character of web services enables users to generate data without the assistance of software programs. It has the potential to revolutionize the present time-consuming process of grid generation and conversion, hence increasing efficiency.

Based on the standardized model encapsulation technology proposed by OpenGMS group, this paper presents a grid-service method tailored to the specific requirements of open geographic model integration applications, and the research work is carried out in the following three areas:

  • The basic strategy of grid servitization. The heterogeneity of the grid generation method is a major factor that prevents it from being invoked via a unified way by web services. To reduce the heterogeneous of the grid generation method, this study proposes a standardized description method based on the Model Description Language (MDL).
  • Method for constructing a grid data generating service. A unified representation approach for grid data is proposed in order to standardize the description of heterogeneous grid data; an encapsulation method for grid generating algorithms is proposed; and grid-service is realized by merging the main idea of grid servitization.
  • Method for constructing a grid data conversion service . A box-type grid indexing approach is provided to facilitate the retrieval of grid cells with a large data volume; two conversion types, topologically similar and topologically inaccessible grid data conversion, are summarized, along with the related conversion procedures. On this foundation, a grid conversion engine is built using the grid service-based strategy as a theoretical guide and integrated with the grid conversion strategy.

Based on the grid service approach proposed in this paper, researchers can generate and converse grid data without tedious steps for downloading and installing programs. Thus, there are more time spend on geography problem solving, hence increasing efficiency.

How to cite: He, Y., Chen, M., Wen, Y., and Yue, S.: A web-based strategy to reuse grids in geographic modeling, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-229, https://doi.org/10.5194/egusphere-egu23-229, 2023.

vEGN.5
|
EGU23-3006
|
ESSI3.5
|
Vincent Fazio

The AuScope 3D Geomodels Portal is a website designed to display a variety of geological models and associated datasets and information from all over the Australian continent. The models are imported from publicly available sources, namely Australian government geological surveys and research organisations. Often the models come in the form of downloadable file packages designed to be viewed in specialised geological software applications. They usually contain enough information to view the model’s structural geometry, datasets and a minimal amount of geological textual information. Seldom do they contain substantial metadata, often they were created before the term ‘FAIR’ was coined or the importance of metadata had dawned upon many of us. This creates challenges for data providers and aggregators trying to maintain a certain standard of FAIR compliance across all their offerings. How to improve the standard of FAIR compliance of metadata extracted from these models? How to integrate these models into existing metadata infrastructure? For the Geomodels portal, these concerns are alleviated within the automated model transformation software. This software transforms the source file packages into a format suitable for display in a modern WebGL compliant browser. Owing to the nature of the model source files only a very modest amount of metadata can be extracted. Hence other sources of metadata must be introduced. For example, often the dataset provider will publish a downloadable PDF report file or a description on a web page associated with the model. Automated textual analysis is used to extract more information from these sources. At the end of the transformation process, an ISO-compliant metadata record is created for importing into a geonetwork catalogue. The geonetwork catalogue record can be used for integration with other applications. For example, AuScope’s flagship portal, the AuScope Portal displays information, download links and a geospatial footprint of models on a map. The metadata can also be displayed in the Geomodels Portal.

How to cite: Fazio, V.: How AuScope 3D Geomodels Portal integrates relatively metadata poor geological models into its metadata infrastructure, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3006, https://doi.org/10.5194/egusphere-egu23-3006, 2023.