Session ESSI3.3

[Programme]

ESSI3.3 | Navigating the Realities of FAIR and Open Data in Research Data Repositories: a Balancing Act Between Fostering Open Science vs Meeting Reporting Requirements

PICO

Fri, 08:30

PICO

Navigating the Realities of FAIR and Open Data in Research Data Repositories: a Balancing Act Between Fostering Open Science vs Meeting Reporting Requirements

Convener: Kirsten Elger | Co-conveners: Rebecca FarringtonECSECS, Alice Fremand, Kristin Vanderbilt, Melanie LorenzECSECS

PICO

| Fri, 28 Apr, 08:30–10:15 (CEST)

PICO spot 2

Research data repositories can store a variety of research outputs, ranging from raw observational data derived from monitoring infrastructures including satellites, drones, aircraft, remote sensors, in situ analytical laboratories, etc. to downstream derived products from scientific projects and the full suite of small and highly-variable data from the long-tail communities as well as sample descriptions. Repositories play an important part in the support of open data and open science, whilst their curation of datasets, together with operation and continuous development of access services, are key to making their data assets FAIR. Many data and sample repositories and their data provider communities are continually working on improving dataset identification and attribution, versioning control, as well as provenance recording and tracking, and improving and enriching relevant metadata records.

Traditionally, data centers monitored their usage by logging access to data sets (number, volume, success/fail rate) and used connection IP addresses for rough geographic differentiation. That usage data, together with user feedback received through other channels, supported data center governance and was sufficient for reporting to funders and other stakeholders. However, in recent times many repositories are being asked by their sponsors and funding agencies to provide information on what data and services are used by whom and for what purpose in greater detail than customary in the past: these more detailed requests can be in conflict with privacy legislation. More sophisticated systems now have to be implemented to enhance usage statistics, and the costs often compete with activities such as improving customer service and user experience.

This session will showcase a range of best practices in research data repositories that are working on making data and metadata Open, FAIR and accessible to both humans and machines. Contributions are welcome on unique identification of datasets and physical samples in a repository, attribution to ensure both funders and creators are credited (particularly of the raw observational datasets), provenance tracking and on making QA/QC assessments of the published datasets publicly available and FAIR. Methodologies and systems for implementing user identification and usage tracking are also invited, including protecting privacy and balancing scarce resources vs the need to continuously improve the user experience.

PICO: Fri, 28 Apr | PICO spot 2

08:30–08:35

5-minute convener introduction

08:35–08:45

PICO2.1

EGU23-13223

ECS

solicited

On-site presentation

Exploiting Curated, Domain-Specific Repositories to Facilitate Globally Interoperable Databases: the GEOROC Use-Case for Global Geochemical Data

Marthe Klöcking, Adrian Sturm, Bärbel Sarbas, Leander Kallas, Stefan Möller-McNett, Jens Nieschulze, Kerstin Lehnert, Kirsten Elger, Wolfram Horstmann, Daniel Kurzawe, Matthias Willbold, and Gerhard Wörner

The GEOROC database is a leading, open-access source of geochemical and isotopic datasets of igneous and metamorphic rocks and minerals. It was established 24 years ago and currently provides access to curated compilations of rock and mineral compositions from >20,600 publications (>32 million single data values). The Digital Geochemical Data Infrastructure (DIGIS) initiative for GEOROC 2.0 is now building a connected platform capable of supporting the diverse demands of digital, data-based geochemical research: including modern solutions to data submission, discovery and access.

One of the challenges for maintaining a high quality, up-to-date database such as GEOROC is consistent data entry. Historically, data were compiled manually from the academic literature by trained curators. This manual data entry process is slow, resource-intensive and prone to errors. Exacerbated by the lack of best-practices or standards for analytical geochemical data reporting, the quality and completeness of data and metadata compiled in this way are highly variable. A possible solution to this challenge is offered by domain-specific repositories: in part driven by demands of some funders and publishers to make all research data publicly available, data producers increasingly publish their research datasets, affording repositories a unique opportunity to impose consistent standards and quality. Following these developments, DIGIS established a domain repository with DOI minting capabilities in 2021 to support independent data submission by authors. In principle, these data submissions may comprise new analytical results as well as compilations of previously published data (“expert datasets”). DIGIS also uses its repository for versioning of the GEOROC data compilations and to provide distinct, citable objects to the researchers that use GEOROC compilations for their work (so-called “precompiled files”, a collection of pre-formatted results of the most popular search queries to the GEOROC database are regularly updated and re-published). However, whilst all data submissions by authors are required to fulfill the scope of the GEOROC database, new analytical data need to meet additional quality requirements: the repository enforces a strict template to ensure consistent reporting of all relevant sample and method/analysis metadata. These templates can then be automatically harvested from the repository directly into the GEOROC database, with the added guarantee that new data entries are a) approved by the owners of the datasets, and b) follow a consistent data reporting and quality standard.

To encourage user uptake of both the repository and the compilations available in the GEOROC database, DIGIS is working closely with IEDA2 and EarthChem towards developing a common infrastructure for geochemical data. One goal of this collaboration is a single repository submission platform that asserts the same requirements for data and metadata quality of submitted datasets. In addition, DIGIS has also partnered with GFZ Data Services as their trusted domain repository. Finally, through the OneGeochemistry initiative, all three partners are working towards global community-endorsed best practices for geochemical data publication. Ultimately, these efforts will facilitate greater interoperability between globally distributed geochemical data systems, enabling more user-friendly delivery of data publication and compilation services to the research community.

How to cite: Klöcking, M., Sturm, A., Sarbas, B., Kallas, L., Möller-McNett, S., Nieschulze, J., Lehnert, K., Elger, K., Horstmann, W., Kurzawe, D., Willbold, M., and Wörner, G.: Exploiting Curated, Domain-Specific Repositories to Facilitate Globally Interoperable Databases: the GEOROC Use-Case for Global Geochemical Data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13223, https://doi.org/10.5194/egusphere-egu23-13223, 2023.

08:45–08:47

PICO2.2

EGU23-10300

ECS

Virtual presentation

Open and FAIR sample based data sharing through the IEDA2 facility

Lucia Profeta, Kerstin Lehnert, Peng Ji, Gokce Ustunisik, Roger Nielsen, Dave Vieglais, Douglas Walker, Karin Block, and Michael Grossberg

IEDA, the Interdisciplinary Earth Data Alliance, is a unique collaborative data infrastructure that provides and continuously evolves a comprehensive ecosystem of data, tools, and services that support researchers in the Geosciences to share and access sample data following the FAIR data principles and ensure open, reproducible, and transparent science practices.

The ‘next generation’ of IEDA - IEDA2 - was funded by the US National Science Foundation in 2022 for 5 years to advance the existing data systems and services of EarthChem (geochemistry data repository, data synthesis, and data access portals), LEPR/TraceDs (synthesis of data from petrological experiments), and SESAR (System for Earth Sample Registration), modernizing system architecture to better support computational and data-driven research, improving usability, and growing a diverse and inclusive user audience through education and engagement.

Our target is to enable a common understanding of existing information that is validated through expert data curation to enable transparent and reproducible (re)use and analysis of data, inform peer review, and guide future research directions. Such common understanding cannot develop unless the community can access and assess the same information.

This collective vision of the data will improve the ability of the entire community to assess the context and significance of new data and models, allow reviewers to evaluate new models that are calibrated using the data, and facilitate new generations of research.

How to cite: Profeta, L., Lehnert, K., Ji, P., Ustunisik, G., Nielsen, R., Vieglais, D., Walker, D., Block, K., and Grossberg, M.: Open and FAIR sample based data sharing through the IEDA2 facility, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-10300, https://doi.org/10.5194/egusphere-egu23-10300, 2023.

08:47–08:49

PICO2.3

EGU23-6831

On-site presentation

Tracking and reporting peta-scale data exploitation within the Earth System Grid Federation through the ESGF Data Statistics service

Alessandra Nuzzo, Fabrizio Antonio, Maria Mirto, Paola Nassisi, Sandro Fiore, and Giovanni Aloisio

The Earth System Grid Federation (ESGF) is an international collaboration powering most global climate change research and managing the first-ever decentralized repository for handling climate science data, with multiple petabytes of data at dozens of federated sites worldwide. It is recognized as the leading infrastructure for the management and access of large distributed data volumes for climate change research and supports the Coupled Model Intercomparison Project (CMIP) and the Coordinated Regional Climate Downscaling Experiment (CORDEX), whose protocols enable the periodic assessments carried out by the IPCC, the Intergovernmental Panel on Climate Change.

As trusted international repository, ESGF hosts and replicates data from a broader range of domains and communities in the Earth sciences leading thus to a strong support to standards for connecting data and application of FAIR data principles to ensure free and open access and interoperability with other similar systems in the Earth Sciences.

ESGF includes a specific software component, funded by the H2020 projects IS-ENES2 and IS-ENES3, named ESGF Data Statistics, which takes care of collecting, analyzing, visualizing the data usage metrics and data archive information across the federation.

It provides a distributed and scalable software infrastructure responsible for capturing a set of metrics both at single site and federation level. It collects and stores a high volume of heterogeneous metrics, covering coarse and fine grain measures such as downloads and clients statistics, aggregated cross and project-specific download statistics thus offering a more user oriented perspective of the scientific experiments.

This allows providing a strong feedback on how much, how frequently and how intensively the whole federation is exploited by the end-users, as well as the most downloaded data, which somehow captures the level of interest from the community on some specific data. It also gives feedback on the less accessed data, which from one side can help designing larger-scale experiments in the future and on the other hand can help getting some insights on the long tail of research. On top of this, a view of the total amount of data published and available through ESGF offers users the possibility to monitor the status of the data archive of the entire federation.

This contribution presents an overview of the Data Statistics capabilities as well as the main results in terms of data analysis and visualization.

How to cite: Nuzzo, A., Antonio, F., Mirto, M., Nassisi, P., Fiore, S., and Aloisio, G.: Tracking and reporting peta-scale data exploitation within the Earth System Grid Federation through the ESGF Data Statistics service, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-6831, https://doi.org/10.5194/egusphere-egu23-6831, 2023.

08:49–08:51

PICO2.4

EGU23-3574

On-site presentation

Expansion of the NASA Astrophysics Data System to Earth and Space Sciences

Alberto Accomazzi and the ADS Team

The NASA Astrophysics Data System (ADS) is the primary Digital Library portal for Space Science Researchers. In addition to the scientific literature, the ADS has for a long time included in its database non-traditional scholarly resources such as research proposals, software packages, and high-level data products, making them discoverable and easily citable. Over the next three years, in response to NASA's efforts supporting interdisciplinary research and Open Science initiatives, the ADS will greatly expand its coverage of the literature, and will develop a new portal unifying access to the fields of Astrophysics, Planetary Science, Heliophysics, and Earth Science. It will also cover NASA funded research in Biological and Physical Sciences. The planned system will combine a scalable, discipline-agnostic core with a set of discipline specific knowledge centers which will curate and enrich its content using deep subject matter expertise from the NASA Science divisions. In this talk I will provide an overview of the ADS system, its distinguishing features, and then focus on our efforts to support and promote the FAIR principles as part of NASA's Year of Open Science initiatives.

How to cite: Accomazzi, A. and the ADS Team: Expansion of the NASA Astrophysics Data System to Earth and Space Sciences, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-3574, https://doi.org/10.5194/egusphere-egu23-3574, 2023.

08:51–08:53

PICO2.5

EGU23-8170

On-site presentation

Tools to support climate researchers for long-tail research data in a FAIR context

Christian Pagé, Abel Aoun, Alessandro Spinuso, Klaus Zimmermann, and Lars Bärring

Doing high quality research involves complex workflows and intermediate datasets. An important part is also sharing of those datasets, software tools and workflows among researchers, and tracking provenance and lineage. It also needs to be stored in a citable permanent repository in order to be referenced in papers and reused subsequently by other researchers. Supporting this research data life cycle properly is a very challenging objective for research infrastructures. This is especially true with rapidly evolving technologies, sustainable funding problems and human expertise.

In the climate research infrastructure, many efforts have been made to support end-users and long tail research. There is the basic data distribution, the ESGF data nodes, but this is to support mainly specialized researchers in climate science. This basic infrastructure implements quite strict standards to enable proper data sharing in the research community. This is far from FAIR compliance, but this has proven to be extremely beneficial for collaborative research. Of course, high level components and services can be built on top. This is not an easy task, and a layered approach is always better to hide the underlying complexity and also to prevent technology locking and too complex codes. One example is the IS-ENES C4I 2.0 platform (https://dev.climate4impact.eu/ ), a front-end that eases very much data access, and is acting like a bridge between the data nodes and computing services. The C4I platform provides a very much enhanced Jupyter-la like interface (SWIRRL), with many services to support sharing of data and common workflow for data staging and preprocessing, as well as the development of new analysis methods in a research context. Advanced tools that can calculate end-user products are also made available along with some example notebooks implementing popular workflows. One of these tools is icclim (https://github.com/cerfacs-globc/icclim), a python software package. C4I also includes high-level services such as on-the-fly inter-comparisons between climate simulations with ESMValTool (https://github.com/ESMValGroup/ESMValTool). All this work is also including large efforts to standardize and to become closer to FAIR for data, workflows and software.

Another way of helping researchers is to pre-compute end-users products like climate indices. This is extremely useful for users because it can be really complex and time consuming to calculate those products. One example is to provide those users datasets of climate indices pre-computed on CMIP6 simulations would be very valuable for those users. Of course all specific needs cannot be taken into account but the most general ones can be fulfilled. The European Open Science Cloud (EOSC) is providing computing and storage resources through the EGI-ACE project, enabling the possibility to compute several climate indices. In this EGI-ACE Use Case, icclim will be used to compute 49 standard climate indices on a large number of CMIP6 simulations, starting with the most used ones. It could also be extended to ERA5 reanalysis, CORDEX and CMIP5 datasets.

This project (IS-ENES3) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement N°824084.

How to cite: Pagé, C., Aoun, A., Spinuso, A., Zimmermann, K., and Bärring, L.: Tools to support climate researchers for long-tail research data in a FAIR context, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8170, https://doi.org/10.5194/egusphere-egu23-8170, 2023.

08:53–08:55

PICO2.6

EGU23-11512

On-site presentation

Access control and reporting at the Data Center for Deep Geothermal Energy

Marc Schaming, Alice Fremand, Mathieu Turlure, and Jean Schmittbuhl

The CDGP [https://cdgp.u-strasbg.fr], Data Center for Deep Geothermal Energy, is a Research repository for deep geothermal data in Alsace (France) originating from academic observatories or industrial providers. It collects seismological (catalogues, waveforms, focal mechanisms), seismic, hydraulic, geological, and all data related to anthropogenic hazard from the different phases of a geothermal project, mainly from the exploration and development stages at Soultz-sous-Forêts, Rittershoffen and Vendenheim EGS geothermal sites. Data are verified, validated, and curated; they are then described with metadata following the ISO 191115/19139 standards and have owner-defined distribution rules associated with them. A persistent Digital Object Identifier (DOI) is associated with the collections as well as a “how to cite” statement, for better citation. Data are converted into standard formats and archived in data warehouses. The CDGP is also a node of the EPOS TCS-AH platform [https://tcs.ah-epos.eu/] and provides Episodes metadata and data.

Metadata are open and harvested, they are also pushed on the TCS- AH platform. Data follow FAIR’s statement “as open as possible, as closed as necessary”. CDGP has set an authentication, authorization, and accounting infrastructure (AAAI) to compare distribution rules and user affiliation. Access to data on the TCS-AH platform is conditional on academic membership, and data are provided on-demand. Access controls are made as near the user as possible (subsidiarity principle). Monitoring and reporting are part of the AAAI, and usage is described in terms of use and privacy. Reports are sent to data providers every semester. CDGP operates a regular bibliographic follow-up, e.g. with Google Scholar or by questioning users that downloaded data. On his side, the TCS-AH can only provide general statistics on user engagement, origin, etc.

How to cite: Schaming, M., Fremand, A., Turlure, M., and Schmittbuhl, J.: Access control and reporting at the Data Center for Deep Geothermal Energy, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-11512, https://doi.org/10.5194/egusphere-egu23-11512, 2023.

08:55–08:57

PICO2.7

EGU23-14871

On-site presentation

Because it matters: Benefits of using the domain repository GFZ Data Services for Earth System Sciences data

Florian Ott, Kirsten Elger, and Simone Frenzel

Implementing the FAIR principles is becoming more and more relevant for the scientific community and in particular for the Earth System Sciences. It is already widely acknowledged that research data inhibit not only relevant information for their respective field of study but can open new avenues of research when appropriately re-used and/or combined with other information. At this point research data repositories represent key data access points and especially domain repositories with careful data curation and enrichment of metadata are increasingly relevant and valuable.

GFZ Data Services, hosted at the GFZ German Research Centre for Geosciences (GFZ), is a domain repository for geosciences data that assigns digital object identifier (DOI) to data and scientific software since 2004 and is Allocating Agent for the International Generic Sample Number IGSN, the globally unique persistent identifier for physical samples. The repository, on one hand, provides DOI minting services for several global monitoring networks/observatories in geodesy and geophysics (e.g. INTERMAGNET; IAG Services ICGEM, IGETS, IGS; GEOFON), collaborative projects (TERENO, EnMAP, GRACE, CHAMP) and has a strong focus long-tail data on the other hand. All metadata and data are curated by domain scientist.

Especially the provision of (i) comprehensive domain-specific standardised and machine-actionable metadata with linked-data vocabularies used in the geosciences domain and (ii) comprehensive technical data descriptions or DOI-referenced data reports complementing the metadata, result in high-quality data publications that are easily discoverable across domains. Furthermore, the provision of cross-references through persistent identifiers (DOI, IGSN, ORCID, Fundref, ROR) to related research products (text, data, software, people and institutions) further increase the visibility and interoperability of research data.

Next to curation workflows for data and metadata realised by domain experts, GFZ Data Services offers detailed and thorough user guidance via its website (https://dataservices.gfz-potsdam.de). This website is the central information and access point for the repository and provides the data and sample catalogues, information on metadata, data formats, the data publication workflow, FAQ, links to different versions of our metadata editor, downloadable data description templates and general information on data management practices.

How to cite: Ott, F., Elger, K., and Frenzel, S.: Because it matters: Benefits of using the domain repository GFZ Data Services for Earth System Sciences data, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-14871, https://doi.org/10.5194/egusphere-egu23-14871, 2023.

08:57–08:59

PICO2.8

EGU23-13473

On-site presentation

Connecting the Long Tail: sharing and describing heterogeneous data via common metadata standards

Otto Lange, Laurens Samshuijzen, Kirsten Elger, Simone Frenzel, Ronald Pijnenburg, Richard Wessels, Geertje ter Maat, and Martyn Drury

The EPOS Multi-scale Laboratories (MSL) community includes a wide range of world-class solid Earth science laboratory infrastructures and as such it provides a multidisciplinary- and coherent platform for both virtual access to data and physical access to sophisticated research equipment. The MSL laboratories provide facilities for highly-specialized experimental research that results in experimental and analytical data underlying publications about phenomena ranging from the molecular to the continental scale.

From the perspective of the intended FAIRness of these laboratory data, the challenge for the MSL community has been to develop a data management paradigm that on one hand acknowledges the uniqueness of many of the data collections involved, and on the other hand maximizes their findability through metadata dissemination via common standards into larger cross-disciplinary communities. Furthermore, besides provenance information about the data themselves, harmonized information about research groups and experimental assets must be considered as increasingly important for feeding the network relations that may help in making sense of scientific impact.

As part of the MSL Data Publication Chain, the MSL community has developed a standardised workflow that allows easy metadata exchange based on common formats (e.g., flavors of DCAT-AP, DataCite 4.x, and ISO19115), whereas at the same time it integrates dedicated ontologies to give access to the richness of specialized terminology with respect to the MSL subdomains (e.g., analogue modelling, paleomagnetism, rock physics, geochemistry). Community developed controlled vocabularies act as the binding agent between data, equipment, and the experiment itself, while at the same time processing tools like a user-friendly metadata editor and a CKAN-based MSL data publication portal provide the building blocks for the chain towards cross-disciplinary sustainable dissemination.

We will demonstrate how the MSL data management paradigm exploits both the strength of controlled terminology and the availability of good agnostic common standards in an approach for managing heterogeneous data coming from long tail communities.

How to cite: Lange, O., Samshuijzen, L., Elger, K., Frenzel, S., Pijnenburg, R., Wessels, R., ter Maat, G., and Drury, M.: Connecting the Long Tail: sharing and describing heterogeneous data via common metadata standards, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13473, https://doi.org/10.5194/egusphere-egu23-13473, 2023.

08:59–09:01

PICO2.9

EGU23-13514

ECS

Virtual presentation

FAIR WISH project – developing metadata templates for IGSN Registration for various sample types

Mareike Wieczorek, Alexander Brauser, Birgit Heim, Simone Frenzel, Linda Baldewein, Ulrike Kleeberg, and Kirsten Elger

The International Generic Sample Number (IGSN) is a unique and persistent identifier for physical objects that was originally developed in the Geosciences. In 2022, after 10 years of service operation and more than 10 million registered samples worldwide, IGSN e.V. and DataCite have agreed on a strategic partnership. As a result, all IGSNs are now registered as DataCite DOIs and the IGSN metadata schema will be mapped to the DataCite Metadata Schema according to agreed guidelines. This will, on the one hand, enrich the very limited mandatory information shared by IGSN allocating agents so far. On the other hand, the DataCite metadata schema is not designed for the comprehensive description of physical objects and their provenance.

The IGSN Metadata Schema is modular: the mandatory Registration Schema only included information on the IGSN identifier, the minting agent and a date - complemented by the IGSN Description Schema (for data discovery) and additional extensions by the allocating agents to customise the sample description according to their sample’s subdomain.

Within the project “FAIR Workflows to establish IGSN for Samples in the Helmholtz Association (FAIR WISH)”, funded by the Helmholtz Metadata Collaboration Platform (HMC), we

(1) customised the GFZ-specific schema to describe water, soil and vegetation samples and

(2) support the metadata collection by the individual researcher with a user-friendly, easy-to-use batch registration template in MS Excel.

The information collected with the template can directly be converted to XML files (or JSON in the future) following the IGSN Metadata schema that is required to generate IGSN landing pages. The template is also the source for the generation of DataCite metadata.

The integration of linked data vocabularies (RDF, SKOS) in the metadata is an essential step in harmonising information across different research groups and institutions and important for the implementation of the FAIR Principles (Findable, Accessible, Interoperable, Reusable) for sample descriptions. More information on these controlled vocabularies can be found in the FAIR WISH D1 List of identified linked open data vocabularies to be included in IGSN metadata (https://doi.org/10.5281/zenodo.6787200).

The template to register IGSNs for samples should ideally fit to various sample types. In a first step, we created templates for samples from surface water and vegetation from AWI polar expeditions on land (AWI Use Case) and incorporated the two other FAIR WISH use cases with core material from the Ketzin coring site (Ketzin Use Case) and for a wide range of marine biogeochemical samples (Hereon Use Case). The template comprises few mandatory and many optional variables to describe a sample, the sampling activity, location and so on. Users can easily create their Excel-template, including only the variables needed to describe a sample. A tutorial on how to use the FAIR WISH: Sample description template (https://doi.org/10.5281/zenodo.7520016) can be found in the FAIR WISH D3 Video Tutorial for the FAIR SAMPLES Template (https://doi.org/10.5281/zenodo.7381390). As our registration template is still a work in progress, we are furthermore happy for user feedback (https://doi.org/10.5281/zenodo.7377904).

Here we will present the template and discuss its applicability for sample registration.

How to cite: Wieczorek, M., Brauser, A., Heim, B., Frenzel, S., Baldewein, L., Kleeberg, U., and Elger, K.: FAIR WISH project – developing metadata templates for IGSN Registration for various sample types, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-13514, https://doi.org/10.5194/egusphere-egu23-13514, 2023.

09:01–09:03

PICO2.10

EGU23-8006

On-site presentation

Implementing FAIR metrics and assessments for the Earth and Environmental Sciences

Robert Huber, Christelle Pierkot, Marine Vernet, and Angelo Strollo

Although there has been great acceptance across disciplines for years to make research data FAIR (findable, accessible, interoperable, and reusable), there is still no consensus in most communities on how to implement this concretely on a discipline-specific basis. Available metrics are exclusively domain agnostic, and there are few approaches to formulate binding metrics and tests specifically for particular domains (e.g. earth and environmental science disciplines) and to implement them in assessment tools.

In this presentation, we will introduce new approaches being developed in the FAIR-IMPACT project, based on domain specific use case partners and their communities, including those from the earth and environmental sciences (e.g. collaboration with the communities involved in the FAIR-EASE project), to extend and thus adapt existing FAIR metrics for assessing data objects and the F-UJI FAIR Assessment Tool to more fully incorporate the disciplinary, 'geo' context. A particular focus here will be on incorporating geo-specific metadata standards, covering data formats and semantic artefacts within FAIR metrics, and the detection or verification of these standards by the F-UJI FAIR Assessment Tool.

We will finally report also on the collaboration with one of the EIDA Data Centers (as part of the European Infrastructure for seismic waveform data in EPOS) where the F-UJI FAIR Assessment Tool has been further developed to be aware of the very domain specific standard data and metadata as well as services.

How to cite: Huber, R., Pierkot, C., Vernet, M., and Strollo, A.: Implementing FAIR metrics and assessments for the Earth and Environmental Sciences, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-8006, https://doi.org/10.5194/egusphere-egu23-8006, 2023.

09:03–09:05

PICO2.11

EGU23-5047

On-site presentation

WDCC - Improvement of FAIRness of an established repository

Eileen Hertwig, Andrea Lammert, Heinke Höck, Andrej Fast, and Hannes Thiemann

The World Data Center for Climate (WDCC) provides access to and offers long-term archiving for datasets relevant for climate and Earth System research in a highly standardized manner following the FAIR principles. The focus is on climate simulation data. The WDCC services are aimed at both scientists who produce data (e.g. to fulfill the guidelines of good scientific practice) and scientists who re-use published data for new research.

The WDCC is hosted by the German Climate Computing Center (DKRZ) in Hamburg, Germany. The repository is an accredited regular member of the World Data System (WDS) since 2003. WDCC is certified as a Trustworthy Data Repository by CoreTrustSeal (https://www.coretrustseal.org).

The WDCC was actively involved in the development of mechanisms to publish scientific datasets as citable entities. The first Datacite DOI ever assigned to a dataset was for a WDCC dataset in 2004 (http://dx.doi.org/10.1594/WDCC/EH4_OPYC_SRES_A2). Since then dataset collections in WDCC can be published with a DOI. In 2022, in compliance with the FAIR principles, the WDCC has also implemented the assignment of PIDs, persistent identifiers, for individual datasets. A PID is a long-lasting reference to a dataset (or other digital object) that is designed to always provide access to the object or to a representation of it, even if the actual URLs of the objects may change over time.

To meet user’s needs it is essential to ensure high quality of data, which means making sure that datasets in the repository are really Findable, Accessible, Interoperable, and Reusable (FAIR). The FAIRness of the WDCC has been systematically assessed in Peters-von Gehlen et al. (2022). Furthermore, to monitor the development of FAIRness in WDCC a FUJI-test is performed for all new dataset collections which are assigned a DOI.

Datasets are easier to find for the users when the corresponding metadata is machine-readable and a standardized vocabulary is used. The WDCC has implemented the schema.org standard, a machine-actionable metadata using JSON-LD format on the landing page of WDCC data publications. These embedded structured metadata in the landing page enhance interoperability across data catalogs and makes the data more discoverable.

WDCC actively participated in the AtMoDat project (https://www.atmodat.de/) and has started to publish datasets following the ATMODAT standard and with the EASYDAB label. The ATMODAT standard specifies requirements for rich metadata with controlled vocabularies, structured landing pages (human- and machine-readable), and the format and structure of the data files.

References:

Peters-von Gehlen, K., Höck, H., Fast, A., Heydebreck, D., Lammert, A. and Thiemann, H., 2022. Recommendations for Discipline-Specific FAIRness Evaluation Derived from Applying an Ensemble of Evaluation Tools. Data Science Journal, 21(1), p.7. DOI: http://doi.org/10.5334/dsj-2022-007

How to cite: Hertwig, E., Lammert, A., Höck, H., Fast, A., and Thiemann, H.: WDCC - Improvement of FAIRness of an established repository, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-5047, https://doi.org/10.5194/egusphere-egu23-5047, 2023.

09:05–09:07

PICO2.12

EGU23-2941

On-site presentation

Modern Scientific Data Governance Framework

Rahul Ramachandran, Ge Peng, Shelby Bagwell, Abdelhak Marouane, Sumant Jha, and Jerika Christman

Science has entered the era of Big Data with new challenges related to data governance, stewardship, and management. The existing data governance practices must catch up to ensure proper data management. Existing data governance policies and stewardship best practices tend to be disconnected from operational data management practices and enforcement and mainly exist in well-meaning documents or reports. These governance policies are, at best, partially implemented and rarely monitored or audited. In addition, existing governance policies keep adding additional data management steps that require a human, ‘a data steward’, in the loop, and the cost of data management can no longer scale proportionately with the current and future increased data volume and complexity.

The goal for developing an updated data governance framework is to modernize scientific data governance to the reality of Big data and align it with the current technology trends such as cloud computing and AI. The goals of this framework are two folds. One is to ensure thoroughness that the governance adequately covers the entire data life cycle. Two, provide a practical approach that offers a consistent and repeatable process for different projects. Three core principles ground this framework. First, focus on just enough governance and prevent data governance from becoming a roadblock toward the scientific process. Remove any unnecessary processes and steps. Second, automate data management steps where possible. Actively remove steps that require ‘human in the loop’ within the management process to be efficient and scale with increasing data. Third, all the processes should continually be optimized using quantified metrics to streamline the monitoring and auditing workflows.

How to cite: Ramachandran, R., Peng, G., Bagwell, S., Marouane, A., Jha, S., and Christman, J.: Modern Scientific Data Governance Framework, EGU General Assembly 2023, Vienna, Austria, 23–28 Apr 2023, EGU23-2941, https://doi.org/10.5194/egusphere-egu23-2941, 2023.

09:07–10:15

Interactive presentations at PICO screens