ESSI3.1

Best Practices and Realities of Research Data Repositories: Balancing the needs of Repositories, Researchers and Publishers

ESSI3.1

Best Practices and Realities of Research Data Repositories: Balancing the needs of Repositories, Researchers and Publishers

Convener: Kirsten Elger | Co-conveners: Lesley Wyborn, Kristin Vanderbilt, Amber Budden, Alice FremandECSECS

Presentations

| Fri, 27 May, 08:30–11:05 (CEST)

Room 0.31/32

Presentations: Fri, 27 May, 08:30–11:05 | Room 0.31/32

Chairpersons: Kirsten Elger, Alice Fremand, Lesley Wyborn

08:30–08:33

Introduction

08:33–08:43

EGU22-12105

solicited

On-site presentation

User Identification and Authentication for Geophysical Data Centers: Exploring a Difficult Transition

Florian Haslinger, Jerry Carter, Helle Pedersen, Jonathan Schaeffer, Robert Casey, Javier Quinteros, and Angelo Strollo

Many geophysical data centers are being asked by their sponsors and funding agencies to provide information on what data and services are used by whom and for what purpose in greater detail than customary in the past, when bulk information about the number of users/accesses and volumes of download were deemed sufficient in most cases. Up to now, data centers generally offer anonymous access to large parts of their holdings, with different approaches to basic monitoring and access logging, e.g. by IP address, as a rough proxy, that allows one to infer geographical user distribution to some detail.

Already today, access to embargoed or otherwise restricted data, or to advanced functions like personal work spaces and computational resources, is usually protected by user authentication and authorisation. Standardization of the identity management protocols is a requirement for further supporting the federation of data centers and their services, also in light of future integration with cloud services or other integrated services. For example in seismology, federated data retrieval systems follow a specific credential process based on standards for data exchange and web services established and maintained by the International Federation of Digital Seismograph Networks (FDSN).

These new information requirements from funding agencies would, however, require implementing identity management systems and some sort of user identification / authentication to many or all data center services and resources. This is raising concerns within the data centers on a number of aspects: Evidence from other domains demonstrates that requiring authentication reduces the use of data center services; enforcing authentication is often perceived as being not in line with best practices for open science; implementing identity management for usage profiling may lead to significantly increased effort at the data centers, especially with regard to compliance with data protection legislation like GDPR, and it may significantly impede automated (scripted) machine-to-machine access; the level of detail that should be reported back to funding agencies is unclear and there are doubts whether detailed user profiling is a reasonable ‘performance indicator’. Indeed, such knowledge gathering on users needs to be obtained through technical implementations that take into account the impact on user experience, the impact on decades of research tool development, and the resources necessary to implement and operate such systems, whether embedded into the operational services or taking other forms such as surveys and outreach to user groups.

Relevant discussions have now started among representatives of major geophysical data centers so that interim plans can be shared, ideas and experiences exchanged, and standard approaches can be developed and recommended for consideration by the community. In these discussions we consider both scenarios where identity management is useful and relevant or where we may consolidate our views and arguments with respect to the general user data reporting requests.

How to cite: Haslinger, F., Carter, J., Pedersen, H., Schaeffer, J., Casey, R., Quinteros, J., and Strollo, A.: User Identification and Authentication for Geophysical Data Centers: Exploring a Difficult Transition, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-12105, https://doi.org/10.5194/egusphere-egu22-12105, 2022.

08:43–08:50

EGU22-10800

Presentation form not yet defined

The Culture of Open Science: Building an Ethos that Feels like Home across the Earth, Space and Environmental Sciences -- One Step at a Time

Shelley Stall and Chris Erdmann

08:50–08:57

EGU22-10546

Virtual presentation

Accelerating Open and FAIR Data Practices Across the Earth, Space, and Environmental Sciences

Christopher Erdmann and Shelley Stall

08:57–09:04

EGU22-13354

ECS

On-site presentation

FID GEO: Promoting Open Science in the Geosciences

Melanie Lorenz, Kirsten Elger, Inke Achterberg, Marcel Meistring, Norbert Pfurr, and Malte Semmler

The change towards Open Science practices is increasingly demanded by science policy and affects the publication culture as well as the information infrastructures. This includes the transition to Open Access for journals and publishers (including the development of new business models), as well as the growing need to make data, scientific software and samples underlying scientific results available for the general public.

The DFG-funded Specialised Information Service for Geoscience FID GEO (Fachinformationsdienst Geowissenschaften), aims at (1) reducing structural deficits in the area of electronic information and (2) promoting Open Science throughout the research life cycle. The service is hosted at the Göttingen State and University Library in Lower Saxony (SUB Göttingen) and the GFZ German Research Centre for Geosciences in Potsdam. The FID GEO team is made up of highly connected librarians, data publishing professionals, and geoscientists. Over the past five years, FID GEO has become a key player for the promotion of Open Science in the geosciences and occupies a central position for connecting researchers, data repositories, information infrastructures, German geosciences societies and publishers. FID GEO actively offers data and text publication services via their associated repositories GFZ Data Services and GEO-LEOe-docs as well as digitisation on demand of print-only geoscience literature and maps.

FID GEO aims at informing the geoscientific community about all aspects of Open Science on one hand, and is available for questions and support, e.g., during data publications, the transition to an Open Access model for journals of geosciences societies or getting a DOI for an article in the Green Open Access model. An online questionnaire in 2021 revealed that there is a high demand for information. This regards particularly topics such as licenses, persistent identifiers (ORCID, ROR, IGSN) and measures to ensure data quality and integrity in order to enable high quality, citable data publications. In the first funding phases of the project, workshops and talks have proven to be very successful tools to meet the large need for discussion, as they allow to directly address questions or uncertainties regarding practical aspects. Information events are prepared specifically for the individual target groups: researchers, German geosciences societies and members of infrastructural support units, like libraries (e.g., while societies are more interested in the development of guidelines, librarians have specific interest in licensing and copyright issues). To intensify the open information culture in the geosciences, FID GEO collaborates with strategic (inter)national initiatives, such as NFDI4Earth, COPDESS and OneGeochemistry.

How to cite: Lorenz, M., Elger, K., Achterberg, I., Meistring, M., Pfurr, N., and Semmler, M.: FID GEO: Promoting Open Science in the Geosciences, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13354, https://doi.org/10.5194/egusphere-egu22-13354, 2022.

09:04–09:11

EGU22-7291

ECS

Presentation form not yet defined

British Antarctic Survey’s Aerogeophysics Data: Releasing 25 Years of Gravity, Magnetics, and Radar Datasets over Antarctica

Alice Fremand, Julien Bodart, Tom Jordan, Fausto Ferraccioli, Carl Robinson, Hugh Corr, Helen Peat, Robert Bingham, and David Vaughan

Over the past 50 years, the British Antarctic Survey (BAS) has been one of the major acquisitors of airborne geophysical data over Antarctica, providing scientists with gravity, magnetics and radar datasets that have been central to many studies of the past, present, and future evolution of the Antarctic Ice Sheet. Until recently, many of these datasets were unpublished in full, restricting the further usage of the data for different glaciological and geophysical applications. Starting in 2020, scientists and data managers at the British Antarctic Survey have worked on standardising and releasing large swaths of aerogeophysical data acquired during the period 1994-2020, including a total of 64 datasets from 24 different surveys, amounting to ~450,000 line-km (or 5.3 million km²) of data across West Antarctica, East Antarctica, and the Antarctic Peninsula. Amongst these are the extensive surveys over the fast-changing Pine Island (2004-05) and Thwaites (2018-20) glacier catchments amongst others. Considerable effort has been made to standardise these datasets to comply with the FAIR (Findable, Accessible, Interoperable and Re-Usable) data principles, as well as to create a new Polar Airborne Geophysics Data Portal (https://www.bas.ac.uk/project/nagdp/), which serves as a user-friendly interface to interact and download the newly published data. Here, we review how these datasets were acquired and processed, and present the methods used to standardise them. We then discuss the new data portal infrastructure and interactive tutorials that were created to improve the accessibility of the data. We believe that this newly released data will be a valuable asset to future geophysical and glaciological studies over Antarctica and extend significantly the life cycle of the data. All datasets included in this data release are now fully accessible at the UK Polar Data Centre, now certified by the CoreTrustSeal: https://data.bas.ac.uk.

How to cite: Fremand, A., Bodart, J., Jordan, T., Ferraccioli, F., Robinson, C., Corr, H., Peat, H., Bingham, R., and Vaughan, D.: British Antarctic Survey’s Aerogeophysics Data: Releasing 25 Years of Gravity, Magnetics, and Radar Datasets over Antarctica, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7291, https://doi.org/10.5194/egusphere-egu22-7291, 2022.

09:11–09:18

EGU22-12563

Virtual presentation

Tools and incentives for the curation of geosciences data. Experiences from GFZ Data Services

Florian Ott, Kirsten Elger, and Simone Frenzel

GFZ Data Services, hosted at the GFZ German Research Centre for Geosciences (GFZ), is a domain repository for geosciences data that assigns digital object identifier (DOI) to data and scientific software since 2004 and is Allocating Agent for IGSN, the globally unique persistent identifier for physical samples, providing IGSN minting services for physical samples since 2012. The repository provides DOI minting services for several global monitoring networks/observatories in geodesy and geophysics (e.g. INTERMAGNET; IAG Services ICGEM, IGETS, IGS; GEOFON), collaborative projects (TERENO, EnMAP, GRACE, CHAMP) and the curation of long-tail data by domain specialists.

Provision of (1) comprehensive domain-specific data description via standardised and machine-readable metadata with controlled domain vocabularies, (2) complementing the metadata with comprehensive and standardised technical data descriptions or reports; and (3) embedding the research data in wider context by providing cross-references through Persistent Identifiers (DOI, IGSN, ORCID, Fundref) to related research products (text, data, software) and people or institutions involved are used by GFZ Data Services to increase the interoperability of long-tail data.

For their data and software publication activities, GFZ Data Services provides an XML metadata editor18 that assists scientists to create metadata in different international metadata schemas (ISO19115, DataCite), while being at the same time usable by and understandable for the scientists (Ulbricht et al., 2017, 2020). Together with the new website launch of GFZ Data Services in 2022 user guidance has increased significantly and the website has further developed from a searchable data portal (only) to an information point for data publications and data management. This includes information on metadata, data formats, the data publication workflow, FAQ, links to different versions of our metadata editor and downloadable data description templates.

How to cite: Ott, F., Elger, K., and Frenzel, S.: Tools and incentives for the curation of geosciences data. Experiences from GFZ Data Services, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-12563, https://doi.org/10.5194/egusphere-egu22-12563, 2022.

09:18–09:25

EGU22-8427

ECS

Presentation form not yet defined

How to publish your data with the EPOS Multi-scale Laboratories data publication chain

Geertje ter Maat and Richard Wessels and the the EPOS TCS Multi-scale Laboratories Team

The Multi-scale Laboratories (MSL) are a network of European laboratories bringing together the scientific fields of analogue modeling, paleomagnetism, rock and melt physics, geochemistry and microscopy. MSL is one of nine Thematic Core Services (TCS) of the European Plate Observing System (EPOS) (https://www.epos-eu.org/). The overarching goal of EPOS is to establish a comprehensive multidisciplinary research platform for the Earth sciences in Europe. It aims at facilitating the integrated use of data, models, and facilities, from both existing and new distributed pan-European Research Infrastructures, allowing open access and transparent (re-)use of data.

Laboratory facilities are an integral part of Earth science research. The diversity of methods employed in such infrastructures reflects the multi-scale nature of the Earth system and is essential for understanding its evolution, assessing geo-hazards, and sustainably exploiting geo-resources.

Experimental data from these laboratories provide the backbone for many scientific publications, but are often available only on request from the author, as supplementary information to research articles or in a non-digital form (printed tables, figures), limiting data re-use, re-interpretation and availability. Moreover, the raw data remains often unpublished, inaccessible, and unpreserved for the long term.

The TCS MSL is committed to making Earth science laboratory data Findable, Accessible, Interoperable, and Reusable (FAIR). For this purpose, the TCS MSL encourages the community to share their data via DOI-referenced, citable data publications. To facilitate this and ensure the provision of rich metadata, we offer user-friendly tools, plus the necessary data management expertise, to support all aspects of data publishing for the benefit of individual lab researchers via partner repositories. Data published via TCS MSL are described with the use of sustainable metadata standards enriched with controlled vocabularies used in geosciences. The resulting data publications are also exposed through a designated TCS MSL online portal that brings together DOI-referenced data publications from partner research data repositories (https://epos-msl.uu.nl/). As such, successful efforts have already been made to interconnect new data (metadata exchange) with existing databases such as MagIC (paleomagnetic data in Earthref.org) and, in the future, we expect to broaden and improve this practice with other repositories.

How to cite: ter Maat, G. and Wessels, R. and the the EPOS TCS Multi-scale Laboratories Team: How to publish your data with the EPOS Multi-scale Laboratories data publication chain, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-8427, https://doi.org/10.5194/egusphere-egu22-8427, 2022.

09:25–09:32

EGU22-10321

On-site presentation

Findability of laboratory data in the solid Earth sciences: a portal for cross-disciplinary metadata

Laurens Samshuijzen, Otto Lange, Ronald Pijnenburg, Kirsten Elger, Richard Wessels, Geertje ter Maat, Simone Frenzel, and Martyn Drury

The Thematic Core Service Multi-scale Laboratories (TCS MSL) is a community within the European Plate Observing System (EPOS) that includes a wide range of world-class laboratory infrastructures and that provides a cross-disciplinary, though coherent platform for virtual access to data and physical access to solid Earth science labs. The data produced at the participating laboratories are crucial to serving society’s need for geo-resources exploration and for protection against geo-hazards. To model resource formation and system behaviour during exploitation, researchers need an understanding from the molecular to the continental scale, based on experimental and analytical data.

Data coming from the MSL laboratories provide the backbone for scientific publications, but they are often available only as supplementary information to research articles. Moreover, the vast majority of the collected data remain unpublished, inaccessible, and often not sustainably preserved for the long term. To allow reuse of these valuable but often neglected data, the TCS MSL developed a full publication chain to support their FAIR dissemination and sustainable accessibility. This chain consists of a community-driven metadata standard that allows multiple discipline-specific detailed descriptions, a publication tool (metadata editor), and an online community portal that gives access to DOI-referenced data publications at multiple research data repositories related to the TCS MSL context (https://epos-msl.uu.nl/). The portal is built on the CKAN repository toolkit and is driven by the richness of the TCS MSL metadata standard. Besides its importance for the TCS MSL community, it also provides a showcase of how to set up the CKAN environment as a cross-disciplinary catalogue for FAIR metadata exchange through a cascade of infrastructures.

How to cite: Samshuijzen, L., Lange, O., Pijnenburg, R., Elger, K., Wessels, R., ter Maat, G., Frenzel, S., and Drury, M.: Findability of laboratory data in the solid Earth sciences: a portal for cross-disciplinary metadata, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10321, https://doi.org/10.5194/egusphere-egu22-10321, 2022.

09:32–09:39

EGU22-12874

ECS

Presentation form not yet defined

The AuScope Geochemistry Network: Facilitating Better Organisation, Coordination and Ability to Share Data Produced by Australian Geochemistry Laboratories

Bryant Ware, Alexander Prent, Samual Boone, Hayden Dalton, Guillaume Florin, Yoann Greau, Fabian Kohlmann, Moritz Theile, Wayne Noble, Erin Matchan, Barry Kohn, Andrew Gleadow, and Brent McInnes

One of the greatest challenges in the global geochemistry community is to aggregate and make the large amounts of geochemical data generated by laboratories FAIR [Findable, Accessible, Interoperable, and Reusable] and publicly available the large amounts of data generated in laboratories. Standardisation and data organisation has often been an individual or voluntary/uncoordinated effort and/or motivated by the likelihood of immediate/near-future publication. Along with the technical challenges of getting laboratory data into a well-structured relational database and linked to samples’ metadata, societal and cultural issues are often present around the standardisation and accessibility of data reporting (e.g. equipment manufacturer, funding body proprietary data outputs, data reduction software accessibility and requirements/“data ownership” of the users/scientists).

In response to a national expression of a need to address the challenges outlined above and for better organisation and coordination of Australian geochemistry laboratories and data, AuScope funded the AuScope Geochemistry Network (AGN) in 2019. The AGN comprises a team of researchers, data-scientists, and technical staff from three universities across Australia; Curtin University, the University of Melbourne, and Macquarie University, tasked in coordinating and strategizing the best approach to:

Unite the diverse Australian geochemistry community.
Promote national capability (existing geochemical capability).
Promote investment in infrastructure (new, advanced geochemical infrastructure).
Support increased end user access to laboratory facilities.
Support professional development via online tools, training courses and workshops.
Preserving legacy data sets

Over the last two years the AGN has worked to organise the geochemistry community and provide solutions to the integration and adoption of international best practices for data management. With the ‘end in mind’ the AGN and collaborator Lithodat have developed the AusGeochem platform, a unique research data platform that services laboratory needs, bridges the gap between sample metadata and analytical data as well as strengthens the user-laboratory connection. To establish data reporting tables that fit the community’s need, yet facilitate FAIR data practices and integrating international best practices for handling geochemistry data, the AGN led and coordinated Expert Advisory Groups composed of geochemical specialists from a number of Australian institutions. Along with the AusGeochem platform that allows laboratories to upload, archive, disseminate and publish their datasets; the AGN has built LabFinder, a web application tool that helps geoscience users find and access the right laboratory and analytical technique to solve their research questions. LabFinder aims to continue to support end user access to laboratory facilities leading to the improvement in the capability and capacity of geochemistry laboratories on a national scale. In the coming two years AGN will continue to build upon these accomplishments by expanding the AGN data partnerships through the on boarding of institutions hosting major geochemistry laboratories, further facilitating collaborations between Australian geochemistry laboratories.

How to cite: Ware, B., Prent, A., Boone, S., Dalton, H., Florin, G., Greau, Y., Kohlmann, F., Theile, M., Noble, W., Matchan, E., Kohn, B., Gleadow, A., and McInnes, B.: The AuScope Geochemistry Network: Facilitating Better Organisation, Coordination and Ability to Share Data Produced by Australian Geochemistry Laboratories, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-12874, https://doi.org/10.5194/egusphere-egu22-12874, 2022.

09:39–09:46

EGU22-5985

On-site presentation

The new emerging DOI biotope within and around the Open Source Geospatial Foundation (OSGeo)

Peter Löwe, Maris Nartišs, Jeff McKenna, and Astrid Emde

09:46–09:53

EGU22-6573

Virtual presentation

DAS Data Management Challenges and Needs

Rob Mellors and Chad Trabant and the DAS RCN Data Management Working Group

Distributed Acoustic Sensing (DAS) is a relatively recent technology that has the capability to collect seismic (and other data) time series data using optical fiber as sensors. These optical fibers may be custom deployments or re-purposed telecommunication fibers. The range of applications is increasing rapidly, and recent studies include subsurface monitoring, earthquake hazard, geotechnical engineering, and ice flow. As the number of uses and studies increase, it is expected that the need for archiving of the datasets will also increase. Archiving of DAS faces multiple challenges at present. These include the need for large amounts (100’s TB) of storage, associated data transport and processing, and a standardized metadata format. As part of the DAS Research Coordination Network (RCN), a DAS data management working group is constructing a metadata model for DAS data that will address these needs. The objective is to develop a common metadata standard for archival purposes and guide data collection at experiments. The metadata requirements include: 1) accommodation of most use cases (data collection scenarios); 2) permitting of cloud-based processing; 3) allowing of pre-processing; and 4) reduction of the burden of data transport. Standard metadata principles, such as findability, accessibility, interoperability, reusability (FAIR), and machine-readability, will be adhered to. The purpose of this presentation is to inform potential users of these efforts, encourage adoption of the proposed standard, and invite community input.

How to cite: Mellors, R. and Trabant, C. and the DAS RCN Data Management Working Group: DAS Data Management Challenges and Needs, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-6573, https://doi.org/10.5194/egusphere-egu22-6573, 2022.

09:53–10:00

EGU22-4996

Virtual presentation

Why we need to highlight FAIR and Open Data and how to do it with EASYDAB

Anette Ganske, Angelika Heil, Hannes Thiemann, and Andrea Lammert

The FAIR¹ data principles are important for the findability, accessibility, interoperability, and reusability of data. Therefore, many repositories make huge efforts to curate data so that they become FAIR and assign DataCite² DOIs to archived data for increasing the findability. Nevertheless, recent investigations (Strecker³, 2021) show that many datasets published with a DataCite DOI don’t meet all aspects of the FAIR principles, as they are missing important information for the reuse and interoperability in their metadata. Further examinations of data from the Earth System Sciences (ESS) reveal that especially automatic processability is suboptimal, e.g. because of missing persistent identifiers for creators and affiliations, missing information about geolocations or time ranges of simulations. As well, many datasets either don’t have any licence information or a non-open licence.

The question arises of how datasets with open licences⁴ and high-quality metadata can be highlighted so that they stand out from the crowd of published data. One solution is the newly developed branding for FAIR and open ESS data, called EASYDAB⁵ (Earth System Data Branding). It consists of a logo that earmarks landing pages of those datasets with a DataCite DOI, an open licence, open file formats⁶, rich metadata, and which were quality controlled by the responsible repository. The EASYDAB logo is protected and may only be used by repositories that agree to follow the EASYDAB Guidelines⁷. These guidelines define principles on how to achieve high metadata quality of ESS datasets by demanding specific metadata information. Domain-specific quality guidelines define the mandatory metadata for a self-explaining description of the data. One example is a quality guideline for atmospheric model data - the ATMODAT Standard⁸. It prescribes not only the metadata for the files but also for the DOI and the landing page. The atmodat data checker⁹ additionally helps data providers and repositories to check whether data files meet the requirements of the ATMODAT Standard.

The use of the EASYDAB logo is free of charge, but repositories must sign a contract with TIB – Leibniz Information Centre for Science and Technology¹⁰. TIB will control in the future that datasets with landing pages highlighted with EASYDAB indeed follow the EASYDAB Guidelines to ensure that EASYDAB remains a branding for high-quality data.

Using EASYDAB, repositories can indicate their efforts to publish data with high-quality metadata. The EASYDAB logo also indicates to data users that the dataset is quality controlled and can be easily reused.

1: https://www.go-fair.org/fair-principles/

2: https://datacite.org/

3: https://edoc.hu-berlin.de/bitstream/handle/18452/23590/BHR470_Strecker.pdf?sequence=1

4: https://opendefinition.org/licenses/

5: https://www.easydab.de

6: http://opendatahandbook.org/guide/en/appendices/file-formats/

7: https://cera-www.dkrz.de/WDCC/ui/cerasearch/entry?acronym=EASYDAB_Guideline_v1.1

8: https://doi.org/10.35095/WDCC/atmodat_standard_en_v3_0

9: https://github.com/AtMoDat/atmodat_data_checker

10: https://www.tib.eu/en/

How to cite: Ganske, A., Heil, A., Thiemann, H., and Lammert, A.: Why we need to highlight FAIR and Open Data and how to do it with EASYDAB, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-4996, https://doi.org/10.5194/egusphere-egu22-4996, 2022.

Coffee break

Chairpersons: Alice Fremand, Kirsten Elger, Lesley Wyborn

10:20–10:27

EGU22-11046

Virtual presentation

How to Prepare Atmospheric Model Data for Publication with the ATMODAT Standard

Angelika Heil, Andrea Lammert, and Anette Ganske

Atmospheric Models are a relevant element of Climate Research. Access to this atmospheric model data is not only of interest to a wide scientific community but also to public services, companies, politicians and citizens. The state-of-the-art approach to make the data available is to publish them via a data repository that adheres to the FAIR (Findable, Accessible, Interoperable, and Reusable) principles⁽¹⁾. A FAIR publication of research data implies that the data are described with rich metadata and that the data and metadata meet relevant discipline-specific standards.

A very comprehensive data standard has been developed for the climate model output within the Climate Model Intercomparison Project (CMIP)⁽²⁾, which largely builds upon the Climate and Forecast Metadata Conventions (CF Conventions)⁽³⁾. Nevertheless, there are many areas of atmospheric modelling, where data standardisation according to the CMIP standard is not possible or very difficult. To facilitate this task, the ATMODAT standard⁽⁴⁾, a quality guideline for the FAIR and open publication of atmospheric model data, was recently established.

The ATMODAT standard defines a set of requirements that aim at ensuring a high degree of reusability of published atmospheric model data. The requirements include the use of the netCDF file format⁽⁵⁾, the application of the CF conventions⁽³⁾, a data publication with a DataCite DOI⁽⁶⁾, and rich and standardised metadata for the data files, the DOI and on the landing page.

The atmodat data checker⁽⁷⁾ was developed to support data producers in checking if their file metadata comply with the ATMODAT standard.

We demonstrate the application of the ATMODAT standard to selected datasets from a building-resolving atmospheric model, the "grassroots" AEROCOM MIP, and weather pattern data derived from an atmospheric reanalysis. We explain the practical workflow involved to achieve an ATMODAT-compliant data publication and discuss the various challenges.

⁽¹⁾ https://doi.org/10.1038/sdata.2016.18
⁽²⁾ https://doi.org/10.5194/gmd-13-201-2020
⁽³⁾ https://cfconventions.org/
⁽⁴⁾ https://doi.org/10.35095/WDCC/atmodat_standard_en_v3_0
⁽⁵⁾ https://www.unidata.ucar.edu/software/netcdf/
⁽⁶⁾ https://datacite.org/
⁽⁷⁾ https://github.com/AtMoDat/atmodat_data_checker

How to cite: Heil, A., Lammert, A., and Ganske, A.: How to Prepare Atmospheric Model Data for Publication with the ATMODAT Standard, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11046, https://doi.org/10.5194/egusphere-egu22-11046, 2022.

10:27–10:34

EGU22-11102

Presentation form not yet defined

Joining Geo Data Across Different Providers to Ease Machine Learning Applications

Matthes Rieke, Benedikt Gräler, and Simon Jirka

Data integration and harmonization has been a tedious task ever since. The increase of available data in volume and variety has further increased the need for a thorough data integration. Furthermore, the application of more and more automatic algorithms stresses the need for a sensible geo data platform to avoid the ‘garbage in, garbage out’ trap and to allow for a meaningful data analysis. We reviewed different projects and learned about various needs and constraints of joint spatial research data infrastructures from local to cloud based deployments. Typically, these systems are not designed from scratch and existing systems need to be integrated or interfaced. As a result or arising from the need to support the sovereignty of distributed data centers, modern infrastructures need to be capable to support federated set-ups. Often these research data infrastructures shall not only be used to store raw data for scientists, but will also provide results (maps, derived data products, tools and applications) to the public. This goes along with the need for access delegation (e.g. OAuth). A special focus is put on the provision of the joint datasets for machine learning applications. In order to facilitate efficient learning and prediction a ML processing environment needs to be aligned with the data infrastructure.

We will present commonalities among these infrastructures and outline typical design patterns. A spatial data infrastructure based on open source software components that can be deployed on the cloud will be introduced. It features open standardized interfaces and services for easy adaptation and connectivity.

How to cite: Rieke, M., Gräler, B., and Jirka, S.: Joining Geo Data Across Different Providers to Ease Machine Learning Applications, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11102, https://doi.org/10.5194/egusphere-egu22-11102, 2022.

10:34–10:41

EGU22-9964

Virtual presentation

Citing large numbers of diverse datasets

James Ayliffe, Martina Stockhause, Shelley Stall, Deb Agarwal, Justin Buck, Caroline Coward, and Chris Erdmann

In Earth and Biological sciences, data are often preserved and publicly available in data repositories where the data are citable by DOIs and published under a Creative Commons CC-BY license. Researchers combine many datasets across disciplines, repositories, and regions to better understand processes, patterns, and drivers. Citing these many datasets is difficult as the large number does not fit into the references section of a paper but the licenses of the datasets require that credit is given to their creators.

The Data Citation Community of Practice (CoP) was formed to target such challenges in data citation and other scholarly work that will support indexing and measuring the impact. The CoP identified a container as a solution for large numbers of data citations that holds the citations and its internal format, which is referred to as a 'reliquary'. The existing dataset collection methods have been gathered and evaluated using concrete citation use cases. Requirements for the reliquary content have been identified and applied to the use cases. In this presentation, we will report on the current progress on an approach to building a reliquary.

Reliquaries are an important part of enabling cross-disciplinary analysis of large amounts of data stored in many repositories. The challenge with a reliquary will be to design a method that works across diverse repositories and domain citation practices and to enhance the indexing system to direct credit to the reliquary content and authors. The CoP is in the process of setting up a Research Data Alliance (RDA) Working Group on Complex Citations in the Earth, Space, and Environmental Sciences to broaden the discussion and to find further use cases for evaluation and interested early adopters.

How to cite: Ayliffe, J., Stockhause, M., Stall, S., Agarwal, D., Buck, J., Coward, C., and Erdmann, C.: Citing large numbers of diverse datasets, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-9964, https://doi.org/10.5194/egusphere-egu22-9964, 2022.

10:41–10:48

EGU22-10143

Virtual presentation

Formalization of experimental setups for integration of heterogeneous data

(withdrawn)

Artem Vladimirov, Taras Vasiliev, Alexander Pashkov, and Nadezda Vasilyeva

10:48–10:55

EGU22-12187

ECS

Virtual presentation

FAIR WISH - FAIR Workflows to establish IGSN for Samples in the Helmholtz Association

Linda Baldewein, Kirsten Elger, Birgit Heim, Alexander Brauser, Simone Frenzel, Ulrike Kleeberg, and Ben Norden

10:55–11:05

Discussion