In recent years, the number of Earth and environmental research data repositories has increased markedly, and so has their range of maturities and capabilities to integrate into the ecosystem of modern scientific communication. Efforts such as the FAIR Data Principles, the CoreTrustSeal Certification for the trustworthiness of research data repositories, and the Enabling FAIR Data Commitment Statement have raised expectations we have towards the capabilities of research data repositories. How do we know which ones meet these benchmarks and future expectations? What are the challenges and appropriate strategies?

This session seeks submissions from any research data repository for Earth and environmental science data. It aims to showcase the range of practices in research data repositories, data publication and the integration of data, software and samples into the scholarly publication process. The session invites repositories to discuss challenges they are facing in meeting these community best practices and expectations for maturity.

Convener: Kirsten Elger | Co-conveners: Helen Glaves, Florian Haslinger
| Attendance Thu, 07 May, 14:00–15:45 (CEST)

Files for download

Download all presentations (75MB)

Chat time: Thursday, 7 May 2020, 14:00–15:45

D822 |
Alice Fremand

The UK Polar Data Centre (UK PDC, https://www.bas.ac.uk/data/uk-pdc/) is the focal point for Arctic and Antarctic environmental data management in the UK. Part of the Natural Environmental Research Council’s (NERC) and based at the British Antarctic Survey (BAS), the UK PDC coordinate the management of polar data from UK-funded research and support researchers in complying with national and international data legislation and policy.

Reflecting the multidisciplinary nature of polar science, the datasets handled by the data centre are extremely diverse. Geophysics datasets include bathymetry, aerogravity, aeromagnetics and airborne radar depth soundings.  These data provide information about the seabed topography, the Earth’s geological structure and the ice thickness. The datasets are used in a large variety of scientific research and projects at BAS. For instance, the significant seabed multibeam coverage of the Southern Ocean enables BAS to be a major contributor to multiple international projects such as International Bathymetric Chart of the Southern Ocean (IBCSO) and Seabed 2030. That is why, it is crucial for the UK Polar Data Centre (PDC) to develop robust procedures to manage these data.

In the last few months’, the procedures to preserve, archive and distribute all these data have been revised and updated to comply with the recommendations from the Standing Committee on Antarctic Data Management (SCADM) and the requirements of CoreTrustSeal for a future certification. The goal is to develop standard ways to publish FAIR (Findable, Accessible, Interoperable and Reusable) data and set up workflows for long-term preservation and access to UK PDC holdings.

How to cite: Fremand, A.: Geophysics data management at the UK Polar Data Centre, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-1422, https://doi.org/10.5194/egusphere-egu2020-1422, 2020.

D823 |
| Highlight
Ionut Iosifescu-Enescu, Gian-Kasper Plattner, Dominik Haas-Artho, David Hanimann, and Konrad Steffen

EnviDat – www.envidat.ch – is the institutional Environmental Data portal of the Swiss Federal Institute for Forest, Snow and Landscape Research WSL. Launched in 2012 as a small project to explore possible solutions for a generic WSL-wide data portal, it has since evolved into a strategic initiative at the institutional level tackling issues in the broad areas of Open Research Data and Research Data Management. EnviDat demonstrates our commitment to accessible research data in order to advance environmental science.

EnviDat actively implements the FAIR (Findability, Accessibility, Interoperability and Reusability) principles. Core EnviDat research data management services include the registration, integration and hosting of quality-controlled, publication-ready data from a wide range of terrestrial environmental systems, in order to provide unified access to WSL’s environmental monitoring and research data. The registration of research data in EnviDat results in the formal publication with permanent identifiers (EnviDat own PIDs as well as DOIs) and the assignment of appropriate citation information.

Innovative EnviDat features that contribute to the global system of modern documentation and exchange of scientific information include: (i) a DataCRediT mechanism designed for specifying data authorship (Collection, Validation, Curation, Software, Publication, Supervision), (ii) the ability to enhance published research data with additional resources, such as model codes and software, (iii) in-depth documentation of data provenance, e.g., through a dataset description as well as related publications and datasets, (iv) unambiguous and persistent identifiers for authors (ORCIDs) and, in the medium-term, (v) a decentralized “peer-review” data publication process for safeguarding the quality of available datasets in EnviDat.

More recently, the EnviDat development has been moving beyond the set of core features expected from a research data management portal with a built-in publishing repository. This evolution is driven by the diverse set of researchers’ requirements for a specialized environmental data portal that formally cuts across the five WSL research themes forest, landscape, biodiversity, natural hazards, and snow and ice, and that concerns all research units and central IT services.

Examples of such recent requirements for EnviDat include: (i) immediate access to data collected by automatic measurements stations, (ii) metadata and data visualization on charts and maps, with geoservices for large geodatasets, and (iii) progress towards linked open data (LOD) with curated vocabularies and semantics for the environmental domain.

There are many challenges associated with the developments mentioned above. However, they also represent opportunities for further improving the exchange of scientific information in the environmental domain. Especially geospatial technologies have the potential to become a central element for any specialized environmental data portal, triggering the convergence between publishing repositories and geoportals. Ultimately, these new requirements demonstrate the raised expectations that institutions and researchers have towards the future capabilities of research data portals and repositories in the environmental domain. With EnviDat, we are ready to take up these challenges over the years to come.

How to cite: Iosifescu-Enescu, I., Plattner, G.-K., Haas-Artho, D., Hanimann, D., and Steffen, K.: Towards a Specialized Environmental Data Portal: Challenges and Opportunities, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13237, https://doi.org/10.5194/egusphere-egu2020-13237, 2020.

D824 |
| Highlight
Nikolai Svoboda, Xenia Specka, Carsten Hoffmann, and Uwe Heinrich

The German research initiative BonaRes (“Soil as a sustainable resource for the bioeconomy”, financed by the Federal Ministry of Education and Research, BMBF) was launched in 2015 with a duration of 9 years and perpetuation envisaged. BonaRes includes 10 collaborative soil research projects and, additionally, the BonaRes Centre.

Within the BonaRes Data Centre (important infrastructure in the planned NFDI4Agri), diverse research data with mostly agricultural and soil background are collected from BonaRes collaborative projects and external scientists.  After a possible embargo expires, all data are made available in a standardized form for free reuse via the BonaRes Repository. Once the administrative and technical infrastructure has been established, the Data Centre provides services for scientists in all terms of data management. The focus here is on the publication of research data (e.g. long-term experiments, field trials, model results) to ensure availability and citeability and thus foster scientific reuse. Available data can be accessed via the BonaRes Repository. For instance: https://doi.org/10.20387/BonaRes-BSVY-R418.

Due to the high diversity of agricultural data provided via our repository, we have developed individually tailored strategies to make them citable for 1.) finalized data, 2.) regularly updating and 3.) data collections with related tables. The challenge is that the author's rights (license CC-BY) must be preserved and yet a user-friendly citation of even large amounts of data must be ensured. We will present our BonaRes DOI concept by means of use cases and will be looking forward to discuss it with the professional community.

How to cite: Svoboda, N., Specka, X., Hoffmann, C., and Heinrich, U.: Towards publishing soil and agricultural research data: the BonaRes DOI, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13531, https://doi.org/10.5194/egusphere-egu2020-13531, 2020.

D825 |
| Highlight
Kyle Copas

GBIF—the Global Biodiversity Information Facility—and its network of more than 1,500 institutions maintain the world's largest index of biodiversity data (https://www.gbif.org), containing nearly 1.4 billion species occurrence records. This infrastructure offers a model of best practices, both technological and cultural, that other domains may wish to adapt or emulate to ensure that its users have free, FAIR and open access to data.

The availability of community-supported data and metadata standards in the biodiversity informatics community, combined with the adoption (in 2014) of open Creative Commons licensing for data shared with GBIF, established the necessary preconditions for the network's recent growth.

But GBIF's development of a data citation system based on the uses of DOIs—Digital Object Identifiers—has established an approach for using unique identifiers to establish direct links between scientific research and the underlying data on which it depends. The resulting state-of-the-art system tracks uses and reuses of data in research and credits data citations back to individual datasets and publishers, helping to ensure the transparency of biodiversity-related scientific analyses.

In 2015, GBIF began issuing a unique Digital Object Identifier (DOI) for every data download. This system resolves each download to a landing page containing 1) the taxonomic, geographic, temporal and other search parameters used to generate the download; 2) a quantitative map of the underlying datasets that contributed to the download; and 3) a simple citation to be included in works that rely on the data.

When authors cite these download DOIs, they in effect assert direct links between scientific papers and underlying data. Crossref registers these links through Event Data, enabling GBIF to track citation counts automatically for each download, dataset and publisher. These counts expand to display a bibliography of all research reuses of the data.This system improves the incentives for institutions to share open data by providing quantifiable measures demonstrating the value and impact of sharing data for others' research.

GBIF is a mature infrastructure that supports a wide pool of researchers publish two peer-reviewed journal articles that rely on this data every day. That said, the citation-tracking and -crediting system has room for improvement. At present, 21% of papers using GBIF-mediated data provide DOI citations—which represents a 30% increase over 2018. Through outreach to authors and collaboration with journals, GBIF aims to continue this trend.

In addition, members of the GBIF network are seeking to extend citation credits to individuals through tools like Bloodhound Tracker (https://www.bloodhound-tracker.net) using persistent identifiers from ORCID and Wikidata IDs. This approach provides a compelling model for the scientific and scholarly benefits of treating individual data records from specimens as micro- or nanopublications—first-class research objects that advancing both FAIR data and open science.

How to cite: Copas, K.: Mirror, mirror…is GBIF the FAIRest of them all?, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-17116, https://doi.org/10.5194/egusphere-egu2020-17116, 2020.

D826 |
Shelley Stall

The Enabling FAIR Data project is an international, community-driven effort in the Earth, space, and environmental sciences that promotes that the data and software supporting our research is to be deposited in a community-accepted, trusted repository and cited in the paper.  Journals will no longer accept data only placed in the supplemental information of the paper.  The supplement is not an archive and does not provide the necessary information about the data, nor is there any way to discover the data separate from the paper.  Repositories provide the critical infrastructure in our research ecosystem, managing and preserving data and software for future researchers to discovery and use.

As a signatory of the Enabling FAIR Data Commitment Statement repositories agree to comply with the defined tenets.  Not all repositories provide the same level of services to researchers or their data holdings. Many researchers find it difficult to select the right repository and understand the process for depositing their data.  Through better coordination between journals and repositories journals can guide researchers to the right repository for deposition.  This is a significant benefit to authors, but there are unintended challenges that result. Here we will discuss the Enabling FAIR Data project, the successes, and the continued effort necessary to make sure our data is treated as a “world heritage.”

How to cite: Stall, S.: Enabling FAIR Data - The Importance of our Scientific Repositories, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-17993, https://doi.org/10.5194/egusphere-egu2020-17993, 2020.

D827 |
Alex Prent, Brent McInnes, Andy Gleadow, Suzanne O'Reilly, Samuel Boone, Barry Kohn, Erin Matchan, and Tim Rawling

AuScope is an Australian consortium of Earth Science institutes cooperating to develop national research infrastructure. AuScope received federal funding in 2019 to establish the AuScope Geochemistry Laboratory Network (AGN), with the objective of coordinating FAIR-based open data initiatives, support user access to laboratory facilities, and strengthen analytical capability on a national scale. 

Activities underway include an assessment of best practices for researchers to register samples using the International Geo Sample Number (IGSN) system in combination with prescribed minima for meta-data collection. Initial activities will focus on testing meta-data schema on high value datasets such as geochronology (SHRIMP U-Pb, Curtin University), geochemistry (Hf-isotopes, Macquarie University) and low-temperature thermochronology analyses (fission track/U-He, University of Melbourne). Collectively, these datasets will lead to a geochemical data repository in the form of an Isotopic Atlas eResearch Platform that is available to the public via the AuScope Discovery Portal. Over time, the repository will aggregate a large volume of publicly funded geochemical data, providing a key resource in quantitatively understanding the evolution of Earth system processes that have shaped the Australian continent and its resources.

How to cite: Prent, A., McInnes, B., Gleadow, A., O'Reilly, S., Boone, S., Kohn, B., Matchan, E., and Rawling, T.: The AuScope Geochemistry Laboratory Network, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-22432, https://doi.org/10.5194/egusphere-egu2020-22432, 2020.

D828 |
Kerstin Lehnert, Lucia Profeta, Annika Johansson, and Lulin Song

Modern scientific research requires open and efficient access to well-documented data to ensure transparency and reproducibility, and to build on existing resources to solve scientific questions of the future. Open access to the results of scientific research - publications, data, samples, code - is now broadly advocated and implemented in policies of funding agencies and publishers because it helps build trust in science, galvanizes the scientific enterprise, and accelerates the pace of discovery and creation of new knowledge. Domain specific data facilities offer specialized services for data curation that are tailored to the needs of scientists in a given domain, ensuring rich, relevant, and consistent metadata for meaningful discovery and reuse of data, as well as data formats and encodings that facilitate data access, data integration, and data analysis for disciplinary and interdisciplinary applications. Domain specific data facilities are uniquely poised to implement best practices that ensure not only the Findability and Accessibility of data under their stewardship, but also their Interoperability and Reusability, which requires detailed data type specific documentation of methods, including data acquisition and processing steps, uncertainties, and other data quality measures. 

The dilemma for domain repositories is that the rigorous implementation of such Best Practices requires substantial effort and expertise, which becomes a challenge when usage of the repository outgrows its resources. Rigorous implementation of Best Practices can also cause frustration of users, who are asked to revise and improve their data submissions, and may make them deposit their data in other, often general repositories that do not perform such rigorous review and therefore minimize the burden of data deposition. 

We will report on recent experiences of EarthChem, a domain specific data facility for the geochemical and petrological science community. EarthChem is recommended by publishers as a trusted repository for the preservation and open sharing of geochemical data. With the implementation of the FAIR Data principles at multiple journals that publish geochemical and petrological research over the past year, the number, volume, and diversity of data submitted to the EarthChem Library has grown dramatically and is challenging existing procedures and resources that do not scale to the new level of usage. Curators are challenged to meet expectations of users for immediate data publication and DOI assignment, and to process submissions that include new data types, are poorly documented, or contain code, images, and other digital content that is outside the scope of the repository. We will discuss possible solutions ranging from tiered data curation support, collaboration with other data repositories, and engagement with publishers and editors to enhance guidance and education of authors.



How to cite: Lehnert, K., Profeta, L., Johansson, A., and Song, L.: Best Practices: The Value and Dilemma of Domain Repositories, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-22533, https://doi.org/10.5194/egusphere-egu2020-22533, 2020.

D829 |
| Highlight
Damian Ulbricht, Kirsten Elger, Boris Radosavljevic, and Florian Ott

Following the FAIR principles, research data should be Findable, Accessible, Interoperable and Reuseable. Publishing research output under these principles requires to generate machine-readable metadata and to use persistent identifiers for cross-linking with descriptive articles, related software for processing or physical samples that were used to derive the data. In addition, research data should be indexed with domain keywords to facilitate discovery. Software solutions are required that help scientists in generating metadata, since metadata models tend to be complex and the serialisation into a format for metadata dissemination is a difficult task, especially in the long-tail communities.

GFZ Data Services is a domain repository for geosciences data, hosted at GFZ German Research Centre for Geosciences, that assigns DOIs to data and scientific software since 2004. The repository has a focus on the curation of long-tail data but also provides DOI minting services for several global monitoring networks/observatories in geodesy and geophysics (e.g. INTERMAGNET, IAG Services ICGEM and IGETS, GEOFON) and collaborative projects (e.g. TERENO, EnMAP, GRACE, CHAMP). Furthermore, GFZ is allocating agent for IGSN, a globally unique persistent identifier for physical samples with discovery functionality of digital sample descriptions via the internet. GFZ Data Services will also contribute to the National Research Data Infrastructure Consortium for Earth System Sciences (NFDI4Earth) in Germany.

GFZ Data Services increases the interoperability of long-tail data by (1) the provision of comprehensive domain-specific data description via standardised and machine-readable metadata complemented with controlled “linked-data” domain vocabularies; (2) complementing the metadata with technical data descriptions or reports; and (3) embedding the research data in wider context by providing cross-references through Persistent Identifiers (DOI, IGSN, ORCID, Fundref) to related research products and people or institutions involved.

A key tool for metadata generation is the GFZ Metadata Editor that assists scientists to create metadata in different metadata schemas that are popular in the Earth sciences (ISO19115, NASA GCMD DIF, DataCite). Emphasis is placed on removing barriers, in particular the editor is publicly available on the internet without registration, a copy of the metadata can be saved to and loaded from the local hard disk and scientists are not requested to provide information that may be generated automatically. To improve usability, form fields are translated into the scientific language and we offer a facility to search structured vocabulary lists. In addition, multiple geospatial references can be entered via an interactive mapping tool, which helps to minimize problems with different conventions to provide latitudes and longitudes.

Visiblity of the data is established through registration of the metadata at DataCite and the dissemination of metadata in standard protocols. The DOI Landing Pages embed metadata in Schema.org to facilitate discovery through internet search engines like the Google Dataset Search. In addition, we feed links of data and related research products into Scholix, which allows to link data publications and scholarly literature, even when the data are published years after the article.

How to cite: Ulbricht, D., Elger, K., Radosavljevic, B., and Ott, F.: Long-tail data curation in the times of the FAIR Principles and Enabling FAIR Data – challenges and best practices from GFZ Data Services, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-16466, https://doi.org/10.5194/egusphere-egu2020-16466, 2020.

D830 |
Mathieu Turlure, Marc Schaming, Alice Fremand, Marc Grunberg, and Jean Schmittbuhl

The CDGP Repository for Geothermal Data

The Data Center for Deep Geothermal Energy (CDGP – Centre de Données de Géothermie Profonde, https://cdgp.u-strasbg.fr) was launched in 2016 by the LabEx G-EAU-THERMIE PROFONDE (http://labex-geothermie.unistra.fr) to preserve, archive and distribute data acquired on geothermal sites in Alsace. Since the beginning of the project, specific procedures are followed to respect international requirements for data management. In particular, FAIR recommendations are used to distribute Findable, Accessible, Interoperable and Reusable data.

Data currently available on the CDGP mainly consist of seismological and hydraulic data acquired at the Soultz-sous-Forêts geothermal plant pilot project. Data on the website are gathered in episodes. Episodes 1994, 1995, 1996, and 2010 from Soultz-sous-Forêts have been recently added to the episodes already available on the CDGP (1988, 1991, 1993, 2000, 2003, 2004 and 2005). All data are described with metadata and interoperability is promoted with use of open or community-shared data formats: SEED, csv, pdf, etc. Episodes have DOIs.

To secure Intellectual Property Rights (IPR) set by data providers that partly come from Industry, an Authentication, Authorization and Accounting Infrastructure (AAAI) grants data access depending to distribution rules and user’s affiliation (i.e. academic, industrial, …).

The CDGP is also a local node for the European Plate Observing System (EPOS) Anthropogenic Hazards platform (https://tcs.ah-epos.eu). The platform provides an environment and facilities (data, services, software) for research onto anthropogenic hazards, especially related to the exploration and exploitation of geo-resources. Some episodes from Soultz-sous-Forêts are already available and the missing-ones will be soon on the platform.

The next step for the CDGP is first to complete data from Soultz-sous-Forêts. Some data are still missing and must be recovered from the industrial partners. Then, data from the other geothermal sites in Alsace (Rittershoffen, Illkirch, Vendenheim) need to be collected in order to be distributed. Finally, with other French data centers, we are on track to apply the CoreTrustSeal certification (ANR Cedre).

The preservation of data can be very challenging and time-consuming. We had to deal with obsolete tapes and formats, even incomplete data. Old data are frequently not well documented and the identification of owner is sometimes difficult. However, the hard work to retrieve, collect old geothermal data and make them FAIR is necessary for new analysis and the valorization of these patrimonial data. The re-use of data (e.g. Cauchie et al, 2020) demonstrates the importance of the CDGP.

How to cite: Turlure, M., Schaming, M., Fremand, A., Grunberg, M., and Schmittbuhl, J.: The CDGP Repository for Geothermal Data, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-7534, https://doi.org/10.5194/egusphere-egu2020-7534, 2020.

D831 |
Kristopher Larsen, Kim Kokkonen, Adrian Gehr, Julie Barnum, James Craft, and Chris Pankratz

Now entering it’s fifth year of on-orbit operations, the Magnetospheric MultiScale (MMS) Mission has produced over eleven million data files, totaling nearly 180 terabytes (as of early 2020) that are available to the science team and heliophysics community. MMS is a constellation of four identical satellites, each with twenty-five instruments across five distinct instrument teams, examining the interaction of the solar wind with Earth’s magnetic field. Each instrument team developed their data products in compliance with standards set by the mission’s long term data repository, NASA’s Space Physics Data Facility (SPDF). The Science Data Center at the Laboratory for Atmospheric and Space Physics at the University of Colorado is responsible for producing and distributing these data products to both the project’s science team as well as the global scientific community.

                This paper will highlight the challenges the MMS SDC has found with maintaining a data repository during an extended mission, from overall data volumes that preclude providing access to every version of each data product (currently nearing one petabyte for MMS) to adjusting to changing standards and publication requirements. We will also discuss the critical need for cooperation between a mission’s science team, instrument teams, data production, and repositories in order to ensure the data meets the needs of the science community both today and in the future, particularly after the end of a given mission.

How to cite: Larsen, K., Kokkonen, K., Gehr, A., Barnum, J., Craft, J., and Pankratz, C.: The evolution of data and practices within a single mission Science Data Center., EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-11998, https://doi.org/10.5194/egusphere-egu2020-11998, 2020.

D832 |
Nicholas Jarboe, Rupert Minnett, Catherine Constable, Anthony Koppers, and Lisa Tauxe

MagIC (earthref.org/MagIC) is an organization dedicated to improving research capacity in the Earth and Ocean sciences by maintaining an open community digital data archive for rock and paleomagnetic data with portals that allow users access to archive, search, visualize, download, and combine these versioned datasets. We are a signatory of the Coalition for Publishing Data in the Earth and Space Sciences (COPDESS)'s Enabling FAIR Data Commitment Statement and an approved repository for the Nature set of journals. We have been in collaboration with EarthCube's GeoCodes data search portal, adding schema.org/JSON-LD headers to our data set landing pages and suggesting extensions to schema.org when needed. Collaboration with the European Plate Observing System (EPOS)'s Thematic Core Service Multi-scale laboratories (TCS MSL) is ongoing with MagIC sending its contributions' metadata to TCS MSL via DataCite records.

Improving and updating our data repository to meet the demands of the quickly changing landscape of data archival, retrieval, and interoperability is a challenging proposition. Most journals now require data to be archived in a "FAIR" repository, but the exact specifications of FAIR are still solidifying. Some journals vet and have their own list of accepted repositories while others rely on other organizations to investigate and certify repositories. As part of the COPDESS group at Earth Science Information Partners (ESIP), we have been and will continue to be part of the discussion on the needed and desired features for acceptable data repositories.

We are actively developing our software and systems to meet the needs of our scientific community. Some current issues we are confronting are: developing workflows with journals on how to publish the journal article and data in MagIC simultaneously, sustainability of data repository funding especially in light of the greater demands on them due to data policy changes at journals, and how to best share and expose metadata about our data holdings to organizations such as EPOS, EarthCube, and Google.

How to cite: Jarboe, N., Minnett, R., Constable, C., Koppers, A., and Tauxe, L.: The Magnetics Information Consortium (MagIC) Data Repository: Successes and Continuing Challenges, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-12088, https://doi.org/10.5194/egusphere-egu2020-12088, 2020.

D833 |
Knut Behrends, Katja Heeschen, Cindy Kunkel, and Ronald Conze

The Drilling Information System (DIS) is a data entry system for field data, laboratory data and sampling data. The International Continental Scientific Drilling Program (ICDP) provides the system to facilitate data management of drilling projects during field work and afterwards. Previously, a legacy DIS client-server application was developed in 1998, and has been refined over the years. The most recent version was released in 2010. However, legacy DIS was locked-in to very specific versions of the Windows- and Office platforms that are non-free, and, more importantly, are no longer supported by Microsoft.


Therefore we have developed a new version of the DIS called the mobile DIS, or mDIS. It is entirely based on open-source components and is platform-independent. We have introduced a basic (beta) version of mDIS at EGU 2019. That version was designed for fieldwork. At EGU 2020 we present an extended version designed for core repositories.


The basic or expedition mDIS manages basic datasets gained during the field work of a drilling project. These datasets comprise initial measurements of the recovered rock samples, such as core logs, special on-site sample requests, and drilling engineering data. It supports label-printing including QR codes, and the automatic assignment of unique International Geo Sample Numbers (IGSN). The data are available online for all project scientists on site as well as offsite.


The curation mDIS, however, satisfies additional requirements of core repositories, which store drill cores for the long term. Additional challenges for the mDIS that occur during long-term sample curation include: (a) the import of large datasets from the expedition mDIS, (b) complex inventory management requirements for physical storage locations, such as shelves, racks, or even buildings, used by the repositories, (c) mass printing of custom labels and custom reports, (d) managing researchers' sample requests, sample curation and sample distribution, (e) providing access to science data according to FAIR principles.

How to cite: Behrends, K., Heeschen, K., Kunkel, C., and Conze, R.: The mobile Drilling Information System (mDIS) for core repositories, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-13663, https://doi.org/10.5194/egusphere-egu2020-13663, 2020.

D834 |
Chiara Sauli, Paolo Diviacco, Alessandro Busato, Alan Cooper, Frank O. Nitsche, Mihai Burca, and Nikolas Potleca

Antarctica is one of the most studied areas on the planet for its profound effects on the Earth’s climate and ocean systems. Antarctic geology keeps records of events that took place in remote times but that can spread light on climate phenomena taking place today. It is therefore of overwhelming importance, to make all data in the area available to the widest scientific community. The remoteness, extreme weather conditions, and environmental sensitivity of Antarctica make new data acquisition complicated and existing seismic data very valuable. It is, therefore, critical that existing data are findable, accessible and reusable..

The Antarctic Seismic Data Library System (SDLS) was created in 1991 under the mandates of the Antarctic Treaty System (ATS) and the auspices of the Scientific Committee on Antarctic Research (SCAR), to provide open access to Antarctic multichannel seismic-reflection data (MCS) for use in cooperative research projects. The legal framework of the ATS dictates that all institutions that collect MCS data in Antarctica must submit their MCS data to the SDLS within 4 years of collection and remain in the library under SDLS guidelines until 8 years after collection. Thereafter, the data switch to unrestricted use in order to trigger and foster as much as possible collaborative research within the Antarctic research community.  In this perspective, the SDLS developed a web portal (http://sdls.ogs.trieste.it) that implements tools that allow all data to be discovered, browsed, accessed and downloaded directly from the web honoring at the same time the ATS legal framework and the Intellectual protection of data owners. The SDLS web portal, is based on the SNAP geophysical web-based data access framework developed by Istituto Nazionale di Oceanografia e di Geofisica Sperimentale - OGS, and offers all standard OGC compliant metadata models, and OGC compliant data access services. It is possible to georeference, preview and even perform some processing on the actual data on the fly. Datasets are assigned DOIs so that they can be referenced  from within research papers or other publications.. We will present in details the SDLS web based system in the light of Open Data and FAIR principles, and the SDLS planned future developments.

How to cite: Sauli, C., Diviacco, P., Busato, A., Cooper, A., Nitsche, F. O., Burca, M., and Potleca, N.: The Antarctic Seismic Data Library System (SDLS): fostering collaborative research through Open Data and FAIR principles, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-18498, https://doi.org/10.5194/egusphere-egu2020-18498, 2020.

D835 |
Karsten Peters, Michael Botzet, Veronika Gayler, Estefania Montoya Duque, Nicola Maher, Sebastian Milinski, Katharina Berger, Fabian Wachsmann, Laura Suarez-Gutierrez, Dirk Olonscheck, and Hannes Thiemann

In a collaborative effort, data management specialists at the German Climate Computing Centre (Deutsches Klimarechenzentrum, DKRZ) and researchers at the Max Planck Institute for Meteorology (MPI-M) are joining forces to achieve long-term and effective global availability of a high-volume flagship climate model dataset: the MPI-M Grand Ensemble (MPI-GE, Maher et al. 20191), which is the largest ensemble of a single state-of-the-art comprehensive climate model (MPI-ESM1.1-LR) currently available. The MPI-GE has formed the basis for a number of scientific publications over the past 4 years2. However, the wealth of data available from the MPI-GE simulations was essentially invisible to potential data users outside of DKRZ and MPI-M.

In this contribution, we showcase the adopted strategy, experiences made and the current status of FAIR long-term preservation of the MPI-GE dataset in the World Data Center for Climate (WDCC), hosted at DKRZ. The importance of synergistic cooperation between domain-expert data providers and knowledgeable repository staff will be highlighted.

Recognising the demand for MPI-GE data access outside of its native environment, the development of a strategy to make MPI-GE data globally available began in mid 2018. A two-stage dissemination/preservation process was decided upon.

In a first step, MPI-GE data would be published and made globally available via the Earth System Grid Federation (ESGF) infrastructure. Second, the ESGF-published data would be transferred to DKRZ’s long-term and FAIR archiving service WDCC. Datasets preserved in the WDCC can be made accessible via ESGF - global access via the established system would thus still be ensured.

To date, the first stage of the above process is completed and data are available via the ESGF3. Data published in the ESGF has to comply with strict data standards in order to ensure efficient data retrieval and interoperability of the dataset. Standardization of the MPI-GE data required selection of an applicable data standard (CMIP5 in this case) and an appropriate variable subset, adaptation and application of fit-for-purpose DKRZ-supplied post-processing software and of course the post-processing of the data itself. All steps required dedicated communication and collaboration between DKRZ and MPI-M staff and required significant time resources. Currently, some 87 TB, comprised of more than 55 000 records, of standardized MPI-GE data are available for search and download from the ESGF. About three to four thousand records with an accumulated volume of several hundred GB are downloaded by ESGF users each month.

The long-term archival of the standardized MPI-GE data using DKRZ’s WDCC-service is planned to begin within the first half of 2020. All preparatory work done so far, especially the data standardization, significantly reduces the effort and resources required for achieving FAIR MPI-GE data preservation in the WDCC.

1Maher, N. et al. ( 2019). J. Adv. Model Earth Sy., 11, 2050– 2069. https://doi.org/10.1029/2019MS001639



How to cite: Peters, K., Botzet, M., Gayler, V., Montoya Duque, E., Maher, N., Milinski, S., Berger, K., Wachsmann, F., Suarez-Gutierrez, L., Olonscheck, D., and Thiemann, H.: Facilitating global access to a high-volume flagship climate model dataset: the MPI-M Grand Ensemble experience, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-9811, https://doi.org/10.5194/egusphere-egu2020-9811, 2020.

D836 |
Danny Brooke

For more than a decade, the Dataverse Project (dataverse.org) has provided an open-source platform used to build data repositories around the world. Core to its success is its hybrid development approach, which pairs a core team based at the Institute for Quantitative Social Science at Harvard University with an empowered, worldwide community contributing code, documentation, and other efforts towards open science. In addition to an overview of the platform and how to join the community, we’ll discuss recent and future efforts towards large data support, geospatial data integrations, sensitive data support, integrations with reproducibility tools, access to computation resources, and many other useful features for researchers, journals, and institutions. 

How to cite: Brooke, D.: Community Built Infrastructure: The Dataverse Project, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-12006, https://doi.org/10.5194/egusphere-egu2020-12006, 2020.

D837 |
Stephen Diggs and Danie Kinkade

Finding and integrating geoscience data that are fit for use can alter the scope and even the type of science exploration undertaken. Most of these difficulties in data discovery and use are due to a technical incompatibilities in the various data repositories that comprise the data system for a particular scientific problem.  We believe these obstacles to be unnecessary attributes of individual data centers that were created more than 20 years ago. This aspirational presentation charts a new way forward for data curators and users alike, and by employing technical advances in adjacent disciples, promises a new era of scientific discovery enabled by re-envisioned 21st century data repositories.

How to cite: Diggs, S. and Kinkade, D.: Re-envisioning data repositories for the 21st century, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-20826, https://doi.org/10.5194/egusphere-egu2020-20826, 2020.

D838 |
| Highlight
Lesley Wyborn

Internationally Earth and environmental Science datasets have the potential to contribute significantly to resolving major societal challenges such as those outlined in the United Nations 2030 Sustainable Development Goals (SDGs). By 2030, we know that leading-edge computational infrastructures will be exascale (repositories, supercomputers, cloud, etc) and that these will facilitate realistic resolution of research challenges at scales and resolutions that cannot be undertaken today. Hence, by 2030, the capability for Earth and environmental science researchers to make valued contributions will depend on developing a global capacity to integrate data online from multiple distributed, heterogeneous repositories. Are we on the right path to achieve this?

Today, online, data repositories are a growing part of the research infrastructure ecosystem: their number and diversity has been slowly increasing over recent years to meet the demands that traditional institutional or other generic repositories can no longer satisfy. Although more specialised repositories are available (e.g., those for petascale volume data sets and domain specific long tail, complex data sets), funding for these specialised repositories is rarely long term.

Through initiatives such as the Commitment Statement from the Coalition for Publishing Data in the Earth and Space Sciences, publishers are now requiring that datasets that support a publication be curated and stored in a ‘trustworthy’ repository that can provide a DOI and a landing page for that dataset, and if possible, can also provide some domain quality assurance to ensure that data sets are not only Findable and Accessible, but also Interoperable and Reusable. But the demand for suitable domain expertise to provide the “I” and the “R” is far exceeding what is available. As a last resort, frustrated researchers are simply depositing the datasets that support their publications into generic repositories such as Figshare and Zenodo, which simply store the file of the data: rarely are domain-specific QA/QC procedures applied to the data. 

These generic repositories do ensure that data is not sitting on inaccessible personal c-drives and USB drives, but the content is rarely interoperable. This can only be achieved by repositories that have the domain expertise to curate the data properly, and ensure that the data meets minimum community standards and specifications that will enable online aggregation into global reference sets. In addition, most researchers are only depositing the files that support a particular publication, and as these files can be highly processed and generalised they difficult to reuse outside of the context of the specific research publication.

To achieve the ambition of Earth and environmental science datasets being reusable and interoperable and make a major contribution to the SDGs by 2030, then today we need: 

      More effort and coordination in the development of international community standards to enable technical, semantic and legal interoperability of datasets; 
      To ensure that publicly funded research data are also available without further manipulation or conversion to facilitate their broader reuse in scientific research particularly as by 2030 as we will also have greater computational capacity to analyse data at scales and resolutions currently not achievable.


How to cite: Wyborn, L.: Towards World-class Earth and Environmental Science Research in 2030: Will Today’s Practices in Data Repositories Get Us There?, EGU General Assembly 2020, Online, 4–8 May 2020, EGU2020-22478, https://doi.org/10.5194/egusphere-egu2020-22478, 2020.