ESSI3.2
Making Geoanalytical Data FAIR: Managing Data from Field to Laboratory to Archive to Publication

ESSI3.2

EDI
Making Geoanalytical Data FAIR: Managing Data from Field to Laboratory to Archive to Publication
Co-organized by GI2/GMPV1
Convener: Marthe Klöcking | Co-conveners: Alexander Prent, Lucia Profeta, Geertje ter Maat, Kirsten Elger
Presentations
| Fri, 27 May, 11:05–11:48 (CEST), 13:20–14:49 (CEST)
 
Room 0.31/32

Presentations: Fri, 27 May | Room 0.31/32

Chairpersons: Marthe Klöcking, Kirsten Elger
11:05–11:10
11:10–11:20
|
EGU22-10726
|
solicited
|
Presentation form not yet defined
|
Kerstin Lehnert and Lesley Wyborn

As volumes of geoanalytical data grow, research in geochemistry, volcanology, petrology, and other disciplines working with geoanalytical data is evolving to data-driven and computational approaches that have enormous potential to lead to new scientific discoveries. Application of advanced methods for data mining and analysis including Machine Learning, and Artificial Intelligence, as well as the generation of models for simulating natural processes all require seamless machine-readable access to large interoperable stores of consistently structured and documented geochemical data. Standard protocols, formats, and vocabularies are also critical in order to process, manage, and publish these growing data volumes efficiently with seamless workflows that are supported by interoperable tools.

Today, easy integration of data into Analysis Ready Data stores and the successful and efficient application of new research methodologies to these data stores is hindered by the fragmentation of the international geochemical data landscape that lacks the technical and semantic standards for interoperability; organizational structures to guide and govern these standards; and a scientific culture that supports and prioritizes a global sustainable data infrastructure. In order to harness the scientific treasures hidden in BIG volumes of geochemical data, the science community, geochemistry data providers, publishers, funders, and other stakeholders need to come together to develop, implement, and maintain standards and best practices for geochemical data, and commit to changing the current data culture in geochemistry. The benefits will be wide-ranging and increase the relevance of the discipline. 

Although many research data initiatives today focus on the implementation of the FAIR principles for Findable, Accessible, Interoperable, and Reusable data, most data is only human-readable, even though the original purpose of the FAIR principles has been to make data machine-actionable. The development of standards today should not focus on spreadsheet templates used to format and compile project-centric databases that are hard to re-purpose. These methods are not scalable. The focus should be on global solutions where any digital data are born connected to agreed machine readable standards so that researchers can utilize the latest AI and ML techniques.

Global standards for geochemical data should not be perceived as ‘one ring to rule them all’, but rather as a series of interoperable ‘rings’ of data, which like the Olympic rings will integrate data from the all continents and nations.



How to cite: Lehnert, K. and Wyborn, L.: Global Data Standards for Geochemistry: Not the ‘One Ring to Rule Them All’, but a set of ‘Olympic Rings’ that Link and Integrate across Continents, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-10726, https://doi.org/10.5194/egusphere-egu22-10726, 2022.

11:20–11:27
|
EGU22-13330
|
Virtual presentation
Erin DiMaggio, Sara Mana, and Cora VanHazinga
Tephra deposits are excellent chronostratigraphic markers that are prolific and widespread in portions of the East African Rift (EAR). Arguably one of the most powerful applications of tephrochronology is the establishment of regional chronological frameworks, enabling the integrated study of the timescales and interaction of the geosphere, hydrosphere, and biosphere. In order for these disparate disciplines to integrate and fully utilize the growing number of available tephra datasets, infrastructural efforts that centralize and standardize information are required. Of particular importance to these efforts is digitizing and standardizing previously published datasets to make them discoverable in alignment with current FAIR data reporting practices.  

EARThD is a NSF funded data compilation project that has integrated and standardized geochemical and geochronological data from over 400 published scientific papers investigating tephra datasets from the East African Rift. Our team has trained 15 undergraduate students in spreadsheet data entry and management, data mining, scientific paper comprehension, and in East African tephrochronology. We utilize an existing NSF-supported community-based data facility, Interdisciplinary Earth Data Alliance (IEDA), to store, curate, and provide access to the datasets. We are currently working with IEDA to ensure that data generated from EARThD is ingested into the IEDA Petrological Database (PetDB) and ultimately EarthChem, making it broadly available. Here we demonstrate our data entry process and how a user can locate, retrieve, and utilize EARThD tephra datasets. With this effort we aim to preserve available geochemical data for posterity, fulfilling a crucial data integration role for researchers working in East Africa --especially those working at paleontological and archeological sites where tephra dating and geochemical correlations are critical. The EARThD compilation also enables data synthesis efforts required to address new science questions.

How to cite: DiMaggio, E., Mana, S., and VanHazinga, C.: EARThD: an effort to make East African tephra geochemical data available and accessible, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13330, https://doi.org/10.5194/egusphere-egu22-13330, 2022.

11:27–11:34
|
EGU22-11348
|
ECS
|
On-site presentation
|
Marthe Klöcking, Kerstin Lehnert, Lucia Profeta, Bärbel Sarbas, Jan Brase, Sean Cao, Juan David Figueroa, Wolfram Horstmann, Peng Ji, Annika Johansson, Leander Kallas, Stefan Möller-McNett, Mariyam Mukhumova, Jens Nieschulze, Adrian Sturm, Hannah Sweets, Matthias Willbold, and Gerhard Wörner

Geochemical data are fundamental to understanding many planetary and environmental processes – yet in the absence of a community-endorsed data culture that adheres to common data standards, the geochemical data landscape is highly fragmented. The GEOROC and PetDB databases are leading, open-access resources for geochemical and isotopic rock and mineral data that have collaborated for nearly 25 years to provide researchers with access to large volumes of curated and harmonized data collections. PetDB is a global synthesis of published chemical, isotopic and mineralogical data for rocks, minerals and melt inclusions with a focus on data for igneous and metamorphic rocks from the ocean floor, ophiolites, xenolith samples from the Earth's mantle and lower crust and tephra, operated by the EarthChem data facility. Its counterpart, GEOROC hosts a collection of published analyses of volcanic and plutonic rocks, minerals and mantle xenoliths, predominantly derived from ocean islands and continental settings. These curated, domain-specific databases are increasingly valuable to data-driven and interdisciplinary research and form the basis of hundreds of new research articles each year across numerous earth data disciplines. 

Over the last two decades, both GEOROC and EarthChem have invested great efforts into operating data infrastructures for findable, accessible, interoperable and reusable data, while working together to develop and maintain the EarthChem Portal (ECP) as a global open data service to the geochemical, petrological, mineralogical and related communities. The ECP provides a single point of access to >30 million analytical values for >1 million samples, aggregated from independently operated databases (PetDB, NAVDAT, GEOROC, USGS, MetPetDB, DARWIN). Yet one crucial element of FAIR data is still largely missing: interoperability across different data systems, that allows data in separately curated databases, such as GEOROC and PetDB, to be integrated into comprehensive, global geochemical datasets.

Both EarthChem and GEOROC have recently embarked on major new developments and upgrades to their systems to improve the interoperability of their data systems. The new Digital Geochemical Data Infrastructure (DIGIS) initiative for GEOROC 2.0 aims to develop a connected platform to meet future challenges of digital data-based research and provide advanced services to the community. EarthChem has been developing an API-driven architecture to align with growing demands for machine-readable, Analysis Ready Data (ARD). This has presented an opportunity to make the two data infrastructures more interoperable and complementary. EarthChem and DIGIS have committed to cooperation on system architecture design, data models, data curation, methodologies, best practices and standards for geochemistry. This cooperation will include: (a) joint research projects; (b) optimized coordination and alignment of technologies, procedures and community engagement; and (c) exchange of personnel, data, technology and information. The EarthChem-DIGIS collaboration integrates with the international OneGeochemistry initiative to create a global geochemical data network that facilitates and promotes discovery and access of geochemical data through coordination and collaboration among international geochemical data providers, in close dialogue with the scientific community and with journal publishers.

How to cite: Klöcking, M., Lehnert, K., Profeta, L., Sarbas, B., Brase, J., Cao, S., Figueroa, J. D., Horstmann, W., Ji, P., Johansson, A., Kallas, L., Möller-McNett, S., Mukhumova, M., Nieschulze, J., Sturm, A., Sweets, H., Willbold, M., and Wörner, G.: GEOROC and EarthChem: Optimizing Data Services for Geochemistry through Collaboration, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11348, https://doi.org/10.5194/egusphere-egu22-11348, 2022.

11:34–11:41
|
EGU22-13457
|
Virtual presentation
Dominik C. Hezel and Kerstin A. Lehnert

MetBase is the world’s largest database for meteorite compositions [1], currently hosted in Germany. MetBase started more than 20 years ago with collecting cosmochemical data by a private collector. Among others, the database consists of more than 500.000 individual data of, for instance, bulk and component chemical, isotopic and physical properties. Further, the database holds more than 90,000 references from 1492 until today. In 2006, the high value of the database was acknowledged by the Meteoritical Society with its Service Award. MetBase has seen substantial transitions in the past years from a purely commercial to a donation, free-of-charge database. The technical foundation has been completely modernised.

More recently, the Astromaterials Data System (AstroMat) has been developed as a data infrastructure to store, curate, and provide access to laboratory data acquired on samples curated in NASA’s Astromaterials Collections. AstroMat is intended to host data from past, present, and future studies. AstroMat is developed and operated by a team that has long-term experiences in the development and operation of data systems for geochemical, petrological, mineralogical, and geochronological laboratory data acquired on physical samples – EarthChem and PetDB.

Astromat and MetBase are two initiatives with two very different histories – but a shared goal. Astromat and MetBase therefore plan a common future. As a part of this, we are currently starting a project to make MetBase data fully FAIR (findable, accessible, interoperable and reusable, [2]), thereby implementing the recently established Astromat database schema [3], which is based on the EarthChem data model. Astromat and MetBase currently also work on new solutions for a long term and centralized hosting of both databases and a data input backbone.

Both MetBase and Astromat participate in the OneGeochemistry initiative, to contribute to the development of  community endorsed and governed standards for FAIR lab analytical data that will allow seamless data exchange and integration. Data access to the MetBase content will be provided both through Astromat and via a front-end that is part of the recently initiated ›National Data Infrastructure Initiative‹ (NFDI), covering all scientific areas [4].

References: [1] http://www.metbase.org. [2] Stall et al. 2019. Make scientific data FAIR. Nature 570(7759): 27-29. [3] https://www.astromat.org [4] https://www.dfg.de/en/research_funding/programmes/nfdi/index.htm [5] https://www.nfdi4earth.de

How to cite: Hezel, D. C. and Lehnert, K. A.: Closing the gap between related databases: MetBase and the Astromaterials Data System (Astromat) plan for a common future, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13457, https://doi.org/10.5194/egusphere-egu22-13457, 2022.

11:41–11:48
|
EGU22-13188
|
ECS
|
Presentation form not yet defined
Bidong Zhang, Paul H. Warren, Alan E. Rubin, Kerstin Lehnert, Lucia Profeta, Annika Johansson, Peng Ji, Juan David Figueroa-Solazar, and Jennifer Mays

The UCLA Cosmochemistry Database was initiated as a data rescue project aiming to archive a variety of cosmochemical data acquired at the University of California, Los Angeles. The database will ensure that future studies can use and reference these data in the examination, analysis and classification of new extraterrestrial samples.

The database is developed in collaboration with the Astromaterials Data System (AstroMat) that will provide persistent access to and archiving of the database. The database is a project in progress. We will continue to make additions, updates, and improvements to the database.

The database includes elemental compositions of extraterrestrial materials (including iron meteorites, chondrites, Apollo samples, and achondrites) analyzed by John T. Wasson, Paul H. Warren and their coworkers using atomic absorption spectrometry (AAS), neutron activation analysis (NAA), and electron microprobe analysis (EMPA) at UCLA over the last six decades. The team started to use INAA to analyze iron meteorites, lunar samples, and stony meteorites starting from the late 1970s [1]. Some achondrites and lunar samples were analyzed by EMPA. Some of the UCLA data have been published, but most of the data were neither digitized nor stored in a single repository.

Compositional data have been compiled by the UCLA team from publications, unpublished files, and laboratory records into datasets using Astromat spreadsheet templates. These datasets are submitted to the Astromat repository. Astromat curators review the datasets for metadata completeness and correctness, register them with DataCite to obtain a DOI and make them citeable, and package them for long-term archiving. To date, we have compiled data from 52 journal articles; each article has its own separate dataset. Data and metadata of these datasets are then incorporated into the Astromat Synthesis database.

The UCLA datasets are publicly accessible at the Astromat Repository, where individual datasets can be searched and downloaded. The UCLA cosmochemical data can also be accessed as part of the Astromat Synthesis database, where they are identified as a special ‘collection’. Users may search, filter, extract, and download customized datasets via the user interface of the Astromat Synthesis database (AstroSearch).  Users will be able to access the UCLA Cosmochemistry Database directly from the home page of AstroMat (https://www.astromat.org/).

We plan to include EMPA data of lunar samples and achondrites, and add scanned PDF files of laboratory notebooks and datasheet binders that are not commonly published in journals. These PDF files contain information on irradiation date, mass, elemental concentrations, and classification for each iron specimen, and John Wasson’s personal notes on meteorites. We will also add backscattered-electron (BSE) images, energy dispersive spectroscopy (EDS) images, and optical microscopy images.

The Astromat team is currently working to develop plotting tools for the interactive tables.

Acknowledgments: We thank John Wasson and his coworkers for collecting the cosmochemical data for the last 60 years. Astromat acknowledges funding from NASA (grant no. 80NSSC19K1102).

References: [1] Scott E.R.D et al. (1977) Meteoritics, 12, 425–436.

How to cite: Zhang, B., Warren, P. H., Rubin, A. E., Lehnert, K., Profeta, L., Johansson, A., Ji, P., Figueroa-Solazar, J. D., and Mays, J.: The UCLA Cosmochemistry Database, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13188, https://doi.org/10.5194/egusphere-egu22-13188, 2022.

Lunch break
Chairpersons: Marthe Klöcking, Kirsten Elger
13:20–13:27
|
EGU22-9960
|
Virtual presentation
Elfrun Lehmann and Harry Becker

The TRR170-DB data repository (https://planetary-data-portal.org/) is a Re3data (r3data.org) referenced repository that manages new machine-readable data and resources from the collaborative research center ‘Late Accretion onto Terrestrial Planets’ (TRR 170) and from other institutions in in the planetary science community. Data in the repository reflect the diverse methods and approaches applied in the planetary sciences, including astromaterials data, experimental studies, remote sensing data, images and geophysical modeling data. The TRR170-DB repository follows a data policy and practice that supports Open Science and the FAIR principles (Wilkinson et al., 2016) as promoted by the German National Research Data Infrastructure (www.nfdi.de) and various national and international funding agencies and initiatives. The TRR170-DB framework supports users to align their data storage with the data life cycle of data sharing, persistent data citation, and data publishing. The permanent host of the TRR170-DB is Freie Universität Berlin. This long-term preservation and access of TRR170-DB’s published data ensures them being reused by researchers and the interested public.

The TRR170-DB repository is operated on the open source data management software Dataverse (dataverse.org). A web portal provides access to the storage environment of the datasets. The web portal guides users through the process of data storage and publication. It also informs about legal conditions and embargo periods to safeguard the data publication process. Additional information is available informing the user about data management and data publication related news and training events.

A user can search metadata information to find specific published data collections and files without logging in to TRR170-DB. A recently integrated new tool, the data explorer, assists the user in advanced searches to browse and find published data content. Data suppliers receive data curation services, a permanent archive and a digital object identifier (DOI) to make the dataset unique and findable. We encourage TRR 170 members and other users to store replication datasets by implementing publishing workflows to link publications to data. These replication datasets are freely available, and no permission is required for reuse and verification of a study. TRR170-DB has a flexible data-driven metadata system that uses tailored metadata blocks for specific data communities. Once a dataset has been published, its metadata and files can be exported in various open metadata standards and file formats. This ensures that all data published in the repository are generally accessible for other external databases and repositories (“interoperability”).

We are currently expanding metadata templates to improve interoperability, findability, preservation, and reuse of geochemical data in TRR170-DB. New geochemical metadata templates will incorporate additional standardized information on samples and materials, analytical methods and additional experimental data.  Advancing metadata templates will be an ongoing process in which the international scientific community and various initiatives (OneGeochemistry, Astromaterials Data System, etc.) need to interact and discuss what is required.

How to cite: Lehmann, E. and Becker, H.: The TRR170-DB Data Repository: The Life Cycle of FAIR Planetary Data from Archive to Publication, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-9960, https://doi.org/10.5194/egusphere-egu22-9960, 2022.

13:27–13:34
|
EGU22-12096
|
Presentation form not yet defined
Ivan Rodero, Andreu Fornós, Raul Bardaji, Stefano Chiappini, and Juanjo Dañobeitia

The European Multidisciplinary Seafloor and water-column Observatory (EMSO) European Research Infrastructure Consortium (ERIC) is a large-scale European Strategy Forum on Research Infrastructure (ESFRI) member with strategically placed sea observatories with the essential scientific objective of real-time, long-term monitoring of environmental processes related to the interaction between the geosphere, biosphere, and hydrosphere. EMSO ERIC collects, curates, and provides high-quality oceanographic measurements from surface to deep seafloor to assess long-term time series and oceanographic modeling. In addition, EMSO ERIC has developed a set of data services that harmonize its regional facilities’ data workflows, enhancing efficiency and productivity, supporting innovation, and enabling data- and knowledge-based discovery and decision-making. These services are developed in connection with the ESFRI cluster of Environmental Research Infrastructures (ENVRI) through the adoption of FAIR data principles (findability, accessibility, interoperability, and reusability) and supported by the ENVRI-FAIR H2020 project. 

EMSO ERIC’s efforts in adopting FAIR principles include the use of globally unique and resolvable persistent identifiers (PIDs) in alignment with the ENVRI-FAIR task forces. We present a service for the identification and long-lasting citability of dynamic data queries on harmonized data sets generated by EMSO ERIC users. The service is aligned with the Research Data Alliance (RDA) working group on data citation and has been integrated into the EMSO ERIC data portal. User-built queries on the data portal are served by the EMSO ERIC Application Programming Interface (API), which retrieves the user requested data and provides a Uniform Resource Identifier (URI) to the query, visualizations, and data sets in CSV and NetCDF formats. The data portal allows users to request a PID to the data query by providing mandatory and optional metadata information through an online form. The mandatory metadata consists of the description of the data and specific information about the creators, personal or organizational, including their identifiers and affiliations. The optional metadata consists of different types of titles and descriptions that the user finds compelling. The service provides a permalink to a web page maintained within the data portal with the PID reference, metadata information, and the URI to the data query. The web pages associated with PIDs also provide the option to request a Digital Object Identifier (DOI) if users are authorized via the EMSO ERIC Authorization and Authentication Infrastructure (AAI) system.

How to cite: Rodero, I., Fornós, A., Bardaji, R., Chiappini, S., and Dañobeitia, J.: Identification and Long-lasting Citability of Dynamic Data Queries on EMSO ERIC Harmonized Data, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-12096, https://doi.org/10.5194/egusphere-egu22-12096, 2022.

13:34–13:41
|
EGU22-13382
|
On-site presentation
Kira Rehfeld

Efforts towards standardizing biogeochemical data from palaeoclimate archives such as speleothems, ice cores, corals, trees or marine sediments allow to tackle global-scale changes in palaeoclimate dynamics. These endeavours are sometimes initiated for very specific research questions. One such example is the multi-archive, multi-proxy dataset used in a characterization of changes in temperature variability from the last Glacial Maximum to the current Interglacial [1]. Here, we focused on collecting all published proxy time series for temperature that fulfilled sampling criteria, but we did not include a lot of metadata.

Another, quite prominent, example is the database that grew out of the working group on Speleothem synthesis and analysis (SISAL) in the Past Global Changes (PAGES) network. In its construction, researchers from all over the world collaborated, producing a quality-controlled data product with rich metadata. SISAL v2 [2] contains data from 691 speleothem records published over the decades, for more than 500 standardized age models were established. The design and data collection in the community allowed to draw together metadata and observations to reproduce the age modeling process of individual studies. This database has a rich set of purposes, ranging from the evaluation of monsoon dynamics, to that of isotope-enabled climate models [3].

Contrasting these two approaches I will discuss the challenges arising when multiple proxies, archives, modeling purposes and community standards need to be considered. I argue that careful design of standardized data products allows for a new type of geoscience work, further catalyzed by digitization, forming a basis for tackling future-relevant palaeoclimatic and palaeoenvironmental questions at the global scale. 

 

[1] Rehfeld, K., et al. "Global patterns of declining temperature variability from the Last Glacial Maximum to the Holocene." Nature 554.7692: 356-359, https://doi.org/10.1038/nature25454, 2018

[2] Comas-Bru, L., et al. (incl. SISAL Working Group members): SISALv2: a comprehensive speleothem isotope database with multiple age–depth models, Earth Syst. Sci. Data, 12, 2579–2606, https://doi.org/10.5194/essd-12-2579-2020, 2020.

[3] Bühler, J. C. et al: Comparison of the oxygen isotope signatures in speleothem records and iHadCM3 model simulations for the last millennium, Clim. Past, 17, 985–1004, https://doi.org/10.5194/cp-17-985-2021, 2021.

How to cite: Rehfeld, K.: Science building on synthesis: From standardized palaeoclimate data to climate model evaluation, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13382, https://doi.org/10.5194/egusphere-egu22-13382, 2022.

13:41–13:48
|
EGU22-11980
|
On-site presentation
Moritz Theile, Wayne Noble, Romain Beucher, Malcolm McMillan, Samuel Boone, and Fabian Kohlmann

In this abstract we introduce a suite of free applications to produce FAIR consistent, clean and easily available geoscience data for research and industry alike. 

Creation of data starts with sample collection in the field and the assigning of an unique global IGSN sample identifier to samples, these samples are stored along with any subsequent  analytical data in our fine-grained and detailed geochemical data models allowing visualising and publishing acquired datasets. This unique solution has been developed by Lithodat Pty Ltd in conjunction with the AuScope Geochemical Network (AGN), Australian geochemical laboratories and can be accessed by the public on the AusGeochem web platform. 

Using our fully integrated field application users can enter and store all sample details on-the-fly during field collection, the data will be stored in the user's private data collection. Once the researchers return from the field they can log into their account on the browser-based AusGeochem platform and view or edit all collected samples. After running subsequent geochemical analyses on the sample those results, including all metadata, can be stored in the database and attached to the sample. Once uploaded, data can be visualised within AusGeochem, using simple data analytics via technique-specific dashboards and graphs. The data can be shared with collaborators, downloaded in multiple formats and made public enabling FAIR data for the research community. 

Here we show a complete sample workflow example, from collection in the field to the final result as a thermochronology study. Sample analysis using fission track and (U-Th)/He and all associated data will be uploaded and stored in the AusGeochem platform. Once all analyses are complete, the data will be shared with collaborators and made available to the public. An important step during this process is by having an integrated IGSN minting option which will give the sample a unique global sample identifier, making the sample globally discoverable. 

Having all data stored in a clean and curated relational database with very detailed and fine-grained data models gives researchers free access to large amounts of structured and normalised data, helping them develop new technologies using machine learning and automated data integration in numerical models. Having all data in one place including all metadata such as ORCIDs from involved researchers, funding sources, grant numbers and laboratories enables the quantification and quality assessment of research projects over time.

How to cite: Theile, M., Noble, W., Beucher, R., McMillan, M., Boone, S., and Kohlmann, F.: From Field Application to Publication: An end-to-end Solution for FAIR Geoscience Data, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11980, https://doi.org/10.5194/egusphere-egu22-11980, 2022.

13:48–13:55
|
EGU22-13429
|
ECS
|
Presentation form not yet defined
|
Alexander M. Prent, Samuel C. Boone, Hayden Dalton, Yoann Gréau, Guillaume Florin, Fabian Kohlmann, Moritz Theile, Wayne Noble, Sally-Ann Hodgekiss, Bryant Ware, David Philips, Barry Kohn, Suzanne O’Reilly, Andrew Gleadow, Brent McInnes, and Tim Rawling

Over the last two years, the Australian AuScope Geochemistry Network (AGN) has developed AusGeochem in collaboration with geoscience-data-solutions company Lithodat Pty Ltd. This open, cloud-based data platform (https://ausgeochem.auscope.org.au) serves as a geo-sample registry, with IGSN minting capability, a geochemical data repository and a data analysis tool. With guidance from experts in the field of geochemistry from a number of Australian institutions, and following international standards and best practices, various sample and geochemistry data models were developed that align with the FAIR data principles. AusGeochem is currently accepting data of SIMS U-Pb as well as of fission track and (U-Th-Sm)/He techniques with LA-ICPS-MS U-Pb and Lu-Hf, 40Ar/39Ar data models under development. Special attention is paid to the implementation of streamlined workflows for AGN laboratories to facilitate ease of data upload from analytical sessions. Analytical results can then be shared with users through AusGeochem and where required can be kept fully confidential and under embargo for specified periods of time. Once the analytical data on individual samples are finalized, the data can then be made more widely accessible, and where required can be combined into specific datasets that support publications.

How to cite: Prent, A. M., Boone, S. C., Dalton, H., Gréau, Y., Florin, G., Kohlmann, F., Theile, M., Noble, W., Hodgekiss, S.-A., Ware, B., Philips, D., Kohn, B., O’Reilly, S., Gleadow, A., McInnes, B., and Rawling, T.: AusGeochem: an Australian AuScope Geochemistry Network data platform for laboratories and their users, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13429, https://doi.org/10.5194/egusphere-egu22-13429, 2022.

13:55–14:02
|
EGU22-7228
|
Virtual presentation
Cecile Le Guern, Jean-François Brunet, Philippe Négrel, Sandrine Lemal, Etienne Taffoureau, Sylvain Grellet, Mickael Beaufils, Clément Lattelais, Christine Le Bas, and Hélène Roussel

Urban territories collect many types of geochemical and physico-chemical data relative to, e.g., soil quality or soil functions. Such data may serve for various purposes like verifying the compatibility with current or future uses, defining (pedo)geochemical backgrounds, establishing levels of exposure to soil pollutants, identifying management options for polluted sites or for excavated soils, verifying the evolution of infiltration ponds, assessing carbon storage, etc. They may also serve to prioritize soil functions and associated ecosystem services such as, e.g., soil fertility, surface and groundwater storage or supply, purification of infiltrated rainwater, etc. Gathering such data in national databases and making them available to stakeholders raises many issues that are technical, legal and social.  Should all of the data be made available or only selected portions? How can access and reuse of the data be ensured in a legal fashion? Are statistical and geostatistical methods able to deal with data from heterogeneous origins, allowing their reuse for other purposes than the initial one? In this context, it is necessary to take into account scientific as well as practical considerations and to collect the societal needs of end-users like urban planners.

 

To illustrate the complexity of these issues and ways to address them, we propose to share the French experience:

  • on gathering urban soil geochemical data in the French national database BDSolU. We will present how this database was created, the choices made in relation with the national context, the difficulties encountered, and the questions that are still open.
  • on a new interrogation system linking agricultural and urban soil databases (DoneSol and BDSolU), which have different requirements, and the corresponding standards. Such linkage based on interoperability is important in the context of changes of soil use, with for example agricultural soils becoming urbanised soils, or soils from brownfields intended for gardening. It is also necessary to ensure a territorial continuity for users.

The objective is to define a robust and standardised methodology for database conceptualisation, sharing and final use by stakeholders including scientists

How to cite: Le Guern, C., Brunet, J.-F., Négrel, P., Lemal, S., Taffoureau, E., Grellet, S., Beaufils, M., Lattelais, C., Le Bas, C., and Roussel, H.: French feedback from urban soil geochemical data archive to data sharing: state of mind and intent, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7228, https://doi.org/10.5194/egusphere-egu22-7228, 2022.

14:02–14:09
|
EGU22-8262
|
On-site presentation
|
Mathias Bavay, Charles Fierz, and Rodica Nitu

Automatic Weather Stations (AWS) deployed in the context of research projects provide very valuable data thanks to the flexibility they offer in term of measured meteorological parameters, choice of sensors and quick deployment and redeployment. However this flexibility is a challenge in terms of metadata and data management. Traditional approaches based on networks of standard stations can not accommodate these needs and often no tools are available to manage these research AWS, leading to wasted data periods because of difficult data reuse, low reactivity in identifying potential measurement problems, and lack of metadata to document what happened.

The Data Access Made Easy (DAME) effort is our answer to these challenges. At its core, it relies on the mature and flexible open source MeteoIO meteorological pre-processing library. It was originally developed as a flexible data processing engine for the needs of numerical models consuming meteorological data and further developed as a data standardization engine for the Global Cryosphere Watch (GCW) of the World Meteorological Organization (WMO). For each AWS, a single configuration file describes how to read and parse the data, defines a mapping between the available fields and a set of standardized names and provides relevant Attribute Conventions Dataset Discovery (ACDD) metadata fields, if necessary on a per input file basis. Low level data editing is also available, such as excluding a given sensor, swapping sensors or merging data from another AWS, for any given time period. Moreover an arbitrary number of filters can be applied on each meteorological parameter, restricted to specific time periods if required. This allows to describe the whole history of an AWS within a single configuration file and to deliver a single, consistent, standardized output file possibly spanning many years, many input data files and many changes both in format and available sensors. Finally, all configuration files are kept in a git repository in order to document their history.

A basic email-based interface has been developed that allows to create new configuration files, modify an existing configuration file or request data on-demand for any time period. Every hour, the data for all available configuration files is regenerated for the last 13 months and stored on a shared drive so all are able to access the current data without even having to submit a request. A table is generated showing all warnings or errors produced during the data generation along with some metadata such as the data owner email in order for the data owner to quickly spot troublesome AWS.

How to cite: Bavay, M., Fierz, C., and Nitu, R.: Data Access Made Easy: flexible, on the fly data standardization and processing, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-8262, https://doi.org/10.5194/egusphere-egu22-8262, 2022.

14:09–14:16
|
EGU22-13338
|
Virtual presentation
|
Dylan O'Ryan, Charuleka Varadharajan, Erek Alper, Kristin Boye, Madison Burrus, Danielle Christianson, Shreyas Cholia, Robert Crystal-Ornelas, Joan Damerow, Wenming Dong, Hesham Elbashandy, Boris Faybishenko, Valerie Hendrix, Douglas Johnson, Zarine Kakalia, Roelof Versteeg, Kenneth Williams, Catherine Wong, and Deborah Agarwal

The Watershed Function Scientific Focus Area (WFSFA) is a U.S. Department of Energy research project that seeks to determine how mountainous watersheds retain and release water, carbon, nutrients, and metals. The WFSFA maintains a community field observatory at its primary field site in the East River, Colorado. The WFSFA collects diverse environmental data and has developed a “Field-Data” workflow that standardizes data management across the project, from field collection to laboratory analysis to publication. This workflow enables the WFSFA to address data quality and management challenges that environmental observatories face. 

Through this workflow, the WFSFA has increased the use of the data curated from the project by (1) providing detailed metadata with unique identifiers for samples, locations, and sensors, (2) streamlining the data sharing and publication process through early sharing of data internally within the team and publication of data on the ESS-DIVE repository following curation, and (3) adopting machine-readable and FAIR community data standards (Findability, Accessibility, Interoperability, Reusability). 

We describe an example application of this workflow for geochemical data, which utilizes a community geochemical data standard for water-soil-sediment chemistry (https://github.com/ess-dive-community/essdive-water-soil-sed-chem) developed by Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE). This data standard is designed to standardize geochemical data, metadata, and file-level metadata, and was applied to WFSFA geochemical data, including ICP-MS, Isotope, Ammonia-N, Anion, DIC/NPOC/TDN datasets. This ensures important metadata is contained within the data file, such as precision of data analysis, storage and sample processing information, detailed sample names, material information, and unique identifiers associated with the samples (IGSNs). This metadata is essential to understand and reuse data products, as well as enable machine-readability for future model applications. Detailed examples of the standardized geochemical data types were created and are now being used as templates by WFSFA researchers to standardize their geochemical data. The adoption of this community geochemical data standard and more broadly the Field-Data workflow will improve the findability and reusability of WFSFA datasets. 

How to cite: O'Ryan, D., Varadharajan, C., Alper, E., Boye, K., Burrus, M., Christianson, D., Cholia, S., Crystal-Ornelas, R., Damerow, J., Dong, W., Elbashandy, H., Faybishenko, B., Hendrix, V., Johnson, D., Kakalia, Z., Versteeg, R., Williams, K., Wong, C., and Agarwal, D.: A workflow to standardize collection and management of large-scale data and metadata from environmental observatories, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13338, https://doi.org/10.5194/egusphere-egu22-13338, 2022.

14:16–14:23
|
EGU22-6332
|
On-site presentation
Ulrich Bundke, Marcel Kennert, Christoph Mahnke, Susanne Rohs, and Andreas Petzold

The European infrastructure In-service Aircraft for a Global Observing System (IAGOS) (www.IAOGS.org) has implemented an automatic workflow for data management organizing the dataflow starting at the sensor towards the central data-portal located in Toulouse. The workflow is realized and documented using the web-based Django framework with a model-based approach using Python.

This workflow performs all necessary data processing and QA/QC tests to automated upload NRT processed data and serves the PI as basis for approval decisions. This includes repeated cycles for different stages of data maturity. The PI can monitor the status of all tasks by web-based reports produced by the Task Manager.  An automated reprocessing is possible by storing metadata on all steps as well as decisions of the PI. The implementation of the workflow is one big step to make IAGOS data handling compliant with the FAIR principles (findable, accessible, interoperable, reusable).

The workflow is easy adaptable to manage the workflow of other Infrastructures or research institutes. Thus, we will open the development under MIT license and invite other datacenters to contribute to the development.

Acknowledgments:

This work was supported by European Union's Horizon 2020 research and innovation programme under grant agreement No 824068 and by Helmholtz STSM Grant “DIGITAL EARTH”

How to cite: Bundke, U., Kennert, M., Mahnke, C., Rohs, S., and Petzold, A.: Implementation of a FAIR Compliant Automated Workflow for Infrastructures, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-6332, https://doi.org/10.5194/egusphere-egu22-6332, 2022.

14:23–14:30
|
EGU22-11766
|
ECS
|
Virtual presentation
Alexander Schlemmer, Julian Merder, Thorsten Dittmar, Ulrike Feudel, Bernd Blasius, Stefan Luther, Ulrich Parlitz, Jan Freund, and Sinikka T. Lennartz

CaosDB is a flexible semantic research data management system, released as open source software. Its versatile data model and data integration toolkit allows for applications in complex and very heterogeneous scientific workflows and different scientific domains. Successful implementations include biomedical physics [1] and glaciology [2]. Here, we present a recent implementation of a use case in marine biogeochemistry which has a special focus on bridging between experimental work and numerical ocean modelling. CaosDB is used to store, index and link data during different stages of research on the marine carbon cycle: Data from experiments and field campaigns is integrated and mapped onto semantic data structures. This data is then linked to data from numerical ocean simulations. The ocean model, here with a specific focus on natural marine carbon sequestration of dissolved organic carbon (DOC), uses the georeferenced data to evaluate model performance. By simultaneously linking empirical data and the sampled model parameter space together with the model output, CaosDB enhances the efficiency of model development. In the current implementation simulated data is linked to georeferenced DOC concentration data. We plan to expand it to complex data sets including thousands of dissolved organic matter molecular formulae and metagenomes of pelagic microbial communities. The combined management of these heterogeneous data structures with semantic models allows us to perform complex searches and seamlessly connect to automated data analysis pipelines.


[1] Fitschen, T.; Schlemmer, A.; Hornung, D.; tom Wörden, H.; Parlitz, U.; Luther, S. CaosDB—Research Data Management for Complex, Changing, and Automated Research Workflows. Data 2019, 4, 83. https://doi.org/10.3390/data4020083
[2] Schlemmer, A.; tom Wörden, H.; Freitag, J.; Fitschen, T.; Kerch, J.; Schlomann, Y.; ... & Luther, S. Evaluation of the semantic research data management system CaosDB in glaciology. deRSE 2019. https://doi.org/10.5446/42478

How to cite: Schlemmer, A., Merder, J., Dittmar, T., Feudel, U., Blasius, B., Luther, S., Parlitz, U., Freund, J., and Lennartz, S. T.: Implementing semantic data management for bridging empirical and simulative approaches in marine biogeochemistry, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11766, https://doi.org/10.5194/egusphere-egu22-11766, 2022.

14:30–14:37
|
EGU22-11103
|
Presentation form not yet defined
Ivonne Anders, Karsten Peters-von Gehlen, Hannes Thiemann, Martin Bergemann, Merret Buurman, Andrej Fast, Christopher Kadow, Marco Kulüke, and Fabian Wachsmann

Some disciplines, especially those that look at the Earth system, work with large to very large amounts of data. Storing this data, but also processing it, places completely new demands on scientific work itself.

Let's take the example of climate research and specifically climate modelling. In addition to long-term meteorological measurements in the recent past, results from climate models form the main basis for research and statements on past and possible future global, regional and local climate. Climate models are very complex numerical models that require high-performance computing. However, with the current and future increasing spatial and temporal resolution of the models, the demand for computing resources and storage space is also increasing. Previous working methods and processes no longer hold up and need to be rethought.

Taking the German Climate Computing Centre (DKRZ) as an example, we analysed the users, their goals and working methods. DKRZ provides the climate science community with resources such as high-performance computing (HPC), data storage and specialised services and hosts the World Data Center for Climate (WDCC). In analysing users, we distinguish between two groups: those who need the HPC system to run resource-intensive simulations and then analyse them, and those who reuse, build on and analyse existing data. Each group subdivides into subgroups. We have analysed the workflows for each identified user and found identical parts in an abstracted form and derived Canonical Workflow Modules.

In the process, we critically examined the possible use of so-called FAIR Digital Objects (FDOs) and checked to what extent the derived workflows and workflow modules are actually future-proof.

The vision is that the global integrated data space is formed by standardised, independent and persistent entities that contain all information about diverse data objects (data, documents, metadata, software, etc.) so that human and, above all, machine agents can find, access, interpret and reuse (FAIR) them in an efficient and cost-saving way. At the same time, these units become independent of technologies and heterogeneous organisation of data, and will contain a built-in mechanism that supports data sovereignty. This will make the handling of data sustainable and secure.

So, each step in a research workflow can be a FDO. In this case, the research is fully reproducible, but parts can also be exchanged and, e.g. experiments can be varied transparently. FDOs can easily be linked to others. The redundancy of data is minimised and thus also the susceptibility to errors is reduced. FDOs open up the possibility of combining data, software or whole parts of workflows in a new and simple but at all times comprehensible way. FDOs will make an important contribution to the reproducibility of research results, but they are also crucial for saving storage space. There are already data that are FDOs, but also self-contained frameworks that store data via tracking workflows. Similar to the TCP/IP standard, DO interface protocols are already developed. However, there are still some open points that are currently being worked on and defined with regard to FDOs in order to make them a globally functioning system.

How to cite: Anders, I., Peters-von Gehlen, K., Thiemann, H., Bergemann, M., Buurman, M., Fast, A., Kadow, C., Kulüke, M., and Wachsmann, F.: Data amounts and reproducibility: How FAIR Digital Objects can revolutionise Research Workflows, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11103, https://doi.org/10.5194/egusphere-egu22-11103, 2022.

14:37–14:44
|
EGU22-13317
|
Presentation form not yet defined
|
Kerstin Lehnert, Jens Klump, Sarah Ramdeen, Kirsten Elger, and Lesley Wyborn

When researchers collect or create physical samples they usually assign a user-generated number to each sample. Subsequently, that sample can be submitted to a laboratory for analysis of a variety of analytes. However, as geoanalytical laboratories are generating ever increasing volumes of data, most laboratories have automated workflows and it is no longer feasible for laboratories to use researcher-supplied sample numbers, particularly as it is not guaranteed that user-supplied numbers will be unique in comparison to numbers submitted by other users to the same laboratory. To address this issue new, laboratory-generated numbers may be assigned to that sample.

Moreover, as a single laboratory rarely has the capability to offer all analytical techniques, individual samples tend to move from laboratory to laboratory to acquire the desired suite of analytes.  Each laboratory may implement a different number to that sample. At the conclusion of their project, the researcher may submit the same sample to a museum or institutional repository, where the sample will be assigned yet another institution-generated number to ensure that all samples are uniquely identified in their repository. 

Ultimately, by the time the researcher submits an article to a journal and wants to identify samples in the text or tables, they may have a multitude of locally-generated numbers to choose from. Not one of the locally assigned numbers to that sample can be guaranteed to be globally unique. It is also unlikely that any of these local numbers will be persistent over the longer term (decades), or be resolvable to enable the location of the identified resource or any information about it elsewhere on the web (metadata, landing page, services related to it, etc).

Globally unique, persistent, resolvable identifiers such as the IGSN play a critical role in the unique identification of geoanalytical samples that pass between systems and organisations: they cannot be duplicated by another researcher, laboratory or sample repository. They persistently link to information about the origin of the sample; to personas in the creation of the sample (collector, institution, funder); to the laboratory data and their creation (analyst, laboratory, institution, funder, data software); and to the sample curation phase (curator, repository, funder). They connect the phases of a sample’s path from collection in the field to lab analysis to the synthesis/research phase to the publication to the archive. Globally unique sample identifiers also enable cross linkages to any artefacts derived from that sample (images, analytical data, other articles). Further, identifiers like IGSN enable sub samples or sample splits to be linked back to their parent sample, creating a holistic picture of any information derived from the initial sample. 

Hence, best practice is clearly to assign the globally unique resolvable identifier to the initial resource. Like a birth certificate, the identifier can be carried through the progressive stages of the research ‘life-cycle’ including laboratory analysis, generation of further data, images, publication, and ultimately curation and preservation. Where any subsamples are derived, they, and any data generated on them, can be linked back to the parent identifier.

How to cite: Lehnert, K., Klump, J., Ramdeen, S., Elger, K., and Wyborn, L.: The critical role of unique identification of samples for the geoanalytical data pipeline, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13317, https://doi.org/10.5194/egusphere-egu22-13317, 2022.

14:44–14:49