Who has got what where? FAIR-ly coordinating multiple levels of geophysical data products over distributed Research Infrastructures (RIs) to meet diverse computational needs and capabilities of users.

Lesley Wyborn; Nigel Rees; Jo Croucher; Hannes Hollmann; Rebecca Farrington; Benjamin Evans; Stephan Thiel; Mark Duffett; Tim Rawling

doi:https://doi.org/10.5194/egusphere-egu24-14052

[Back] [Session ESSI3.5]

EGU24-14052, updated on 09 Mar 2024

https://doi.org/10.5194/egusphere-egu24-14052

EGU General Assembly 2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Who has got what where? FAIR-ly coordinating multiple levels of geophysical data products over distributed Research Infrastructures (RIs) to meet diverse computational needs and capabilities of users.

Lesley Wyborn

¹, Nigel Rees¹, Jo Croucher¹, Hannes Hollmann¹, Rebecca Farrington

², Benjamin Evans¹, Stephan Thiel³, Mark Duffett⁴, and Tim Rawling

²

Lesley Wyborn et al.

¹National Computational Infrastructure, Acton, Australian National University, Australia (lesley.wyborn@anu.edu.au)
²AuScope, Parkville, Australia
³CSIRO, Adelaide, Australia
⁴Mineral Resources Tasmania, Hobart, Australia

Modern research data processing pipelines/workflows can have quite complex lineages. Today, it is more than likely that a scientific workflow will rely on multiple Research Infrastructures (RIs), numerous funding agencies and geographically separate organisations to collect, produce, process, analyse and reanalyse primary and derivative datasets. Workflow components can include:

Shared instruments to acquire the data;
Separate research groups processing/calibrating field data and developing additional derived products;
Multiple repository infrastructures to steward, preserve and provide access to the primary data and resultant products sustainably and persistently; and
Different types of software and compute infrastructures that enable multiple ways to access and process the data and products, including in-situ access, distributed web services and simple file downloads.

In these complex workflows, individual research products can be generated through multiple levels of processing (L0-L4), as raw instrument data is collected by remote instruments (satellites, drones, airborne instruments, shared laboratory and field infrastructures) and is converted into more useful parameters and formats to meet multiple use cases. Each individual level of processing can be undertaken by different research groups using a variety of funding sources and RIs, whilst derivative products could be stored in different repositories around the globe.

An additional complexity is that the volumes and resolution of modern earth and environmental datasets is exponentially growing and many RIs can no longer store and process the volumes of primary data acquired. Specialised hybrid HPC/Cloud infrastructures with co-located datasets that allow for virtual in situ high volume data access are emerging. But these petascale/exascale infrastructures are not required for all use cases, and traditional small volume file downloads of evolved data products and images for local processing are all that many users need.

At the core of many of these complex workflows are the primary, often high resolution observational dataset that can be in the order of terabytes and petabytes. Hence for transparent Open Science and to enable attribution to funders, collectors and repositories that preserve these valuable data assets, all levels of all derivative data products need to be able to trace their provenance back to these source datasets.

Using examples from the recently completed 2030 Geophysics Data Collection project in Australia (co-funded by AuScope, NCI and ARDC), this paper will show how original primary field acquired datasets and their derivative products can be accessible from multiple distributed RIs and government websites. They are connected using the FAIR principles and ensure that at a minimum, lineage and prehistory is recorded in provenance statements and linked using metadata elements such as ‘isDerivedFrom’ and DOIs. Judicious use of identifiers such as ORCIDs, RORs and DOIs links data at each level of processing with the relevant researchers, research infrastructure, funders, software developers, software etc. Integrating HPC centers that are colocated with large volume high resolution data infrastructures within complex and configurable research workflows is providing a key input to supporting next-generation earth and environmental research and enabling new and exciting scientific discoveries.

How to cite: Wyborn, L., Rees, N., Croucher, J., Hollmann, H., Farrington, R., Evans, B., Thiel, S., Duffett, M., and Rawling, T.: Who has got what where? FAIR-ly coordinating multiple levels of geophysical data products over distributed Research Infrastructures (RIs) to meet diverse computational needs and capabilities of users., EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-14052, https://doi.org/10.5194/egusphere-egu24-14052, 2024.