Who Done It? Reproducibility of Data Products Also Requires Lineage to Determine Impact and Give Credit Where Credit is Due.
- 1National Computational Infrastructure, Australian National University, Canberra, Australia (lesley.wyborn@anu.edu.au)
- 2National Computational Infrastructure, Australian National University, Canberra, Australia (nigel.rees@anu.edu.au)
- 3Mineral Resources, CSIRO, Kensington, Australia (jens.klump@csiro.au
- 4National Computational Infrastructure, Australian National University, Canberra, Australia (ben.evans@anu.edu.au)
- 5AuScope Limited, Melbourne, Australia (rebecca@auscope.org.au)
- 6AuScope Limited, Melbourne, Australia (tim@auscope.org,au)
Reproducible research necessitates full transparency and integrity in data collection (e.g. from observations) or generation of data, and further data processing and analysis to generate research products. However, Earth and environmental science data are growing in complexity, volume and variety and today, particularly for large-volume Earth observation and geophysics datasets, achieving this transparency is not easy. It is rare for a published data product to be created in a single processing event by a single author or individual research group. Modern research data processing pipelines/workflows can have quite complex lineages, and it is more likely that an individual research product is generated through multiple levels of processing, starting from raw instrument data at full resolution (L0) followed by successive levels of processing (L1-L4), which progressively convert raw instrument data into more useful parameters and formats. Each individual level of processing can be undertaken by different research groups using a variety of funding sources: rarely are those involved in the early stages of processing/funding properly cited.
The lower levels of processing are where observational data essentially remains at full resolution and is calibrated, georeferenced and processed to sensor units (L1) and then geophysical variables are derived (L2). Historically, particularly where the volumes of the L0-L2 datasets are measured in Terabytes to Petabytes, processing could only be undertaken by a minority of specialised scientific research groups and data providers, as few had the expertise/resources/infrastructures to process them on-premise. Wider availability of colocated data assets and HPC/cloud processing means that the full resolution, less processed forms of observational data can now be processed remotely in realistic timeframes by multiple researchers to their specific processing requirements, and also enables greater exploration of parameter space allowing multiple values for the same inputs to be trialled. The advantage is that better-targeted research products can now be rapidly produced. However, the downside is that far greater care needs to be taken to ensure that there is sufficient machine-readable metadata and provenance information to enable any user to determine what processing steps and input parameters were used in each part of the lineage of any released dataset/data product, as well as be able to reference exactly who undertook any part of the acquisition/processing and identify sources of funding (including instruments/field campaigns that collected the data).
The use of Persistent Identifiers (PIDs) for any component objects (observational data, synthetic data, software, model inputs, people, instruments, grants, organisations, etc.) will be critical. Global and interdisciplinary research teams of the future will be reliant on software engineers to develop community-driven software environments that aid and enhance the transparency and reproducibility of their scientific workflows and ensure recogniton. The advantage of the PID approach is that not only will reproducibility and transparency be enhanced, but through the use of Knowledge Graphs it will also be possible to trace the input of any researcher at any level of processing, whilst funders will be able to determine the impact of each stage from the raw data capture through to any derivative high-level data product.
How to cite: Wyborn, L., Rees, N., Klump, J., Evans, B., Farrington, R., and Rawling, T.: Who Done It? Reproducibility of Data Products Also Requires Lineage to Determine Impact and Give Credit Where Credit is Due., EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-12864, https://doi.org/10.5194/egusphere-egu23-12864, 2023.