EGU25-2142, updated on 14 Mar 2025
https://doi.org/10.5194/egusphere-egu25-2142
EGU General Assembly 2025
© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.
Poster | Tuesday, 29 Apr, 10:45–12:30 (CEST), Display time Tuesday, 29 Apr, 08:30–12:30
 
Hall X4, X4.21
Enhancing Data Provenance in Workflow Management: Integrating FAIR Principles into Autosubmit and SUNSET
Albert Puiggros, Miguel Castrillo, Bruno de Paula Kinoshita, Pierre-Antoine Bretonniere, and Victòria Agudetse
Albert Puiggros et al.

Ensuring robust data provenance is paramount for advancing transparency, traceability, and reproducibility in climate research. This work presents the integration of FAIR (Findable, Accessible, Interoperable, and Reusable) principles into the workflow management ecosystem through provenance integration in Autosubmit, a workflow manager developed at the Barcelona Supercomputing Center (BSC), and SUNSET (SUbseasoNal to decadal climate forecast post-processing and asSEmenT suite), an R-based verification workflow also developed at the BSC.

Autosubmit supports the generation of data provenance information based on RO-Crate, facilitating the creation of machine-actionable digital objects that encapsulate detailed metadata about its executions. Autosubmit integrates persistent identifiers (PIDs) and schema.org annotations, making provenance records more accessible and actionable for both humans and machines.  However, the provenance metadata provided by Autosubmit through RO-Crate focuses on the workflow process and does not encapsulate the details of the data transformation processes. This is where SUNSET plays a complementary role. SUNSET’s approach for provenance information is based on the METACLIP (METAdata for CLImate Products) ontologies. METACLIP offers a semantic approach for describing climate products and their provenance. This framework enables SUNSET to provide specific, high-resolution  provenance metadata for its operations, improving transparency and compliance with FAIR principles. The generated files provide detailed information about each transformation the data has undergone, as well as additional details about the data's state, location, structure, and associated source code, all represented in a tree-like structure.

The main contribution of this work is the generation of a comprehensive provenance object by integrating these tools. SUNSET uses Autosubmit to parallelize its data processing tasks, with Autosubmit managing SUNSET jobs. As part of this process, an RO-Crate is automatically generated describing the overall execution. This object encapsulates detailed provenance metadata for each individual job within the workflow, using METACLIP's semantic framework to represent each SUNSET execution process. Certain schema.org entities are introduced to have the RO-Crate created by Autosubmit link with the provenance details generated by SUNSET. This integrated approach provides a unified hierarchical provenance record that spans to both the workflow management system and the individual job executions, ensuring that provenance objects are automatically generated for each experiment conducted.

This work demonstrates the practical application of FAIR principles in climate research by advancing provenance tracking within complex workflows. It represents an initial step to obtain and share metadata about the provenance of the data products that a workflow provides. The integration of RO-Crate and METACLIP not only enhances the reproducibility of climate data products but also fosters greater confidence in their reliability. To our knowledge, this is the first effort in the climate domain to combine different provenance formats into a single object, aiming to obtain a complete provenance graph with all the metadata. 

How to cite: Puiggros, A., Castrillo, M., de Paula Kinoshita, B., Bretonniere, P.-A., and Agudetse, V.: Enhancing Data Provenance in Workflow Management: Integrating FAIR Principles into Autosubmit and SUNSET, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-2142, https://doi.org/10.5194/egusphere-egu25-2142, 2025.