EGU23-13347
https://doi.org/10.5194/egusphere-egu23-13347
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Transparent and reproducible data analysis workflows in Earth System Modelling combining interactive notebooks and semantic data management

Alexander Schlemmer1,3,4 and Sinikka Lennartz2
Alexander Schlemmer and Sinikka Lennartz
  • 1Research Group Biomedical Physics, Max Planck Institute for Dynamics and Self-Organization, Göttingen, Germany (alexander.schlemmer@ds.mpg.de)
  • 2Institute for Chemistry and Biology of the Marine Environment, University of Oldenburg, Oldenburg, Germany (sinikka.lennartz@uni-oldenburg.de)
  • 3German Center for Cardiovascular Research (DZHK), Partner Site Göttingen, Germany
  • 4IndiScale GmbH, Göttingen, Germany (a.schlemmer@indiscale.com)

In our project we are employing semantic data management with the Open Source research data management system (RDMS) CaosDB [1] to link empirical data and simulation output from Earth System Models [2]. The combined management of these data structures allows us to perform complex queries and facilitates the integration of data and meta data into data analysis workflows.

One particular challenge for analyses of model output is to keep track of all necessary meta data of each simulation during the whole digital workflow. Especially for open science approaches it is of great importance to properly document - in human- and computer-readable form - all the information necessary to completely reproduce obtained results. Furthermore, we want to be able to feed all relevant data from data analysis back into our data management system, so that we are able to perform complex queries also on data sets and parameters stemming from data analysis workflows.

A specific aim of this project is to re-analyse existing sets of simulations under different research questions. This endeavour can become very time consuming without proper documentation in an RDMS.

We implemented a workflow, combining semantic research data management with CaosDB and Jupyter notebooks, that keeps track of data loaded into an analysis workspace. Procedures are provided that create snapshots of specific states of the analysis. These snapshots can automatically be interpreted by the CaosDB crawler that is able to insert and update records in the system accordingly. The snapshots include links to the input data, parameter information, the source code and results and therefore provide a high-level interface to the full chain of data processing, from empirical and simulated raw data to the results. For example, input parameters of complex Earth System Models can be extracted automatically and related to model performance. In our use case, not only automated analyses are feasible, but also interactive approaches are supported.

  • [1] Fitschen, T.; Schlemmer, A.; Hornung, D.; tom Wörden, H.; Parlitz, U.; Luther, S. CaosDB—Research Data Management for Complex, Changing, and Automated Research Workflows. Data 2019, 4, 83. https://doi.org/10.3390/data4020083
  • [2] Schlemmer, A., Merder, J., Dittmar, T., Feudel, U., Blasius, B., Luther, S., Parlitz, U., Freund, J., and Lennartz, S. T.: Implementing semantic data management for bridging empirical and simulative approaches in marine biogeochemistry, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-11766, https://doi.org/10.5194/egusphere-egu22-11766, 2022.

How to cite: Schlemmer, A. and Lennartz, S.: Transparent and reproducible data analysis workflows in Earth System Modelling combining interactive notebooks and semantic data management, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-13347, https://doi.org/10.5194/egusphere-egu23-13347, 2023.