EGU24-18724, updated on 11 Mar 2024
https://doi.org/10.5194/egusphere-egu24-18724
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Reproducible Workflows and Compute Environments for Reusable Datasets, Simulations and Research Software

Alan Correa, Anil Yildiz, and Julia Kowalski
Alan Correa et al.
  • RWTH Aachen University, Chair of Methods for Model-based Development in Computational Engineering, Mechanical Engineering, Germany (alan.correa@rwth-aachen.de)

The pursuit of reproducibility in research has long been emphasized. It is even more critical in geohazards research and practice, where model-based decision-making needs to be transparent for trustworthy applications. However, enabling reproducibility in process-based or machine learning workflows requires time, energy, and sometimes manual operations or even unavailable resources. Moreover, the diversity in modern compute environments, both in hardware and software, significantly hinders the path to reproducibility. While many researchers focus on reproducibility, we advocate that reusability holds greater value and inherently requires the former. Reusable datasets and simulations can allow for transparent and reliable decision support, analysis as well as benchmarking studies. Reusable research software can foster composition and faster development of complex projects, while avoiding the reinvention of complicated data structures and algorithms.

Establishing reproducible workflows and compute environments is vital to enable and ensure reusability. Prioritising reproducible workflows is crucial for individual use, while both reproducible compute environments and workflows are essential for broader accessibility and reuse by others. We present herein various challenges faced in coming up with reproducible workflows and compute environments along with solution strategies and recommendations through experiences from two projects in geohazards research. We discuss an object-oriented approach to simulation workflows, automated metadata extraction and data upload, unique identification of datasets (assets) and simulation workflows (processes) through cryptographic hashes. We investigate essential factors, such as software versioning and dependency management, reproducibility across diverse hardware used by researchers, and time to first reproduction/reuse (TTFR), to establish reproducible computational environments. Finally, we shall explore the landscape of reproducibility in compute environments, covering language-agnostic package managers, containers, and language-specific package managers supporting binary dependencies.

How to cite: Correa, A., Yildiz, A., and Kowalski, J.: Reproducible Workflows and Compute Environments for Reusable Datasets, Simulations and Research Software, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-18724, https://doi.org/10.5194/egusphere-egu24-18724, 2024.