Please note that this session was withdrawn and is no longer available in the respective programme. This withdrawal might have been the result of a merge with another session.
SC5.11 | Go with the (work)flow (manager): creating reproducible and scalable data analyses
Go with the (work)flow (manager): creating reproducible and scalable data analyses
Convener: Tania MaxwellECSECS | Co-convener: Lukas WeilgunyECSECS
In data analysis we often need to rerun analyses either on the same data, e.g. to update methodology or on new data, e.g. from additional experiments. In some cases this might require repeated execution of the entire analysis, but often running certain subroutines is sufficient to incorporate changes or corrections. Workflow managers are designed to keep track of which parts in a pipeline need to be run again in light of modifications or new data. Additionally, they can be used to parallelize execution of individual steps, to scale pipelines on HPC systems, and to simplify provision of reproducible analyses. We will explore different scenarios encountered in Geoscience data analysis which could profit from using a workflow manager. There are several workflow managers available and we will present examples using Snakemake, an open source framework for reproducible and scalable data analysis. This allows for straight-forward automatic parallelization of processes on many platforms, from a local computer to compute clusters. We will focus on how to write efficient snakemake pipelines for simplified examples of geospatial data analyses that execute a series of inter-dependent steps.

Comment: This short course will be a live coding session. Previous experience using the command line and using conda environments is suggested but not required. Participants are encouraged to bring their own laptops (with access to a Unix terminal — Linux, MacOS, or Windows Subsystem for Linux) to follow along the course. We will provide a conda environment to install all required software as well as links to a repository hosting the code and data in advance.