Open science is commonly associated with open access publications, and FAIR (findable, accessible, interoperable and reusable) data. Open source code is progressively being considered an essential component of open science, too. However, even if all these ingredients are available and openly accessible, it is often impossible to reproduce the graphs in a paper from the data and code provided. Which script was used on what part of the data to generate a given plot? Which version of a cited database was used, and what query to extract the presented data points? Moreover, even the basic steps of a scientific analysis, i.e. the derivation of mathematical equations, are often not traceable. Ever came across the famous “it follows that”, where, what follows, contains variables that were not present in the preceding equations?
Here I present part of a hydrology course based on a framework designed to address many of the above challenges. It is based on the open-source RENKU platform and deployed in a Jupyterhub instance at https://renkulab.io. RENKU enables the tracking of datasets and their versions, and records executions of code with their respective input and output files, producing a knowledge graph of the entire project and enabling the user to easily re-do all necessary steps to update relevant results whenever a data or code file is updated. RENKULAB uses the docker system to help reproduce the computational environment needed to re-execute the analysis. This greatly facilitates collaborative research and learning, as it removes the need for collaborators and students to recreate the computational environment in their local systems. Integration of GITLAB in RENKULAB facilitates student feedback and collaborative problem solving through issue tracking, where students can gain points by submitting meaningful issues and helping others.
The course also uses an open source package for mathematical derivations (ESSM, https://essm.readthedocs.org), which is based on the Python package Sympy, and facilitates clear definitions of variables including their dimensions and units, and dimensionally consistent fundamental equations. These can then be used to deduce derived equations by automatic solving of systems of equations for unknown variables, derivatives, integrations, and many other mathematical operations contained in Sympy. The package combines graphical depiction of equations, as seen in papers, with computational reproducibility of derivations and transparent re-use of equations in numerical code.
By employing Open Science approaches from the start, students become naturally accustomed to reproducible research and can use the skills they learn in any professional environments, as they are not bound to proprietary software that their future employers and collaborators may or may not have purchased licenses for.