SmartLake: A smart datalake for short and long tail data types

Jens Turowski; Gunnar Pruß; Christian Erikson; Tobias Jaeuthe; Hui Tang

doi:https://doi.org/10.5194/egusphere-egu26-3919

[Back] [Session ITS1.21/ESSI4.5]

EGU26-3919, updated on 07 Apr 2026

https://doi.org/10.5194/egusphere-egu26-3919

EGU General Assembly 2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

SmartLake: A smart datalake for short and long tail data types

Jens Turowski¹, Gunnar Pruß¹, Christian Erikson¹, Tobias Jaeuthe², and Hui Tang¹

Jens Turowski et al.

¹GFZ Helmholtz Centre for Geosciences, 4.6 Geomorphology, Potsdam, Germany (jens.turowski@gfz.de)
²PERFACCT GmbH, August-Bebel-Straße 27, 14482 Potsdam, Germany

The geosciences are a data-heavy discipline, and a wide range of data types and formats are commonly used, even within the same sub-discipline or working group. For example, in hydrology or geomorphology, geospatial data (e.g., satellite imagery, maps, sample locations) are routinely paired with time-series data (e.g., discharge or precipitation monitoring) and laboratory-derived data from individual samples (e.g., isotope chemistry from water samples). For some data types, widely-used community standards exist (e.g., seismic or satellite remote sensing data), stipulating data formats, file types, and relevant metadata. These are known as short-tail data types. Yet, for many data types, either such standards do not exist at all, or several competing standards are used in parallel. These are known as long-tail data types. As a result, research and monitoring data are often not managed and archived according to the FAIR principles or even get lost as researchers move between positions. Yet, many funding agencies require a data management plan and a commitment to open data principles already at the proposal stage. We require a flexible digital infrastructure for data management, that (1) can handle the entire data management chain from upload to publication, (2) is modular and scalable in the sense that it can be set up for individual projects, a workgroup or unit, or entire institutes, (3) is customizable in the sense that it can be set up for different types of data, environments, and tasks, (4) allows for the automation of data management tasks, and (5) can associate rich metadata with individual data files. Here, we introduce SmartLake, a datalake application that integrates a storage environment with a modular metadata catalog and a workflow engine. We describe the concept and architecture of SmartLake, and demonstrate that it can handle a broad range of data management tasks in a flexible way. The workflow engine allows the integration of customizable workflows to retrieve data and metadata, perform quality checks, file type conversions, and standard analysis, transform the data into a form necessary for machine learning, and generate data publications. Once set up, SmartLake can, in principle, automatically handle the entire data management pipeline, thereby minimizing the efforts required for data management, metadata enrichment, archiving, and publication.

How to cite: Turowski, J., Pruß, G., Erikson, C., Jaeuthe, T., and Tang, H.: SmartLake: A smart datalake for short and long tail data types, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-3919, https://doi.org/10.5194/egusphere-egu26-3919, 2026.