EGU26-3919, updated on 13 Mar 2026
https://doi.org/10.5194/egusphere-egu26-3919
EGU General Assembly 2026
© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.
Oral | Monday, 04 May, 08:35–08:45 (CEST)
 
Room -2.31
SmartLake: A smart datalake for short and long tail data types
Jens Turowski1, Gunnar Pruß1, Christian Erikson1, Tobias Jaeuthe2, and Hui Tang1
Jens Turowski et al.
  • 1GFZ Helmholtz Centre for Geosciences, 4.6 Geomorphology, Potsdam, Germany (jens.turowski@gfz.de)
  • 2PERFACCT GmbH, August-Bebel-Straße 27, 14482 Potsdam, Germany

The geosciences are a data-heavy discipline, and a wide range of data types and formats are commonly used, even within the same sub-discipline or working group. For example, in hydrology or geomorphology, geospatial data (e.g., satellite imagery, maps, sample locations) are routinely paired with time-series data (e.g., discharge or precipitation monitoring) and laboratory-derived data from individual samples (e.g., isotope chemistry from water samples). For some data types, widely-used community standards exist (e.g., seismic or satellite remote sensing data), stipulating data formats, file types, and relevant metadata. These are known as short-tail data types. Yet, for many data types, either such standards do not exist at all, or several competing standards are used in parallel. These are known as long-tail data types. As a result, research and monitoring data are often not managed and archived according to the FAIR principles or even get lost as researchers move between positions. Yet, many funding agencies require a data management plan and a commitment to open data principles already at the proposal stage. We require a flexible digital infrastructure for data management, that (1) can handle the entire data management chain from upload to publication, (2) is modular and scalable in the sense that it can be set up for individual projects, a workgroup or unit, or entire institutes, (3) is customizable in the sense that it can be set up for different types of data, environments, and tasks, (4) allows for the automation of data management tasks, and (5) can associate rich metadata with individual data files. Here, we introduce SmartLake, a datalake application that integrates a storage environment with a modular metadata catalog and a workflow engine. We describe the concept and architecture of SmartLake, and demonstrate that it can handle a broad range of data management tasks in a flexible way. The workflow engine allows the integration of customizable workflows to retrieve data and metadata, perform quality checks, file type conversions, and standard analysis, transform the data into a form necessary for machine learning, and generate data publications. Once set up, SmartLake can, in principle, automatically handle the entire data management pipeline, thereby minimizing the efforts required for data management, metadata enrichment, archiving, and publication.

How to cite: Turowski, J., Pruß, G., Erikson, C., Jaeuthe, T., and Tang, H.: SmartLake: A smart datalake for short and long tail data types, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-3919, https://doi.org/10.5194/egusphere-egu26-3919, 2026.