EGU23-12971
https://doi.org/10.5194/egusphere-egu23-12971
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Reproducible quality control of time series data with SaQC

David Schäfer, Bert Palm, Peter Lünenschloß, Lennart Schmidt, and Jan Bumberger
David Schäfer et al.
  • Helmholtz Centre for Environmental Research - UFZ

Environmental sensor networks produce ever-growing volumes of time series data with great potential to broaden the understanding of complex spatiotemporal environmental processes. However, this growth also imposes its own set of new challenges. Especially the error-prone nature of sensor data acquisition is likely to introduce disturbances and anomalies into the actual environmental signal. Most applications of such data, whether it is used in data analysis, as input to numerical models or modern data science approaches, usually rely on data that complies with some definition of quality.

To move towards high-standard data products, a thorough assessment of a dataset's quality, i.e., its quality control, is of crucial importance. A common approach when working with time series data is the annotation of single observations with a quality label to transport information like its reliability. Downstream users and applications are hence able to make informed decisions, whether a dataset in its whole or at least parts of it are appropriate
for the intended use.

Unfortunately, quality control of time series data is a non-trivial, time-consuming, scientifically undervalued endeavor and is often neglected or executed with insufficient rigor. The presented software, the System for automated Quality Control (SaQC), provides all basic and many advanced building blocks to bridge the gap between data that is usually faulty but expected to be correct in an accessible, consistent, objective and reproducible way. Its user interfaces address different audiences ranging from the scientific practitioner with little access to the possibilities of modern software development to the trained programmer. SaQC delivers a growing set of generic algorithms to detect a multitude of anomalies and to process data using resampling, aggregation, and data modeling techniques. However, one defining component of SaQC is its innovative approach to storing runtime process information. In combination with a flexible quality annotation mechanism, SaQC allows to extend quality labels with fine-grained provenance information appropriate to fully reproduce the system's output.

SaQC is proving its usefulness on a daily basis in a range of fully automated data flows for large environmental observatories. We highlight use cases from the TERENO Network, showcasing how reproducible automated quality control can be implemented into real-world, large-scale data processing workflows to provide environmental sensor data in near real-time to data users, stakeholders and decision-makers.

 

How to cite: Schäfer, D., Palm, B., Lünenschloß, P., Schmidt, L., and Bumberger, J.: Reproducible quality control of time series data with SaQC, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-12971, https://doi.org/10.5194/egusphere-egu23-12971, 2023.