EGU23-7455
https://doi.org/10.5194/egusphere-egu23-7455
EGU General Assembly 2023
© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

An Automated Data Ingestion Workflow for the TOAR Database

Enxhi Kreshpa, Sabine Schröder, Niklas Selke, and Martin Schultz
Enxhi Kreshpa et al.
  • Forschungszentrum Juelich, Juelich Supercomputing Centre, Germany (e.kreshpa@fz-juelich.de)

Over the last years, several repositories with curated environmental datasets have been created so that scientific communities have gained access to large collections of data from various domains. The level of data harmonisation and FAIRness, technical readiness and scalability of these repositories differs substantially. This restricts data exploration opportunities and limits scientific exploration with modern data science methods, such as machine learning. In­ the domain of air quality research, we have pioneered a data infrastructure for global observations of surface ozone and other air pollutant measurements that comes with rich possibilities for online data analysis. The data in the Tropospheric Ozone Assessment Report (TOAR) database is collected from about 40 different resource providers, from national and international environmental agencies to individual research groups around the world.
One of these data providers is OpenAQ, the world's first open, real-time air quality platform. Due to the higher standards of curation, the need for data harmonization, and the enriched metadata in the TOAR database, we had to develop an automated workflow to transport archived and real-time data from this provider to the TOAR database. The primary step is to clean and format all the OpenAQ records, according to the TOAR database schema, and concurrently, refine the metadata. The workflow includes tests for data sanity and checks if time series and station metadata can be amended, or whether new time series or station records must be created. The automation manager triggers the workflow hourly, so the database provides clean and updated air quality data at any time. 
The presentation describes the automated workflow and its design principles and discusses how such a workflow might be re-used in other environmental domains. All TOAR-related codes are open source.

How to cite: Kreshpa, E., Schröder, S., Selke, N., and Schultz, M.: An Automated Data Ingestion Workflow for the TOAR Database, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-7455, https://doi.org/10.5194/egusphere-egu23-7455, 2023.