The TOAR data infrastructure: A generalised database infrastructure for environmental time series
- Forschungszentrum Jülich GmbH, Jülich Supercomputing Centre, 52425 Jülich, Germany
In all areas of research, robust, versatile and high-performance data infrastructures are needed.
TOAR is a global research project to analyze the spatial distribution and temporal evolution of ozone in the troposphere and to provide data of surface ozone measurements and its precursors for assessing the impact of ozone on human health, vegetation and climate.
These observational data are collected from various environmental agencies and programs, universities and individual researchers under different requirements for data formats, metadata standards and quality control, and are harmonized and quality-controlled into the TOAR database using our infrastructure. All data in the database are easily accessible through open, freely available and well-documented web services. The TOAR data centre team is committed to the FAIR principles and aims to achieve the highest standards with respect to data curation, archival, and re-use. We established a common approach for data ingestion to ensure that data from different sources is handled in a defined and equal way and that all modifications are recorded in a provenance log. Clear rules are defined how the submitted metadata is mapped into the metadata schema used by the TOAR database. To harmonize the data quality, we employ automated tests of different granularity using statistical methods and heuristics, which assign a score for each data point. Those scores are then translated into categorical data quality flags.
The TOAR data infrastructure has proven that it can handle large amounts of data operationally in a performant way. It not only provides standardized REST-API access to the underlying database but also allows for the integration and linking of additional services. For example, the results of the quality control tool mentioned above can be accessed as interactive charts with tables of aggregated figures. Furthermore, we offer analysis services that implement a variety of statistical evaluations and metrics to allow users to get aggregated, ready-to-use data in a consistent, reproducible, and interoperable manner and also allow for bulk raw data downloads. It is also possible to invoke a service to calculate air pollution trends with quantile regression.
During development, we emphasized the reusability of the database infrastructure code. Therefore, we believe that the database layout, the related workflows for data ingestion and processing, and the service architecture can be transferred to other types of environmental data and perhaps even to data from other disciplines.
With the TOAR database infrastructure, the TOAR community receives a cutting-edge repository and system of web services that allows for easy-to-use, fast, flexible, and reproducible analyses of air pollution and associated data.
How to cite: Schröder, S., Selke, N., and Schultz, M. G.: The TOAR data infrastructure: A generalised database infrastructure for environmental time series, EGU General Assembly 2023, Vienna, Austria, 24–28 Apr 2023, EGU23-1848, https://doi.org/10.5194/egusphere-egu23-1848, 2023.