- 1The Cyprus Institute, CARE-C, Nicosia, Cyprus (j.araya@cyi.ac.cy)
- 22. Max Plank Institute for Chemistry, Mainz, Germany.
With the advent of Machine Learning methods and the development of new techniques in data mining, knowledge representation and data extraction, new possibilities have emerged to address the shortcomings of data imperfection. In this context, there are different methods for producing synthetic time series, which vary across goals and disciplines. In certain situations, it can be challenging to obtain the relevant data required to test assumptions about the skill and performance of machine learning models. Synthetic data generation approaches provide an effective solution by enabling the testing of machine learning algorithms in the absence of real data.
Although data availability is seemingly ubiquitous these days, a paradox arises in situations where bureaucratic, practical, or technical limitations make it difficult for researchers to rely on the required data, particularly when accessing real measurements (e.g., time series data) for specific purposes.
Our preliminary study features a case in operational meteorology where synthetic data proves particularly useful, addressing challenges associated with limited or inaccessible real measurements. Specifically, we investigate the capability of machine learning algorithms to generate high-quality synthetic time series that can be applied in meteorological data processing and analysis. To achieve this, synthetic datasets were developed based on informed criteria that integrate dynamical features of near-surface temperature data, tailored to the unique geographic and environmental context of Cyprus. These criteria include key characteristics such as trends, extreme values, diurnal cycles and vertical temperature gradients, ensuring a realistic and comprehensive representation of near-surface temperature behavior. This approach facilitates the testing and validation of data-driven models in operational settings, providing a robust framework for evaluating their performance under controlled, yet realistic, conditions.
We characterized the general features of these synthetic datasets and evaluated their utility as benchmarks for data quality control purposes. Our findings underscore the potential value of synthetic datasets in operational meteorology, particularly in supporting the development and evaluation of robust, purpose-specific, machine learning algorithms.
How to cite: Araya, J., Proestos, Y., and Lelieveld, J.: Use of synthetic time series datasets for quality control of meteorological data. , EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-5609, https://doi.org/10.5194/egusphere-egu25-5609, 2025.