- University of Vienna, Department of Meteorology and Geophysics, Meteorology, Austria (markus.rosenberger@univie.ac.at)
Automatized image analysis with a broad spectrum of different approaches, e.g. pixel-wise statistical evaluation or machine learning methods, is ever more emerging to deal with a growing amount of data or to assist humans in classifying images. Due to high monetary and personnel expenses some of these automatized methods are even supposed to replace human annotators. Fields where such methods can be utilized are for example medical image analysis or the classification of clouds in the sky. Many studies introducing methods for automatized image analysis use human annotations as ground truth. However, assessments of the reliability and accuracy of those are rare.
In our work, we investigate the agreement of human cloud classifications conducted according to the WMO SYNOP coding scheme for operational cloud type observations where clouds are classified at every instance into one out of ten classes in each of three altitude levels. We base our analysis on three experiments, where we compare: a) non-simultaneous observations of seven observers at the same weather station in Vienna, b) simultaneous observations at three close together stations over the course of more than 50 years, and c) independent reports of five meteorologists, who classified clouds from over 350 ground-based RGB images. Experiments a) and b) are designed to find systematical biases in operational on-site observations of single observers or weather stations and experiment c) directly targets the subjectivity of human cloud classifications. Results indicate, that human cloud observations of both single observers at the same station and also at different stations are biased towards specific cloud types, which can only partly be assigned to environmental or meteorological influence. Even for classifications based on the exact same information, i.e. an identical set of images in experiment c), disagreement could be found. The accuracy of single observers is around 55 – 65% when their reports are compared with a gold standard ground truth and inter-observer agreement shows similar values. An accuracy of close to 70% can be reached if the reports of four observers are combined via a majority voting approach and similar cloud categories are merged during post-processing. It can thus be hypothesized that a fraction of false classifications is due to the confusion of visually similar categories, which is a consequence of the very complex WMO SYNOP classification scheme. On the other hand, with respect to operational human on-site observations, the annotation of all-sky images was correct in only 30–40% of cases. Therefore, the accuracy of image classifications with respect to the ground truth is highly dependent on the used data set.
Although the WMO classification scheme is well-defined, it can be summarized that cloud classification is subjective to some extent because of e.g. the occurrence of clouds in transitional stages. Also, if the quality of the ground truth is not assessed in future studies a reliable determination of the accuracy of a newly presented automatized method would be impossible since both the new method but also the ground truth could be erroneous.
How to cite: Rosenberger, M., Dorninger, M., and Weissmann, M.: Uncertainties of human SYNOP cloud classifications, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-18898, https://doi.org/10.5194/egusphere-egu26-18898, 2026.