Artificial Intelligence (AI) and Machine Learning (ML)-applications have become a huge hype. What does it mean to serve data for AI and ML? EUMETSAT climate reprocessing data records try to meet following guidelines as far as possible.
In ML applications data is typically combined from several sources. Training ML model needs normally a long history of data. Typical environmental ML applications employ 1-5 years of historical data while for example impact forecasts require often at least 10 years of history to contain enough extreme weather samples. ML applications are often trained with history data but applied to near-real-time (NRT) data. Thus, corresponding NRT data should be always available.
The historical data series should obviously be as harmonised as possible. However, the harmonisation doesn’t need to be perfect. Small changes in the data are not necessary affecting the performance of ML model too much. The changes in the underlying data should well documented.
Data quality is also very important aspect as ML models are just as good as underlying data. Thus, quality flags should be always available and provided in a way that they can be used to filter out bad samples. While reasonable assumption for default is to provide only good quality data, also other samples should be available as sometimes more lower quality data yields better results than less higher quality data. Whenever possible, users should be provided with option to access the raw data as well since it may open avenues to new ways to apply ML models or pre-process.
Data access should be obviously as fast as possible and all data should always be served from online data storage. As datasets are almost always combined with each other, data formats should be as well-known and supported as possible, even that would mean loss of metadata. Typically, it’s better to provide metadata beside the actual data and keep the data as consist as possible.
Some of the ML methods, such as Random Forests (RF) are more often used for supervised learning to specific points while i.e. neural networks (NN) are used for images and gridded fields, tensors. Serving data for point-based applications greatly benefits from API capable to provide best representative samples for any given point so that it’s easy to be combined with labels. Serving data for grid-based applications, however, benefit of relatively raw interfaces, such as S3, with wide client support. Critical requirements for the interface and the data model is to enable sub-setting and slicing.
Finally, providing well-known and documented reference datasets with ready labels would be highly beneficial for ML developers. Such general domain datasets, such as the Iris Dataset already exist. Meteorological community should publish such datasets along with ready methods in common libraries to load the dataset easily.
How to cite: Tervo, R. and Grant, M.: Providing AI- and ML-ready data, EMS Annual Meeting 2022, Bonn, Germany, 5–9 Sep 2022, EMS2022-17, https://doi.org/10.5194/ems2022-17, 2022.