EGU22-13193
https://doi.org/10.5194/egusphere-egu22-13193
EGU General Assembly 2022
© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

CliMetLab and Pangeo use case: Machine learning data pipeline for sub-seasonal To seasonal prediction (S2S)

florian pinault1, Aaron Spring2, Frederic Vitart1, and Baudouin Raoult1
florian pinault et al.
  • 1ECMWF
  • 2Max Planck Institute for Technologies

As machine learning algorithms are being used more and more prominently in the meteorology and climate domains, the need for reference datasets has been identified as a priority. Moreover, boilerplate code for data handling is ubiquitous in scientific experiments. In order to focus on science, climate/meteorology/data scientists need generic and reusable domain-specific tools. To achieve these goals, we used the plugin based CliMetLab python package along with many packages listed by Pangeo.  


Our use case consists in providing data for machine learning algorithms in the context of the sub-seasonal to seasonal (S2S) prediction challenge 2021. The data size is about 2 Terabytes of model predictions from three different models. We experimented with providing data in multiple formats: Grib, NetCDF, and Zarr. A Pangeo recipe (using the python package pangeo_forge_recipes) was used to generate Zarr data (relying heavily on xarray and dask for parallelisation). All three versions of the S2S data have been stored on an S3 bucket located on the ECMWF European Weather Cloud (ECMWF-EWC). 


CliMetLab aims at providing a simple interface to access climate and meteorological datasets, seamlessly downloading and caching data, converting to xarray datasets or panda dataframes, plotting data, feed them into machine learning frameworks such as tensorflow or pytorch. CliMetLab is open-source and still a Beta version (https://climetlab.readthedocs.io). The main target platform of CliMetLab is Jupyter notebooks. Additionally, a CliMetLab plugin allows shipping dataset-specific code along with a well-defined published dataset. Taking advantage of the CliMetLab tools to minimize the boilerplate code, a plugin has been developed for S2S data as a companion python package of the dataset.

How to cite: pinault, F., Spring, A., Vitart, F., and Raoult, B.: CliMetLab and Pangeo use case: Machine learning data pipeline for sub-seasonal To seasonal prediction (S2S), EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-13193, https://doi.org/10.5194/egusphere-egu22-13193, 2022.

Displays

Display file