EGU24-8058, updated on 08 Mar 2024
https://doi.org/10.5194/egusphere-egu24-8058
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Beyond ESGF – Bringing regional climate model datasets to the cloud on AWS S3 using the Pangeo Forge ETL framework

Lars Buntemeyer
Lars Buntemeyer
  • Helmholtz-Zentrum Hereon, GERICS, Germany (lars.buntemeyer@hereon.de)

The Earth System Grid Federation (ESGF) data nodes are usually the first address for accessing climate model datasets from WCRP-CMIP activities. It is currently hosting different datasets in several projects, e.g., CMIP6, CORDEX, Input4MIPs or Obs4MIPs. Datasets are usually hosted on different data nodes all over the world while data access is managed by any of the ESGF web portals through a web-based GUI or the ESGF Search RESTful API. The ESGF data nodes provide different access methods, e.g., https, OPeNDAP or Globus. 

Beyond ESGF, there has been the Pangeo / ESGF Cloud Data Working Group that coordinates efforts related to storing and cataloging CMIP data in the cloud, e.g., in the Google cloud and in the Amazon Web Services Simple Storage Service (S3) where a large part of the WCRP-CMIP6 ensemble of global climate simulations is now available in analysis-ready cloud-optimized (ARCO) zarr format. The availibility in the cloud has significantly lowered the barrier for users with limited resources and no access to an HPC environment to work with CMIP6 datasets and at the same time increases the chance for reproducibility and reusability of scientific results. 

Following the Pangeo strategy, we have adapted parts of the Pangeo Forge software stack for publishing our regional climate model datasets from the EURO-CORDEX initiative on AWS S3 cloud storage. The main tools involved are Xarray, Dask, Zarr, Intake and the ETL tools of pangeo-forge-recipes. Thanks to similar meta data conventions in comparison to the global CMIP6 datasets, the workflows require only minor adaptations. In this talk, we will show the strategy and workflow implemented and orchestrated in GitHub Actions workflows as well as a demonstration of how to access EURO-CORDEX datasets in the cloud.

How to cite: Buntemeyer, L.: Beyond ESGF – Bringing regional climate model datasets to the cloud on AWS S3 using the Pangeo Forge ETL framework, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-8058, https://doi.org/10.5194/egusphere-egu24-8058, 2024.