Storage growth mitigation through data analysis ready climate datasets using HDF5 Virtual Datasets
- Meteorology Group, Instituto de Física de Cantabria (IFCA, CSIC-UC), Santander, Spain
Climate datasets are usually provided in separate files that facilitate dataset management in climate data distribution systems. In ESGF1 (Earth System Grid Federation) a time series of a variable is split into smaller pieces of data in order to reduce file size. Although this enhances usability for data management in the ESGF distribution system (i.e. file publishing, download, …), this demotes usability for data analysis. Usually, workflows need to pre-process and rearrange multiple files as a single data source, in order to obtain a data analysis dataset, involving data rewriting and duplication with the corresponding storage growth.
The mitigation of storage growth can be achieved by creating virtual views, allowing a number of actual datasets to be multidimensionally mapped together into a single multidimensional dataset that does not require rewriting data nor to consume additional storage. Due to the increasing interest in offering to climate researchers appropriate single data analysis datasets, some mechanisms have been or are being developed to tackle this issue, such as NcML (netCDF Markup Language), xarray/netCDF-4 Multiple File datasets and H5VDS. HDF5 Virtual Datasets3 (H5VDS) provide researchers with different views of interest of a compound dataset, without the cost of duplicating information, facilitating data analysis in an easy and transparent way.
In the climate community and in ESGF, netCDF is the standard data model and format for climate data exchange. netCDF-4 default storage format is HDF5, introducing into the netCDF library features from HDF5. This includes chunking2, compression2, virtual datasets and many other capabilities. H5VDS introduces a new dataset storage type that allows a number of multiple HDF5 (and netCDF-4) datasets to be mapped together into a single sliceable dataset via an interface layer. The datasets can be mixed in arbitrary combinations, based on range selection mapping to range selection on sources. This mapping allows mapping between different data types and to add, remove or modify existing metadata (i.e. datasets attributes), which usually it’s a common issue to access the data.
In this work, examples of applications of H5VDS features are applied to CMIP6 climate simulations datasets from ESGF, in order to provide data analysis ready virtual datasets. Examples of common tools/libraries (i.e. netcdf-c, xarray, nco, cdo, …) illustrate the convenience of the proposed approach. Using H5VDS facilitates data analysis workflows by enabling climate researchers to focus on data analysis rather than data engineering tasks. Also, since the H5VDS is created at the storage layer, these datasets are transparent to the netCDF-4 library and existing applications can benefit from this feature.
References
[3] The HDF Group, “Virtual Dataset VDS”, 06-Apr.-2018. [Online]. Available: https://portal.hdfgroup.org/display/HDF5/Virtual+Dataset++-+VDS. [Accessed: 12-Jan.-2022]
Acknowledgements
This work it’s been developed under support from IS-ENES3 which is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 824084.
How to cite: Cimadevilla, E. and Cofiño, A. S.: Storage growth mitigation through data analysis ready climate datasets using HDF5 Virtual Datasets, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7151, https://doi.org/10.5194/egusphere-egu22-7151, 2022.