EGU22-7151, updated on 28 Mar 2022
https://doi.org/10.5194/egusphere-egu22-7151
EGU General Assembly 2022
© Author(s) 2022. This work is distributed under
the Creative Commons Attribution 4.0 License.

Storage growth mitigation through data analysis ready climate datasets using HDF5 Virtual Datasets

Ezequiel Cimadevilla and Antonio S. Cofiño
Ezequiel Cimadevilla and Antonio S. Cofiño
  • Meteorology Group, Instituto de Física de Cantabria (IFCA, CSIC-UC), Santander, Spain

Climate datasets are usually provided in separate files that facilitate dataset management in climate data distribution systems. In ESGF1 (Earth System Grid Federation) a time series of a variable is split into smaller pieces of data in order to reduce file size. Although this enhances usability for data management in the ESGF distribution system (i.e. file publishing, download, …), this demotes usability for data analysis. Usually, workflows need to pre-process and rearrange multiple files as a single data source, in order to obtain a data analysis dataset, involving data rewriting and duplication with the corresponding storage growth.

The mitigation of storage growth can be achieved by creating virtual views, allowing a number of actual datasets to be multidimensionally mapped together into a single multidimensional dataset that does not require rewriting data nor to consume additional storage. Due to the increasing interest in offering to climate researchers appropriate single data analysis datasets, some mechanisms have been or are being developed to tackle this issue, such as NcML (netCDF Markup Language), xarray/netCDF-4 Multiple File datasets and H5VDS. HDF5 Virtual Datasets3 (H5VDS) provide researchers with different views of interest of a compound dataset, without the cost of duplicating information, facilitating data analysis in an easy and transparent way.

In the climate community and in ESGF, netCDF is the standard data model and format for climate data exchange. netCDF-4 default storage format is HDF5, introducing into the netCDF library features from HDF5. This includes chunking2, compression2, virtual datasets and many other capabilities. H5VDS introduces a new dataset storage type that allows a number of multiple HDF5 (and netCDF-4) datasets to be mapped together into a single sliceable dataset via an interface layer. The datasets can be mixed in arbitrary combinations, based on range selection mapping to range selection on sources. This mapping allows mapping between different data types and to add, remove or modify existing metadata (i.e. datasets attributes), which usually it’s a common issue to access the data. 

In this work, examples of applications of H5VDS features are applied to CMIP6 climate simulations datasets from ESGF, in order to provide data analysis ready virtual datasets. Examples of common tools/libraries (i.e. netcdf-c, xarray, nco, cdo, …) illustrate the convenience of the proposed approach. Using H5VDS facilitates data analysis workflows by enabling climate researchers to focus on data analysis rather than data engineering tasks. Also, since the H5VDS is created at the storage layer, these datasets are transparent to the netCDF-4 library and existing applications can benefit from this feature.

References

[1] L. Cinquini et al., “The Earth System Grid Federation: An open infrastructure for access to distributed geospatial data,” Future Generation Computer Systems, vol. 36, pp. 400–417, 2014, doi: 10.1016/j.future.2013.07.002. [Online]. Available: https://linkinghub.elsevier.com/retrieve/pii/S0167739X13001477. [Accessed: 16-Jan-2020]
[2] The HDF Group, “Chunking in HDF5”, 11-Feb.-2019. [Online]. Available: https://portal.hdfgroup.org/display/HDF5/Chunking+in+HDF5. [Accessed: 12-Jan.-2022]

[3] The HDF Group, “Virtual Dataset VDS”, 06-Apr.-2018. [Online]. Available: https://portal.hdfgroup.org/display/HDF5/Virtual+Dataset++-+VDS. [Accessed: 12-Jan.-2022]

Acknowledgements

This work it’s been developed under support from IS-ENES3 which is funded by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 824084.

How to cite: Cimadevilla, E. and Cofiño, A. S.: Storage growth mitigation through data analysis ready climate datasets using HDF5 Virtual Datasets, EGU General Assembly 2022, Vienna, Austria, 23–27 May 2022, EGU22-7151, https://doi.org/10.5194/egusphere-egu22-7151, 2022.

Displays

Display file