- 1University of Reading, NCAS, Meteorology, Reading, United Kingdom of Great Britain – England, Scotland, Wales (valeriu.predoi@ncas.ac.uk)
- 2Universidad de Cantabria
- 3Uni Bonn, IfGeo, Unser Institut, Germany
Programmatic access to remote high-volume multi-dimensional geophysical data was nearly impossible before the advent of high-speed networks and public cloud storage. Even then, data often had to be made "analysis-ready" before such access was possible. However, once analysis ready data is available, remote access becomes possible, with only the bytes needed by the client transferred across a network to the local client. In many cases such access will be faster and more energy efficient than downloading the entire dataset that contains the relevant variables (or parts of variables). Additionally, even when it is not more efficient than downloading data on a case-by-case basis, it may not be possible to locally cache the data, and remote access may be the only possibility. Hence, the notion of analysis ready data has become very popular, and this has often been understood to mean "made available on an object store in Zarr format". However, the key aspects of analysis ready data can be delivered via other interfaces and formats, provided the right software stack is available. Here we present such a stack in the context of how we expect to enable remote access to NetCDF4 data from the upcoming CMIP7 Assessment Fast Track (and other data to be held in the newly upgraded Earth System Grid Federation, ESGF). The new ESGF will expose data via http servers which will support remote range-get to portions of files, which essentially provides the same remote access capabilities as an object store. The requirements for using such a stack, for the new ESGF, and for object stores are (1) the data to be accessed must be appropriately chunked (partitioned into suitably dimensioned hyperslabs), (2) the chunk indices must be efficiently stored, and (3) the reading software using tools such Dask must be fully parallelisable. If either of the first two criteria are not met, data access can be impossibly slow even for relatively small problems, and if the third is not met, large problems cannot be efficiently addressed. To address the first two issues, we present: `cmip7-repack` a tool to ensure that key aspects of the CMIP7 data are chunked appropriatel
y; `pyfive`, a pure-Python thread-safe library for reading HDF data performantly in both serial and parallel applications; and a `pyfive`-enabled version of the `h5netcdf` library for facilitating remote and/or parallel data access using the NetCDF4 API. With these tools we are able to show that reformatting data from the NetCDF4 data preferred by modellers into additional formats such as Zarr, and/or maintaining duplicate copies of chunk indices made by tools such as kerchunk, will no longer be necessary for most workloads.
How to cite: Hassell, D., Predoi, V., Lawrence, B., Cimadevilla, E., and Mühlbauer, K.: Supporting remote access to HDF5 datasets, EGU General Assembly 2026, Vienna, Austria, 3–8 May 2026, EGU26-21900, https://doi.org/10.5194/egusphere-egu26-21900, 2026.