EGU24-9781, updated on 08 Mar 2024
https://doi.org/10.5194/egusphere-egu24-9781
EGU General Assembly 2024
© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Optimizing NetCDF performance for cloud computing : exploring a new chunking strategy

Flavien Gouillon1, Cédric Pénard2, Xavier Delaunay2, and Florian Wery1
Flavien Gouillon et al.
  • 1Centre National d'Etude Spatial, Toulouse, France
  • 2Thales Service Numérique, Toulouse, France

Owing to the increasing number of satellites and advancements in sensor resolutions, the volume of scientific data is experiencing rapid growth. NetCDF (Network Common Data Form) stands as the community standard for storing such data, necessitating the development of efficient solutions for file storage and manipulation in this format.

Object storage, emerging with cloud infrastructures, offers potential solutions for data storage and parallel access challenges. However, NetCDF may not fully harness this technology without appropriate adjustments and fine-tuning. To optimize computing and storage resource utilization, evaluating NetCDF performance on cloud infrastructures is essential. Additionally, exploring how cloud-developed software solutions contribute to enhanced overall performance for scientific data is crucial.

Offering multiple file versions with data split into chunks tailored for each use case incurs significant storage costs. Thus, we investigate methods to read portions of compressed chunks, creating virtual sub-chunks that can be read independently. A novel approach involves indexing data within NetCDF chunks compressed with deflate, enabling extraction of smaller data portions without reading the entire chunk.

This feature is very valuable in use cases such as pixel drilling or extracting small amounts of data from large files with sizable chunks. It also saves reading time, particularly in scenarios of poor network connection, such as those encountered onboard research vessels.

We conduct performance assessments of various libraries in various use cases to provide recommendations for the most suitable and efficient library for reading NetCDF data in different situations.

Our tests involved accessing remote NetCDF datasets (two files from the SWOT mission) available on the network via a lighttpd server and an s3 server. Additionally, simulations of degraded Internet connections, featuring high latency, packet loss, and limited bandwidth, are also performed.

We evaluate the performance of four Python libraries (netcdf4 lib, Xarray, h5py, and our chunk indexing library) for reading dataset portions through fsspec or fs_s3. A comparison of reading performance using netCDF, zarr, and nczarr data formats is also conducted on an s3 server.

Preliminary findings indicate that the h5py library is the most efficient, while Xarray exhibits poor performance in reading NetCDF files. Furthermore, the NetCDF format demonstrates reasonably good performance on an s3 server, albeit lower than zarr or nczarr formats. However, the considerable efforts required to convert petabytes of archived NetCDF files and adapt numerous software libraries for a performance improvement within the same order of magnitude can raise questions about the practicality of such endeavors and benefits is thus extremely related to the use cases.

How to cite: Gouillon, F., Pénard, C., Delaunay, X., and Wery, F.: Optimizing NetCDF performance for cloud computing : exploring a new chunking strategy, EGU General Assembly 2024, Vienna, Austria, 14–19 Apr 2024, EGU24-9781, https://doi.org/10.5194/egusphere-egu24-9781, 2024.