Compression and Aggregation: a CF data model approach

David Hassell; Sadie Bartholomew; Bryan Lawrence; Daniel Westwood

doi:https://doi.org/10.5194/egusphere-egu25-20430

[Back] [Session ESSI2.13]

EGU25-20430, updated on 15 Mar 2025

https://doi.org/10.5194/egusphere-egu25-20430

EGU General Assembly 2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Compression and Aggregation: a CF data model approach

David Hassell

¹, Sadie Bartholomew¹, Bryan Lawrence

¹, and Daniel Westwood²

David Hassell et al.

¹NCAS, Department of Meteorology, Reading, United Kingdom of Great Britain – England, Scotland, Wales
²CEDA Atmospheric Science, Centre for Environmental Data Analysis (CEDA), UK

The CF (Climate and Forecast) metadata conventions for netCDF datasets describe means of "compression-by-convention", i.e. methods for compressing and decompressing data according to algorithms that are fully described within the conventions themselves. These algorithms, which can be lossless or lossy, are not applicable to arbitrary data, rather the data have to exhibit certain characteristics to make the compression worthwhile, or even possible.

Aggregation, available in CF-1.13, provides the utility of being able to view, as a single entity, a dataset that has been partitioned across multiple other independent datasets on disk, whilst taking up very little extra space on disk since the aggregation dataset contains no copies of the data in each component dataset. Aggregation can facilitate a range of activities such as data analysis, by avoiding the computational expense of deriving the aggregation at the time of analysis; archive curation, by acting as a metadata-rich archive index; and the post-processing of model simulation outputs, by spanning multiple datasets written at run time that together constitute a more cohesive and useful product. CF aggregation currently has cf-python and xarray implementations.

The conceptual CF data model does not recognise compression nor aggregation, choosing to view all CF datasets as if they were uncompressed and containing all of their own data. As a result, the cf-python data analysis library, that is built exactly on the CF data model, also presents datasets lazily to the user in this manner, without decompressing or re-combining the data in memory until the user actually accesses the data, at which time it occurs automatically. This approach allows the user to interact with their data in an intuitive and efficient manner; and also removes the need for the user to have to assimilate large parts of the CF conventions and having to create their own code for dealing with the compression and aggregation algorithms.

We will introduce compression by ragged arrays (as used by Discrete Sampling Geometry features, such as timeseries and trajectories) and dataset aggregation, with cf-python examples to demonstrate the ease of use that arises from using the CF data model interpretation of the data.

How to cite: Hassell, D., Bartholomew, S., Lawrence, B., and Westwood, D.: Compression and Aggregation: a CF data model approach, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-20430, https://doi.org/10.5194/egusphere-egu25-20430, 2025.