Challenges and perspectives of climate data compression in times of kilometre-scale models and generative machine learning

Milan Klöwer; Tim Reichelt; Juniper Tyree; Ayoub Fatihi; Hauke Schulz

doi:https://doi.org/10.5194/egusphere-egu25-13394

[Back] [Session ESSI2.13]

EGU25-13394, updated on 15 Mar 2025

https://doi.org/10.5194/egusphere-egu25-13394

EGU General Assembly 2025

© Author(s) 2025. This work is distributed under
the Creative Commons Attribution 4.0 License.

Challenges and perspectives of climate data compression in times of kilometre-scale models and generative machine learning

Milan Klöwer¹, Tim Reichelt¹, Juniper Tyree², Ayoub Fatihi³, and Hauke Schulz⁴

Milan Klöwer et al.

¹University of Oxford
²University of Helsinki
³University of Lausanne
⁴Danish Meteorological Institute

Climate data compression urgently needs new standards. The continuously growing exascale mountain of data requires compressors that are widely used and supported, essentially hiding the compression details from many users. With the advent of AI revolutionising scientific computing, we have to set the rules of this game. Minimizing information loss, maximising compression factors, at any resolution, grid and dataset size, for all variables, with chunks and random access, while preserving all statistics and derivatives, at a reasonable speed — are squaring the compression circle. Many promising compressors are hardly used as trust among domain scientists is hard to gain: The large spectrum of research questions and applications using climate data is very difficult to satisfy simultaneously.

Here, we illustrate the motivation behind the newly defined climate data compression benchmark ClimateBenchPress, designed as a quality check in all those dimensions of the problem. Any benchmark will inevitably undersample this space, but we define datasets from atmosphere, ocean, and land as well as evaluation metrics to pass. Results are presented as score cards, highlighting strengths and weaknesses for every compressor.

The bitwise real information content shows a systematic way in case no error bounds are known. In the case of the ERA5 reanalysis, errors are estimated and allow us to categorize many variables into linear, log and beta distributions with values bounded from zero, one or both sides, respectively. This allows us to define error thresholds arising from observation and model errors directly, providing another alternative to the still predominant subjective choices. Most error-bounded compressors come with parameters that can be automatically chosen following this analysis.

Also new data formats are on the horizon: Chunking and hierarchical data structures allow and force us to adapt compressors to spatially or length-scale dependent information densities. Extreme events, maybe counterintuitively, often increase the compressibility through higher uncertainties, but lie on the edge or outside of the training data for machine learned-compressors. This again increases the need for well-tested compressors. Benchmarks like ClimateBenchPress are required to encourage new standards for safe lossy climate data compression.

How to cite: Klöwer, M., Reichelt, T., Tyree, J., Fatihi, A., and Schulz, H.: Challenges and perspectives of climate data compression in times of kilometre-scale models and generative machine learning, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-13394, https://doi.org/10.5194/egusphere-egu25-13394, 2025.