ESSI2.13 | Data compression and reduction for Earth System Sciences datasets
EDI
Data compression and reduction for Earth System Sciences datasets
Co-organized by AS5/CL5/GD10/GI2/NP4
Convener: Clément BouvierECSECS | Co-conveners: Karsten Peters-von Gehlen, Juniper Tyree, Oriol Tinto, Sara Faghih-Naini

Recent Earth System Sciences (ESS) datasets, such as those resulting from very high resolution numerical modelling, have increased both in terms of precision and size. These datasets are central to the advancement of ESS for the benefit of all stakeholders, public policymaking on climate change and to the performance of modern applications such as Machine Learning (ML) and forecasting.

The storage and shareability of ESS datasets have become an important discussion point in the scientific community. It is apparent that datasets produced by state-of-the-art applications are becoming so large that even current high-capacity data centres and infrastructures are incapable of storing, let alone ensuring the usability and processability of such datasets. The needs of ongoing and upcoming community activities, such as various digital twin centred projects or the 7th Phase of the Coupled Model Intercomparison Project (CMIP7) already stretch the abilities of current infrastructures. With future investment in hardware being limited, a viable way forward is to explore the possibilities of data reduction and compression with the needs of stakeholders in mind. Therefore, the use of data compression has grown in interest to 1) make the data weight more manageable, 2) speed up data transfer times and resource needs and 3) without reducing the quality of scientific analyses.

Concurrently, replicability is another major concern for ESS and downstream applications. Being able to reproduce the most recent ML and forecasting results and analyses thereof has become mandatory to develop new methods and integrated workflows for operational settings. On the other hand, the data accuracy needed to produce reliable downstream products has not yet been thoroughly investigated. Therefore, research on data reduction and prediction interpretability helps to 1) understand the relationship between the datasets and the resulting prediction and 2) increase the stability of prediction.

This session discusses the latest advances in both data compression and reduction for ESS datasets, focusing on:
1) Approaches and techniques to enhance shareability of high-volume ESS datasets: data compression (lossless and lossy) or reduction approaches.
2) Understanding the effects of reduction and replicability: feature selection, feature fusion, sensitivity to data, active learning.
3) Analyses of the effect of reduced/compressed data on numerical weather prediction and/or machine learning methods.

Recent Earth System Sciences (ESS) datasets, such as those resulting from very high resolution numerical modelling, have increased both in terms of precision and size. These datasets are central to the advancement of ESS for the benefit of all stakeholders, public policymaking on climate change and to the performance of modern applications such as Machine Learning (ML) and forecasting.

The storage and shareability of ESS datasets have become an important discussion point in the scientific community. It is apparent that datasets produced by state-of-the-art applications are becoming so large that even current high-capacity data centres and infrastructures are incapable of storing, let alone ensuring the usability and processability of such datasets. The needs of ongoing and upcoming community activities, such as various digital twin centred projects or the 7th Phase of the Coupled Model Intercomparison Project (CMIP7) already stretch the abilities of current infrastructures. With future investment in hardware being limited, a viable way forward is to explore the possibilities of data reduction and compression with the needs of stakeholders in mind. Therefore, the use of data compression has grown in interest to 1) make the data weight more manageable, 2) speed up data transfer times and resource needs and 3) without reducing the quality of scientific analyses.

Concurrently, replicability is another major concern for ESS and downstream applications. Being able to reproduce the most recent ML and forecasting results and analyses thereof has become mandatory to develop new methods and integrated workflows for operational settings. On the other hand, the data accuracy needed to produce reliable downstream products has not yet been thoroughly investigated. Therefore, research on data reduction and prediction interpretability helps to 1) understand the relationship between the datasets and the resulting prediction and 2) increase the stability of prediction.

This session discusses the latest advances in both data compression and reduction for ESS datasets, focusing on:
1) Approaches and techniques to enhance shareability of high-volume ESS datasets: data compression (lossless and lossy) or reduction approaches.
2) Understanding the effects of reduction and replicability: feature selection, feature fusion, sensitivity to data, active learning.
3) Analyses of the effect of reduced/compressed data on numerical weather prediction and/or machine learning methods.